Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TED recipes based on --topics are not working anymore #149

Closed
kelson42 opened this issue Dec 10, 2023 · 5 comments · Fixed by #151
Closed

TED recipes based on --topics are not working anymore #149

kelson42 opened this issue Dec 10, 2023 · 5 comments · Fixed by #151
Assignees
Labels
Milestone

Comments

@kelson42
Copy link
Contributor

See https://farm.openzim.org/recipes?category=ted

[ted2zim::2023-11-29 04:01:10,401] INFO:Starting scraper with:
  langs: en
  subtitles : all
  video format : webm
[ted2zim::2023-11-29 04:01:10,401] INFO:Testing S3 Optimization Cache credentials
[ted2zim::2023-11-29 04:01:11,732] INFO:Using cache: s3.us-west-1.wasabisys.com with bucket: org-kiwix-ted
[ted2zim::2023-11-29 04:01:11,732] DEBUG:Fetching video links for topic: Business
[ted2zim::2023-11-29 04:01:11,733] DEBUG:generate_search_result_and_scrape: https://ted.com/talks?topics%5B%5D=Business&language=en&page=1
[ted2zim::2023-11-29 04:01:13,359] DEBUG:0 video(s) found on current page
[ted2zim::2023-11-29 04:01:13,359] INFO:Total video links found in Business: 0
[ted2zim::2023-11-29 04:01:13,359] ERROR:FAILED. An error occurred: No videos found for any topic in the language(s) requested. Check topic(s) and/or language code supplied to --languages
[ted2zim::2023-11-29 04:01:13,359] ERROR:No videos found for any topic in the language(s) requested. Check topic(s) and/or language code supplied to --languages
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.13-py3.8.egg/ted2zim/entrypoint.py", line 190, in main
    scraper.run()
  File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.13-py3.8.egg/ted2zim/scraper.py", line 1058, in run
    self.remove_failed_topics_and_check_extraction(failed)
  File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.13-py3.8.egg/ted2zim/scraper.py", line 1027, in remove_failed_topics_and_check_extraction
    raise ValueError(
ValueError: No videos found for any topic in the language(s) requested. Check topic(s) and/or language code supplied to --languages

@kelson42 kelson42 added the bug label Dec 10, 2023
@kelson42 kelson42 added this to the 2.1.0 milestone Dec 10, 2023
@benoit74
Copy link
Collaborator

First obvious thing is that the filtering by language does not work anymore.

It however seems to be linked to a change in UI (as far as I remember, the UI was not like this last time I visited the website), so I'm not sure the rest will work either.

@benoit74 benoit74 reopened this Dec 11, 2023
@benoit74 benoit74 changed the title All TED recipes dying since 2 months: TED recipes based on --topics are not working anymore Dec 11, 2023
@benoit74
Copy link
Collaborator

I confirm that not supplying a --languages, the scraper achieves to retrieve the page but does not achieve to parse it correctly.

A second issue (hidden) is that the topic page (e.g. https://www.ted.com/talks?sort=relevance&topics%5B0%5D=Design) does not accept a page parameter anymore. One has to click on "Show more" to load more videos. This won't work with urllib / requests.

Are you aware of any new way to retrieve this list of videos filtered by topic ?

It looks like we could plug directly to the underlying API used on the page, even if this is probably as fragile as parsing the HTML.

@rgaudin
Copy link
Member

rgaudin commented Dec 11, 2023

I believe the scraper uses both ; because the internal API was introduced later and some info were easier to access from it but it already changed in the past (hence the emphasis on internal).

Still appears to be a better strategy than the DOM. It's understood that those scrapers are fragile and as long as it doesn't change multiple times per day, it's an acceptable effort to adapt.

@benoit74
Copy link
Collaborator

I did not found any reference to an internal API in current codebase, do you remember what it was used for (I probably simply missed it).

I only found scraping of the playlists or tasks page + using JSON found in every video page in a special <script> tag.

Note that we should probably not fix this until discussion on #150 has settled.

@rgaudin
Copy link
Member

rgaudin commented Dec 12, 2023

I only found scraping of the playlists or tasks page + using JSON found in every video page in a special <script> tag.

I don't recall but if you look at the online website, you'll see every playlist and/or talk calls a JSON file that has the details. I believe we get some data out of it but I haven't looked at this code base in a long time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants