TED recipes based on `--topics` are not working anymore #149

kelson42 · 2023-12-10T16:44:14Z

See https://farm.openzim.org/recipes?category=ted

[ted2zim::2023-11-29 04:01:10,401] INFO:Starting scraper with:
  langs: en
  subtitles : all
  video format : webm
[ted2zim::2023-11-29 04:01:10,401] INFO:Testing S3 Optimization Cache credentials
[ted2zim::2023-11-29 04:01:11,732] INFO:Using cache: s3.us-west-1.wasabisys.com with bucket: org-kiwix-ted
[ted2zim::2023-11-29 04:01:11,732] DEBUG:Fetching video links for topic: Business
[ted2zim::2023-11-29 04:01:11,733] DEBUG:generate_search_result_and_scrape: https://ted.com/talks?topics%5B%5D=Business&language=en&page=1
[ted2zim::2023-11-29 04:01:13,359] DEBUG:0 video(s) found on current page
[ted2zim::2023-11-29 04:01:13,359] INFO:Total video links found in Business: 0
[ted2zim::2023-11-29 04:01:13,359] ERROR:FAILED. An error occurred: No videos found for any topic in the language(s) requested. Check topic(s) and/or language code supplied to --languages
[ted2zim::2023-11-29 04:01:13,359] ERROR:No videos found for any topic in the language(s) requested. Check topic(s) and/or language code supplied to --languages
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.13-py3.8.egg/ted2zim/entrypoint.py", line 190, in main
    scraper.run()
  File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.13-py3.8.egg/ted2zim/scraper.py", line 1058, in run
    self.remove_failed_topics_and_check_extraction(failed)
  File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.13-py3.8.egg/ted2zim/scraper.py", line 1027, in remove_failed_topics_and_check_extraction
    raise ValueError(
ValueError: No videos found for any topic in the language(s) requested. Check topic(s) and/or language code supplied to --languages

The text was updated successfully, but these errors were encountered:

benoit74 · 2023-12-11T07:45:43Z

First obvious thing is that the filtering by language does not work anymore.

It however seems to be linked to a change in UI (as far as I remember, the UI was not like this last time I visited the website), so I'm not sure the rest will work either.

benoit74 · 2023-12-11T08:23:14Z

I confirm that not supplying a --languages, the scraper achieves to retrieve the page but does not achieve to parse it correctly.

A second issue (hidden) is that the topic page (e.g. https://www.ted.com/talks?sort=relevance&topics%5B0%5D=Design) does not accept a page parameter anymore. One has to click on "Show more" to load more videos. This won't work with urllib / requests.

Are you aware of any new way to retrieve this list of videos filtered by topic ?

It looks like we could plug directly to the underlying API used on the page, even if this is probably as fragile as parsing the HTML.

rgaudin · 2023-12-11T09:57:45Z

I believe the scraper uses both ; because the internal API was introduced later and some info were easier to access from it but it already changed in the past (hence the emphasis on internal).

Still appears to be a better strategy than the DOM. It's understood that those scrapers are fragile and as long as it doesn't change multiple times per day, it's an acceptable effort to adapt.

benoit74 · 2023-12-12T10:20:47Z

I did not found any reference to an internal API in current codebase, do you remember what it was used for (I probably simply missed it).

I only found scraping of the playlists or tasks page + using JSON found in every video page in a special <script> tag.

Note that we should probably not fix this until discussion on #150 has settled.

rgaudin · 2023-12-12T10:31:37Z

I only found scraping of the playlists or tasks page + using JSON found in every video page in a special <script> tag.

I don't recall but if you look at the online website, you'll see every playlist and/or talk calls a JSON file that has the details. I believe we get some data out of it but I haven't looked at this code base in a long time

kelson42 added the bug label Dec 10, 2023

kelson42 assigned benoit74 Dec 10, 2023

kelson42 added this to the 2.1.0 milestone Dec 10, 2023

benoit74 closed this as completed Dec 11, 2023

benoit74 reopened this Dec 11, 2023

benoit74 changed the title ~~All TED recipes dying since 2 months:~~ TED recipes based on --topics are not working anymore Dec 11, 2023

benoit74 mentioned this issue Dec 12, 2023

TED is not pushing 6 big topics anymore #150

Closed

benoit74 mentioned this issue Dec 14, 2023

Fix fetching topics, moved to an internal search API #151

Merged

benoit74 closed this as completed in #151 Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TED recipes based on `--topics` are not working anymore #149

TED recipes based on `--topics` are not working anymore #149

kelson42 commented Dec 10, 2023

benoit74 commented Dec 11, 2023

benoit74 commented Dec 11, 2023

rgaudin commented Dec 11, 2023

benoit74 commented Dec 12, 2023

rgaudin commented Dec 12, 2023

TED recipes based on --topics are not working anymore #149

TED recipes based on --topics are not working anymore #149

Comments

kelson42 commented Dec 10, 2023

benoit74 commented Dec 11, 2023

benoit74 commented Dec 11, 2023

rgaudin commented Dec 11, 2023

benoit74 commented Dec 12, 2023

rgaudin commented Dec 12, 2023

TED recipes based on `--topics` are not working anymore #149

TED recipes based on `--topics` are not working anymore #149