Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for grabing all videos, no matter the language #171

Closed
benoit74 opened this issue Mar 19, 2024 · 9 comments · Fixed by #174
Closed

Add support for grabing all videos, no matter the language #171

benoit74 opened this issue Mar 19, 2024 · 9 comments · Fixed by #174

Comments

@benoit74
Copy link
Collaborator

The scraper should support the case where a user want "all" languages.

For now, it is not possible, the user has to pass the precise list of languages needed.

@JeremiahHerring
Copy link

@benoit74 I have an idea of how we can approach this issue. In entrypoint.py we can add a parser argument that scrapes videos in all languages (shown in screenshot), and we make sure that the user doesn't pass in arguments for languages and and all-languages at the same time in error handling. Then in scraper.py we would need to initialize all_languages in the class and then make sure we convert all language queries into TED language codes. Please let me know if I'm on the right track here
image

@benoit74
Copy link
Collaborator Author

@JeremiahHerring

I don't think we need to add a new parser argument, we can probably just make the language argument optional, and if not set it means we do not want a specific language but all videos available (in selected topic(s) or playlist(s)).

Next when language argument is not set, we have to adapt queries that use this argument (or the derived source_languages attribute, or any other attribute) to not filter anymore by language. Adding all language codes is too cumbersome and risky (what if a new language appears and we do not support it yet in our list of TED codes). It is important to adapt both run mode: by playlist and by topic. And also to ensure the TED multi (where we create one ZIM per playlist or topic) is adapted as well.

Is that clearer? WDYT about it?

@elfkuzco
Copy link
Contributor

@benoit74 , I have read your reply to the question and here's what I understand:
When no language is set, automatically download any video that is found. In essence, one should use the length of the source_language attribute as a flag before deciding to ignore or not.
Am I correct?

@benoit74
Copy link
Collaborator Author

When no language is set, automatically download any video that is found. In essence, one should use the length of the source_language attribute as a flag before deciding to ignore or not.

I would prefer to base the decision on language value rather than source_language, since the former is the real trigger, but you've got the point yes.

@benoit74 benoit74 added this to the 3.0.0 milestone Mar 25, 2024
@elfkuzco
Copy link
Contributor

elfkuzco commented Mar 25, 2024

@benoit74 , I have been digging through the code and I have made some fixes that should address the issue but I would require some little bit of clarification. As there is no flag to make a dry run, I had to download the output of the self.videos with and without language specified using the commands ted2zim --playlist=134 --name="the_most_popular_ted_talks_of_all_time" --debug --languages="English,French,German" and ted2zim --playlist=134 --name="the_most_popular_ted_talks_of_all_time" --debug respectively.

videos_with_lang.json
videos_without_langs.json

Where I would require clarification is looking at the output, there is one video link irrespective of if the --language attribute is specified or not. However, they differ in the languages and subtitles attributes. Is this the expected behaviour?

Also, I would also like to propose disabling the --subtitles flag or override to ALL when no language is specified

@benoit74
Copy link
Collaborator Author

Where I would require clarification is looking at the output, there is one video link irrespective of if the --language attribute is specified or not. However, they differ in the languages and subtitles attributes. Is this the expected behavior?

It looks normal, yes:

  • having one video link is normal, there is only one video on TED
  • having much more subtitles and languages available when --languages is not set seems pretty logic

Also, I would also like to propose disabling the --subtitles flag or override to ALL when no language is specified

What is the issue if we do not do this? I don't see the problem.

@elfkuzco
Copy link
Contributor

Okay, I think I misunderstood it a little. Still wrapping my head around all the options.

@benoit74
Copy link
Collaborator Author

Do not hesitate to continue to ask question or speak up if what I'm saying makes no sense, you have the code under your eyes, I have memories.

@elfkuzco
Copy link
Contributor

Okay. Thanks for your assistance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants