Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail on missing resources #160

Closed
rgaudin opened this issue Oct 26, 2020 · 5 comments · Fixed by #178
Closed

Fail on missing resources #160

rgaudin opened this issue Oct 26, 2020 · 5 comments · Fixed by #178
Assignees
Labels
Milestone

Comments

@rgaudin
Copy link
Member

rgaudin commented Oct 26, 2020

As seen in #159, there are cases where we failed to download resources yet succeeded the scraper. We should fail on missing resources.

@satyamtg
Copy link
Contributor

@rgaudin should we fail on all types of resources (which might be a bit tricky to implement as we have some invalid URLs too), or shall we fail on only major resources like videos?

@rgaudin
Copy link
Member Author

rgaudin commented Oct 26, 2020

Can you describe what are the failures we get currently? Which URLs and the reasons?

@satyamtg
Copy link
Contributor

Yep. Some of them are during the subtitle download for some videos and fail with a 404, due to invalid links in the HTML (as it can be very random). We currently do acknowledge if download was successfull and rewrite the links only if successful downloads took place.

One solution would be to handle this explicitly for different xblocks and types of assets. A better solution would be to fail when we get errors and we have exhausted all retry attempts. But then we need to ensure that the URL exists and is not some random invalid URL due to which we fail the whole scraper.

Moreover, for some links, the content might not be available. An example would be video 8 on https://mooc.phzh.ch/courses/course-v1:PHZH+W-IB+2019_E/9a122b295d484793bbf1a33ab0217a69/ , which has been removed from YouTube, and hence youtube_dl would throw an error.

@stale
Copy link

stale bot commented Dec 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Dec 26, 2020
@benoit74
Copy link
Collaborator

Would it make sense to allow some failed resources like I did on iFixit. I mean we should probably not fail on first resource missing, but maybe an absolute and/or a relative threshold would make sens, e.g. if more than 10% of resources are missing, it means that we have a significant bug which should fail the scrapper run. Does it makes any sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants