-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Robots.txt support #48
Comments
robots.txt is a file that should be used by crawlers: tools that discover urls (see https://developers.google.com/search/docs/advanced/robots/robots_txt or https://datatracker.ietf.org/doc/html/draft-koster-rep ) this tool is meant to be used after a crawler has been run, on the resulting validated urls |
Hi @rom1504, is the robots.txt protocol really only meant only for "tools that discover urls"?
Does this mean that it's a requirement that the crawler collecting the links only keeps links that are not disallowed in the robots.txt of the target site? I'm not aware of any web datasets that do this and erase such links from the HTML captures. Also CCBot checks the site's robots.txt before accessing any HTML page on that site but does not remove links from WARC (and WAT) captures if the link would be disallowed by the target site's robots.txt. In other words, there are several reasons why fetching a particular image might be disallowed by robots.txt, while fetching the HTML pages linking to the image was allowed:
Happy to discuss these points – maybe it's worth to reopen this issue or open a new one. Thanks! (see also #249)
|
Makes sense. Feel free to implement such a tool. |
Or can you think of a way this could be implemented in an efficient way for img2dataset current architecture? |
I don't think, it's practical, given the maximum cache duration required by RFC 9309, section 2.4: "Crawlers SHOULD NOT use the cached version for more than 24 hours". Another point would be the requirement for crawlers to "set their own name" (user-agent product token) and send it along with the HTTP request, see section 2.2.1.
I see two ways to go:
The robots.txt location is defined in the RFC (it is |
2 is not possible for two reasons:
1 may be possible but we'd need to find good ways to automatically deploy a kV solution to keep the tool easy to use |
Parsing robots.txt is a solved problem in Python, it's in the stdlib. https://docs.python.org/3/library/urllib.robotparser.html Why not start doing it once per worker and interate from that to a more efficient solution in an Agile fashion? |
Feel free to give it a try |
No. A lot of sites were indexed by Common Crawl before their index was used to train AIs. They have since opted-out but their pages live in old copies of the index. A website I maintain is regularly hit by img2dataset bots, even though the site now disallows and even blocks Common Crawl. The site sends the "noai" X-Robots-Tag but this is a waste of CPU and bandwidth. It makes more sense to add something to robots.txt so that these crawlers just stay away. |
Scripts and softwares for automated scrapping must follow robots.txt rules, otherwise it may make the user liable for unauthorised use of data.
The text was updated successfully, but these errors were encountered: