Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Robots.txt support #48

Open
Scoppio opened this issue Sep 16, 2021 · 9 comments
Open

Implement Robots.txt support #48

Scoppio opened this issue Sep 16, 2021 · 9 comments
Labels
enhancement New feature or request

Comments

@Scoppio
Copy link

Scoppio commented Sep 16, 2021

Scripts and softwares for automated scrapping must follow robots.txt rules, otherwise it may make the user liable for unauthorised use of data.

@rom1504
Copy link
Owner

rom1504 commented Sep 16, 2021

robots.txt is a file that should be used by crawlers: tools that discover urls (see https://developers.google.com/search/docs/advanced/robots/robots_txt or https://datatracker.ietf.org/doc/html/draft-koster-rep )

this tool is meant to be used after a crawler has been run, on the resulting validated urls

@rom1504 rom1504 closed this as completed Sep 16, 2021
@sebastian-nagel
Copy link

Hi @rom1504,

is the robots.txt protocol really only meant only for "tools that discover urls"?

  1. RFC 9309 is addressed to "automatic clients known as crawlers". I think we can agree that img2dataset is an "automatic client" (or uses one).
  2. robots.txt files in the wild often provide rulesets addressing image "crawlers", e.g. "Googlebot-Image", "Baiduspider-image", "YandexImages".

this tool is meant to be used after a crawler has been run, on the resulting validated urls

Does this mean that it's a requirement that the crawler collecting the links only keeps links that are not disallowed in the robots.txt of the target site? I'm not aware of any web datasets that do this and erase such links from the HTML captures. Also CCBot checks the site's robots.txt before accessing any HTML page on that site but does not remove links from WARC (and WAT) captures if the link would be disallowed by the target site's robots.txt.

In other words, there are several reasons why fetching a particular image might be disallowed by robots.txt, while fetching the HTML pages linking to the image was allowed:

  1. images may be disallowed by robots.txt while HTML pages are not, e.g., by rules such as

    Disallow: /media/
    Disallow: /images/
    Disallow: *.gif$
    
  2. the image link was found in an HTML page on another site (the robots.txt of the site where the image is hosted may disallow fetching the image)

  3. different user-agents are used when crawling the HTML and later when fetching the images

  4. time gap: the robots.txt may change between accessing the HTML page and the image

Happy to discuss these points – maybe it's worth to reopen this issue or open a new one. Thanks!

(see also #249)

check "robots.txt", as it is out of scope, and should probably be implemented in a way that allows caching to improve performance, and avoid multiple calls to "/robots.txt" per website.

@rom1504
Copy link
Owner

rom1504 commented Mar 1, 2023

Makes sense.
This needs to be addressed in a previous filtering step before using img2dataset then

Feel free to implement such a tool.

@rom1504
Copy link
Owner

rom1504 commented Mar 1, 2023

Or can you think of a way this could be implemented in an efficient way for img2dataset current architecture?
I can't see how this could be done without doing 2 calls for each image. (Assuming it's even possible to find robots.txt location)

@sebastian-nagel
Copy link

in a previous filtering step before using img2dataset

I don't think, it's practical, given the maximum cache duration required by RFC 9309, section 2.4: "Crawlers SHOULD NOT use the cached version for more than 24 hours".

Another point would be the requirement for crawlers to "set their own name" (user-agent product token) and send it along with the HTTP request, see section 2.2.1.

Or can you think of a way this could be implemented in an efficient way for img2dataset current architecture?

I see two ways to go:

  1. a central robots.txt cache (Redis?) - somewhat similar in functionality to the recommended "caching" DNS resolver
  2. partition the input so that all URLs from a single host end up in a single partition (or very few partitions for hosts with many URLs)
    • e.g. by hash(hostname) % num_partitions
    • this allows the robots.txt rules to be cached in the worker process itself, since a single partition does not contain too many different hostnames
    • this should also reduce the load on the DNS resolver
    • this approach is implemented by Nutch and StormCrawler

(Assuming it's even possible to find robots.txt location)

The robots.txt location is defined in the RFC (it is /robots.txt), including the expected behavior if the location is redirected, or in case of a response code other than HTTP 200.

@rom1504
Copy link
Owner

rom1504 commented Mar 2, 2023

2 is not possible for two reasons:

  • The data needs to be randomly shuffled for training reasons
  • The data needs to be randomly shuffled to distribute the load uniformly over urls/domains to avoid imposing a too high load to them

1 may be possible but we'd need to find good ways to automatically deploy a kV solution to keep the tool easy to use
There are other reasons that would justify using such a kv store per domain (eg speed limiting per domain), so that might be something interesting to investigate

@rom1504 rom1504 reopened this Mar 2, 2023
@rom1504 rom1504 changed the title Robots.txt not being respected Implement Robots.txt support Apr 23, 2023
@samjsharpe
Copy link

Parsing robots.txt is a solved problem in Python, it's in the stdlib. https://docs.python.org/3/library/urllib.robotparser.html

Why not start doing it once per worker and interate from that to a more efficient solution in an Agile fashion?

Repository owner deleted a comment from PaulWay Apr 23, 2023
@rom1504
Copy link
Owner

rom1504 commented Apr 23, 2023

Why not start doing it once per worker and interate from that to a more efficient solution in an Agile fashion?

Feel free to give it a try

@robrwo
Copy link

robrwo commented Dec 14, 2023

robots.txt is a file that should be used by crawlers: tools that discover urls (see https://developers.google.com/search/docs/advanced/robots/robots_txt or https://datatracker.ietf.org/doc/html/draft-koster-rep )

this tool is meant to be used after a crawler has been run, on the resulting validated urls

No. A lot of sites were indexed by Common Crawl before their index was used to train AIs. They have since opted-out but their pages live in old copies of the index.

A website I maintain is regularly hit by img2dataset bots, even though the site now disallows and even blocks Common Crawl. The site sends the "noai" X-Robots-Tag but this is a waste of CPU and bandwidth. It makes more sense to add something to robots.txt so that these crawlers just stay away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants