Implement Robots.txt support #48

Scoppio · 2021-09-16T13:18:57Z

Scripts and softwares for automated scrapping must follow robots.txt rules, otherwise it may make the user liable for unauthorised use of data.

rom1504 · 2021-09-16T13:50:39Z

robots.txt is a file that should be used by crawlers: tools that discover urls (see https://developers.google.com/search/docs/advanced/robots/robots_txt or https://datatracker.ietf.org/doc/html/draft-koster-rep )

this tool is meant to be used after a crawler has been run, on the resulting validated urls

sebastian-nagel · 2023-03-01T17:27:06Z

Hi @rom1504,

is the robots.txt protocol really only meant only for "tools that discover urls"?

RFC 9309 is addressed to "automatic clients known as crawlers". I think we can agree that img2dataset is an "automatic client" (or uses one).
robots.txt files in the wild often provide rulesets addressing image "crawlers", e.g. "Googlebot-Image", "Baiduspider-image", "YandexImages".

this tool is meant to be used after a crawler has been run, on the resulting validated urls

Does this mean that it's a requirement that the crawler collecting the links only keeps links that are not disallowed in the robots.txt of the target site? I'm not aware of any web datasets that do this and erase such links from the HTML captures. Also CCBot checks the site's robots.txt before accessing any HTML page on that site but does not remove links from WARC (and WAT) captures if the link would be disallowed by the target site's robots.txt.

In other words, there are several reasons why fetching a particular image might be disallowed by robots.txt, while fetching the HTML pages linking to the image was allowed:

images may be disallowed by robots.txt while HTML pages are not, e.g., by rules such as
```
Disallow: /media/
Disallow: /images/
Disallow: *.gif$
```
the image link was found in an HTML page on another site (the robots.txt of the site where the image is hosted may disallow fetching the image)
different user-agents are used when crawling the HTML and later when fetching the images
time gap: the robots.txt may change between accessing the HTML page and the image

Happy to discuss these points – maybe it's worth to reopen this issue or open a new one. Thanks!

(see also #249)

check "robots.txt", as it is out of scope, and should probably be implemented in a way that allows caching to improve performance, and avoid multiple calls to "/robots.txt" per website.

rom1504 · 2023-03-01T18:08:12Z

Makes sense.
This needs to be addressed in a previous filtering step before using img2dataset then

Feel free to implement such a tool.

rom1504 · 2023-03-01T18:09:29Z

Or can you think of a way this could be implemented in an efficient way for img2dataset current architecture?
I can't see how this could be done without doing 2 calls for each image. (Assuming it's even possible to find robots.txt location)

sebastian-nagel · 2023-03-02T16:58:45Z

in a previous filtering step before using img2dataset

I don't think, it's practical, given the maximum cache duration required by RFC 9309, section 2.4: "Crawlers SHOULD NOT use the cached version for more than 24 hours".

Another point would be the requirement for crawlers to "set their own name" (user-agent product token) and send it along with the HTTP request, see section 2.2.1.

Or can you think of a way this could be implemented in an efficient way for img2dataset current architecture?

I see two ways to go:

a central robots.txt cache (Redis?) - somewhat similar in functionality to the recommended "caching" DNS resolver
- it's enough to cache the rules for the matched user-agent, not the whole robots.txt file
- while there are various robots.txt parsers written in Python, including Google's RFC 9309 reverence implementation, I'm not aware of any ready-to-use robots.txt cache: grob is outdated, ev. Scrapy's robots.txt middleware can be used
partition the input so that all URLs from a single host end up in a single partition (or very few partitions for hosts with many URLs)
- e.g. by hash(hostname) % num_partitions
- this allows the robots.txt rules to be cached in the worker process itself, since a single partition does not contain too many different hostnames
- this should also reduce the load on the DNS resolver
- this approach is implemented by Nutch and StormCrawler

(Assuming it's even possible to find robots.txt location)

The robots.txt location is defined in the RFC (it is /robots.txt), including the expected behavior if the location is redirected, or in case of a response code other than HTTP 200.

rom1504 · 2023-03-02T17:42:20Z

2 is not possible for two reasons:

The data needs to be randomly shuffled for training reasons
The data needs to be randomly shuffled to distribute the load uniformly over urls/domains to avoid imposing a too high load to them

1 may be possible but we'd need to find good ways to automatically deploy a kV solution to keep the tool easy to use
There are other reasons that would justify using such a kv store per domain (eg speed limiting per domain), so that might be something interesting to investigate

samjsharpe · 2023-04-23T21:58:46Z

Parsing robots.txt is a solved problem in Python, it's in the stdlib. https://docs.python.org/3/library/urllib.robotparser.html

Why not start doing it once per worker and interate from that to a more efficient solution in an Agile fashion?

rom1504 · 2023-04-23T22:55:49Z

Why not start doing it once per worker and interate from that to a more efficient solution in an Agile fashion?

Feel free to give it a try

robrwo · 2023-12-14T14:02:52Z

robots.txt is a file that should be used by crawlers: tools that discover urls (see https://developers.google.com/search/docs/advanced/robots/robots_txt or https://datatracker.ietf.org/doc/html/draft-koster-rep )

this tool is meant to be used after a crawler has been run, on the resulting validated urls

No. A lot of sites were indexed by Common Crawl before their index was used to train AIs. They have since opted-out but their pages live in old copies of the index.

A website I maintain is regularly hit by img2dataset bots, even though the site now disallows and even blocks Common Crawl. The site sends the "noai" X-Robots-Tag but this is a waste of CPU and bandwidth. It makes more sense to add something to robots.txt so that these crawlers just stay away.

rom1504 closed this as completed Sep 16, 2021

rom1504 reopened this Mar 2, 2023

rom1504 mentioned this issue Mar 9, 2023

Respect robots.txt #284

Closed

rom1504 changed the title ~~Robots.txt not being respected~~ Implement Robots.txt support Apr 23, 2023

rom1504 mentioned this issue Apr 23, 2023

Please make this tool "opt-in" by default #293

Closed

Repository owner deleted a comment from PaulWay Apr 23, 2023

rom1504 added the enhancement New feature or request label Apr 23, 2023

rom1504 mentioned this issue Apr 24, 2023

img2dataset ignores X-Robots-Tag #298

Closed

ephphatha mentioned this issue Apr 25, 2023

Read and cache robots.txt files for each host using thread-local storage #302

Open

bobmatyas mentioned this issue May 25, 2024

Add: Img2dataset bobmatyas/wp-block-ai-crawlers#31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Robots.txt support #48

Implement Robots.txt support #48

Scoppio commented Sep 16, 2021

rom1504 commented Sep 16, 2021 •

edited

Loading

sebastian-nagel commented Mar 1, 2023

rom1504 commented Mar 1, 2023

rom1504 commented Mar 1, 2023 •

edited

Loading

sebastian-nagel commented Mar 2, 2023

rom1504 commented Mar 2, 2023

samjsharpe commented Apr 23, 2023

rom1504 commented Apr 23, 2023

robrwo commented Dec 14, 2023

Implement Robots.txt support #48

Implement Robots.txt support #48

Comments

Scoppio commented Sep 16, 2021

rom1504 commented Sep 16, 2021 • edited Loading

sebastian-nagel commented Mar 1, 2023

rom1504 commented Mar 1, 2023

rom1504 commented Mar 1, 2023 • edited Loading

sebastian-nagel commented Mar 2, 2023

rom1504 commented Mar 2, 2023

samjsharpe commented Apr 23, 2023

rom1504 commented Apr 23, 2023

robrwo commented Dec 14, 2023

rom1504 commented Sep 16, 2021 •

edited

Loading

rom1504 commented Mar 1, 2023 •

edited

Loading