-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read and cache robots.txt files for each host using thread-local storage #302
base: main
Are you sure you want to change the base?
Conversation
Pretty short code, looks good! Could you:
|
bc26f88
to
b1cf7b7
Compare
Benchmarking is not really feasible from my connection given the tests connect to real hosts. If you have a synthetic test case I can run that. I did see a few things that could be improved when experimenting, urllib.robotparser doesn't use a timeout in read() which caused it to wait longer than the 10 second timeout used for downloading images. Also added some code around responses to more closely align with RFC9309 which should help when a single host is referenced multiple times in a thread. Fixed the lint issues and it looks like tests are running in approximately the same time as main. |
so I tried to run this and I found
Some examples of urls that are considered blocked by robots.txt but shouldn't:
I see your code is trying to handle these cases properly already, but it seems something is not working. Could you add some tests to make sure we can handle those properly ? (you can follow the example of the server serving synthetic examples in the other tests) |
raw = f.read() | ||
self.parse(raw.decode("utf-8").splitlines()) | ||
except urllib.error.HTTPError as err: | ||
if err.code in (401, 403) or 500 <= err.code <= 599: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to 2.3.1.3 of https://datatracker.ietf.org/doc/rfc9309/ 401 and 403 are considered Unavailable hence should be allow all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was based on the cpython implementation of read, which treats authentication errors as unreachable statuses. I forgot to comment that behaviour so have added it now.
self.allow_all = True | ||
except Exception: # pylint: disable=broad-except | ||
# treat other errors as meaning the server is unreachable (timeout, ssl error, dns failure) | ||
self.disallow_all = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be measured, probably not cached, it is not clear one failure of the server means the server is fully down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the errors that lead to this block are network errors (name resolution failures, timeout). There's definitely improvements that can be made for network errors but for this initial implementation I've just gone with the required behaviour described (by MUST) in 2.3.1.4.
Another option could be to let these Exceptions propagate up to be caught by the download_image try/except block, this would skip requesting the current image and try fetching robots.txt the next time the host is encountered in the thread.
"""Returns True if the given user agent is allowed to fetch this url based on the hosts robots.txt""" | ||
robots_txt_cache = thread_context.robots_txt_cache | ||
|
||
robots_url = urllib.parse.urljoin(url, "/robots.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not seem correct
The rules MUST be accessible in a file named "/robots.txt" (all
lowercase) in the top-level path of the service.
https://datatracker.ietf.org/doc/rfc9309/ 2.3
url is not the top-level path of the service
it would be needed to keep only the domain name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"/robots.txt" is a root-relative URL. urljoin with a root relative URL as the second argument uses the host, scheme, and port from the first argument to make a complete URL.
ok yeah I see, fetching robots.txt at the root + correctly handling 401 and 403 seems to find some true positives |
b1cf7b7
to
f685f9c
Compare
http://kimkelly.smugmug.com/robots.txt has a default policy of I've already added a robots.txt response to the test server with a default policy of |
@@ -22,5 +22,10 @@ async def get(): | |||
return "hi" | |||
|
|||
|
|||
@app.get("/robots.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won’t this change break test coverage for x-robots-header parsing by causing the downloader to fail early? Maybe make a different 3rd mount point that’s disallowed by robots.txt and not by HTTP headers, and add a separate test case using that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would've, but that test doesn't actually test whether the disallowed paths/files are not downloaded or what point in the process they were disallowed. The FastAPI server was returning a json response for robots.txt causing it to be ignored (content type was meant to be text/plain).
The test case still doesn't validate that the disallow method was correct but will at least fail if one isn't working now.
Would be better to share between threads to minimise requests to the same host, but that'll require synchronising read() calls and object initialisation.
f685f9c
to
da369f6
Compare
Hi, If you want to get this merged sooner these tasks are necessary
Of course you have no obligation to work on any that. I'll get around to it eventually. Just wanted to let you know what blocks merging in my opinion. |
Would be better to share between threads to minimise requests to the same host, but that'll require synchronising read() calls and object initialisation.
This also doesn't respect the request-rate or crawl-delay directives as that requires communicating the last access time per host across the pool. That still needs to be done to properly address #48
This should be merged after #300 to allow hosts to set a policy for this tool. urllib.robotparser expects a simple useragent string and splits on the first
/
. The current useragent will cause it to only apply the catch-all policy or a policy intended for browsers (User-Agent: Mozilla
).