-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
umbrella ticket to resolve iteration / read size / chunked encoding questions #844
Comments
👍 for |
@slingamn thanks so much for stepping up with this ;) |
I'm not sure using 1024 as the default size for A knowledgeable developer that wants better performance could still decide to use a larger read size knowing that it could mean that the read process will hang until the specified amount is read or the stream closed and that some lines could come in late. I love That said, +1 for |
I think I'm missing something but why must |
@shazow Because the synchronous network read call will not return until the specified amount of bytes has been read or the stream is closed. Thus, is you use a 1K read size and the server returns lines averaging 100 bytes, you will have to wait for the server to return at least 11 lines before |
Ah I see what you mean. For some reason I was thinking in local IO terms. Thanks for the clarification! I wonder if there is any way to do it timeout-based, such that it will read as much as it can in N ms and process, loop. |
@mponton these are good points. Does anyone know what browsers typically do when reading streaming text data without chunked encoding? |
I did a quick test out of curiosity. It seems to depend on the In any case, I'm not sure we should compare and try to emulate a browser... We're "stuck" in synchronous I/O land, browsers aren't. We should implement what makes the most sense for |
I was poking around for something unrelated and there's quite an interesting implementation of a It doesn't respect unicode or universal newlines, so we probably can't use it out of the box, but it looks like they put quite a bit of thought into it, so maybe some of the ideas are applicable. |
It seems most of the discussion is to do with how iter_lines should work. Is there any reason for iter_content not to have a larger default chunk size? The performance hit was really hard for the 1 byte size before I figured out what was going on. |
There's no reason (at least, for To be honest, I think the typical use case should just be to read the entire body with the |
I had changed back to 10K (original value) in #589 but the travisbot failed the build because one of the unit tests' expected result was not the same with a chunk size > 1 (see https://github.com/kennethreitz/requests/pull/589#issuecomment-5538019 for details). Kenneth, proceeded to close the issue saying |
@slingamn Could you elaborate on point 3 from your initial list? |
+1 for improving the default performance of iter_content, or at least, in the property Request.content, make sure to call iter_content with a larger chunk size. I think there's no reason for reading in chunks over there... Edit: sorry. I see this has already been fixed in the latest release. Thanks for that! |
@slingamn pointed out that there are a few issues here that are still unresolved. Sometime today I'll go through and work out which ones still haven't been done. |
Current status as of v1.1.0:
|
My view:
|
3 is an interesting problem. There is probably a minimum length you might want to consider in those cases, but I'm sure you won't be able to please everyone. Since UTF-8 is on the rise, we could probably use that as a way of deciding on a default length. |
I would like to jump in because it appears that a bit of the above discussion is based upon a false premise: that a synchronous network receive of n bytes will block until at least n bytes are received (or, presumably, the socket has closed). This is not, in fact, the case, and Unix network programming would be a shambles if it were — think of the disaster that every network program would face: applications would have to make the horrible decision to either read data byte-by-byte, or block indefinitely if they overestimated, even very slightly, the amount of data about to arrive on a socket. Network programmers would all stand impaled upon the horns of a dilemma. Everyone might use Windows for network programming instead. :) But, fortunately, the plain normal vanilla blocking synchronous version of the It is true that, for those rare exceptional cases where you really want to stay blocked because you really know that you need n bytes before you can do anything useful, there exists a POSIX flag So what is the problem here, you ask? Well, @mponton actually lets the cat out of the bag without knowing it! Look carefully at this phrase from his reply to @shazow: “…the synchronous network read call…” “Read” call? What? Who would do a Why, the author of Yes, that's right. Instead of simply sitting in a tidy standard They gain a tiny bit of convenience — and maybe, way back when it was written, C-level performance? — by having a Python file-like object watch for the end-of-line character for them. But they completely disabled the ability to stream live data from the network by making this choice, probably because they were operating in an era when people read and wrote network payloads whole anyway. I recommend that Requests move off of |
I need to go have dinner, but another quick note that I'll expand on later: there's also no reason that |
I doubt we're prepared to ditch httplib (and thus urllib3) over |
@brandon-rhodes thanks very much, this was extremely illuminating. |
urllib3 would be very very happy to ditch httplib. urllib3/urllib3#58 I'm thinking of trying to find some corporate sponsors for making this happen and diving in (alongside some other high-demand issues). |
@shazow: I'm all for that! Let me know if you get off the ground with it and I'll do my best to help out. |
Some quick thoughts:
|
@shazow I guess since @slingamn On that previous If everyone agrees this is an acceptable solution, I can make a pull-request in the next few days for it. |
Agreed that ideally |
We're handling request errors in our applications in this way:
This seems to work in all other cases than when an IncompleteRead is raised by httplib (see stacktrace below). Shouldn't IncompleteRead be catched by requests and converted into a RequestException? File "singleticketprocess.py", line 257, in _post_data |
I'm getting the |
I'm pretty sure you should only get |
@Lukasa not unless it's the default. Try this:
|
Yeah, the server is sending chunked data: >>> import requests
>>> r = requests.get('http://www.stkierans.org/', stream=True)
>>> r.status_code
200
>>> r.headers['transfer-encoding']
'chunked' I haven't had time to dig into httplib right now, and I'm about to go to work, but my guess is that this gets raised if the web server doesn't send the mandatory empty chunk at the end. |
Yep, so the webserver is at fault here. It's specifying chunked encoding, but sending all of its data back in one chunk. I suggest you contact the administrator of the website and ask why they're doing crazy stuff. =) |
For what it's worth, you don't really want their content anyway. They're seem to be doing some user-agent sniffing and are returning a placeholder page telling you that your browser doesn't support frames, and that you should upgrade. Frankly, I think that entire website might need an upgrade. =) |
@akavlie That page is just using frames to hide this page: 'http://www.catholicweb.com/splash/stkierens/'. I'd suggest that you just target that page in Requests, but when I do it I get a 'Connection Reset By Peer'. Which is obnoxious. |
Ok, so, they're doing user-agent sniffing on that page too, and when they find that we're not a browser they know they're just killing the TCP connection instead of sending a 4XX error. Which is even more obnoxious. You can get your data by using: import requests
r = requests.get('http://www.catholicweb.com/splash/stkierens/', headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.47'}) Pretending to be Chrome does it just fine. =) |
@Lukasa Thanks for all the digging in... I've seen lots of pointless iframes and other horrible practices on various church sites, so this does not suprise me. I'm targeting a lot of sites with requests in this application, and specifying overrides for bad behavior like this isn't very realistic at this point. Is it reasonable to expect Requests to catch and wrap an exception like this? It looked to me like Requests itself had a bug. |
I'll investigate it, but it might be that there are a few cases where this can be thrown. We should be able to wrap it though, either here or in urllib3. =) |
@Lukasa It would indeed be great if IncompleteRead was wrapped. Currently, when calling requests.get or requests.post, I need to catch both requests.exceptions.RequestException and httplib.IncompleteRead, which does not make sense. IncompleteRead should be turned into a RequestException or its subclass. |
I think urllib3/urllib3#190 might possibly be related. I've put an idea in there of how this could be approached. This ticket is quite a big read, so apologies if this is no longer relevant. |
Are there any updates on this? Can we close this out since there hasn't been much recent activity on it? |
Closed because no-one has said anything in literally more than a year. |
This ticket is intended to aggregate previous discussion from #539, #589, and #597 about the default value of
chunk_size
used byiter_content
anditer_lines
.cc @mponton @gwrtheyrn @shazow
Issues:
iter_content
is 1 byte; this is probably inefficientiter_content
anyway; not all websites are standards-compliant and when this was tried it caused more problems than it solved.iter_lines
is 10kB. This is high enough that iteration over lines can be perceived as unresponsive --- no lines are returned until all 10kB have been read.iter_lines
using blocking I/O, we just have to bite the bullet and take a guess as to how much data we should read.iter_lines
, I think because of the edge case where a read ends between a\r
and a\n
.iter_lines
is backed byiter_content
, which operates on raw byte strings and splits at byte boundaries. I think there may be edge cases where we could split the body in the middle of a multi-byte encoding of a Unicode character.My guess at a solution:
chunk_size
to 1024 bytes, for bothiter_content
anditer_lines
.iter_chunks
) for iterating over chunks of pages that are known to correctly implement chunked encoding, e.g., Twitter's firehose APIssplitlines
that is deterministic with respect to our chunking boundaries, i.e., remembers if the last-read character was\r
and suppresses a subsequent\n
. We may also need to build in Unicode awareness at this level, i.e., decode as much of the body as is valid, then save any leftover invalid bytes to be prepended to the next chunk.Comments and thoughts are much appreciated. Thanks for your time!
The text was updated successfully, but these errors were encountered: