Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

umbrella ticket to resolve iteration / read size / chunked encoding questions #844

Closed
slingamn opened this issue Sep 7, 2012 · 52 comments
Closed

Comments

@slingamn
Copy link
Contributor

slingamn commented Sep 7, 2012

This ticket is intended to aggregate previous discussion from #539, #589, and #597 about the default value of chunk_size used by iter_content and iter_lines.

cc @mponton @gwrtheyrn @shazow

Issues:

  1. The default read size of iter_content is 1 byte; this is probably inefficient
  2. Requests does not expose the ability to read chunked encoding streams in the "correct" way, i.e., using the provided octet counts to tell how much to read.
  3. However, this would not be suitable as the default implementation of iter_content anyway; not all websites are standards-compliant and when this was tried it caused more problems than it solved.
  4. The current default read size for iter_lines is 10kB. This is high enough that iteration over lines can be perceived as unresponsive --- no lines are returned until all 10kB have been read.
  5. There is no "correct" way to implement iter_lines using blocking I/O, we just have to bite the bullet and take a guess as to how much data we should read.
  6. There's apparently some nondeterminism in iter_lines, I think because of the edge case where a read ends between a \r and a \n.
  7. iter_lines is backed by iter_content, which operates on raw byte strings and splits at byte boundaries. I think there may be edge cases where we could split the body in the middle of a multi-byte encoding of a Unicode character.

My guess at a solution:

  1. Set the default chunk_size to 1024 bytes, for both iter_content and iter_lines.
  2. Provide a separate interface (possibly iter_chunks) for iterating over chunks of pages that are known to correctly implement chunked encoding, e.g., Twitter's firehose APIs
  3. We may need our own implementation of splitlines that is deterministic with respect to our chunking boundaries, i.e., remembers if the last-read character was \r and suppresses a subsequent \n. We may also need to build in Unicode awareness at this level, i.e., decode as much of the body as is valid, then save any leftover invalid bytes to be prepended to the next chunk.

Comments and thoughts are much appreciated. Thanks for your time!

@kennethreitz
Copy link
Contributor

👍 for iter_chunks

@kennethreitz
Copy link
Contributor

@slingamn thanks so much for stepping up with this ;)

@mponton
Copy link

mponton commented Sep 7, 2012

I'm not sure using 1024 as the default size for iter_lines is appropriate, I think it should be 1 byte. Don't get me wrong, I KNOW how bad this is for performance. But I stick with my original reasoning: If someone iterates on iter_lines, lines should be processed as soon as they are available and there should be no chance of "hanging" with complete lines in the buffer. The interface is called iter_lines so it should do just that by default.

A knowledgeable developer that wants better performance could still decide to use a larger read size knowing that it could mean that the read process will hang until the specified amount is read or the stream closed and that some lines could come in late.

I love requests because it is fun and easy to use first, yet very powerful. This is why I think iter_lines should behave as expected even if it means poor performance. This decision should simply be documented. Newcomers will appreciate this (i.e. works as expected out of the box) while more experienced devs can still fiddle with the buffer size.

That said, iter_content should definitely have a larger default read size. The previous one was 10K if I remember correctly. The newly proposed one is 1K. I'm not sure which one make a better default.

+1 for iter_chunks.

@shazow
Copy link
Contributor

shazow commented Sep 7, 2012

I think I'm missing something but why must iter_lines hang when it has complete lines in the buffer? Why can't we scan for completed lines (\n or such) and yield it just the same even if we've read less than the default size?

@mponton
Copy link

mponton commented Sep 7, 2012

@shazow Because the synchronous network read call will not return until the specified amount of bytes has been read or the stream is closed. Thus, is you use a 1K read size and the server returns lines averaging 100 bytes, you will have to wait for the server to return at least 11 lines before iter_lines return any lines and your Python code is executed. If one line is sent every minute, it will take 11 minutes before your code is even aware of the first line sent by the server.

@shazow
Copy link
Contributor

shazow commented Sep 7, 2012

Ah I see what you mean. For some reason I was thinking in local IO terms. Thanks for the clarification!

I wonder if there is any way to do it timeout-based, such that it will read as much as it can in N ms and process, loop.

@slingamn
Copy link
Contributor Author

slingamn commented Sep 8, 2012

@mponton these are good points.

Does anyone know what browsers typically do when reading streaming text data without chunked encoding?

@mponton
Copy link

mponton commented Sep 8, 2012

I did a quick test out of curiosity. It seems to depend on the Content-Type. For text/ascii, each line is printed as it comes in without delay (Safari 6). If content type is unspecified or HTML, the browser waits for the server to close the connection before rendering the content.

In any case, I'm not sure we should compare and try to emulate a browser... We're "stuck" in synchronous I/O land, browsers aren't. We should implement what makes the most sense for requests.

@slingamn
Copy link
Contributor Author

I was poking around for something unrelated and there's quite an interesting implementation of a readline method inside the socket module. I haven't read it in detail yet, but it looks as though it reads byte-at-a-time in some cases and 8kB at a time in others.

It doesn't respect unicode or universal newlines, so we probably can't use it out of the box, but it looks like they put quite a bit of thought into it, so maybe some of the ideas are applicable.

@gazpachoking
Copy link
Contributor

It seems most of the discussion is to do with how iter_lines should work. Is there any reason for iter_content not to have a larger default chunk size? The performance hit was really hard for the 1 byte size before I figured out what was going on.

@slingamn
Copy link
Contributor Author

slingamn commented Oct 4, 2012

There's no reason (at least, for iter_content's own default, as opposed to the default that iter_lines uses it with). I've been sort of reluctant to just dive in and try to change it, lest we trigger latent bugs (which is what happened the last time I changed a default).

To be honest, I think the typical use case should just be to read the entire body with the .content property. iter_content strikes me as a difficult interface to use correctly.

@mponton
Copy link

mponton commented Oct 4, 2012

I had changed back to 10K (original value) in #589 but the travisbot failed the build because one of the unit tests' expected result was not the same with a chunk size > 1 (see https://github.com/kennethreitz/requests/pull/589#issuecomment-5538019 for details).

Kenneth, proceeded to close the issue saying iter_content now had a proper size. At the time I thought he had made the changes on his side would merge them. Apparently this feel through a crack or something...

@piotr-dobrogost
Copy link

@slingamn Could you elaborate on point 3 from your initial list?

@jonathanslenders
Copy link

+1 for improving the default performance of iter_content, or at least, in the property Request.content, make sure to call iter_content with a larger chunk size. I think there's no reason for reading in chunks over there...

Edit: sorry. I see this has already been fixed in the latest release. Thanks for that!

@Lukasa Lukasa closed this as completed Jan 21, 2013
@Lukasa Lukasa reopened this Jan 23, 2013
@Lukasa
Copy link
Member

Lukasa commented Jan 23, 2013

@slingamn pointed out that there are a few issues here that are still unresolved. Sometime today I'll go through and work out which ones still haven't been done.

@slingamn
Copy link
Contributor Author

Current status as of v1.1.0:

  1. the iter_content read size is still at 1 byte
  2. the iter_lines read size has been reduced to 512 bytes (which should help responsiveness)
  3. multibyte characters and newlines can still overlap with read boundaries
  4. there's an implementation of chunked encoding for outgoing requests, but I'm not sure if there is one for incoming requests (the documentation says so, but I can't find the implementation)

@Lukasa
Copy link
Member

Lukasa commented Jan 23, 2013

My view:

  1. Yeah, probably. Something small-ish though, as we generally expect users who need iter_content to specify their own values.
  2. I'd call that fixed. =D
  3. This has all the hallmarks of being an interesting problem. =) If no-one else does I'll probably take a look at seeing if we can fix this.
  4. Again, interesting problem for someone to work on.

@sigmavirus24
Copy link
Contributor

3 is an interesting problem. There is probably a minimum length you might want to consider in those cases, but I'm sure you won't be able to please everyone. Since UTF-8 is on the rise, we could probably use that as a way of deciding on a default length.

@kennethreitz
Copy link
Contributor

@slingamn #4's a documentation problem and needs to be corrected.

@brandon-rhodes
Copy link

I would like to jump in because it appears that a bit of the above discussion is based upon a false premise: that a synchronous network receive of n bytes will block until at least n bytes are received (or, presumably, the socket has closed).

This is not, in fact, the case, and Unix network programming would be a shambles if it were — think of the disaster that every network program would face: applications would have to make the horrible decision to either read data byte-by-byte, or block indefinitely if they overestimated, even very slightly, the amount of data about to arrive on a socket. Network programmers would all stand impaled upon the horns of a dilemma. Everyone might use Windows for network programming instead. :)

But, fortunately, the plain normal vanilla blocking synchronous version of the recv() call is merciful to us: it only blocks waiting as long as no data at all is ready to be consumed! As soon as even a single byte of data arrives in an incoming packet, your beefy call to recv(1048576) will return that single byte and your program is off and running again. There is no penalty for giving recv() permission to return lots of data if a lot of data arrives in a single packet, or arrives in several packets while the operating system is still getting around to scheduling your thread again.

It is true that, for those rare exceptional cases where you really want to stay blocked because you really know that you need n bytes before you can do anything useful, there exists a POSIX flag MSG_WAITALL that you can pass to recv() so that your program really does block until lots of data arrives. But that is a very rare flag to specify, for all of these reasons that we have put on the table here.

So what is the problem here, you ask?

Well, @mponton actually lets the cat out of the bag without knowing it! Look carefully at this phrase from his reply to @shazow:

“…the synchronous network read call…”

“Read” call? What? Who would do a read() call on a socket? While POSIX does allow read() and write() on sockets for those extremely rare cases where you pass a socket to a library that is only designed for talking to files, it's not something you would ever do in a network program—you would lock yourself up waiting for n bytes to arrive even if less than n bytes were available immediately! Who would do that?

Why, the author of httplib, of course!

Yes, that's right. Instead of simply sitting in a tidy standard recv() loop and slowly filling a buffer until an end-of-line is visible, the author of httplib wraps the socket in a fake file-like object with makefile() and completely abandons all of the benefits of network programming under Unix!

They gain a tiny bit of convenience — and maybe, way back when it was written, C-level performance? — by having a Python file-like object watch for the end-of-line character for them. But they completely disabled the ability to stream live data from the network by making this choice, probably because they were operating in an era when people read and wrote network payloads whole anyway.

I recommend that Requests move off of httplib. It does very, very little for you; it could be re-implemented much more simply with recv() and send() in a day or two, given the simplicity of HTTP. There is no reason why Python or Requests shouldn't have line-based or content-based iteration that returns the moment that enough data has arrived to satisfy the caller; you just need to move off of a broken implementation that wantonly imposes the semantics of the read() call on what is really a socket capable of doing recv() if you'll just ask nicely!

@brandon-rhodes
Copy link

I need to go have dinner, but another quick note that I'll expand on later: there's also no reason that readline() should find UTF-8 a challenge, because the pair of bytes \r\n look exactly the same in UTF-8 — and since every byte of a multi-byte character in UTF-8 has the high bit set, no multi-byte character or sequence of them could accidentally have \r\n sitting inside. So you just need readline() to call recv() getting raw bytes until you see the sequence \r\n, and then pass the sequence of bytes — that you know already is a single line — to the decoder to be turned into a string.

@Lukasa
Copy link
Member

Lukasa commented Feb 11, 2013

I doubt we're prepared to ditch httplib (and thus urllib3) over iter_lines().

@slingamn
Copy link
Contributor Author

@brandon-rhodes thanks very much, this was extremely illuminating.

@shazow
Copy link
Contributor

shazow commented Feb 12, 2013

urllib3 would be very very happy to ditch httplib. urllib3/urllib3#58

I'm thinking of trying to find some corporate sponsors for making this happen and diving in (alongside some other high-demand issues).

@Lukasa
Copy link
Member

Lukasa commented Feb 12, 2013

@shazow: I'm all for that! Let me know if you get off the ground with it and I'll do my best to help out.

@shazow
Copy link
Contributor

shazow commented Feb 12, 2013

Some quick thoughts:

  • +1 on just using the unofficial fp member. If httplib changes in some Py4, we can adapt (as we will have to anyways). Or bonus points if we drop httplib altogether.
  • You can already get access to httplib's fp via urllib3. I don't think we need to make it any more "official" than that. Requests and urllib3 releases are pretty well synchronized and I'll make sure Requests has time to adapt if the access route changes.

@mponton
Copy link

mponton commented Feb 12, 2013

@shazow I guess since requests and urllib3 are such good neighbours (and requests packages its own version), there is little benefit to make that official and instead, should urllib3 internals change later, only requests will have to be adapted. I just feel dirty when I have to use something._xyz :-)

@slingamn On that previous rstrip() comment: I had a brain fart. We can simply specify \n\r to rstrip(). Dhuh! (Apparently I don't use rstrip() often enough...)

If everyone agrees this is an acceptable solution, I can make a pull-request in the next few days for it.

@slingamn
Copy link
Contributor Author

Agreed that ideally iter_lines would return strings terminated by the original newline character(s), since that's how the file iterator works. @kennethreitz would this be too big of an API break to introduce?

@jpaalasm
Copy link

We're handling request errors in our applications in this way:

    from requests.exceptions import RequestException, ConnectionError

    ...

    try:
        response = requests.post(url, data, headers=headers, session=_get_session())
    except ConnectionError:
        raise APIConnectionError("Could not connect to server (URL: %s)" % url)
    except RequestException:
        raise APIError("Exception when accessing API (URL: %s)" % url)

This seems to work in all other cases than when an IncompleteRead is raised by httplib (see stacktrace below). Shouldn't IncompleteRead be catched by requests and converted into a RequestException?

File "singleticketprocess.py", line 257, in _post_data
response = requests.post(url, data, headers=headers, session=_get_session())
File "requests/api.py", line 98, in post
return request('post', url, data=data, *_kwargs)
File "requests/safe_mode.py", line 39, in wrapped
return function(method, url, *_kwargs)
File "requests/api.py", line 51, in request
return session.request(method=method, url=url, **kwargs)
File "requests/sessions.py", line 241, in request
r.send(prefetch=prefetch)
File "requests/models.py", line 662, in send
self.response.content
File "requests/models.py", line 801, in content
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
File "requests/utils.py", line 438, in stream_decompress
for chunk in iterator:
File "requests/models.py", line 747, in generate
chunk = self.raw.read(chunk_size)
File "requests/packages/urllib3/response.py", line 146, in read
return self._fp.read(amt)
File "python2.7/httplib.py", line 541, in read
return self._read_chunked(amt)
File "python2.7/httplib.py", line 586, in _read_chunked
raise IncompleteRead(''.join(value))

@akavlie
Copy link

akavlie commented Apr 17, 2013

I'm getting the httplib.IncompleteRead error on a site when doing a plain requests.get() call; I'm not doing anything with iteration or chunked encoding. Is that covered by this ticket, or a different issue altogether?

@Lukasa
Copy link
Member

Lukasa commented Apr 17, 2013

I'm pretty sure you should only get IncompleteRead when you or your server uses chunked encoding: are you sure the server isn't?

@akavlie
Copy link

akavlie commented Apr 17, 2013

@Lukasa not unless it's the default.

Try this:

requests.get('http://www.stkierans.org/')

@Lukasa
Copy link
Member

Lukasa commented Apr 17, 2013

Yeah, the server is sending chunked data:

>>> import requests
>>> r = requests.get('http://www.stkierans.org/', stream=True)
>>> r.status_code
200
>>> r.headers['transfer-encoding']
'chunked'

I haven't had time to dig into httplib right now, and I'm about to go to work, but my guess is that this gets raised if the web server doesn't send the mandatory empty chunk at the end.

@Lukasa
Copy link
Member

Lukasa commented Apr 17, 2013

Yep, so the webserver is at fault here. It's specifying chunked encoding, but sending all of its data back in one chunk. I suggest you contact the administrator of the website and ask why they're doing crazy stuff. =)

@Lukasa
Copy link
Member

Lukasa commented Apr 17, 2013

For what it's worth, you don't really want their content anyway. They're seem to be doing some user-agent sniffing and are returning a placeholder page telling you that your browser doesn't support frames, and that you should upgrade. Frankly, I think that entire website might need an upgrade. =)

@Lukasa
Copy link
Member

Lukasa commented Apr 17, 2013

@akavlie That page is just using frames to hide this page: 'http://www.catholicweb.com/splash/stkierens/'. I'd suggest that you just target that page in Requests, but when I do it I get a 'Connection Reset By Peer'. Which is obnoxious.

@Lukasa
Copy link
Member

Lukasa commented Apr 17, 2013

Ok, so, they're doing user-agent sniffing on that page too, and when they find that we're not a browser they know they're just killing the TCP connection instead of sending a 4XX error. Which is even more obnoxious. You can get your data by using:

import requests
r = requests.get('http://www.catholicweb.com/splash/stkierens/', headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.47'})

Pretending to be Chrome does it just fine. =)

@akavlie
Copy link

akavlie commented Apr 18, 2013

@Lukasa Thanks for all the digging in... I've seen lots of pointless iframes and other horrible practices on various church sites, so this does not suprise me.

I'm targeting a lot of sites with requests in this application, and specifying overrides for bad behavior like this isn't very realistic at this point.

Is it reasonable to expect Requests to catch and wrap an exception like this? It looked to me like Requests itself had a bug.

@Lukasa
Copy link
Member

Lukasa commented Apr 18, 2013

I'll investigate it, but it might be that there are a few cases where this can be thrown. We should be able to wrap it though, either here or in urllib3. =)

@jpaalasm
Copy link

@Lukasa It would indeed be great if IncompleteRead was wrapped. Currently, when calling requests.get or requests.post, I need to catch both requests.exceptions.RequestException and httplib.IncompleteRead, which does not make sense. IncompleteRead should be turned into a RequestException or its subclass.

@foxx
Copy link

foxx commented Jun 3, 2013

I think urllib3/urllib3#190 might possibly be related. I've put an idea in there of how this could be approached.

This ticket is quite a big read, so apologies if this is no longer relevant.

@sigmavirus24
Copy link
Contributor

Are there any updates on this? Can we close this out since there hasn't been much recent activity on it?

@Lukasa
Copy link
Member

Lukasa commented Jan 19, 2015

Closed because no-one has said anything in literally more than a year.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests