iter_lines small chunks #539

dbrgn · 2012-04-06T15:04:40Z

I'm writing a script to track keywords in the twitter realtime api.

At first, I was using requests v0.10.8. Showing the tweets was almost instant, but the issue #515 occured frequently.

Then I upgraded requests to v0.11.1. Now #515 is resolved, but apparently small chunks are not processed instantly, if they don't reach the chunk_size. By setting the chunk size to a value like 10, it works, but that can't be the solution :)

The text was updated successfully, but these errors were encountered:

kennethreitz · 2012-04-06T19:22:35Z

Hmm, it might be actually :)

Someone implemented "proper" chunks, but it caused way more problems than it solved, so we took it out.

mponton · 2012-04-10T22:24:13Z

So, I had opened #533, came back to update it about chunk_size and noticed this issue.

I guess there aren't many ways to read lines from a blocking socket, so chunk_size probably is the solution.

Still, I would like to raise two points:

iter_lines() is supposed to iterate on lines. Should the user be expected to specify a chunk_size for it to work? (You know, HTTP for Humans ;-)). If the argument could be considered useful to some, should't it have a default value of 1? (iter_lines() should return empty lines too)
iter_lines() should handle the case where the server closes the connection without raising httplib.IncompleteRead: IncompleteRead(0 bytes read). It should simply stop yielding. No? It seems wrong to provide a "readline-style" interface and expect the user to handle read exception related to chunk sizes.

dbrgn · 2012-04-10T22:39:30Z

Good thoughts, @mponton :) +1 on both of them.

kennethreitz · 2012-04-10T22:41:03Z

1: It originally did have a default size of 1, actually, before iter_content was written. :)

2: 👍

kennethreitz · 2012-04-11T01:59:39Z

I still think chunk_size is valuable though. Let's just change the default.

mponton · 2012-04-11T02:00:35Z

On 2: I was thinking a bit more on this, shouldn't this also apply to iter_content too? I mean, if I iterate over chunks of content from a streaming URL and the server closes the connection, I would expect the generator to stop yielding. Then, if I really expected a number of bytes, I would check if I got everything. My interpretation of chunk_size is a maximum number of bytes read per iteration, not a guarantee that the content is either exactly that size or a multiple of that size.

I think iter_content should be the one to catch IncompleteRead exceptions and then check if self.raw.isclosed() is true. If it is, it would simply return the last chunk read and exit. If not, it should re-raise the exception (but I say that because I am not certain if a blocking HTTPResponse.read() could return less than the specified amount of bytes when a connection was not closed by the server... Please correct me if this is not possible.).

That said, all this is really only necessary for streaming connections AFAIK. A read(x) on a "normal" HTTP connection that has a remaining content size less than x will simply return the remaining content and '' on next reads. I assume this is due to the Content-Size header being present.

mponton · 2012-04-11T02:05:31Z

Agreed, leaving chunk_size doesn't hurt if the default is 1 for iter_lines.

You may want to add a short note in the documentation though. People may change the chunk_size thinking they will still get each line as it comes in for streaming requests.

piotr-dobrogost · 2012-05-05T16:51:44Z

I see no connection between size of chunk_size and problems mentioned in this ticket. Default size of 1 is very bad idea. Am I missing something here?

mponton · 2012-05-06T17:13:51Z

iter_lines uses iter_content which has a default of 10K bytes for its chunk size (well, it had, then was changed, see my pull request). Thus, iter_lines() would "hang" until it read at least 10K of data before returning any lines.

So using a chunk_size of 1 for iter_lines(), although not efficient, makes more sense as it will return lines as soon as they become available, even if they have only 10 characters for example.

piotr-dobrogost · 2012-05-06T18:21:53Z

Well, if read(chunk_size) blocks until at least chunk_size bytes are read then the whole implementation does not make sense and changing chunk_size to 1 is not the solution.

kennethreitz · 2012-05-06T19:56:31Z

@mponton when chunk_size was set to 1, requests of more than a few KB were exponentially slower than with any other client.

mponton · 2012-05-07T00:19:21Z

@kennethreitz Well, of course it will be. You're reading "chunks" of 1 byte at a time. However, when we talked about changing the default chunk_size, it was for iter_lines(), not iter_chunk() (see my pull request). The reason was to allow iter_lines() to behave like most people using requests would expect it to behave: Read lines as they come in. If you do use iter_lines() to read from a streaming API like the Twitter API, and it was used with an chunk_size of 10K, you may have to wait a freaking long time before you ever get any line.

That said, I'm open to alternatives, but unless you start polling the socket in non-blocking mode, the read(chunk_size) will block until it get chunk_size or the connection is closed AFAIK. Please let me know if I'm missing something.

@piotr-dobrogost If you know of another implementation to read lines as they come in from a socket while using blocking I/O, please feel free to let me (or us) know. I'm not trying to sell my quick fix to anyone, I'm simply trying to find a way to have iter_lines() behave like (I hope) it was meant to behave.

@kennethreitz Also, if you feel chunk_size should not be set to 1 in the distribution, please change it back, I don't want to impose on anybody. I can specify it in my code but I'm pretty sure you'll get regular "issues" opened about this thing by people wondering why their "lines" are not coming in when sent by the server as expected. You already had two soon after the initial code change, @gwrtheyrn and I.

slingamn · 2012-05-08T05:53:22Z

@mponton would iter_lines ideally put the socket into nonblocking mode and read until it sees \n?

mponton · 2012-05-08T15:22:36Z

It may limit the number of reads on the socket, but you'd have to handle cases where I/O would block (EWOULDBLOCK) when no data is available. The next step would be to go completely async behind the scene in iter_lines() and ask the system to call us back only when data is available to minimize the number of reads while still returning data line by line, but this might be a bit overkill and would definitely make the code more complex. I'm not sure its worth the effort.

If you expect to have hundred of lines per seconds in the streaming response, you could still set the chunk_size to something bigger. The only issue is if the "flow" of data slows down, your requests may start to hang with complete lines already available.

slingamn · 2012-05-08T19:13:38Z

That makes sense; we'd need our own event loop, and it probably wouldn't play nice with the gevent event loop.

mponton mentioned this issue Apr 10, 2012

Broken HTTPResponse.iter_lines() in 0.11.1 #533

Closed

kennethreitz pushed a commit that referenced this issue Apr 11, 2012

#539

365c64f

mponton mentioned this issue May 6, 2012

Fixed chunk_size values in iter_content() and iter_lines(). False default instead of None. #589

Closed

kennethreitz closed this as completed May 8, 2012

mponton mentioned this issue May 8, 2012

chunk_size for iter_lines() versus iter_content() #597

Closed

slingamn mentioned this issue Sep 7, 2012

umbrella ticket to resolve iteration / read size / chunked encoding questions #844

Closed

github-actions bot locked as resolved and limited conversation to collaborators Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iter_lines small chunks #539

iter_lines small chunks #539

dbrgn commented Apr 6, 2012

kennethreitz commented Apr 6, 2012

mponton commented Apr 10, 2012

dbrgn commented Apr 10, 2012

kennethreitz commented Apr 10, 2012

kennethreitz commented Apr 11, 2012

mponton commented Apr 11, 2012

mponton commented Apr 11, 2012

piotr-dobrogost commented May 5, 2012

mponton commented May 6, 2012

piotr-dobrogost commented May 6, 2012

kennethreitz commented May 6, 2012

mponton commented May 7, 2012

slingamn commented May 8, 2012

mponton commented May 8, 2012

slingamn commented May 8, 2012

iter_lines small chunks #539

iter_lines small chunks #539

Comments

dbrgn commented Apr 6, 2012

kennethreitz commented Apr 6, 2012

mponton commented Apr 10, 2012

dbrgn commented Apr 10, 2012

kennethreitz commented Apr 10, 2012

kennethreitz commented Apr 11, 2012

mponton commented Apr 11, 2012

mponton commented Apr 11, 2012

piotr-dobrogost commented May 5, 2012

mponton commented May 6, 2012

piotr-dobrogost commented May 6, 2012

kennethreitz commented May 6, 2012

mponton commented May 7, 2012

slingamn commented May 8, 2012

mponton commented May 8, 2012

slingamn commented May 8, 2012