Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iter_lines small chunks #539

Closed
dbrgn opened this issue Apr 6, 2012 · 15 comments
Closed

iter_lines small chunks #539

dbrgn opened this issue Apr 6, 2012 · 15 comments
Labels

Comments

@dbrgn
Copy link
Contributor

dbrgn commented Apr 6, 2012

I'm writing a script to track keywords in the twitter realtime api.

At first, I was using requests v0.10.8. Showing the tweets was almost instant, but the issue #515 occured frequently.

Then I upgraded requests to v0.11.1. Now #515 is resolved, but apparently small chunks are not processed instantly, if they don't reach the chunk_size. By setting the chunk size to a value like 10, it works, but that can't be the solution :)

@kennethreitz
Copy link
Contributor

Hmm, it might be actually :)

Someone implemented "proper" chunks, but it caused way more problems than it solved, so we took it out.

@mponton
Copy link

mponton commented Apr 10, 2012

So, I had opened #533, came back to update it about chunk_size and noticed this issue.

I guess there aren't many ways to read lines from a blocking socket, so chunk_size probably is the solution.

Still, I would like to raise two points:

  1. iter_lines() is supposed to iterate on lines. Should the user be expected to specify a chunk_size for it to work? (You know, HTTP for Humans ;-)). If the argument could be considered useful to some, should't it have a default value of 1? (iter_lines() should return empty lines too)
  2. iter_lines() should handle the case where the server closes the connection without raising httplib.IncompleteRead: IncompleteRead(0 bytes read). It should simply stop yielding. No? It seems wrong to provide a "readline-style" interface and expect the user to handle read exception related to chunk sizes.

@dbrgn
Copy link
Contributor Author

dbrgn commented Apr 10, 2012

Good thoughts, @mponton :) +1 on both of them.

@kennethreitz
Copy link
Contributor

1: It originally did have a default size of 1, actually, before iter_content was written. :)

2: 👍

@kennethreitz
Copy link
Contributor

I still think chunk_size is valuable though. Let's just change the default.

kennethreitz pushed a commit that referenced this issue Apr 11, 2012
@mponton
Copy link

mponton commented Apr 11, 2012

On 2: I was thinking a bit more on this, shouldn't this also apply to iter_content too? I mean, if I iterate over chunks of content from a streaming URL and the server closes the connection, I would expect the generator to stop yielding. Then, if I really expected a number of bytes, I would check if I got everything. My interpretation of chunk_size is a maximum number of bytes read per iteration, not a guarantee that the content is either exactly that size or a multiple of that size.

I think iter_content should be the one to catch IncompleteRead exceptions and then check if self.raw.isclosed() is true. If it is, it would simply return the last chunk read and exit. If not, it should re-raise the exception (but I say that because I am not certain if a blocking HTTPResponse.read() could return less than the specified amount of bytes when a connection was not closed by the server... Please correct me if this is not possible.).

That said, all this is really only necessary for streaming connections AFAIK. A read(x) on a "normal" HTTP connection that has a remaining content size less than x will simply return the remaining content and '' on next reads. I assume this is due to the Content-Size header being present.

@mponton
Copy link

mponton commented Apr 11, 2012

Agreed, leaving chunk_size doesn't hurt if the default is 1 for iter_lines.

You may want to add a short note in the documentation though. People may change the chunk_size thinking they will still get each line as it comes in for streaming requests.

@piotr-dobrogost
Copy link

I see no connection between size of chunk_size and problems mentioned in this ticket. Default size of 1 is very bad idea. Am I missing something here?

@mponton
Copy link

mponton commented May 6, 2012

iter_lines uses iter_content which has a default of 10K bytes for its chunk size (well, it had, then was changed, see my pull request). Thus, iter_lines() would "hang" until it read at least 10K of data before returning any lines.

So using a chunk_size of 1 for iter_lines(), although not efficient, makes more sense as it will return lines as soon as they become available, even if they have only 10 characters for example.

@piotr-dobrogost
Copy link

Well, if read(chunk_size) blocks until at least chunk_size bytes are read then the whole implementation does not make sense and changing chunk_size to 1 is not the solution.

@kennethreitz
Copy link
Contributor

@mponton when chunk_size was set to 1, requests of more than a few KB were exponentially slower than with any other client.

@mponton
Copy link

mponton commented May 7, 2012

@kennethreitz Well, of course it will be. You're reading "chunks" of 1 byte at a time. However, when we talked about changing the default chunk_size, it was for iter_lines(), not iter_chunk() (see my pull request). The reason was to allow iter_lines() to behave like most people using requests would expect it to behave: Read lines as they come in. If you do use iter_lines() to read from a streaming API like the Twitter API, and it was used with an chunk_size of 10K, you may have to wait a freaking long time before you ever get any line.

That said, I'm open to alternatives, but unless you start polling the socket in non-blocking mode, the read(chunk_size) will block until it get chunk_size or the connection is closed AFAIK. Please let me know if I'm missing something.

@piotr-dobrogost If you know of another implementation to read lines as they come in from a socket while using blocking I/O, please feel free to let me (or us) know. I'm not trying to sell my quick fix to anyone, I'm simply trying to find a way to have iter_lines() behave like (I hope) it was meant to behave.

@kennethreitz Also, if you feel chunk_size should not be set to 1 in the distribution, please change it back, I don't want to impose on anybody. I can specify it in my code but I'm pretty sure you'll get regular "issues" opened about this thing by people wondering why their "lines" are not coming in when sent by the server as expected. You already had two soon after the initial code change, @gwrtheyrn and I.

@slingamn
Copy link
Contributor

slingamn commented May 8, 2012

@mponton would iter_lines ideally put the socket into nonblocking mode and read until it sees \n?

@mponton
Copy link

mponton commented May 8, 2012

It may limit the number of reads on the socket, but you'd have to handle cases where I/O would block (EWOULDBLOCK) when no data is available. The next step would be to go completely async behind the scene in iter_lines() and ask the system to call us back only when data is available to minimize the number of reads while still returning data line by line, but this might be a bit overkill and would definitely make the code more complex. I'm not sure its worth the effort.

If you expect to have hundred of lines per seconds in the streaming response, you could still set the chunk_size to something bigger. The only issue is if the "flow" of data slows down, your requests may start to hang with complete lines already available.

@slingamn
Copy link
Contributor

slingamn commented May 8, 2012

That makes sense; we'd need our own event loop, and it probably wouldn't play nice with the gevent event loop.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants