streams: Readable highWaterMark is measured in bytes or characters #6798

mscdex · 2016-05-17T01:22:45Z

Version: all
Platform: all
Subsystem: streams

Currently the Readable streams documentation states that highWaterMark is measured in bytes for non-object mode streams. However, when .setEncoding() is called on the Readable stream, then highWaterMark is measured in characters and not bytes since state.length += chunk.length happens after chunk is overwritten with a decoded string and state.length is what is compared with highWaterMark.

This seems to be an issue since at least v0.10, so I wasn't sure if we should just update the documentation to note this behavior or if we should change the existing behavior to match the documentation.

/cc @nodejs/streams

The text was updated successfully, but these errors were encountered:

jasnell · 2016-05-17T01:31:35Z

My initial take on this is that measuring characters is a bug.

mcollina · 2016-05-17T06:41:52Z

@jasnell I agree! this is a nice issue if anybody wants to start contributing on streams and on core cc @nodejs/streams.

MylesBorins · 2016-05-17T07:05:16Z

I'd be interested in digging into this to learn more about streams.... will take a look in the morning

mscdex · 2016-06-16T07:37:26Z

One other issue to think about with this is when someone calls setEncoding() when state.length > 0, what to do then? Convert all the buffered data to strings and update state.length or just loop through the buffer using Buffer.byteLength(chunk, encoding) to update state.length ?

mcollina · 2016-06-16T13:59:28Z

I think we should just use byte length for highWaterMark. Converting values... sounds tricky.

jalafel · 2016-10-16T15:32:50Z

Hi! It seems like it's been awhile since anyone's mentioned anything on this issue. Does anyone mind if I take a go at this?

mcollina · 2016-10-16T17:01:03Z

@jessicaquynh please do! Ping @nodejs/streams if you need any help.

jalafel · 2016-10-18T18:15:11Z

@mcollina Thanks! Should I also include a test for this change?

cjihrig · 2016-10-18T18:18:54Z

@jessicaquynh yes please!

jalafel · 2016-10-18T18:24:26Z

Great, will do! Thanks @cjihrig!

jalafel · 2016-10-20T12:35:18Z

@nodejs/streams would love some guidance! The implementation that I made has caused a side-effect and I am unsure what to do to fix it.

I have changed this line to write the chunk.length (if not in objectMode) to be a converted Buffer.byteLength(chunk, encoding).

However, the most recent version of _stream_readable.js has this line that conflicts with the change.

This if statement gets entered pre-emptively when the byteLength is quite low. Say, 2 or 4 and it throws the error that state.buffer.head is undefined. The difference is in these instances, the string chunk length was 0. I notice that the file on master and the file in this issue's reference contain a different howMuchToRead function. So perhaps the new function got rewritten without this bug in mind?

Anyhow, any help or feedback would be greatly appreciated! Thanks! :)

mcollina · 2016-10-20T16:41:56Z

So perhaps the new function got rewritten without this bug in mind?

I do not think so, no.

The best approach is to write a new unit test, that exhibit the described issue, and push it on branch on your profile. I would like to have a closer look at both the change and the test.

As a side note, Buffer.byteLength(str) calls str.toString(), which might be the issue here.

jalafel · 2016-10-22T22:09:27Z

@mcollina That was a very complicated process, but I ended up learning a lot along the way! Would love your advice and feedback!

I wrote a test for this for on a lower byte level using mdash and the euro character, as their character length and byte length are not the same.

https://github.com/jessicaquynh/node/tree/issue%236798

mcollina · 2016-10-24T08:10:21Z

@jessicaquynh nice one! I think the problem is here: https://github.com/jessicaquynh/node/blob/issue%236798/lib/_stream_readable.js#L253. That state.buffer.head.data can be both a Buffer or a String, which means that we need to use Buffer.byteLength.

Let me know how it goes!

jalafel · 2016-10-26T00:27:51Z

@mcollina hey! I am not sure if I implemented your fix correctly, but it's still failing a few other tests. I wasn't exactly sure how to reproduce beyond introducing the process.on('uncaughtException') listener. Let me know if this is incorrect or bad practice!

Nonetheless, here's an updated test that reproduces the error. If you push in an exact buffer source, it works as expected. But should you split the buffer and the split is not exact, it attempts to return the empty buffer head!

mcollina · 2016-10-26T07:50:16Z

Some notes. I fear this is definitely not a good first contribution. I was wrong in tagging this as is.

Anyway, we are progressing: we need also to update https://github.com/jessicaquynh/node/blob/b587c1f2a7f21fc00b5c116d48a9c44d62feae11/lib/_stream_readable.js#L865-L884. As it's making its calculations on the length, not the bytes.

There is still a thing that worries me: characters are unsplittable, we can send the first half of a utf8 sequence through when the consumer expect it to be a string. Not sure if it's going to be an issue code-wise.

jalafel · 2016-10-27T06:45:58Z

@mcollina Ahah :) I will admit it's been a daunting (although rewarding) task. Thank you for your aid in this process.

I have implemented your recommended changes. There is still one error that happens on the parallel tests, and that has to do with a ReadLine implementation. Namely, this test which is throwing the empty buffer head error.

From my interpretation it looks like the byte order mark is causing the length to be off?

mcollina · 2016-10-27T13:20:04Z

@jessicaquynh can you push your branch? master...jessicaquynh:issue#6798 does not have all the fixes (especially the howMuchToRead ones). I want to run the tests on your branch too.

The problem is related to https://github.com/jessicaquynh/node/blob/b587c1f2a7f21fc00b5c116d48a9c44d62feae11/lib/_stream_readable.js#L865-L884. As you can see, it's still accessing the length without passing through Buffer.byteLength.

jalafel · 2016-10-27T14:09:22Z

@mcollina Sorry about that! I forgot to push up last night with fromListPartial fix. It has now been updated and pushed to be the same as my local branch.

Additionally, I am wondering I could ask for clarification:

There is still a thing that worries me: characters are unsplittable, we can send the first half of a utf8 sequence through when the consumer expect it to be a string. Not sure if it's going to be an issue code-wise.

Is this in regards to instances in the test when the Buffer splits the euro/emdash characters between different chunks?

mcollina · 2016-10-27T14:27:55Z

Is this in regards to instances in the test when the Buffer splits the euro/emdash characters between different chunks?

Yes!
In current master, the logic is exact, and it treats bytes and chars in the same way.
Meaning that, if you try to .read(16), it will read either 16 bytes or characters.

If we count only characters, what should we do? If we have 15 bytes, and then a 3-bytes utf sequence, we can either read 15 or 18 bytes, we cannot read 16 or 17. This is very likely what is happening here.

mcollina · 2016-10-27T14:47:00Z

(Maybe I am wrong about that utf story, this change is really hard).

I'm seeing two tests failing:

test/parallel/test-stream2-set-encoding.js
test/parallel/test-stream-preprocess.js

In case of 1., there is something wrong going on with hex encoding.

[ '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '61616161616161616161',
  '' ]
[ '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161' ]

All the data is there, but organized differently. There might be an issue in onEofChunk, again accessing chunk.length: https://github.com/jessicaquynh/node/blob/cedc393221c305eb87fc50fde2191664bddac697/lib/_stream_readable.js#L405

mafintosh · 2016-10-27T15:12:15Z

Maybe this is just too hard to fix in general? The use case of piping with setEncoding is kinda edge-casy anyway.

mcollina · 2016-10-27T15:23:59Z

@mafintosh probably, at this point I am wondering if it's feasible, and if it decreases throughput. Maybe we are just worrying for nothing.

jalafel · 2016-10-28T15:51:52Z

Hey @mcollina, I've added your suggestion but it actually crops up more errors with the tests.

I'm seeing two tests failing:
test/parallel/test-stream2-set-encoding.js

I noticed in my first changes that this same error was happening when I was using the state's encoding in the readableAddChunk func, here.

I am not exactly sure what the issue is. So, I tried experimenting without setting the encoding in fromListPartial, here and that seemed to work.

Though, I am not exactly sure why because the failing test/parallel/test-stream2-set-encoding.js is checking for hex encoding, and with the empty string it defaults to utf8. If you wouldn't mind explaining this difference in behaviour, I'd very much appreciate it :)

Nonetheless

test/parallel/test-stream-preprocess.js

is still failing with this error.

return Buffer.byteLength(state.buffer.head.data)

mcollina · 2016-11-02T20:51:03Z

@jessicaquynh sorry it's taking a bit of time :( this week and the next one will be a bit busy, I will try to answer as quickly as I can.

jalafel · 2016-11-02T22:04:08Z

@mcollina no worries! I appreciate all your help :)

jasnell · 2017-05-30T04:54:01Z

ping @nodejs/streams

mcollina · 2017-05-30T07:32:57Z

I think the best approach about this issue is to update the documentation. An actual fix proved to be way harder, and it might slow things down considerably.

Would anyone update the docs? Maybe @jessicaquynh?

jalafel · 2017-05-31T19:31:19Z

@mcollina definitely can; are you thinking about putting it in as a "gotcha" statement?

mcollina · 2017-06-01T07:44:28Z

@jessicaquynh yes, I think that would work, let's see what other people think when you fire the PR.

This commit documents and edge-case behavior in readable streams. It is expected that non-object streams are measured in bytes against the highWaterMark. However, it was discovered in issue nodejs#6798 that after calling .setEncoding() on the stream, it will thereafter begin to measure the buffer's length in characters.

This commit documents and edge-case behavior in readable streams. It is expected that non-object streams are measured in bytes against the highWaterMark. However, it was discovered in issue thereafter begin to measure the buffer's length in characters. PR-URL: #13442 Refs: #6798 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Vse Mozhet Byt <[email protected]> Reviewed-By: Ruben Bridgewater <[email protected]>

This commit documents and edge-case behavior in readable streams. It is expected that non-object streams are measured in bytes against the highWaterMark. However, it was discovered in issue thereafter begin to measure the buffer's length in characters. PR-URL: nodejs#13442 Refs: nodejs#6798 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Vse Mozhet Byt <[email protected]> Reviewed-By: Ruben Bridgewater <[email protected]>

This commit documents and edge-case behavior in readable streams. It is expected that non-object streams are measured in bytes against the highWaterMark. However, it was discovered in issue thereafter begin to measure the buffer's length in characters. PR-URL: #13442 Refs: #6798 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Vse Mozhet Byt <[email protected]> Reviewed-By: Ruben Bridgewater <[email protected]>

gireeshpunathil · 2018-04-09T12:57:06Z

#13442 is landed, so close-able?
ping @mscdex

mcollina · 2018-04-09T14:53:23Z

Yes, I think this could be closed.

mscdex added the stream Issues and PRs related to the stream subsystem. label May 17, 2016

mscdex added the good first issue Issues that are suitable for first-time contributors. label May 17, 2016

MylesBorins self-assigned this May 17, 2016

nouex mentioned this issue Jun 19, 2016

stream, doc: an ambiguity in the readable.read([size]) doc #7337

Closed

MylesBorins removed their assignment Oct 12, 2016

mcollina removed the good first issue Issues that are suitable for first-time contributors. label Oct 27, 2016

jalafel mentioned this issue Jun 3, 2017

doc: document switchover from measured bytes to chars after setEncoding #13442

Closed

2 tasks

mcollina closed this as completed Apr 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streams: Readable highWaterMark is measured in bytes or characters #6798

streams: Readable highWaterMark is measured in bytes or characters #6798

mscdex commented May 17, 2016

jasnell commented May 17, 2016

mcollina commented May 17, 2016

MylesBorins commented May 17, 2016

mscdex commented Jun 16, 2016 •

edited

Loading

mcollina commented Jun 16, 2016

jalafel commented Oct 16, 2016

mcollina commented Oct 16, 2016

jalafel commented Oct 18, 2016

cjihrig commented Oct 18, 2016

jalafel commented Oct 18, 2016

jalafel commented Oct 20, 2016

mcollina commented Oct 20, 2016

jalafel commented Oct 22, 2016

mcollina commented Oct 24, 2016

jalafel commented Oct 26, 2016

mcollina commented Oct 26, 2016

jalafel commented Oct 27, 2016 •

edited

Loading

mcollina commented Oct 27, 2016

jalafel commented Oct 27, 2016 •

edited

Loading

mcollina commented Oct 27, 2016

mcollina commented Oct 27, 2016

mafintosh commented Oct 27, 2016

mcollina commented Oct 27, 2016

jalafel commented Oct 28, 2016

mcollina commented Nov 2, 2016

jalafel commented Nov 2, 2016

jasnell commented May 30, 2017

mcollina commented May 30, 2017

jalafel commented May 31, 2017

mcollina commented Jun 1, 2017

gireeshpunathil commented Apr 9, 2018

mcollina commented Apr 9, 2018

streams: Readable highWaterMark is measured in bytes *or* characters #6798

streams: Readable highWaterMark is measured in bytes *or* characters #6798

Comments

mscdex commented May 17, 2016

jasnell commented May 17, 2016

mcollina commented May 17, 2016

MylesBorins commented May 17, 2016

mscdex commented Jun 16, 2016 • edited Loading

mcollina commented Jun 16, 2016

jalafel commented Oct 16, 2016

mcollina commented Oct 16, 2016

jalafel commented Oct 18, 2016

cjihrig commented Oct 18, 2016

jalafel commented Oct 18, 2016

jalafel commented Oct 20, 2016

mcollina commented Oct 20, 2016

jalafel commented Oct 22, 2016

mcollina commented Oct 24, 2016

jalafel commented Oct 26, 2016

mcollina commented Oct 26, 2016

jalafel commented Oct 27, 2016 • edited Loading

mcollina commented Oct 27, 2016

jalafel commented Oct 27, 2016 • edited Loading

mcollina commented Oct 27, 2016

mcollina commented Oct 27, 2016

mafintosh commented Oct 27, 2016

mcollina commented Oct 27, 2016

jalafel commented Oct 28, 2016

mcollina commented Nov 2, 2016

jalafel commented Nov 2, 2016

jasnell commented May 30, 2017

mcollina commented May 30, 2017

jalafel commented May 31, 2017

mcollina commented Jun 1, 2017

gireeshpunathil commented Apr 9, 2018

mcollina commented Apr 9, 2018

streams: Readable highWaterMark is measured in bytes or characters #6798

streams: Readable highWaterMark is measured in bytes or characters #6798

mscdex commented Jun 16, 2016 •

edited

Loading

jalafel commented Oct 27, 2016 •

edited

Loading

jalafel commented Oct 27, 2016 •

edited

Loading