Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string_decoder: fix bad utf8 character handling #7310

Merged
merged 2 commits into from
Jun 24, 2016

Conversation

mscdex
Copy link
Contributor

@mscdex mscdex commented Jun 15, 2016

Checklist
  • make -j4 test (UNIX) or vcbuild test nosign (Windows) passes
  • a test and/or benchmark is included
  • the commit message follows commit guidelines
Affected core subsystem(s)
  • string_decoder
Description of change

This commit fixes an issue when extra utf8 continuation bytes appear at the end of a chunk of data, causing miscalculations to be made when checking how many bytes are needed to decode a complete
character.

Fixes: #7308

@mscdex mscdex added the string_decoder Issues and PRs related to the string_decoder subsystem. label Jun 15, 2016
@mscdex mscdex force-pushed the string-decoder-fix-regression branch 3 times, most recently from 164ca60 to 541bcc9 Compare June 15, 2016 16:50
@mscdex
Copy link
Contributor Author

mscdex commented Jun 15, 2016

/cc @gagern

self.lastNeed = nb + 1 - (buf.length - j);
if (self.lastNeed < 0)
self.lastNeed = nb = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make these two assignments on separate lines here and below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this way describes the intention more succinctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is against the code style though, not sure why we don't lint it.

Copy link
Contributor

@silverwind silverwind Jun 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@indutny I'm not aware of a fitting rule in ESLint. Filed eslint/eslint#6424 for a rule proposal.

@cjihrig
Copy link
Contributor

cjihrig commented Jun 15, 2016

LGTM pending CI and a style fix.

@mscdex
Copy link
Contributor Author

mscdex commented Jun 15, 2016

@@ -55,9 +55,14 @@ assert.strictEqual(decoder.write(Buffer.from('\ufffd\ufffd\ufffd')),
assert.strictEqual(decoder.end(), '');

decoder = new StringDecoder('utf8');
assert.strictEqual(decoder.write(Buffer.from('efbfbde2', 'hex')), '\ufffd');
assert.strictEqual(decoder.write(Buffer.from('EFBFBDE2', 'hex')), '\ufffd');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? And if needed, maybe leave the old test too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just making capitalization consistent with all of the other hex buffers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to leave it just to make sure that lower-case is tested. (I'm sure that under the hood, string_decoder delegates to buffer or whatever. So it's probably far-fetched that we'd be in a situation where lower-case was broken but upper-case still worked. But it doesn't cost anything to have that tiny bit of extra coverage and the change seems aesthetic only.)

While I have an opinion, I don't feel strongly enough about this to protest or anything. If you do feel strongly, feel free to leave it as you have it.

Copy link
Contributor Author

@mscdex mscdex Jun 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. The hex string is being passed to the Buffer creation function (Buffer.from()), not to the StringDecoder functions. That function has plenty of lowercase and uppercase hex string tests in the appropriate test-buffer* test files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mscdex Argh! You're right, of course. Ignore me.

@mscdex
Copy link
Contributor Author

mscdex commented Jun 15, 2016

CI is green. /cc @nodejs/collaborators

@indutny
Copy link
Member

indutny commented Jun 15, 2016

One comment, otherwise LGTM.

@mscdex mscdex force-pushed the string-decoder-fix-regression branch from 541bcc9 to 52462c6 Compare June 15, 2016 21:54
gagern added a commit to gagern/node that referenced this pull request Jun 16, 2016
@gagern
Copy link
Contributor

gagern commented Jun 16, 2016

I did some more tests in gagern/node@f3d4f2a. Observations:

  • Input EC41 decodes incorrectly if decoded in two chunks. In that case, the ASCII byte gets turned into an U+FFFD as well. This is a regression compared to 6.2.0, too.
  • (At some point I had thought that EDA0B5EDB08D, which is CESU-8 for U+1D40D, were behaving incorrectly as well. But I guess I got this wrong, the result is six U+FFFD consistently.)

@gagern
Copy link
Contributor

gagern commented Jun 16, 2016

I tried some more systematic testing in gagern/node@2e4bff1: using the bytes 00 41 b8 cc e2 f0 f1 fb as a hopefully representative set of input bytes likely to cause different behavior, I tried all byte sequences of length up to four and checked whether chunking had any effect. For the following byte sequences it had:

(e2|f0|f1)(00|41)
ccccb8
f[01](b8|cc)(00|41)
f[01]ccb8
f[01]fb(00|41)
(cc|e2)e2b8b8
e2(b8|e2)ccb8
e2fbcc(01|b8)

I'm not sure how much of this can be attributed to the same bug. I'm also not sure how much of this is a regression compared to 6.2.0. If you can afford the time, it might make sense to include some variant of this systematic testing code. If not, then at least try out the indicated problematic cases, and perhaps add them to the suite to guard against regressions. Is there some special test suite where lengthy tests can be placed, to be run occasionally without slowing down the main test suite?

I'm also somewhat concerned about the use of Buffer.allocUnsafe in there. Is there some machinery to check whether some given code reads some buffer content without writing it first, so that it could be affected by the random data potentially included in such a buffer?

@gagern
Copy link
Contributor

gagern commented Jun 16, 2016

@mscdex would you care to have a look at #7318? The fact that even after your fix there are so many possible breakages here made me have a closer look at the code, and come up with an alternate solution. I'd value your opinion on that.

@mscdex mscdex force-pushed the string-decoder-fix-regression branch from 52462c6 to 023bb56 Compare June 18, 2016 19:37
@mscdex mscdex force-pushed the string-decoder-fix-regression branch from 023bb56 to 985bd75 Compare June 18, 2016 19:38
@mscdex
Copy link
Contributor Author

mscdex commented Jun 18, 2016

I've now included fixes for the other test cases that @gagern had created, which I have now included.

CI: https://ci.nodejs.org/job/node-test-pull-request/3025/

CI is green except for an unrelated test failure on Windows.

@mscdex mscdex force-pushed the string-decoder-fix-regression branch from 985bd75 to bafb77f Compare June 18, 2016 21:50
@jasnell
Copy link
Member

jasnell commented Jun 20, 2016

LGTM

@mscdex
Copy link
Contributor Author

mscdex commented Jun 20, 2016

@cjihrig @indutny Does this still LGTY after the additional changes?

@cjihrig
Copy link
Contributor

cjihrig commented Jun 20, 2016

Yea, LGTM. CI seemed happy.

if (nb >= 0) {
if (nb > 0)
self.lastNeed = nb + 1 - (buf.length - j);
self.lastNeed = nb - 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't it still go negative here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because if execution has reached here, nb would have to be a 2, 3, or 4-byte character indicator. So self.lastNeed would range from 0 to 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, nb can't be 1.

@mscdex
Copy link
Contributor Author

mscdex commented Jun 24, 2016

One last CI before landing, just in case: https://ci.nodejs.org/job/node-test-pull-request/3064/

EDIT: CI is green, but a few CI nodes are offline at the moment. Still enough coverage IMHO.

mscdex and others added 2 commits June 23, 2016 23:18
This commit fixes an issue when extra utf8 continuation bytes appear
at the end of a chunk of data, causing miscalculations to be made
when checking how many bytes are needed to decode a complete
character.

Fixes: nodejs#7308
PR-URL: nodejs#7310
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Fedor Indutny <[email protected]>
PR-URL: nodejs#7310
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Fedor Indutny <[email protected]>
@mscdex mscdex force-pushed the string-decoder-fix-regression branch from d3cea4b to 5e8cbd7 Compare June 24, 2016 03:50
@mscdex mscdex merged commit 5e8cbd7 into nodejs:master Jun 24, 2016
@mscdex mscdex deleted the string-decoder-fix-regression branch June 24, 2016 03:55
Fishrock123 pushed a commit that referenced this pull request Jun 27, 2016
This commit fixes an issue when extra utf8 continuation bytes appear
at the end of a chunk of data, causing miscalculations to be made
when checking how many bytes are needed to decode a complete
character.

Fixes: #7308
PR-URL: #7310
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Fedor Indutny <[email protected]>
Fishrock123 pushed a commit that referenced this pull request Jun 27, 2016
PR-URL: #7310
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Fedor Indutny <[email protected]>
@Fishrock123 Fishrock123 mentioned this pull request Jun 27, 2016
Fishrock123 pushed a commit that referenced this pull request Jul 5, 2016
This commit fixes an issue when extra utf8 continuation bytes appear
at the end of a chunk of data, causing miscalculations to be made
when checking how many bytes are needed to decode a complete
character.

Fixes: #7308
PR-URL: #7310
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Fedor Indutny <[email protected]>
Fishrock123 pushed a commit that referenced this pull request Jul 5, 2016
PR-URL: #7310
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Fedor Indutny <[email protected]>
@Fishrock123 Fishrock123 mentioned this pull request Jul 5, 2016
@MylesBorins
Copy link
Contributor

@mscdex lts?

@mscdex
Copy link
Contributor Author

mscdex commented Jul 12, 2016

@thealphanerd This PR is only relevant if #6777 is also landed, however that PR is currently marked as dont-land-on-v4.x to give the rewrite some time in non-LTS. If others are comfortable with the original StringDecoder rewrite now landing on LTS, then this would need to land also.

@MylesBorins
Copy link
Contributor

@mscdex I'm going to mark this as dont-land for now as well. Would you like us to keep an issue in @nodejs/lts to keep track of the commits neccessary to backport the string_decoder changes?

@mscdex
Copy link
Contributor Author

mscdex commented Jul 12, 2016

@thealphanerd Sure.

@gagern
Copy link
Contributor

gagern commented Jul 14, 2016

Will the 6.3.1 release be derived from current master, and hence contain this fix here? Or is the fix scheduled for 6.4.0?

@targos
Copy link
Member

targos commented Jul 14, 2016

@gagern 6.3.0 already has the fix.

@gagern
Copy link
Contributor

gagern commented Jul 14, 2016

Ah thanks, @targos. I apparently only searched the changelog for #7308 and the 5e8cbd7 in master, so I missed the 962ac37 for #7310 that is listed. Sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
string_decoder Issues and PRs related to the string_decoder subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

string_decoder: RangeError introduced by 6.2.1
9 participants