Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer: implement WHATWG Encoding Standard API #13644

Closed
wants to merge 1 commit into from

Conversation

jasnell
Copy link
Member

@jasnell jasnell commented Jun 13, 2017

Provide an (initially experimental) implementation of the WHATWG Encoding Standard API (TextDecoder and TextEncoder). The is the same API implemented on the browser side.

By default, with small-icu, only the UTF-8, UTF-16le and UTF-16be decoders are supported. With full-icu enabled, every encoding other than iso-8859-16 is supported.

This provides a basic test, but does not include the full web platform tests. Note: many of the web platform tests for this would fail by default because we ship with small-icu by default.

The implementation is added without changing any of the existing encoding support in core.

Refs: https://encoding.spec.whatwg.org/

/cc @domenic @TimothyGu @addaleax

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)

buffer

@jasnell jasnell added buffer Issues and PRs related to the buffer subsystem. semver-minor PRs that contain new features and should be released in the next minor version. labels Jun 13, 2017
@nodejs-github-bot nodejs-github-bot added buffer Issues and PRs related to the buffer subsystem. build Issues and PRs related to build files or the CI. c++ Issues and PRs that require attention from people who are familiar with C++. errors Issues and PRs related to JavaScript errors originated in Node.js core. i18n-api Issues and PRs related to the i18n implementation. tools Issues and PRs related to the tools directory. labels Jun 13, 2017
@mscdex
Copy link
Contributor

mscdex commented Jun 13, 2017

What's special about iso-8859-16?

@jasnell
Copy link
Member Author

jasnell commented Jun 13, 2017

It's not implemented by ICU at all, even with full-icu

Copy link
Member

@TimothyGu TimothyGu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some preliminary comments. I'll need some time to digest the intricacies of BOM and the various flags related to it.


function normalizeEncoding(encoding) {
if (encoding === undefined)
return 'utf-8';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't have to handle defaults -- nor should it. The two places that use this function both ensure encoding !== undefined. I also somewhat prefer the signature function getEncodingFromLabel(label) to align better with the spec.

return 'utf-8';
let enc = encodings.get(encoding);
if (enc !== undefined) return enc;
enc = encodings.get(encoding.toLowerCase());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASCII whitespace (and ASCII whitespace only, not Unicode) needs to be trimmed as well per step 1 of the spec.

let enc = encodings.get(encoding);
if (enc !== undefined) return enc;
enc = encodings.get(encoding.toLowerCase());
if (enc !== undefined) return enc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just return enc w/o if should be fine and simpler.

src/node_i18n.cc Outdated

UErrorCode status = U_ZERO_ERROR;
UConverter* conv = ucnv_open(*label, &status);
args.GetReturnValue().Set(U_SUCCESS(status) == 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just !!U_SUCCESS(status)? The == 1 looks kinda fragile.

if (handle === undefined)
throw new errors.Error('ERR_ENCODING_NOT_SUPPORTED', encoding);

Object.defineProperties(this, {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object.defineProperties is pretty slow to be used on per object construction. I don't see any harm in making kHandle configurable, enumerable, or writable, as it's a symbol property already (and it's what we do for URL). encoding, fatal, and ignoreBOM should all be defined on the prototype as getter functions per spec.

src/node_i18n.cc Outdated
ConverterObject* converter;
ASSIGN_OR_RETURN_UNWRAP(&converter, args[0].As<Object>());
SPREAD_BUFFER_ARG(args[1], input_obj);
int flags = args[2]->Uint32Value();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the non-deprecated Uint32Value(Local<Context>) signature of the function.

}

Local<ObjectTemplate> t = ObjectTemplate::New(env->isolate());
t->SetInternalFieldCount(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent of this PR, we should find a way to cache this ObjectTemplate.

Potentially dumb question: why not use ObjectWrap?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping this question...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For no particular reason other than ObjectWrap really isn't used anywhere else in core. I've actually been contemplating whether or not it is something we could eliminate at some point.

src/node_i18n.cc Outdated
NULL, flush, &status);

if (U_SUCCESS(status)) {
result.SetLength(target - &result[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetLength, too, takes number of entries as argument, not number of bytes.

@@ -35,7 +35,7 @@
'defines': [
# ICU cannot swap the initial data without this.
# http://bugs.icu-project.org/trac/ticket/11046
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the comment still apply?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Was planning to remove the whole section.

Per the [WHATWG Encoding Standard][], the encodings supported by the
`TextDecoder` API are outlined in the table below. For each encoding,
one or more aliases may be used. Support for some encodings is enabled
only when Node.js is using the full ICU data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it documented anywhere how to get full ICU?

Copy link
Contributor

@vsemozhetbyt vsemozhetbyt Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be better documented.


encode(input) {
const buf = lazyBuffer().from(String(input));
return new Uint8Array(buf.buffer, buf.offset, buf.length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per spec, the returned Uint8Array must be unpooled and have an entire block of ArrayBuffer dedicated to it.

`http.get()`, if the returned charset is one of those listed in the WHATWG spec
it's possible that the server actually returned win-1252-encoded data, and
using `'latin1'` encoding may incorrectly decode the characters.
*Note*: Today's browsers follow the [WHATWG Encoding Standard] which aliases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WHATWG Encoding Standard][] (+ second pair of brackets) for consistency?

const decoder = new TextDecoder('shift_jis');
let string = '';
let buffer;
while(buffer = getNextChunkSomehow()) {
Copy link
Contributor

@vsemozhetbyt vsemozhetbyt Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while( -> while ( ?

let string = '';
let buffer;
while(buffer = getNextChunkSomehow()) {
string += decoder.decode(buffer, { stream:true });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream:true -> stream: true ?

`false`.
* `ignoreBOM` {boolean} When `true`, the `TextDecoder` will include the byte
order mark in the decoded result. This option is only used when `encoding`
is `'utf-8'`, `'utf-16be'` or `'utf-16le'`.
Copy link
Contributor

@vsemozhetbyt vsemozhetbyt Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note about defaults (false = strip the BOM)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(false = strip the BOM)

That's correct.

var dec = new TextDecoder('utf-16be', { ignoreBOM: true });
[...dec.decode(new Uint8Array([0xfe, 0xff, 0x00, 0x20]))].map(str => str.codePointAt(0).toString(16));
  // ["feff", "20"]
var dec = new TextDecoder('utf-16be', { ignoreBOM: false });
[...dec.decode(new Uint8Array([0xfe, 0xff, 0x00, 0x20]))].map(str => str.codePointAt(0).toString(16));
  // ["20"]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean maybe it is worth to add a note about the default value.


const common = require('../common');
const assert = require('assert');
const { Buffer, TextDecoder, TextEncoder } = require('buffer');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to require Buffer if we state in the doc:

The Buffer class is a global within Node.js, making it unlikely that one would need to ever use require('buffer').Buffer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in a lint rule we use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyGu Could you elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use it only in the lib/.eslintrc.yaml and not in the test/.eslintrc.yaml?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing here: tests are supposed to be self-contained scripts, while everything in lib is user-visible and -contaminatable, like a seemingly innocent var Buffer = {} line in REPL. See nodejs/node-convergence-archive#21.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyGu Thank you.

@jasnell
Copy link
Member Author

jasnell commented Jun 13, 2017

Updated

one or more aliases may be used. Support for some encodings is enabled
only when Node.js is using the full ICU data.

<table>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If https://help.github.com/articles/organizing-information-with-tables/ works as advertised, github markdown tables would be easier to read in-source and to review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand (and I do not see anything in the referenced link) .. markdown tables do not support rowspan.

* `options` {object}
* `fatal` {boolean} `true` if decoding failures are fatal. Defaults to
`false`.
* `ignoreBOM` {boolean} When `true`, the `TextDecoder` will include the byte
Copy link
Contributor

@sam-github sam-github Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if withBOM (EDIT: or leaveBOM) would make more sense for this, if its not a fixed name due to the spec? I thought the docs might have had a typo and reversed the sense, because if the BOM is ignored it wouldn't be in the output, then I thought about it, and realized the sense of ignore is "don't remove from input stream if present". I'm not sure what would happen if the BOM is not in the input stream, though, would it get added to the decoded result?

Copy link
Member Author

@jasnell jasnell Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoreBOM is fixed within the spec. To change this would require a change in the spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I know this is all experimental, but FYI, its not clear to me from current docs whether a BOM is added if not present, or just passed through ("ignored") if present.

throw new errors.TypeError('ERR_INVALID_ARG_TYPE', 'input',
['ArrayBuffer', 'ArrayBufferView']);
}
if (options === null ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Web IDL spec treats null the same way as undefined. See https://heycam.github.io/webidl/#es-dictionary, step 4.1.2.

throw new errors.TypeError('ERR_INVALID_ARG_TYPE', 'options', 'object');
}

var flags = (options.stream === true) ? 0 : CONVERTER_FLAGS_FLUSH;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDL recognizes all truthy values as true, so just options.stream ?.

class TextDecoder {
constructor(encoding = 'utf-8', options = {}) {
if (typeof encoding !== 'string')
throw new errors.Error('ERR_INVALID_ARG_TYPE', 'encoding', 'string');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should coerce encoding to string rather than check its type, per IDL rules. See here for how it's done in URL.

constructor(encoding = 'utf-8', options = {}) {
if (typeof encoding !== 'string')
throw new errors.Error('ERR_INVALID_ARG_TYPE', 'encoding', 'string');
if (options !== undefined && typeof options !== 'object')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too: null should be treated just like undefined. Also, options cannot be undefined here due to the default parameter above.

}

decode(input = empty, options = {}) {
if (isAnyArrayBuffer(input)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SharedArrayBuffer is not supported at this moment in Web platform APIs that do not have [AllowShared] extended attribute. See https://heycam.github.io/webidl/#es-buffer-source-types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, understood. Unfortunately, the separate process.binding('util').isArrayBuffer() and process.binding('util').isSharedArrayBuffer() method no longer exist. It looks like those were condensed recently into a single IsAnyArrayBuffer(). I think this is one we can safely ignore for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems bad to ignore, as becoming spec-complaint later will be a backward-incompatible breaking change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed. I've added the isArrayBuffer() util method back in to process.binding('util')

src/node_i18n.cc Outdated
static void Decode(const FunctionCallbackInfo<Value>& args) {
Environment* env = Environment::GetCurrent(args);

CHECK_GE(args.Length(), 3); // Converter, ArrayBuffer, Flags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ArrayBuffer/Buffer/

src/node_i18n.cc Outdated

CHECK_GE(args.Length(), 2);
Utf8Value label(env->isolate(), args[0]);
int flags = args[1]->Uint32Value();
Copy link
Member

@TimothyGu TimothyGu Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the Local<Context> version as well.

src/node_i18n.cc Outdated
const char* source = input_obj_data;
size_t source_length = input_obj_length;

if (converter->unicode_ && !converter->bomSeen_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put && !converter->ignoreBOM_ here as well?

src/node_i18n.cc Outdated
}
source += bomOffset;
source_length -= bomOffset;
converter->bomSeen_ = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag should only be set when there actually is a BOM, i.e. bomOffset != 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite ...

In: https://encoding.spec.whatwg.org/#interface-textdecoder

* If encoding is UTF-8, UTF-16BE, or UTF-16LE, and ignore BOM flag and BOM seen flag are unset, then:
  * If token is U+FEFF, then set BOM seen flag.
  * Otherwise, if token is not end-of-stream, then set BOM seen flag and append token to output.
  * Otherwise, return output.

UChar* target = *result;
ucnv_toUnicode(converter->conv,
&target, target + (limit * sizeof(UChar)),
&source, source + source_length,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ICU docs say:

For most Unicode charsets it is also possible to ignore the indicated number of initial stream bytes and start converting after them. However, there are stateful Unicode charsets (UTF-7 and BOCU-1) for which this will not work. Therefore, it is best to ignore the first output UChar instead of the input signature bytes.

AFAICT source is incremented by the signature bytes, so we are doing what the ICU docs advise not to do. Are the negative consequences applicable to us?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe so. TextDecoder does not support UTF-7 nor BOCU-1. We are only skipping the signature bytes for UTF-8, UTF-16le, and UTF-16be, none of which have this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, in that case LGTM.

return 'utf-8';
}

encode(input) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input is an optional parameter too. See https://encoding.spec.whatwg.org/#interface-textencoder

}

encode(input) {
return new Uint8Array(lazyBuffer().from(String(input)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to use template literal syntax to prevent symbols from getting converted to strings. See

const str = `${val}`;
.

}

class TextEncoder {
constructor() {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be omitted?

@jasnell
Copy link
Member Author

jasnell commented Jun 14, 2017

Updated :-)
@TimothyGu ... as always, I'm loving the thorough review :-)

Copy link
Member

@TimothyGu TimothyGu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much better, thanks a lot!


### textDecoder.decode([input[, options]])

* `input` {ArrayBuffer|TypedArray} An `ArrayBuffer` or `%TypedArray%` instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also DataView. The %TypedArray% notation is generally reserved to specs, so do you think it is appropriate to use it here?


* `input` {ArrayBuffer|TypedArray} An `ArrayBuffer` or `%TypedArray%` instance
containing the encoded data.
* `options` {object}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object

```js
const { TextEncoder } = require('buffer');
const encoder = new TextEncoder();
const int8array = encoder.encode('this is some data');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint8array?


class TextDecoder {
constructor(encoding = 'utf-8', options = {}) {
encoding = String(encoding);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Template string literal is needed here as well (to disallow symbols).


[inspect](depth, opts) {
if (this == null || this[kEncoding] === undefined) {
throw new errors.TypeError('ERR_INVALID_THIS', 'TextDecoder');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should technically be used for the user-visible getters and the methods as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for now I'd rather leave it just here tho as it's not a check we commonly do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, we are doing this for URLs (see Web IDL create a operation function step 2.1.2.4). I'll defer to your opinion regarding these classes though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend adding appropriate type-checking here if possible; not a big deal, but seems unnecessary to diverge from the spec and from URL in that way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@domenic ... just want to make sure I'm clear on what kind of checking you're recommending? instanceof or is the more efficient hidden-symbol check ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hidden symbol check for sure. Although ideally one that distinguishes encoder vs. decoder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping this.

// This is not perfect, but there's really nothing else to key off since
// there are no internal properties specific to TextEncoder
if (this == null || this[Symbol.toStringTag] !== 'TextEncoder') {
throw new errors.TypeError('ERR_INVALID_THIS', 'TextEncoder');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. encoding and encode should have this check too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being, I'd prefer not.


[inspect](depth, opts) {
// This is not perfect, but there's really nothing else to key off since
// there are no internal properties specific to TextEncoder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:/ In this case I'd rather add a dummy property through a constructor, just so that subclasses that extend TextEncoder but override Symbol.toStringTag can work.

}

Local<ObjectTemplate> t = ObjectTemplate::New(env->isolate());
t->SetInternalFieldCount(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping this question...

var s = 0;
var e = label.length;
while (s < e && (
label[s] === '\u0009' ||
Copy link
Member

@TimothyGu TimothyGu Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a regex like /[\t\n\f\r ]/ be faster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in my experience.

@TimothyGu
Copy link
Member

TimothyGu commented Jun 15, 2017

A bug I noticed: https://encoding.spec.whatwg.org/#dom-textdecoder-decode step 1 unsets the BOM seen flag if do not flush flag resulted from a previous run is not set (i.e. this is either the first call to decode() or the first call after a decode() call with { stream: false }). However, this doesn't appear to be so:

> var dec = new buffer.TextDecoder('utf-8');
> [...dec.decode(Uint8Array.of(0xEF, 0xBB, 0xBF, 0x20))].map(str => str.codePointAt(0).toString(16));
[ '20' ]
> [...dec.decode(Uint8Array.of(0xEF, 0xBB, 0xBF, 0x20))].map(str => str.codePointAt(0).toString(16));
[ 'feff', '20' ]

The correct behavior seems to also be the one implemented by Chrome:

> var dec = new TextDecoder('utf-8');
> [...dec.decode(Uint8Array.of(0xEF, 0xBB, 0xBF, 0x20))].map(str => str.codePointAt(0).toString(16));
[ '20' ]
> [...dec.decode(Uint8Array.of(0xEF, 0xBB, 0xBF, 0x20))].map(str => str.codePointAt(0).toString(16));
[ '20' ]

Copy link
Member

@TimothyGu TimothyGu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more observation...

// ['hz-gb-2312', 'replacement'],
// ['iso-2022-cn', 'replacement'],
// ['iso-2022-cn-ext', 'replacement'],
// ['iso-2022-kr', 'replacement'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than mentioning these as "unsupported", we can in fact say they don't necessarily need to be supported, since TextDecoder errors out on 'replacement'.

throw new errors.Error('ERR_INVALID_ARG_TYPE', 'options', 'object');

const enc = getEncodingFromLabel(encoding);
if (enc === undefined || enc === 'replacement')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make sure no entries in encodings Map actually map to 'replacement' as value, we don't have to explicitly check for it here.

@jasnell
Copy link
Member Author

jasnell commented Jun 15, 2017

@TimothyGu ... updated!

Copy link
Member

@TimothyGu TimothyGu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as is, but some further observations are below. WPT integration can be worked on after this PR is merged.

I do have one more suggested edit: TimothyGu@0d0f7fc, which eliminates the need to copy from a Buffer to an Uint8Array.


[inspect](depth, opts) {
if (this == null || this[kEncoding] === undefined) {
throw new errors.TypeError('ERR_INVALID_THIS', 'TextDecoder');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, we are doing this for URLs (see Web IDL create a operation function step 2.1.2.4). I'll defer to your opinion regarding these classes though.

}

[inspect](depth, opts) {
if (this == null || this[kEncoding] === undefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unfortunately will not distinguish TextEncoder and TextDecoder objects, as both have this property defined.

@addaleax
Copy link
Member

This doesn’t land cleanly on 8.x; if you can, please follow the guide and raise a backport PR.

@Trott Trott removed the ctc-review label Jul 31, 2017
jasnell added a commit to jasnell/node that referenced this pull request Aug 1, 2017
Provide an (initially experimental) implementation of the WHATWG Encoding
Standard API (`TextDecoder` and `TextEncoder`). The is the same API
implemented on the browser side.

By default, with small-icu, only the UTF-8, UTF-16le and UTF-16be decoders
are supported. With full-icu enabled, every encoding other than iso-8859-16
is supported.

This provides a basic test, but does not include the full web platform
tests. Note: many of the web platform tests for this would fail by default
because we ship with small-icu by default.

A process warning will be emitted on first use to indicate that the
API is still experimental. No runtime flag is required to use the
feature.

Refs: https://encoding.spec.whatwg.org/
PR-URL: nodejs#13644
Reviewed-By: Timothy Gu <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
addaleax pushed a commit that referenced this pull request Aug 2, 2017
Provide an (initially experimental) implementation of the WHATWG Encoding
Standard API (`TextDecoder` and `TextEncoder`). The is the same API
implemented on the browser side.

By default, with small-icu, only the UTF-8, UTF-16le and UTF-16be decoders
are supported. With full-icu enabled, every encoding other than iso-8859-16
is supported.

This provides a basic test, but does not include the full web platform
tests. Note: many of the web platform tests for this would fail by default
because we ship with small-icu by default.

A process warning will be emitted on first use to indicate that the
API is still experimental. No runtime flag is required to use the
feature.

Backport-PR-URL: #14585
Backport-Reviewed-By: Anna Henningsen <[email protected]>

Refs: https://encoding.spec.whatwg.org/
PR-URL: #13644
Reviewed-By: Timothy Gu <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
@addaleax addaleax added backported-to-v8.x notable-change PRs with changes that should be highlighted in changelogs. and removed backport-requested-v8.x labels Aug 2, 2017
addaleax added a commit that referenced this pull request Aug 2, 2017
V8 6.0:

  The V8 engine has been upgraded to version 6.0, which has a significantly
  changed performance profile.
  [#14574](#14574)

  More detailed information on performance differences can be found at
  https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de

Other notable changes:

* **DNS**
  * Independent DNS resolver instances are supported now, with support for
    cancelling the corresponding requests.
    [#14518](#14518)

* **REPL**
  * Autocompletion support for `require()` has been improved.
    [#14409](#14409)

* **Utilities**
  * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has
    been implemented.
    [#13644](#13644)

* **Added new collaborators**
  * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
@addaleax addaleax mentioned this pull request Aug 2, 2017
addaleax added a commit that referenced this pull request Aug 3, 2017
V8 6.0:

  The V8 engine has been upgraded to version 6.0, which has a significantly
  changed performance profile.
  [#14574](#14574)

  More detailed information on performance differences can be found at
  https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de

Other notable changes:

* **DNS**
  * Independent DNS resolver instances are supported now, with support for
    cancelling the corresponding requests.
    [#14518](#14518)

* **N-API**
  * Multiple N-API functions for error handling have been changed to support
    assigning error codes.
    [#13988](#13988)

* **REPL**
  * Autocompletion support for `require()` has been improved.
    [#14409](#14409)

* **Utilities**
  * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has
    been implemented as an experimental feature.
    [#13644](#13644)

* **Added new collaborators**
  * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
addaleax added a commit that referenced this pull request Aug 7, 2017
V8 6.0:

  The V8 engine has been upgraded to version 6.0, which has a significantly
  changed performance profile.
  [#14574](#14574)

  More detailed information on performance differences can be found at
  https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de

Other notable changes:

* **DNS**
  * Independent DNS resolver instances are supported now, with support for
    cancelling the corresponding requests.
    [#14518](#14518)

* **N-API**
  * Multiple N-API functions for error handling have been changed to support
    assigning error codes.
    [#13988](#13988)

* **REPL**
  * Autocompletion support for `require()` has been improved.
    [#14409](#14409)

* **Utilities**
  * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has
    been implemented as an experimental feature.
    [#13644](#13644)

* **Added new collaborators**
  * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
cjihrig pushed a commit that referenced this pull request Aug 8, 2017
V8 6.0:

  The V8 engine has been upgraded to version 6.0, which has a significantly
  changed performance profile.
  [#14574](#14574)

  More detailed information on performance differences can be found at
  https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de

Other notable changes:

* **DNS**
  * Independent DNS resolver instances are supported now, with support for
    cancelling the corresponding requests.
    [#14518](#14518)

* **N-API**
  * Multiple N-API functions for error handling have been changed to support
    assigning error codes.
    [#13988](#13988)

* **REPL**
  * Autocompletion support for `require()` has been improved.
    [#14409](#14409)

* **Utilities**
  * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has
    been implemented as an experimental feature.
    [#13644](#13644)

* **Added new collaborators**
  * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
'defines': [
# ICU cannot swap the initial data without this.
# http://bugs.icu-project.org/trac/ticket/11046
'UCONFIG_NO_LEGACY_CONVERSION=1'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, Coverity is none too happy about this being re-enabled.

addaleax added a commit that referenced this pull request Aug 9, 2017
V8 6.0:

  The V8 engine has been upgraded to version 6.0, which has a significantly
  changed performance profile.
  [#14574](#14574)

  More detailed information on performance differences can be found at
  https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de

Other notable changes:

* **DNS**
  * Independent DNS resolver instances are supported now, with support for
    cancelling the corresponding requests.
    [#14518](#14518)

* **N-API**
  * Multiple N-API functions for error handling have been changed to support
    assigning error codes.
    [#13988](#13988)

* **REPL**
  * Autocompletion support for `require()` has been improved.
    [#14409](#14409)

* **Utilities**
  * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has
    been implemented as an experimental feature.
    [#13644](#13644)

* **Added new collaborators**
  * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
  * [gabrielschulhof](https://github.com/gabrielschulhof) – Gabriel Schulhof
cjihrig pushed a commit that referenced this pull request Aug 10, 2017
V8 6.0:

  The V8 engine has been upgraded to version 6.0, which has a significantly
  changed performance profile.
  [#14574](#14574)

  More detailed information on performance differences can be found at
  https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de

Other notable changes:

* **DNS**
  * Independent DNS resolver instances are supported now, with support for
    cancelling the corresponding requests.
    [#14518](#14518)

* **N-API**
  * Multiple N-API functions for error handling have been changed to support
    assigning error codes.
    [#13988](#13988)

* **REPL**
  * Autocompletion support for `require()` has been improved.
    [#14409](#14409)

* **Utilities**
  * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has
    been implemented as an experimental feature.
    [#13644](#13644)

* **Added new collaborators**
  * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
  * [gabrielschulhof](https://github.com/gabrielschulhof) – Gabriel Schulhof

Conflicts:
	src/node_version.h
TimothyGu added a commit to TimothyGu/node that referenced this pull request Sep 15, 2017
MylesBorins pushed a commit that referenced this pull request Sep 19, 2017
PR-URL: #13916
Refs: #13644 (comment)
Reviewed-By: Vse Mozhet Byt <[email protected]>
@gibfahn
Copy link
Member

gibfahn commented Jan 15, 2018

Release team decided not to land on v6.x, if you disagree let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. build Issues and PRs related to build files or the CI. c++ Issues and PRs that require attention from people who are familiar with C++. errors Issues and PRs related to JavaScript errors originated in Node.js core. i18n-api Issues and PRs related to the i18n implementation. notable-change PRs with changes that should be highlighted in changelogs. semver-minor PRs that contain new features and should be released in the next minor version. tools Issues and PRs related to the tools directory.
Projects
None yet
Development

Successfully merging this pull request may close these issues.