-
Notifications
You must be signed in to change notification settings - Fork 29.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion of implementing the Encoding Standard with ICU, and associated web platform tests #13646
Comments
first step will be to get the WPT encoding tests working. Because full-icu will be required, those will need to go into their own special bucket. Once set up, however, we can really start to get into verifying that things actually do work as spec'd. Currently, this is definitely going to be entirely limited by the extent of ICU's coverage of the encoding standard details. |
@jungshik can probably comment most authoritatively on Chromium's ICU changes. Patches are at https://cs.chromium.org/chromium/src/third_party/icu/patches/ - most changes to handle Encoding including labels have been upstreamed (behind a compile flag). |
I pushed encoding_rs to mozilla-inbound moments ago, so hopefully a Firefox nightly with an Encoding Standard-compliant back end will be available for experimentation soon. (I say compliant back end, because there are known integration bugs around BOM and EOF handling that need to be fixed as follow-ups.) As part of the landing, I adjusted some Web Platform Tests in mozilla-central. My understanding is that they will get upstreamed in due course, but @jgraham would know on what schedule. As for the pending Web Platform Tests from the W3C i18n Core WG, I believe that to the cases that Firefox with encoding_rs doesn't pass are test suite bugs (so commented on each relevant issue of the Encoding Standard repo). encoding_rs has some CC0 test cases baked into it. Many of those should be easy enough to extract mechanically for use with other implementations, but I don't currently have the available cycles to do it myself. (Example of Rust-emdebbed tests. Example of separate-file input and reference.) The obvious case where Chromium is compliant but Node using ICU only probably won't be is UTF-8 error handling, where Chromium is compliant by using WTF's UTF-8 decoder instead of ICU's. |
I changed a KOI8-U test and an ISO-2022-JP test. |
@r12a should get credit for those tests, not me. |
Oh goodness, my apologies for mixing up peoples' names! I guess I skipped the stuff beyond the first/last letter. |
@domenic no worries :) Btw, i've been a little distracted recently by other things, but i know the tests need a little attention - it's just a question of finding the time. But if others want to suggest specific changes i'd be happy to consider them. |
mozilla-central has worker-compatible non-WPT tests that might be of interest.
encoding_rs in now on Nightly. |
We've started landing the WPTs, although they are not in a nice JSON-consumable format like URL were, so getting them running in Node is going to be much harder :-/. |
We'll get that part figured out once the tests stablize a bit more. As I mentioned previously, the other thing we're going to need to figure out before we can test everything properly is the whole full-icu requirement. Currently, the only decoders supported out of the box by the experimental implementation are UTF-8, UTF-16le and UTF-16be. |
Chrome's ICU encoding changes have not been upstreamed. What's upstreamed is to exclude the ICU's encoding related C/C++ codes NOT relevant to web. ICU encoding table changes in Chromium's copy are not just labels. Encoding tables are changed a lot. See tables whose names end with html.ucm See also eucjp-gen.sh, euckr-gen.sh, sjis-gen.sh, single_byte_gen.sh and ibm866_gen.sh in |
I'm going to try to loop @TimothyGu into this issue, mostly because encodings seem to be something he cares about and WPT is something he's implemented in Node.js for the Feel absolutely free to close this if you don't think it's likely to get any traction any time in the next year. It's been dormant for the last year already. That doesn't mean it's not worthwhile, of course. |
FWIW, encoding_rs has C and C++17 bindings (and writing other kind of C++ bindings is an easy copy-paste job), so using it for Node might be easier than dealing with importing the ICU patches from Chromium if you can accept a Rust dependency.
ICU has become compliant on this particular point (though I haven't tested to verify) during the past year. |
There's been no further discussion on this in nearly 2 years. We have the |
I originally wrote this as a comment on #13644, but I realized it was getting a bit far afield, so I wanted to split it out into a separate issue.
/cc @inexorabletash @ricea @jungkees as my Chromium encoding peeps, @hsivonen from Mozilla implementations, and @annevk as spec editor; this thread is about Node.js implementing the Encoding Standard, via ICU, and also about web platform tests. Feel free to unsubscribe if not interested.
Anyway:
I'd be curious about the web platform test pass rates for #13644 with full-icu, although I understand that might reflect more on ICU's conformance to the Encoding Standard than on the code in #13644. I have heard Chromium has a fork of some kind around ICU to better comply with the Encoding Standard, although I vaguely remember it being mostly around label processing, not around actual encoding.
@ricea also wrote a bunch of web platform tests a while back that never ended up landing, see https://github.com/whatwg/encoding/labels/tests; maybe we should make a renewed push for those (although I see @hsivonen has some corrections based on his work). I am happy to help land, although I am not useful for reviewing from an encodings perspective.
The text was updated successfully, but these errors were encountered: