-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding TextEncoder.containsLoneSurrogates() static #174
Comments
I was thinking of a different suggestion instead: adding Currently it feels incosistent that |
It's a little different though. The contract of an encoding is to provide a mapping between a byte sequence and a scalar value sequence. In particular this excludes a code point sequence. From that perspective it makes more sense to defer that concern to a higher layer. If we don't want to do that anymore, we'd also have to support streaming I think, in case your input ends with a lone surrogate, but a subsequent call might provide the other lone surrogate that together form a scalar value. |
We already don't support streaming [well], because currently this will result in two replacement characters. It seems better to throw explicitly on such cases, and, as you said, defer handling of streaming to the upper layer. |
I'm trying to say that if we want to support code point sequence as an input, and add fatal handling for lone surrogates, we also need to handle streaming them for consistency. Or we defer both problems to a higher level as the current design tries to. |
Ah I think now I understand what you're trying to say - that right now |
Correct, it happens at the IDL layer due to |
I see. I think I'd be still in favour of handling them in Otherwise with something like |
Firefox has SIMD-accelerated code for checking if a UTF-16 string is valid UTF-16. I guess it makes sense to expose it to the Web. It's less clear that Firefox currently doesn't have a mechanism to signal errors for unpaired surrogates in UTF-16 to UTF-8 conversion, because previously there has not been a use case for such detection. So implementing that approach would involve writing some new (non-glue) code. I'd like to see if the event issue can be fixed quickly in browsers. |
Also, for the event issue, both checking the whole string for lone surrogates and making UTF-16 to UTF-8 conversion report unpaired surrogates are overkills: It's enough to check if the last code unit of the JS string is an unpaired surrogate. |
It would be symmetrical to TextDecoder though. Even if JS gains own support for string encoding/decoding in the future, it seems useful to have this level of consistency between existing APIs in WHATWG Encoding.
Sure, but I think that this usecase (asking for only valid UTF-8 output) is common enough to warrant an implementation if/when API is agreed on. Firefox can use the existing SIMD-based check implementation internally, but I don't think that the actual shape of public-facing APIs should be driven by existing internal implementations... |
s/valid UTF-8/lossless UTF-8/. I don't think it has been demonstrated that asking for losslessness is going to be a common use case. So far, experience with the Servo style engine in particular suggests that the Web, despite Hyrum's Law generally ruling everything, is remarkably free of relying on unpaired surrogates.
It shouldn't. I'm just mentioning the implementation distance for particular approaches. |
Also, regarding whether this is going to be common, I think it's a notable data point that Firefox doesn't need error-signaling UTF-16 to UTF-8 conversion internally. (It does have error-signaling potentially-invalid UTF-16 to potentially-invalid UTF-8 comparison operation, but maybe it could get away with not having even that one.) |
I'm somewhat worried that the reason is that exactly that alternative is not exposed in the API and developers are not aware that it's currently lossy by default. (I personally wasn't aware of this until running into this issue.) |
I think that would be a good thing to have, but that's orthogonal to this, because a fatal option will not help us. We want to completely ignore strings which contain unpaired surrogates, so we definitely don't want runtime exceptions!
Indeed, I agree that is very unfortunate. But for our use case we still wouldn't want
Web code may not rely upon unpaired surrogates per se, but they definitely rely on the ability for invalid JS strings to roundtrip correctly (e.g. the double Unfortunately when encoding to UTF-8 you lose the ability to roundtrip. |
@Pauan Hmm, that doesn't sound really orthogonal to my let result;
try {
result = strictEncoder.encode(someString);
} catch {
result = null;
}
// here `result` is what you're describing, still without separate API or going through the string twice |
@RReverser Catching exceptions is about ~135 times slower than returning And you might have to put in extra checking to make sure that you're catching the right exception (and not some unrelated exception). It's quite common for DOM APIs to return |
wasm-bindgen already does this try-catch in lots of places, including inside existing
Which APIs are you refering to? Usually they throw |
Yes, but we plan to move away from that (especially in areas where we know errors cannot occur). Our goal is to be as fast as possible (zero-cost ideally), at least as fast as JS, and in the future faster-than-JS.
I admit they're not quite the same as |
I'm not saying that it's unimportant, because the Web Platform doesn't have at already. I'm saying that Firefox is a large codebase that deals with UTF-16 and UTF-8 and gets away with not having it internally. It's not conclusive, but it's a data point.
The input event issue is a problem, but I think one problem shouldn't be extrapolated into an assumption that it's going to be a common case. |
Do you have data point other than the double input event bug? Indeed, I think it's better to analyze it as a bug in Chrome's and Firefox's Windows input integration than to analyze it as a fundamental unpaired surrogate dependency of the Web Platform. Notably, the new (since Firefox 57) CSS OM in Firefox does not round-trip unpaired surrogates, and that hasn't been a problem. |
@Pauan Sure, but the error case is rare and not something I'd optimise for, while the success case should be fast with either approach and dominated by the string encoding performance and not error handling. In this case having idiomatic JS APIs feels more important than relatively small speed improvements, especially since in the future WASM might switch to some host bindings encoding approach, while the API is here to stay for JS developers too.
Yeah, but all of these are not really failures (like |
@hsivonen I actually wasn't referring to the input bug - I don't often work with UIs, so it's less interesting for me - but more to the encoding of arbitrary JS strings to UTF-8 in general, both in browser and Node.js (which both have |
@RReverser in what scenarios did you encounter issues? It would help to have specific examples. |
That's a fair perspective. I also hadn't ever encountered unpaired surrogates until this issue. I think we still need to have support for unpaired surrogates in Rust (for completeness and thoroughness). However, if the This is pretty slow, but since unpaired surrogates are almost nonexistent, I think that's acceptable.
For many years However, I just now tested it, and was pleasantly surprised that you are right: In any case, since it looks like the |
Oh yeah I know, but it's been fixed in V8 for ~1.5 years now :) |
@annevk Sorry I missed this! I can't point to a specific issue because, as I said, until recently I haven't even realised I have one :) But these days I'm working on a mixed JS + WASM parser where JS API can be send strings to WASM and get ASTs with parsed strings backand the opposite direction for the codegen. This is just one particular usecase, but both in this case and in any other that involves binary formats, native code or WASM, it feels better to have explicit detection and reporting of any errors instead of silently changing them to something else. As mentioned above, this feels very similar to |
With lookbehind support in JavaScript regular expressions, this functionality can trivially be implemented in userland: function isWellFormed(string) {
const reLoneSurrogate = /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/;
return !reLoneSurrogate.test(string);
} With the function isWellFormed(string) {
const reLoneSurrogate = /\p{Surrogate}/u;
return !reLoneSurrogate.test(string);
} That might explain why it hadn't been considered for standardization before. Still, if this comes up often enough in standards discussions, that might be sufficient motivation to add something like a static |
Sure, there's lots of ways to implement it (such as this). The reason for this suggestion was for performance, not functionality (because the browser can implement it faster). The reason we cared about performance is because this function would be called a lot (every time a string is sent from JS to Rust). However, we managed to narrow our use case down to only checking |
cc @gibson042 |
Why? There's still plenty of places strings with lone surrogates can come from (as mentioned above). |
Are there? We've only encountered them with To be clear, unpaired surrogates are invalid UTF-16, and so anything that generates them is (in my opinion) a bug. If you want to deal with binary data, you should be using something else like |
JavaScript doesn't claim UTF-16 compatibility though, so it's not really a bug, but rather part of the language and so such strings should be taken into account in any APIs that interact with JS IMO (especially in encoders/decoders). As mentioned above, I'm usually working with parsers in JavaScript. However, even without going into parsers that parse JavaScript itself, the simplest example can be created even with built-in > JSON.parse(String.raw`"\uD800"`)
"\ud800" Here the input is perfectly valid UTF-8 / UTF-16, but the output is not, and has to be checked whether when passing string to WASM or encoding it to the disk, otherwise you're risking silent data loss that is not trivial to debug. (It's not hard to imagine user calling Hence my suggestion to enhance |
I think I'm convinced that adding this to |
I am very aware of that. That doesn't change the fact that breaking UTF-16 is a really bad idea, which is why I said that it is (in my opinion) a bug. To be clear, I'm saying it's a bug in the application. Whether it's a bug in the language or not is a different question. And as @hsivonen has said, based on Firefox's experience the web generally doesn't generate or interact with unpaired surrogates (since pretty much every JS API never produces unpaired surrogates). So even though technically JS isn't UTF-16, in practice it is, because nobody actually generates unpaired surrogates.
I think that's debatable. Sure, from the perspective of the consumer, before they run But from the perspective of the producer, they had a string which was invalid UTF-16, and then they called So I would say that that is a bug in the producer, since they should have never created an invalid string in the first place. Basically, except in contrived examples, somebody messed up and generated an invalid string. And so it's their responsibility to fix that. So I'm asking for non-contrived examples of where unpaired surrogates were generated.
Like I said earlier, I think that's a good idea, but it's orthogonal to |
Maybe, but pointing elsewhere doesn't exactly make debugging any easier (rather proves my point that, in the absence of a checked API, the debugging of such cases and finding out the root cause can quickly become a hell).
I don't mind moving this to It's just At the same time, JS on its own doesn't care about interop with external APIs or binary formats and traditionally works with lone surrogates in all built-in APIs. |
I think the notion of "data loss" is overblown. Unpaired surrogates don't have any semantics, so in that sense replacing an unpaired surrogate doesn't lose any meaning from the string. However, it is imaginable that it could be harmful for the equality relation of (erroneous) strings to differ on different sides of a boundary, so that two strings compare unequal in JS but equal in Wasm. See a similar-themed but different Chrome bug where the bug was caused by two UTF-8 decoders generating a different number of U+FFFDs for erroneous byte sequences. Round-tripping equality is the main issue for being able to round-trip file paths on Windows in Rust. The Web Platform is somewhat different in the sense that there aren't similar places where the platform gives you unpaired surrogates in a non-bug way (either on its own or giving you unpaired surrogates persisted by someone else) and expects their equality to be preserved. Specification-wise the Web Platform tries hard not to hand you unpaired surrogates. However, if you yourself create unpaired surrogates, the Web Platform most of the time preserves them for the duration of the execution of a JS program. |
@RReverser what's the JS assertion based on? E.g., https://tc39.github.io/ecma262/#sec-encode throws on lone surrogates afaict. |
For the original use case here, I think it would still be preferable to get the Gecko and Chromium keyboard event bugs fixed instead of adding facilities for optimizing workarounds. It would be good to do so before EdgeHTML fades away, as it now keeps the Web from assuming the Gecko/Chromium bug.
|
I think more often than not, it would be used to throw a validation error. |
Chatting with some TC39 folks it seems there would be support for well-formed string APIs if someone did the work. It's not clear to me we need to track that here so I'm inclined to close this, but will give it some time in case people feel otherwise. |
Draft ECMAScript proposal by @guybedford: https://github.com/guybedford/proposal-is-usv-string |
This came up in in the IDL for Wasm discussion at WebAssembly/interface-types#21 (comment) (bit before and after is also relevant context) and @Pauan suggested it might be faster than rustwasm/wasm-bindgen#1348 (comment).
cc @alexcrichton @lukewagner
The text was updated successfully, but these errors were encountered: