Encode hook not adequate #235

annevk · 2020-10-20T09:35:52Z

While working on whatwg/url#557 I realized that the URL standard will have to invoke the Encoding Standard at a lower level of abstraction as it needs to deal with erroneous output (&#...;) differently from non-erroneous output. I.e., an error that results in &#...; might have to be percent-encoded, but a non-error &#...; sequence might not have to be.

So URL basically wants to invoke the encoder's handler directly I think. I don't really see a better abstraction as it needs to deal with errors in a very different way. I suppose we could make error handling a caller defined set of steps, but I don't really like the complexity of that.

text/plain form submission could in theory still use the current high-level encode hook, but I'm not sure it's worth saving just for that.

It also seems there's potentially quite a lot of other potential cleanup that would result from this (e.g., https://encoding.spec.whatwg.org/#concept-encoding-process no longer needs to handle "html").

Having "html" in the Encoding standard as well as this high-level hook was intentional as a way of limiting the amount of badness that could be introduced by consumers, but we will have to use review rather than abstractions for that instead (UTF-8 or die). It's unfortunate, but it might also make the Encoding standard a little leaner.

cc @andreubotella @hsivonen @achristensen07 @JKingweb

The text was updated successfully, but these errors were encountered:

hsivonen · 2020-10-20T10:17:08Z

FWIW, encoding_rs provides "html" as a built-in capability and Firefox uses this for form submission. Then the URL implementation in Firefox uses the lower-level API that exposes the error code point instead of performing replacement.

In that sense, leaving "html" in the spec would leave software and the spec as corresponding to each other better than removing "html" and only providing the low-level abstraction. Within the spec, though, "html" should layer on top of the low level.

annevk · 2020-10-20T10:37:28Z

Notes:

URL parsing and application/x-www-form-urlencoded use shared infrastructure in the specification. text/plain does not. Unsure about multipart/form-data. (application/x-www-form-urlencoded could in theory still use encode/"html" as well because its percent-encode set encompasses &, #, and ;, but the way we structure it in the URL Standard seems nicer.)
If we want to directly invoke the encoder's handler, the ISO-2022-JP encoder should really do a state transition to ASCII before returning the error. (This is currently handled by the process algorithm putting these ASCII error code points on input so they get encoded, but that would no longer happen.) Presumably no other encoder is affected as they are not stateful?

This way you can invoke its handler directly without going the process algorithm as is needed to fix #235. This also avoids the need for "prepend" in the "process" algorithm. Additionally, this commit adds two clarifying asserts to the "process" algorithm documenting what error modes can be in effect when.

annevk · 2020-10-20T12:42:07Z

I think I have a proposal that works:

We change the "fatal" branch of https://encoding.spec.whatwg.org/#concept-encoding-process to return result rather than a new error. This way the error's code point information remains preserved. (Perhaps also do some editorial cleanup, e.g., "if result is [an] error")
We introduce a new specification hook, "encode or fail" that the URL Standard can use. It uses the "fatal" mode, but as opposed to "UTF-8 decode without BOM or fail" it returns what it decoded so far and an error if it hit one (as a tuple I suppose). It expects to be invoked again with an empty output buffer as long as it returns an error. This works because the only stateful encoding we have is effectively reset (at least after Editorial: make ISO-2022-JP encoder perform error state switch #236) at the point where it returns an error. (Alternatively we export "run", "encoder", and "error" so the URL Standard can use them directly, but that seems slightly less elegant.)

Thoughts appreciated!

hsivonen · 2020-10-21T07:10:21Z

Makes sense to me. For non-browser users of the spec, it would be prudent to document that ISO-2022-JP may be left in the Roman state, so the safe bytes available for replacement are the ones that have the ASCII interpretation even in the Roman state.

This avoids the need for "prepend" in the "process" algorithm as is needed to fix #235. Additionally, this commit adds two clarifying asserts to the "process" algorithm documenting what error modes can be in effect when. Related tests: web-platform-tests/wpt#26158.

annevk · 2020-10-21T09:20:03Z

Encode or fail would not work as the encoder can be in one of two states when returning an error. So I think we need to do the alternative proposal or a variant of that whereby the caller keeps the encoder alive so it can retain state.

Fixes #235.

annevk mentioned this issue Oct 20, 2020

Percent-encoding of query wrong for ISO-2022-JP whatwg/url#557

Closed

annevk mentioned this issue Oct 20, 2020

Editorial: make ISO-2022-JP encoder perform error state switch #236

Merged

annevk mentioned this issue Oct 21, 2020

Refactor query state to operate on a buffer whatwg/url#558

Merged

3 tasks

annevk closed this as completed in #236 Oct 21, 2020

annevk reopened this Oct 21, 2020

annevk added a commit that referenced this issue Oct 21, 2020

Add get an encoder and encode or fail for URLs

46594d4

Fixes #235.

annevk added a commit that referenced this issue Oct 22, 2020

Add get an encoder and encode or fail for URLs

beecec9

Fixes #235.

annevk closed this as completed in c55584b Oct 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode hook not adequate #235

Encode hook not adequate #235

annevk commented Oct 20, 2020

hsivonen commented Oct 20, 2020

annevk commented Oct 20, 2020

annevk commented Oct 20, 2020

hsivonen commented Oct 21, 2020

annevk commented Oct 21, 2020

Encode hook not adequate #235

Encode hook not adequate #235

Comments

annevk commented Oct 20, 2020

hsivonen commented Oct 20, 2020

annevk commented Oct 20, 2020

annevk commented Oct 20, 2020

hsivonen commented Oct 21, 2020

annevk commented Oct 21, 2020