Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode hook not adequate #235

Closed
annevk opened this issue Oct 20, 2020 · 5 comments · Fixed by #236
Closed

Encode hook not adequate #235

annevk opened this issue Oct 20, 2020 · 5 comments · Fixed by #236

Comments

@annevk
Copy link
Member

annevk commented Oct 20, 2020

While working on whatwg/url#557 I realized that the URL standard will have to invoke the Encoding Standard at a lower level of abstraction as it needs to deal with erroneous output (&#...;) differently from non-erroneous output. I.e., an error that results in &#...; might have to be percent-encoded, but a non-error &#...; sequence might not have to be.

So URL basically wants to invoke the encoder's handler directly I think. I don't really see a better abstraction as it needs to deal with errors in a very different way. I suppose we could make error handling a caller defined set of steps, but I don't really like the complexity of that.

text/plain form submission could in theory still use the current high-level encode hook, but I'm not sure it's worth saving just for that.

It also seems there's potentially quite a lot of other potential cleanup that would result from this (e.g., https://encoding.spec.whatwg.org/#concept-encoding-process no longer needs to handle "html").

Having "html" in the Encoding standard as well as this high-level hook was intentional as a way of limiting the amount of badness that could be introduced by consumers, but we will have to use review rather than abstractions for that instead (UTF-8 or die). It's unfortunate, but it might also make the Encoding standard a little leaner.

cc @andreubotella @hsivonen @achristensen07 @JKingweb

@hsivonen
Copy link
Member

FWIW, encoding_rs provides "html" as a built-in capability and Firefox uses this for form submission. Then the URL implementation in Firefox uses the lower-level API that exposes the error code point instead of performing replacement.

In that sense, leaving "html" in the spec would leave software and the spec as corresponding to each other better than removing "html" and only providing the low-level abstraction. Within the spec, though, "html" should layer on top of the low level.

@annevk
Copy link
Member Author

annevk commented Oct 20, 2020

Notes:

  1. URL parsing and application/x-www-form-urlencoded use shared infrastructure in the specification. text/plain does not. Unsure about multipart/form-data. (application/x-www-form-urlencoded could in theory still use encode/"html" as well because its percent-encode set encompasses &, #, and ;, but the way we structure it in the URL Standard seems nicer.)
  2. If we want to directly invoke the encoder's handler, the ISO-2022-JP encoder should really do a state transition to ASCII before returning the error. (This is currently handled by the process algorithm putting these ASCII error code points on input so they get encoded, but that would no longer happen.) Presumably no other encoder is affected as they are not stateful?

annevk added a commit that referenced this issue Oct 20, 2020
This way you can invoke its handler directly without going the process algorithm as is needed to fix #235. This also avoids the need for "prepend" in the "process" algorithm.

Additionally, this commit adds two clarifying asserts to the "process" algorithm documenting what error modes can be in effect when.
@annevk
Copy link
Member Author

annevk commented Oct 20, 2020

I think I have a proposal that works:

  1. We change the "fatal" branch of https://encoding.spec.whatwg.org/#concept-encoding-process to return result rather than a new error. This way the error's code point information remains preserved. (Perhaps also do some editorial cleanup, e.g., "if result is [an] error")
  2. We introduce a new specification hook, "encode or fail" that the URL Standard can use. It uses the "fatal" mode, but as opposed to "UTF-8 decode without BOM or fail" it returns what it decoded so far and an error if it hit one (as a tuple I suppose). It expects to be invoked again with an empty output buffer as long as it returns an error. This works because the only stateful encoding we have is effectively reset (at least after Editorial: make ISO-2022-JP encoder perform error state switch #236) at the point where it returns an error. (Alternatively we export "run", "encoder", and "error" so the URL Standard can use them directly, but that seems slightly less elegant.)

Thoughts appreciated!

@hsivonen
Copy link
Member

Makes sense to me. For non-browser users of the spec, it would be prudent to document that ISO-2022-JP may be left in the Roman state, so the safe bytes available for replacement are the ones that have the ASCII interpretation even in the Roman state.

annevk added a commit that referenced this issue Oct 21, 2020
This avoids the need for "prepend" in the "process" algorithm as is needed to fix #235. 

Additionally, this commit adds two clarifying asserts to the "process" algorithm documenting what error modes can be in effect when.

Related tests: web-platform-tests/wpt#26158.
@annevk annevk reopened this Oct 21, 2020
@annevk
Copy link
Member Author

annevk commented Oct 21, 2020

Encode or fail would not work as the encoder can be in one of two states when returning an error. So I think we need to do the alternative proposal or a variant of that whereby the caller keeps the encoder alive so it can retain state.

annevk added a commit that referenced this issue Oct 21, 2020
annevk added a commit that referenced this issue Oct 22, 2020
@annevk annevk closed this as completed in c55584b Oct 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants