Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOM-aware Unicode encodings #17

Open
lifthrasiir opened this issue Nov 1, 2013 · 2 comments
Open

BOM-aware Unicode encodings #17

lifthrasiir opened this issue Nov 1, 2013 · 2 comments

Comments

@lifthrasiir
Copy link
Owner

This issue was spotted during the removal of TextEncoder and TextDecoder (#4). TextDecoder has an ability to automatically strip the BOM (U+FFFD) from the input string if any. We need to emulate this in a separate encoding, perhaps BOMAwareUTF8Encoding (which whatwg_name() is still utf-8)? This use case itself can be handled better with decoders with a fallback encoding (#19), but we may need to require BOM-attached Unicode encodings from time to time: many applications of UTF-16 require BOM, for example.

@SimonSapin
Copy link
Collaborator

I think that BOMAwareUTF8Encoding the wrong approach. Rather, what’s needed is what the spec calls decode.

It could be be a BOMDecoder (or other name) that takes a "fallback encoding" parameter. When the input starts with a BOM, the BOM is stripped and the corresponding encoding is used. Otherwise, the fallback encoding is used.

This decoder should always be used for formats that support multiple encoding, because the BOM (by proximity) is more accurate than other metadata.

@lifthrasiir
Copy link
Owner Author

@SimonSapin I have updated the description. I agree that this use case should be handled elsehow, see #19 for a separate discussion. BOM-aware encoding itself might be useful by itself though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants