BOM-aware Unicode encodings #17

lifthrasiir · 2013-11-01T17:48:57Z

This issue was spotted during the removal of TextEncoder and TextDecoder (#4). TextDecoder has an ability to automatically strip the BOM (U+FFFD) from the input string if any. ~~We need to emulate this in a separate encoding, perhaps BOMAwareUTF8Encoding (which whatwg_name() is still utf-8)?~~ This use case itself can be handled better with decoders with a fallback encoding (#19), but we may need to require BOM-attached Unicode encodings from time to time: many applications of UTF-16 require BOM, for example.

The text was updated successfully, but these errors were encountered:

SimonSapin · 2013-11-01T17:56:37Z

I think that BOMAwareUTF8Encoding the wrong approach. Rather, what’s needed is what the spec calls decode.

It could be be a BOMDecoder (or other name) that takes a "fallback encoding" parameter. When the input starts with a BOM, the BOM is stripped and the corresponding encoding is used. Otherwise, the fallback encoding is used.

This decoder should always be used for formats that support multiple encoding, because the BOM (by proximity) is more accurate than other metadata.

lifthrasiir · 2013-11-01T18:18:28Z

@SimonSapin I have updated the description. I agree that this use case should be handled elsehow, see #19 for a separate discussion. BOM-aware encoding itself might be useful by itself though.

lifthrasiir mentioned this issue Nov 1, 2013

Unicode decoder with a fallback encoding #19

Open

lifthrasiir added this to the 0.4 ("1.0" minus the language stability) milestone Nov 22, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BOM-aware Unicode encodings #17

BOM-aware Unicode encodings #17

lifthrasiir commented Nov 1, 2013

SimonSapin commented Nov 1, 2013

lifthrasiir commented Nov 1, 2013

BOM-aware Unicode encodings #17

BOM-aware Unicode encodings #17

Comments

lifthrasiir commented Nov 1, 2013

SimonSapin commented Nov 1, 2013

lifthrasiir commented Nov 1, 2013