-
Notifications
You must be signed in to change notification settings - Fork 29.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: add option to automatically add BOM to write methods #11767
Comments
The BOM is part of the file, removing or adding it would modify the data to/from disk, and trigger another set of bug reports along the lines of "why is the data I read in node smaller than the file on disk?", and "why does this file I wrote have these strange bytes in the front?". Its a bit awkward, but I think explicit is better than guessing the user's intentions here, though it might be possible to introduce new encoding names with |
If you write as ucs2le (utf16le) and don't know that you have to manually
add the BOM (even though BOM is mandatory in that encoding), you create
invalid data. I had to find a discussion (linked before) where somebody had
to figure out the Unicode character to add the BOM, meaning is not obvious.
I think that Core should have an extra affordance that takes care of this.
--
…--
*David Gasperoni*
|
What is your specific suggestion? What "different write methods" would you like /cc @srl295 |
I imagined to be an option like addBom for writing and stripBom for
reading, but it could also be "bom: auto|on|off" to leave the stream as it
is, either way. With "auto" it would be on for encodings like utf16le but
off for UTF8, for example.
--
…--
*David Gasperoni*
|
The UTF-8 byte order mark (EF BB BF), while not common and discouraged, is in use. It's trivial to strip from incoming data but I predict endless discussions on whether it should be added or not to outgoing data. |
Exactly, at that point we're just talking defaults. To me, auto in write methods would not add BOM if the
In my experience, if you encode in utf16le and don't include a BOM, many readers won't be able to interprete it. BBEdit 11.6 on a Mac, VS Code 1.10.1 also on a Mac, they either can't open the file in the latter case or misinterpret it in the former. |
I think you miss my point. You should file a pull request if you feel it's a worthwhile addition but you should be prepared for lots of discussion when it's a convenience thing like an 'auto' mode (or even an on/off mode; stripping and inserting BOMs is after all just a convenience.) |
Indeed, I see many of those previous Issues are coming from a convenience
point of view. The usability of a language/environment comes down from
these things too.
I'll sketch up the PR to get the conversation going.
--
…--
*David Gasperoni*
|
What is the use case for writing UTF-16 at all? |
Unfortunately there are business softwares (HP SmartStream Designer) that
don't distinguish between UTF8 and Ansi because the former is
back-compatible, and their understanding of Unicode is UCS-2 (utf16le).
This was my use case, I'm sure there are many more.
--
…--
*David Gasperoni*
|
I think you might better of wrapping or monkey-patching |
Personally, as a user of the language, I expect the runtime to know that if I choose the |
OK, so you need the BOM for utf-8. It seems reasonable to me to have a convenient way to do that.
But you don't need to use utf-16, correct? A bit of trivia about the BOM for utf-16: Today per the Encoding Standard, the utf-16 decoder is more robust wrt the BOM and encoding label, but it does not specify an encoder because browsers do not need an encoder and everyone should be using only utf-8. |
I admit that I'm not expert in encodings, or standards about them. In my case, I can only talk by what I experience using them, and this is how I experienced the issue:
I'm on a Mac, I have several text editors to try to open the test file. In order: TextEdit, BBEdit, Visual Studio Code, Safari, Hex Fiend. Then I did the following:
And without making other screenshots, just trust me when the same apps interpreted the file just fine. |
No, I'll try to explain better: I need to feed this software text files, it uses them as records. It usually happens that there are characters (like the one above in my last comment). If I save the output as This is just anecdotal for the rest of the world I guess, but I thought it showed a "hole" in the assumption of Node trying to write to |
I think you would probably be better off using utf-8 with a BOM than using utf-16 (any variant). Alternatively use utf-8 without a BOM and configure your editors to default to utf-8 instead of "Ansi". |
The editors are not the problem… it's the specific software that expects either UCS-2 or UTF-16. My example with the various editors was to show that they can't either read or detect UTF-16 without BOM, so it's not just me 😄 |
The key challenge to writing the BOM automatically is that the stream interface is quite agnostic to the encoding right now. It would be fairly straightforward, however, to create a light weight wrapper interface in userland that does this... something like... const BomStream = require('...');
const fs = require('fs');
const out = fs.createWriteStream('data');
const bomout = new BomStream.Utf16LeStream(out);
bomout.write('some data'); While I am quite sympathetic to the problem, I don't believe we should be adding support for this in core. |
Hmm, maybe I misnamed this Issue, and created confusion along the way. What I'm proposing is a way to say I guess it should also be smart enough to not do anything if the encoding being passed in one which doesn't not use BOMs. |
What's the use case for writing UTF-16 without a BOM? Shouldn't a BOM be automatically be added when UTF-16 output is requested? |
That's exactly what brought me here to discuss this. |
This seems stalled (and seems to me like something that should be solved as a published module before consideration for adding to core, but reasonable people can disagree on that). I'm going to close this, but if that's misguided because there's active work going on or for some other reason, by all means, comment to that effect (or re-open if GitHub allows you to). |
Considering @mcdado proposal, I looked again at BOM definition from RFC 2781. An excerpt from Section 3.2 Byte Order Mark (BOM):
According to Node JS documentation for fs.writeFile method, data can be string, Buffer, TypedArray, or DataView. When user is passing string, the data length is fixed or can be said to represent a single stream with known length. As the first position of single stream that contains multibyte characters will represent BOM, it will be handy to have an option to add BOM (e.g.: I stumbled into this same issue when trying to display East Asian characters in a CSV file generated with Node JS properly in Excel on MacOS. It may not be too intuitive for users if they have to manually add BOM into the data. When the option is there, users can have better insight about how to deal with writing multibyte characters into file properly. I wrote an article about this issue, including trial and error to properly write multibyte characters into a CSV file that can be later read across different applications. |
@mikaelfs thank you for your input. I hope that at least this Issue has the SEO juice to bubble up in search results and help confused people searching for a solution. 🙂 |
Today I found out that you need to manually add a unicode representation of the Byte Order Mark in unicode files/streams.
The fact that you have to manually prepend it leads to confusion IMHO. I think that it would be better to add a
addBom
(or something like that) as an option to the different write methods, that would remove the manuality of the process.The text was updated successfully, but these errors were encountered: