Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

fs.readFileSync(filename, 'utf8') doesn't strip BOM markers #1918

Closed
dobesv opened this issue Oct 21, 2011 · 11 comments
Closed

fs.readFileSync(filename, 'utf8') doesn't strip BOM markers #1918

dobesv opened this issue Oct 21, 2011 · 11 comments
Labels

Comments

@dobesv
Copy link

dobesv commented Oct 21, 2011

Environment: cloud9ide.com, node version 0.4.5

If I read a file using fs.readFileSync(filename, 'utf8') that is encoded using UTF8 with BOM, the BOM is included in the resulting string.

I think the routine to decode UTF8 is supposed to automatically strip the BOM from the start of the stream before returning the string.

@dobesv
Copy link
Author

dobesv commented Oct 21, 2011

Workaround:

body = body.replace(/^\uFEFF/, '');

After reading a UTF8 file where you are uncertain whether it may have a BOM marker in it.

@koichik
Copy link

koichik commented Oct 21, 2011

If fs.readFileSync() strips the BOM automatically,

var text = fs.readFileSync('foo.tx', 'utf8');
fs.writeFileSync('foo.txt', text, 'utf8');

The BOM is lost...

@dobesv
Copy link
Author

dobesv commented Oct 24, 2011

Hmm maybe it is something that was fixed in a more recent version of node.js?

@koichik
Copy link

koichik commented Oct 24, 2011

No, I mean the BOM was lost from a file ('foo.txt') after fs.writeFileSync().
fs.writeFileSync() cannot add the BOM automatically because it depends on the application whether the BOM is necessary.
Therefore, I think that the BOM should not be removed automatically.

@koichik koichik closed this as completed Oct 24, 2011
@dobesv
Copy link
Author

dobesv commented Nov 4, 2011

@koichik - can you clarify why you closed this issue? If I read a utf-8 file into a string it should not have a BOM in it, that's simply how UTF-8 decoding works, the BOM is not included in the decoded string.

Applications that expect the BOM to be present can add it back on when they write out the file, or to preserve the BOM they can read/write the file as binary.

@dobesv
Copy link
Author

dobesv commented Nov 4, 2011

OK I read a huge argument about this subject on the python mailing list and a bug report on the JVM systems and I see that it is more controversial than I had originally thought.

So, never mind ... looks like it's up to programmers to remove the BOM from UTF-8 files themselves.

What they did in python was interesting - they added a new encoding scheme called 'utf8-sig' which will strip the bom if present and emit a BOM when encoding to bytes. This allows the programmer to decide whether to use a BOM or not.

See http://docs.python.org/library/codecs.html:

"On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file."

Do you think that approach would be acceptable for use in node?

@koichik
Copy link

koichik commented Nov 4, 2011

You can easily write the utility (e.g. myfs.readUtf8FileSync()) in a user land.
So... I do not think that it is necessary to include 'utf8-sig' in Nodoe core.

@MylesPenlington
Copy link

If the ut8 file has a BOM, then in the latest node (0.6.18) it leaves the first characer of the string as unicode 65279 - which is 0xFE 0xFF - which is not what was read (that is the utf16 BOM?) - as the utf8 signature on a utf8 file is 0xef, 0xbb, 0xbf - so the current file reading does not really make sense at all.

@dobesv
Copy link
Author

dobesv commented May 24, 2012

Hi Myles,

It is confusing but it makes sense in way; when you decode those three
bytes using the UTF decoding algorithm you get the 16-bit BOM as the first
single character.

On Wed, May 23, 2012 at 7:35 PM, MylesPenlington <
[email protected]

wrote:

If the ut8 file has a BOM, then in the latest node (0.6.18) it leaves the
first characer of the string as unicode 65279 - which is 0xFE 0xFF - which
is not what was read (that is the utf16 BOM?) - as the utf8 signature on a
utf8 file is 0xef, 0xbb, 0xbf - so the current file reading does not really
make sense at all.


Reply to this email directly or view it on GitHub:
#1918 (comment)

@tracker1
Copy link

@TimothyGu
Copy link
Member

Also https://www.npmjs.org/package/strip-bom

rajkumar42 added a commit to rajkumar42/omnisharp-vscode that referenced this issue Jul 18, 2016
rajkumar42 added a commit to dotnet/vscode-csharp that referenced this issue Jul 18, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants