Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify UTF-8 as the character set #17

Closed
sunfishcode opened this issue Apr 4, 2019 · 26 comments
Closed

Specify UTF-8 as the character set #17

sunfishcode opened this issue Apr 4, 2019 · 26 comments

Comments

@sunfishcode
Copy link
Member

Continuing the discussion from bytecodealliance/wasmtime#86:

WASI currently doesn't document the character sets used for filesystem paths, command-line arguments or environment variables.

Two high-level strategies have been proposed:

  • Just use UTF-8, and say that WASI can't directly interact with non-UTF-8-encodable strings from the outside world. Where needed, provide escape-hatch features in the API (eg., you can't open a file with an unencodable name by name, but you can get to it by iterating through a directory).
  • Use uninterpreted byte seqeunces, and then additional functions for translating to and from UTF-8, as described here.
@kg
Copy link

kg commented Apr 4, 2019

My suggestion for dealing with per-platform/per-filesystem normalization constraints would be to expose an explicit API call for normalizing paths, so that an application can do that once and compare the strings to identify whether they were changed and avoid paying the cost of normalization on every IO call.

IIRC at least one platform requires this now? I think Apple's APFS?

@ghost
Copy link

ghost commented Apr 4, 2019

I strongly vote for UTF-8. I have never seen a filesystem path non-representable by UTF-8 in practice.

@npmccallum
Copy link

I also strongly agree with UTF-8 everywhere.

@Hywan
Copy link

Hywan commented Apr 12, 2019

Agree with UTF-8.

acfoltzer referenced this issue in CraneStation/wasi-common Nov 6, 2019
This changes the fields on the builder to types that let the various `.arg()`, `.env()`, etc methods
infallible, so we don't have to worry about handling any errors till we actually build. This reduces
line noise when using a builder in a downstream application.

Deferring the processing of the builder fields also has the advantage of eliminating the opening and
closing of `/dev/null` for the default stdio file descriptors unless they're actually used by the
resulting `WasiCtx`.

Unicode errors when inheriting arguments and environment variables no longer cause a panic, but
instead go through `OsString`. We return `ENOTCAPABLE` at the end if there are NULs, or if UTF-8
conversion fails on Windows.

This also changes the bounds on some of the methods from `AsRef<str>` to `AsRef<[u8]>`. This
shouldn't break any existing code, but allows more flexibility when providing arguments. Depending
on the outcome of https://github.com/WebAssembly/WASI/issues/8 we may eventually want to require
these bytes be UTF-8, so we might want to revisit this later.

Finally, this fixes a tiny bug that could arise if we had exactly the maximum number of file
descriptors when populating the preopens.
acfoltzer referenced this issue in CraneStation/wasi-common Nov 6, 2019
This changes the fields on the builder to types that let the various `.arg()`, `.env()`, etc methods
infallible, so we don't have to worry about handling any errors till we actually build. This reduces
line noise when using a builder in a downstream application.

Deferring the processing of the builder fields also has the advantage of eliminating the opening and
closing of `/dev/null` for the default stdio file descriptors unless they're actually used by the
resulting `WasiCtx`.

Unicode errors when inheriting arguments and environment variables no longer cause a panic, but
instead go through `OsString`. We return `ENOTCAPABLE` at the end if there are NULs, or if UTF-8
conversion fails on Windows.

This also changes the bounds on some of the methods from `AsRef<str>` to `AsRef<[u8]>`. This
shouldn't break any existing code, but allows more flexibility when providing arguments. Depending
on the outcome of https://github.com/WebAssembly/WASI/issues/8 we may eventually want to require
these bytes be UTF-8, so we might want to revisit this later.

Finally, this fixes a tiny bug that could arise if we had exactly the maximum number of file
descriptors when populating the preopens.
kubkon referenced this issue in CraneStation/wasi-common Nov 7, 2019
This changes the fields on the builder to types that let the various `.arg()`, `.env()`, etc methods
infallible, so we don't have to worry about handling any errors till we actually build. This reduces
line noise when using a builder in a downstream application.

Deferring the processing of the builder fields also has the advantage of eliminating the opening and
closing of `/dev/null` for the default stdio file descriptors unless they're actually used by the
resulting `WasiCtx`.

Unicode errors when inheriting arguments and environment variables no longer cause a panic, but
instead go through `OsString`. We return `ENOTCAPABLE` at the end if there are NULs, or if UTF-8
conversion fails on Windows.

This also changes the bounds on some of the methods from `AsRef<str>` to `AsRef<[u8]>`. This
shouldn't break any existing code, but allows more flexibility when providing arguments. Depending
on the outcome of https://github.com/WebAssembly/WASI/issues/8 we may eventually want to require
these bytes be UTF-8, so we might want to revisit this later.

Finally, this fixes a tiny bug that could arise if we had exactly the maximum number of file
descriptors when populating the preopens.
kubkon referenced this issue in CraneStation/wasi-common Nov 7, 2019
* fix Linux `isatty` implementation

* defer `WasiCtxBuilder` errors to `build()`; don't change API yet

This changes the fields on the builder to types that let the various `.arg()`, `.env()`, etc methods
infallible, so we don't have to worry about handling any errors till we actually build. This reduces
line noise when using a builder in a downstream application.

Deferring the processing of the builder fields also has the advantage of eliminating the opening and
closing of `/dev/null` for the default stdio file descriptors unless they're actually used by the
resulting `WasiCtx`.

Unicode errors when inheriting arguments and environment variables no longer cause a panic, but
instead go through `OsString`. We return `ENOTCAPABLE` at the end if there are NULs, or if UTF-8
conversion fails on Windows.

This also changes the bounds on some of the methods from `AsRef<str>` to `AsRef<[u8]>`. This
shouldn't break any existing code, but allows more flexibility when providing arguments. Depending
on the outcome of https://github.com/WebAssembly/WASI/issues/8 we may eventually want to require
these bytes be UTF-8, so we might want to revisit this later.

Finally, this fixes a tiny bug that could arise if we had exactly the maximum number of file
descriptors when populating the preopens.

* make `WasiCtxBuilder` method types less restrictive

This is a separate commit, since it changes the interface that downstream clients have to use, and
therefore requires a different commit of `wasmtime` for testing. That `wasmtime` commit is currently
on my private fork, so this will need to be amended before merging.

Now that failures are deferred until `WasiCtxBuilder::build()`, we don't need to have `Result` types
on the other methods any longer.

Additionally, using `IntoIterator` rather than `Iterator` as the trait bound for these methods is
slightly more general, and saves the client some typing.

* enforce that arguments and environment variables are valid UTF-8

* remove now-unnecessary platform-specific OsString handling

* `ENOTCAPABLE` -> `EILSEQ` for failed arg/env string conversions

* fix up comment style

* Apply @acfoltzer's fix to isatty on Linux to BSD
@sunfishcode
Copy link
Member Author

I propose we go with UTF-8, which is overwhelmingly supported by commenters here and in other feedback I've received, and has a lot of obvious advantages. To my knowledge, it's what most implementations have started doing as well.

Concerning efficiency, string-oriented APIs already don't play to WASI's strengths -- if there are APIs which involve a lot of string passing, we should look for ways to let users to create and pass around handles instead, as that has several advantages, character encoding being just one.

API design that de-emphasises strings

As an example, consider iterating over the files in a large directory. With fd_readdir as it currently exists, this involves transcoding all the filenames into UTF-8, even if the names aren't needed. If we wish to avoid this overhead, we could change fd_readdir to return directory-entry handles instead, from which filenames could be requested via additional API calls, or which could be used to open files without naming them.

Accessing unnameable files

One other issue is the ability to open files which don't have encodable names. This is very uncommon, but it should be possible to do from applications willing to do a little extra effort. My idea here is, to encode an arbitrary byte sequence in UTF-8:

  • replace unencodable bytes with U+FFFD, in the manner of other relevant standards which do such things
  • for strings which lose information that way, append a NUL followed by a hash of the first part of the string along with the lost information -- specific syntax TBD, but all encoded within UTF-8 bytes. NUL is not a valid filename character in Windows or Unix, so this won't collide with actual files. The hash allows us to detect if the string is modified by code that doesn't understand the encoding scheme, so we can report errors instead of doing bogus things.
  • change WASI's APIs to use ptr+length strings rather than NUL-terminated strings (most already do, but currently the command-line and environment-variable APIs don't)

This way, applications that can use ptr+length could treat these strings as black boxes and retain the ability to open any file. Applications that use C-style strings would work on all encodable names, but see lossy strings in unencodable cases. For many use cases, this will be fine, because unencodable filenames are very rare. And when they do happen, the U+FFFDs should at least make it fairly obvious what's happening.

@Serentty
Copy link

Serentty commented Dec 8, 2019

Both Windows and Linux have their quirks in this regard. Windows allows pretty much any string of 16-bit numbers as a filename, including unpaired UTF-16 surrogates, whereas Linux leaves the issue of actually interpreting filenames up to applications, so files don't have to be named in the same encoding throughout a drive, and can be nearly any bag of bytes.

Interestingly, I see a lot of software bend over backwards to support unpaired surrogates in filenames for the sake of Windows, despite them not appearing often at all in practice, and yet I never see the same attempt being made for Linux.

I think the approach mentioned above of using UTF-8 with null-based hash escape sequences for unrepresentable characters is definitely worth considering. It allows WASI to give very strong guarantees to software about the encoding of filenames, which is something that most platforms can't do.

@sunfishcode sunfishcode changed the title Character Set Specify UTF-8 as the character set Feb 19, 2020
@indolering
Copy link

I think the approach mentioned above of using UTF-8 with null-based hash escape sequences for unrepresentable characters is definitely worth considering. It allows WASI to give very strong guarantees to software about the encoding of filenames, which is something that most platforms can't do.

Prohibiting control characters would prevent some attacks as well. DWheeler's Fixing Unix/Linux/POSIX Filenames documents a lot of existing prohibitions at the application level. His desire to prohibit leading dashes - is understandable, but probably too restrictive.

@sunfishcode
Copy link
Member Author

That's a great essay. I agree that it'd be good to see if we can prohibit control characters in filenames while we're here.

I also agree that prohibiting leading - feels like it might be too restrictive. Filenames are end-user facing, so a possible guideline we could use here is: how would a user who only uses computers through GUIs perceive this restriction? For control characters, it's difficult to see how they would even be able to observe the restriction. For leading -, such a user wouldn't expect filenames like "---> hello <---" or "-40 degrees in Whitefish" to be invalid, even if they are uncommon.

@indolering
Copy link

For leading -, such a user wouldn't expect filenames like "---> hello <---" or "-40 degrees in Whitefish" to be invalid, even if they are uncommon.

I'm very into the Lang Sec philosophy, but where does it end? The only time you would insert a control character on purpose is as an attack on some input sanitizer. But quotes, unicode literals, and others just as dangerous as a leading dash, just in a different context.

If we are going to scrub filenames, then for the sake of performance we could parameterize the function to allow developer to specify things like a blocklist, a specific Unicode normalization, or even WTF-8. That way we can provide a sane default, but allow developers to tweak it as needed.

@indolering
Copy link

Interestingly, I see a lot of software bend over backwards to support unpaired surrogates in filenames for the sake of Windows, despite them not appearing often at all in practice

Can we quantify this somehow?

@sunfishcode
Copy link
Member Author

I'm very into the Lang Sec philosophy, but where does it end? The only time you would insert a control character on purpose is as an attack on some input sanitizer. But quotes, unicode literals, and others just as dangerous as a leading dash, just in a different context.

I'm exploring the idea of a GUI guideline as a place we could draw the line. There probably isn't a way to input a control character, unpaired surrogate, or other invalid encoding in typical GUI filename fields. But users could plausibly type - or a quote into a filename field without even intending to be malicious.

If we are going to scrub filenames, then for the sake of performance we could parameterize the function to allow developer to specify things like a blocklist, a specific Unicode normalization, or even WTF-8. That way we can provide a sane default, but allow developers to tweak it as needed.

Could you say more about how this might work? If a developer of a wasm module picks eg. a normalization which differs from the host environment a user wants to run the module in uses, it seems like it could be very inefficient.

@devsnek
Copy link
Member

devsnek commented Oct 19, 2020

I have a weak preference for leaving the requirement at valid UTF8. Security requirements are highly circumstantial, so imo it makes sense for us to provide the weakest invariant and let others build on top.

@indolering
Copy link

I'm exploring the idea of a GUI guideline as a place we could draw the line. There probably isn't a way to input a control character, unpaired surrogate, or other invalid encoding in typical GUI filename fields. But users could plausibly type - or a quote into a filename field without even intending to be malicious.

I would agree with you if we had control over the entire stack, i.e. we were shipping our own OS. Even then you would need to interoperate with other systems, which is why Windows has a POSIX mode. The Unicode identifier standard(s) would be of use here....

I have a weak preference for leaving the requirement at valid UTF8. Security requirements are highly circumstantial, so imo it makes sense for us to provide the weakest invariant and let others build on top.

It's a dangerous path to tread down. At a minimum, I would try to at least repair the surrogates. Even then, I believe that code written in UCS-2 languages (like JavaScript) would trigger weird errors.

FWIW, there are only ~1 million unicode points, so exhaustive testing is doable.

@sunfishcode
Copy link
Member Author

I have the beginnings of an experiment using the NUL-escaping technique outlined above. I'm not yet convinced that this is the right path to take, and it's still evolving, but it illustrates the idea. What this would let us do is have filenames, command-line args and environment variables inside WASI always be valid UTF-8, but still have the ability to access any filename in any filesystem, POSIX-style or Windows-style. It doesn't yet address the control codes idea, but that could be added in.

@indolering
Copy link

Apparently Raku defines a special lossless/round-trippable encoding for filenames, UTF8-C8.

@sunfishcode
Copy link
Member Author

I missed UTF8-C8 when I did a survey before, so thanks for bringing that up! Looking into it, the idea of using a private-use codepoint as an escape is clever, but it's unfortunate that translation between Raku's UTF8-C8 to UTF-8 is lossy; two different byte sequences round-trip to the same byte sequence:

> say Buf.new(0xf4,0x8f,0xbf,0xbd,0x78,0x46,0x46).decode('utf8-c8').encode();
utf8:0x<F4 8F BF BD 78 46 46>
> say Buf.new(0xff).decode('utf8-c8').encode();
utf8:0x<F4 8F BF BD 78 46 46>

Raku itself probably avoids this problem because Raku code probably usaully decodes into Raku's internal string representation which remembers the difference, but in WASI we'll be communicating strings to other languages which won't, so we'd get filename collisions.

It would be possible to make a variant of Raku's UTF8-C8 which encodes U+10FFFD in the manner of unrecognized bytes:

    0xf4,0x8f,0xbf,0xbd,'x','F','4',
    0xf4,0x8f,0xbf,0xbd,'x','8','F',
    0xf4,0x8f,0xbf,0xbd,'x','B','F',
    0xf4,0x8f,0xbf,0xbd,'x','B','D',

which would fix the round-tripping. It's otherwise a simple and relatively readable encoding. A remaining downside is that there would exist Unicode filenames for which an application that knows the name would fail to open them by that name, though it would be very rare in practice.

Another approach is Python's surrogateescape encoding, however this encodes undecodable bytes as lone surrogates, so it has the disadvantage of not producing valid UTF-8.

And there's NUL-escaping, as discussed above. This is based on the observation that neither POSIX-family platforms nor Windows permit NULs in filesystem names, so we can use NUL as an escape byte without collisions. ARF strings are an experimental prototype of this. It's trickier in the case of filenames with invalid encodings, but has no special cases for filenames with valid encodings. NUL-escaping has an element of optimism -- on filesystems which require valid Unicode filenames, NUL-escaping could be omitted entirely, leaving behind no special cases.

@indolering
Copy link

I like the NUL escaping idea, I just randomly ran across Raku's thing last night.

FWIW, I believe the standard practice is to preserve encoding and fallback to bag-of-bytes to access it. Altering the filename encoding can cause issues, for example OS X normalizing to NFD was considered a mistake because the retrieved name would not match (the vastly more common) NFC encoding.

While we are at it, the web is almost entirely in NFC form....

@programmerjake
Copy link

NUL escaping seems like it would work poorly for C: C would treat the first NUL as an end-of-string marker and truncate it there.

@sunfishcode
Copy link
Member Author

sunfishcode commented Oct 21, 2020

The ARF-string version of NUL escaping takes a "make the easy things easy, and the hard things possible" approach to C. The vast majority of paths are valid in practice, so they'll look and act completely normal to C (or better, because there's no need to guess the encoding).

In the uncommon case of an invalid path, with ARF strings, unaware C code will see a string containing s plus some non-printing characters to help avoid collisions, which will usually lead to it simply not finding a file by that name. C developers would have a choice, of whether to (a) modify their code to be aware of ARF strings (and there's a library to help), or (b) to just accept that their code can't open such files (which might be reasonable in some contexts, since they're so rare in practice).

@qwertie
Copy link

qwertie commented Oct 22, 2020

If I understand correctly,

  1. Unix filenames can be almost any sequence of 8-bit values
  2. Windows filenames can be almost any sequence of 16-bit values (though with more restrictions than Unix)
  3. No major OS allows '/' or '\0' in filenames
  4. WASI programs ought to be able to open any existing file, and get a directory listing that includes non-unicode names; users should be able to click a filename to open a file even if it contains control characters or whatever
  5. It should be possible to write WASI backup software and software that is able to behave the same as native software, implying a need to be able to create files with strange names.

So, how about this?

  1. reserve '/' as a path separator (which works fine on Windows, of course) with backslash also allowed on Windows (are Macs no longer allowing ':' as a separator?)
  2. allow a filename to be any sequence of bytes that the OS allows, with UTF-8 translated to UTF-16 on Windows
  3. but allow "weird" filenames (lone surrogates and control characters) only if the API is called with an additional flag to allow it, with the default recommendation being that this flag be included when reading or writing files but not when creating files.

Other issues come to mind: (6) a lot of Windows software is limited to reading/writing files whose entire path is shorter than 260 characters (PATH_MAX)... there's no need for WASI to have this limitation but when creating files it may be wiser for interoperability reasons to stick to the limit. (Note that in general it is always possible to exceed the limit, since a folder can be renamed to a longer name which then could make a file inside exceed the limit. Also, on POSIX it's complicated) (7) Windows uses drive letter prefixes like "C:" and "C:/". This obviously generalizes to "protocols" like "http://" which could also, in principle, be supported by a "file system" interface, but wait, isn't "C:" and "http:" considered a valid filename on Unix? I notice that ':' is used on Unix as a separator in PATH as well as being a former path separator on MacOS, so maybe it belongs on the list of discouraged characters everywhere. This wouldn't really solve the conflict though; if the "weird filenames flag" is used then you can't tell if "c:/foo" is an absolute or relative path, except by checking which OS you are on.

@qwertie
Copy link

qwertie commented Oct 22, 2020

I understand this issue goes beyond filenames to command-line arguments and environment variables, but aren't there separate issues for all of these? I noticed that Unix has "command-line arguments" as a core concept - a process receives a list of arguments - whereas on Windows a process receives a single string, not a list. Any C standard library then does something (what?) to parse that string and send it to main(char* argv[]). So if you want to pass a single command-line argument that contains spaces to a Windows program, what is the officially correct way to do that? Admittedly I've been coding for decades on Windows and never stumbled upon a good answer.

(you might say this is "separate" from the character set issue, but it's not quite separate, since issues of parsing are interrelated with issues of character set, e.g. / is excluded from the character set of "filename" in order to meet the needs of parsing paths.)

@programmerjake
Copy link

Windows parses command lines using CommandLineToArgvW. Wine source:
https://github.com/wine-mirror/wine/blob/6d801377055911d914226a3c6af8d8637a63fa13/dlls/shell32/shell32_main.c#L85

@kg
Copy link

kg commented Oct 22, 2020

reserve '/' as a path separator (which works fine on Windows, of course) with backslash also allowed on Windows (are Macs no longer allowing ':' as a separator?)

I'd have to double-check, but I believe under normal circumstances : is also prohibited in filenames on Windows, because it's used for both drive letters and alternate streams (https://docs.microsoft.com/en-us/sysinternals/downloads/streams) so it would be reasonable to also prohibit it in WASI paths. On the other hand, they might work in \\?\ paths...

In general on Windows you have classic Win32 paths with MAX_PATH and other restrictions like not being able to name a file ... Then there's the File Namespace accessed with \\?\ which passes the path directly to the filesystem, which means constraints are set by the filesystem instead of Win32. However, it seems like from the documentation each filesystem can have its own limitations... which means if you're on a FAT32 drive it might enforce limits on length or acceptable characters that don't apply to NTFS, and a shared volume (SMB, etc) might have its own rules. I haven't dealt with this enough to know how consistent the rules are. I think it would be reasonable for WASI to always use the file namespace (to avoid MAX_PATH and other limitations) as long as the important behaviors end-users expect are there.

Sometimes the only way to generate a well-formed path is to keep trying options until it succeeds, which is unfortunate. I've had to write code like that recently.

So if you want to pass a single command-line argument that contains spaces to a Windows program, what is the officially correct way to do that? Admittedly I've been coding for decades on Windows and never stumbled upon a good answer.

Typically the solutions are quoting the argument and/or using a response file. It's definitely something that varies from app to app, like how you would escape a quote inside the quoted argument isn't necessarily obvious.

@programmerjake
Copy link

IIRC / doesn't work on a \\?\-style path

@qwertie
Copy link

qwertie commented Oct 22, 2020

@kg Yes, Windows filenames can't contain any of \ / : * ? " > < | and there are extra restrictions. See StackOverflow, MS Docs.

@sunfishcode
Copy link
Member Author

The current API uses strings for filesystem paths, which contains sequences of Unicode scalar values (USVs), which applications can work with using strings encoded in UTF-8, UTF-16, or other Unicode encodings.

This does mean that the API is unable to open files which do not have well-formed Unicode encodings, which may want separate APIs for handling such paths or may want something like the arf-strings proposal, but if we need that we should file a new issue for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants