-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify UTF-8 as the character set #17
Comments
My suggestion for dealing with per-platform/per-filesystem normalization constraints would be to expose an explicit API call for normalizing paths, so that an application can do that once and compare the strings to identify whether they were changed and avoid paying the cost of normalization on every IO call. IIRC at least one platform requires this now? I think Apple's APFS? |
I strongly vote for UTF-8. I have never seen a filesystem path non-representable by UTF-8 in practice. |
I also strongly agree with UTF-8 everywhere. |
Agree with UTF-8. |
This changes the fields on the builder to types that let the various `.arg()`, `.env()`, etc methods infallible, so we don't have to worry about handling any errors till we actually build. This reduces line noise when using a builder in a downstream application. Deferring the processing of the builder fields also has the advantage of eliminating the opening and closing of `/dev/null` for the default stdio file descriptors unless they're actually used by the resulting `WasiCtx`. Unicode errors when inheriting arguments and environment variables no longer cause a panic, but instead go through `OsString`. We return `ENOTCAPABLE` at the end if there are NULs, or if UTF-8 conversion fails on Windows. This also changes the bounds on some of the methods from `AsRef<str>` to `AsRef<[u8]>`. This shouldn't break any existing code, but allows more flexibility when providing arguments. Depending on the outcome of https://github.com/WebAssembly/WASI/issues/8 we may eventually want to require these bytes be UTF-8, so we might want to revisit this later. Finally, this fixes a tiny bug that could arise if we had exactly the maximum number of file descriptors when populating the preopens.
This changes the fields on the builder to types that let the various `.arg()`, `.env()`, etc methods infallible, so we don't have to worry about handling any errors till we actually build. This reduces line noise when using a builder in a downstream application. Deferring the processing of the builder fields also has the advantage of eliminating the opening and closing of `/dev/null` for the default stdio file descriptors unless they're actually used by the resulting `WasiCtx`. Unicode errors when inheriting arguments and environment variables no longer cause a panic, but instead go through `OsString`. We return `ENOTCAPABLE` at the end if there are NULs, or if UTF-8 conversion fails on Windows. This also changes the bounds on some of the methods from `AsRef<str>` to `AsRef<[u8]>`. This shouldn't break any existing code, but allows more flexibility when providing arguments. Depending on the outcome of https://github.com/WebAssembly/WASI/issues/8 we may eventually want to require these bytes be UTF-8, so we might want to revisit this later. Finally, this fixes a tiny bug that could arise if we had exactly the maximum number of file descriptors when populating the preopens.
This changes the fields on the builder to types that let the various `.arg()`, `.env()`, etc methods infallible, so we don't have to worry about handling any errors till we actually build. This reduces line noise when using a builder in a downstream application. Deferring the processing of the builder fields also has the advantage of eliminating the opening and closing of `/dev/null` for the default stdio file descriptors unless they're actually used by the resulting `WasiCtx`. Unicode errors when inheriting arguments and environment variables no longer cause a panic, but instead go through `OsString`. We return `ENOTCAPABLE` at the end if there are NULs, or if UTF-8 conversion fails on Windows. This also changes the bounds on some of the methods from `AsRef<str>` to `AsRef<[u8]>`. This shouldn't break any existing code, but allows more flexibility when providing arguments. Depending on the outcome of https://github.com/WebAssembly/WASI/issues/8 we may eventually want to require these bytes be UTF-8, so we might want to revisit this later. Finally, this fixes a tiny bug that could arise if we had exactly the maximum number of file descriptors when populating the preopens.
* fix Linux `isatty` implementation * defer `WasiCtxBuilder` errors to `build()`; don't change API yet This changes the fields on the builder to types that let the various `.arg()`, `.env()`, etc methods infallible, so we don't have to worry about handling any errors till we actually build. This reduces line noise when using a builder in a downstream application. Deferring the processing of the builder fields also has the advantage of eliminating the opening and closing of `/dev/null` for the default stdio file descriptors unless they're actually used by the resulting `WasiCtx`. Unicode errors when inheriting arguments and environment variables no longer cause a panic, but instead go through `OsString`. We return `ENOTCAPABLE` at the end if there are NULs, or if UTF-8 conversion fails on Windows. This also changes the bounds on some of the methods from `AsRef<str>` to `AsRef<[u8]>`. This shouldn't break any existing code, but allows more flexibility when providing arguments. Depending on the outcome of https://github.com/WebAssembly/WASI/issues/8 we may eventually want to require these bytes be UTF-8, so we might want to revisit this later. Finally, this fixes a tiny bug that could arise if we had exactly the maximum number of file descriptors when populating the preopens. * make `WasiCtxBuilder` method types less restrictive This is a separate commit, since it changes the interface that downstream clients have to use, and therefore requires a different commit of `wasmtime` for testing. That `wasmtime` commit is currently on my private fork, so this will need to be amended before merging. Now that failures are deferred until `WasiCtxBuilder::build()`, we don't need to have `Result` types on the other methods any longer. Additionally, using `IntoIterator` rather than `Iterator` as the trait bound for these methods is slightly more general, and saves the client some typing. * enforce that arguments and environment variables are valid UTF-8 * remove now-unnecessary platform-specific OsString handling * `ENOTCAPABLE` -> `EILSEQ` for failed arg/env string conversions * fix up comment style * Apply @acfoltzer's fix to isatty on Linux to BSD
I propose we go with UTF-8, which is overwhelmingly supported by commenters here and in other feedback I've received, and has a lot of obvious advantages. To my knowledge, it's what most implementations have started doing as well. Concerning efficiency, string-oriented APIs already don't play to WASI's strengths -- if there are APIs which involve a lot of string passing, we should look for ways to let users to create and pass around handles instead, as that has several advantages, character encoding being just one. API design that de-emphasises stringsAs an example, consider iterating over the files in a large directory. With Accessing unnameable filesOne other issue is the ability to open files which don't have encodable names. This is very uncommon, but it should be possible to do from applications willing to do a little extra effort. My idea here is, to encode an arbitrary byte sequence in UTF-8:
This way, applications that can use ptr+length could treat these strings as black boxes and retain the ability to open any file. Applications that use C-style strings would work on all encodable names, but see lossy strings in unencodable cases. For many use cases, this will be fine, because unencodable filenames are very rare. And when they do happen, the U+FFFDs should at least make it fairly obvious what's happening. |
Both Windows and Linux have their quirks in this regard. Windows allows pretty much any string of 16-bit numbers as a filename, including unpaired UTF-16 surrogates, whereas Linux leaves the issue of actually interpreting filenames up to applications, so files don't have to be named in the same encoding throughout a drive, and can be nearly any bag of bytes. Interestingly, I see a lot of software bend over backwards to support unpaired surrogates in filenames for the sake of Windows, despite them not appearing often at all in practice, and yet I never see the same attempt being made for Linux. I think the approach mentioned above of using UTF-8 with null-based hash escape sequences for unrepresentable characters is definitely worth considering. It allows WASI to give very strong guarantees to software about the encoding of filenames, which is something that most platforms can't do. |
Prohibiting control characters would prevent some attacks as well. DWheeler's Fixing Unix/Linux/POSIX Filenames documents a lot of existing prohibitions at the application level. His desire to prohibit leading dashes |
That's a great essay. I agree that it'd be good to see if we can prohibit control characters in filenames while we're here. I also agree that prohibiting leading |
I'm very into the Lang Sec philosophy, but where does it end? The only time you would insert a control character on purpose is as an attack on some input sanitizer. But quotes, unicode literals, and others just as dangerous as a leading dash, just in a different context. If we are going to scrub filenames, then for the sake of performance we could parameterize the function to allow developer to specify things like a blocklist, a specific Unicode normalization, or even WTF-8. That way we can provide a sane default, but allow developers to tweak it as needed. |
Can we quantify this somehow? |
I'm exploring the idea of a GUI guideline as a place we could draw the line. There probably isn't a way to input a control character, unpaired surrogate, or other invalid encoding in typical GUI filename fields. But users could plausibly type
Could you say more about how this might work? If a developer of a wasm module picks eg. a normalization which differs from the host environment a user wants to run the module in uses, it seems like it could be very inefficient. |
I have a weak preference for leaving the requirement at valid UTF8. Security requirements are highly circumstantial, so imo it makes sense for us to provide the weakest invariant and let others build on top. |
I would agree with you if we had control over the entire stack, i.e. we were shipping our own OS. Even then you would need to interoperate with other systems, which is why Windows has a POSIX mode. The Unicode identifier standard(s) would be of use here....
It's a dangerous path to tread down. At a minimum, I would try to at least repair the surrogates. Even then, I believe that code written in FWIW, there are only ~1 million unicode points, so exhaustive testing is doable. |
I have the beginnings of an experiment using the NUL-escaping technique outlined above. I'm not yet convinced that this is the right path to take, and it's still evolving, but it illustrates the idea. What this would let us do is have filenames, command-line args and environment variables inside WASI always be valid UTF-8, but still have the ability to access any filename in any filesystem, POSIX-style or Windows-style. It doesn't yet address the control codes idea, but that could be added in. |
Apparently Raku defines a special lossless/round-trippable encoding for filenames, UTF8-C8. |
I missed UTF8-C8 when I did a survey before, so thanks for bringing that up! Looking into it, the idea of using a private-use codepoint as an escape is clever, but it's unfortunate that translation between Raku's UTF8-C8 to UTF-8 is lossy; two different byte sequences round-trip to the same byte sequence: > say Buf.new(0xf4,0x8f,0xbf,0xbd,0x78,0x46,0x46).decode('utf8-c8').encode();
utf8:0x<F4 8F BF BD 78 46 46>
> say Buf.new(0xff).decode('utf8-c8').encode();
utf8:0x<F4 8F BF BD 78 46 46> Raku itself probably avoids this problem because Raku code probably usaully decodes into Raku's internal string representation which remembers the difference, but in WASI we'll be communicating strings to other languages which won't, so we'd get filename collisions. It would be possible to make a variant of Raku's UTF8-C8 which encodes U+10FFFD in the manner of unrecognized bytes:
which would fix the round-tripping. It's otherwise a simple and relatively readable encoding. A remaining downside is that there would exist Unicode filenames for which an application that knows the name would fail to open them by that name, though it would be very rare in practice. Another approach is Python's And there's NUL-escaping, as discussed above. This is based on the observation that neither POSIX-family platforms nor Windows permit NULs in filesystem names, so we can use NUL as an escape byte without collisions. ARF strings are an experimental prototype of this. It's trickier in the case of filenames with invalid encodings, but has no special cases for filenames with valid encodings. NUL-escaping has an element of optimism -- on filesystems which require valid Unicode filenames, NUL-escaping could be omitted entirely, leaving behind no special cases. |
I like the NUL escaping idea, I just randomly ran across Raku's thing last night. FWIW, I believe the standard practice is to preserve encoding and fallback to bag-of-bytes to access it. Altering the filename encoding can cause issues, for example OS X normalizing to NFD was considered a mistake because the retrieved name would not match (the vastly more common) NFC encoding. While we are at it, the web is almost entirely in NFC form.... |
NUL escaping seems like it would work poorly for C: C would treat the first NUL as an end-of-string marker and truncate it there. |
The ARF-string version of NUL escaping takes a "make the easy things easy, and the hard things possible" approach to C. The vast majority of paths are valid in practice, so they'll look and act completely normal to C (or better, because there's no need to guess the encoding). In the uncommon case of an invalid path, with ARF strings, unaware C code will see a string containing |
If I understand correctly,
So, how about this?
Other issues come to mind: (6) a lot of Windows software is limited to reading/writing files whose entire path is shorter than 260 characters (PATH_MAX)... there's no need for WASI to have this limitation but when creating files it may be wiser for interoperability reasons to stick to the limit. (Note that in general it is always possible to exceed the limit, since a folder can be renamed to a longer name which then could make a file inside exceed the limit. Also, on POSIX it's complicated) (7) Windows uses drive letter prefixes like "C:" and "C:/". This obviously generalizes to "protocols" like "http://" which could also, in principle, be supported by a "file system" interface, but wait, isn't "C:" and "http:" considered a valid filename on Unix? I notice that ':' is used on Unix as a separator in PATH as well as being a former path separator on MacOS, so maybe it belongs on the list of discouraged characters everywhere. This wouldn't really solve the conflict though; if the "weird filenames flag" is used then you can't tell if "c:/foo" is an absolute or relative path, except by checking which OS you are on. |
I understand this issue goes beyond filenames to command-line arguments and environment variables, but aren't there separate issues for all of these? I noticed that Unix has "command-line arguments" as a core concept - a process receives a list of arguments - whereas on Windows a process receives a single string, not a list. Any C standard library then does something (what?) to parse that string and send it to (you might say this is "separate" from the character set issue, but it's not quite separate, since issues of parsing are interrelated with issues of character set, e.g. |
Windows parses command lines using |
I'd have to double-check, but I believe under normal circumstances In general on Windows you have classic Win32 paths with MAX_PATH and other restrictions like not being able to name a file Sometimes the only way to generate a well-formed path is to keep trying options until it succeeds, which is unfortunate. I've had to write code like that recently.
Typically the solutions are quoting the argument and/or using a response file. It's definitely something that varies from app to app, like how you would escape a quote inside the quoted argument isn't necessarily obvious. |
IIRC |
@kg Yes, Windows filenames can't contain any of |
The current API uses This does mean that the API is unable to open files which do not have well-formed Unicode encodings, which may want separate APIs for handling such paths or may want something like the arf-strings proposal, but if we need that we should file a new issue for it. |
Continuing the discussion from bytecodealliance/wasmtime#86:
WASI currently doesn't document the character sets used for filesystem paths, command-line arguments or environment variables.
Two high-level strategies have been proposed:
The text was updated successfully, but these errors were encountered: