Character Set #86

kg · 2019-03-29T08:45:18Z

Is the character set / encoding used by WASI specified anywhere? I don't see it in the docs.

It would be nice to see an advisory comment or even requirement that all the char* textual arguments are UTF-8 unless otherwise specified, along with specificity about the encoding of textual arguments - like is environ a list of variables, each terminated by a nul and then terminated by a double nul? etc.

sunfishcode · 2019-03-29T18:56:15Z

It isn't explicitly documented yet, but the intention is that strings are UTF-8 in general. The contents of argv and environ follow POSIX as far as NUL string termination and NULL pointer array termination go.

Using UTF-8 does create some problems when interfacing with various existing host platforms, and we don't have all the answers yet.

One idea for filenames is to say that a file with a name that can't be translated to UTF-8 can't be accessed by name in WASI. We'd then add an API for iterating over a directory that would allow the file to be accessed without requesting it by name. Similarly, the corresponding idea for command-line arguments is to say that you can't launch a WASI program if the arguments can't be translated to UTF-8.

These tradeoffs obviously aren't ideal for all use cases. And, I haven't described case insensitivity, Unicode normalization, reserved characters, name length limits or other issues yet. So there's clearly more work to be done here. If people have ideas about how we should address these issues, we'd be happy for the help :-).

kg · 2019-03-30T00:51:41Z

My suggestion for dealing with the platform specific stuff like normalization would be to expose separate APIs for path normalization. On platforms with no requirements those APIs can be no-ops, but on platforms like Win32 or OS X they can apply any relevant normalization rules. Then software can both detect normalization rules through probing and also reliably perform normalization in advance instead of paying the cost on every I/O operation.

For UTF-8 as the standard representation, I think that's the best option. Many runtime environments and libraries can auto-convert to UTF8. In the case of C#, there is a mechanism to specify a custom string marshaler to handle UTF8, and there are plans to make it a built-in interop format in the future.

sunfishcode · 2019-04-04T20:43:07Z

Continuing the discussion in https://github.com/WebAssembly/WASI/issues/8.

…wrap()` (bytecodealliance#86) this is just aesthetic, but prefixed unwraps are a lot harder on the eyes than postfixed

sunfishcode added the wasi:api Issues pertaining to the WASI API, not necessarily specific to Wasmtime. label Mar 29, 2019

sunfishcode closed this as completed Apr 4, 2019

sunfishcode mentioned this issue Dec 15, 2020

Specify UTF-8 as the character set WebAssembly/wasi-filesystem#17

Closed

grishasobol pushed a commit to grishasobol/wasmtime that referenced this issue Nov 29, 2021

Don't expand locals. (bytecodealliance#86)

730c918

pchickey added a commit to pchickey/wasmtime that referenced this issue May 12, 2023

replace fn unwrap and fn unwrap_result with postfix `.trapping_un…

11e24c2

…wrap()` (bytecodealliance#86) this is just aesthetic, but prefixed unwraps are a lot harder on the eyes than postfixed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character Set #86

Character Set #86

kg commented Mar 29, 2019

sunfishcode commented Mar 29, 2019

kg commented Mar 30, 2019

sunfishcode commented Apr 4, 2019

Character Set #86

Character Set #86

Comments

kg commented Mar 29, 2019

sunfishcode commented Mar 29, 2019

kg commented Mar 30, 2019

sunfishcode commented Apr 4, 2019