Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: x/sys/windows/mkwinsyscall: support UTF16 functions not ending with W #51786

Open
qmuntal opened this issue Mar 18, 2022 · 6 comments
Labels
Milestone

Comments

@qmuntal
Copy link
Contributor

qmuntal commented Mar 18, 2022

Background

Microsoft normally provides two parallel sets of APIs, one for ANSI strings, whose function names end with A, and the other for Unicode strings, whose function names end with W (source).

This pattern is well supported in x/sys/windows/mkwinsyscall: function parameters of type string are transformed using syscall.UTF16PtrFromString if the function names ends with W, and using syscall.BytePtrFromString otherwise:

https://github.com/golang/sys/blob/2edf467146b5fc89e484991587e3032c8421ae8c/windows/mkwinsyscall/mkwinsyscall.go#L616-L622

Problem

Newer libraries that were created after Windows introduced Unicode tend to provide just the UTF16 API set without the W-ending convention. In my particular case, I'm creating bindings for the BCrypt library, in which all functions accepting strings expect them to be UTF16 encoded even when the function name does not end with W.

This forces me to define the //sys functions using *uint16 instead of string and to do the conversion outside the autogenerated code.

Request

Add a knob to x/sys/windows/mkwinsyscall to decide if a string parameter should be transformed to ASCII or Unicode.

Two complementary solutions come to mind:

  • Add a cli flag, i.e. -utf16, which changes the code generation behavior from
    everything is ASCII unless the function name ends with W
    to
    everything is UTF16 unless the function name ends with A.

  • Add a new optional condition to each function definition, i.e. [encoding=utf16/ascii], which will override the encoding selected from the function name or the cli flag (if finally added).

The first solution will probably be enough to cover all cases, as new Windows APIs are Unicode by default and only legacy APIs expect ASCII strings, in which case they already provide the both API sets.

cc @zx2c4

@ianlancetaylor
Copy link
Contributor

CC @alexbrainman

@zx2c4
Copy link
Contributor

zx2c4 commented Mar 19, 2022

What if we just unconditionally change it to default to utf16? Or perhaps always require either -utf16 or -ascii, but not none. I don't think Go promises any stability w.r.t. autogenerators, right?

The reason I suggest this is that I'd hate for the behavior of mkwinsyscall to become, "it does the old bad thing unless you remember to pass this special switch." I'd rather it be "it does the new good thing, unless you tell it to do the old bad thing" or "you have to tell it which thing you want to do, so that it won't do the old bad thing without you telling it to."

I suspect changing it to default to utf16 would make most sense, considering most if not all functions are already wchar.

@alexbrainman
Copy link
Member

@qmuntal

This forces me to define the //sys functions using *uint16 instead of string and to do the conversion outside the autogenerated code.

I don't see using *uint16 instead of string as a bad thing. Users of your package would have to explicitly call windows.UTF16PtrFromString, but they could save some memory allocations because of that. I don't know if memory saving is important when using BCrypt library, but I always try not to hide memory allocations when creating low level libraries.

I suggest you copy x/sys/windows/mkwinsyscall into your package and adjust it as required (probably 1 or 2 lines change), and use it for your case. If you discover more use cases for your proposal, I am happy to reconsider.

@zx2c4

What if we just unconditionally change it to default to utf16?

That will break some users code. I suggested a reasonable workaround for @qmuntal problem. Why breaking users code is better than my workaround?

Or perhaps always require either -utf16 or -ascii, but not none

That will definitely break users code. And your proposed flags will force current users split their code into 2 different source files: ascii version and utf16 version.

I am not convinced this is a problem that we need to solve. Yet.

But feel free to ignore my opinion.

Alex

@qmuntal
Copy link
Contributor Author

qmuntal commented Mar 22, 2022

@alexbrainman

I don't see using *uint16 instead of string as a bad thing. Users of your package would have to explicitly call windows.UTF16PtrFromString, but they could save some memory allocations because of that.

windows.UTF16PtrFromString returns a pointer to the UTF16 string and an error, so we are forcing users to check that error before calling the autogenerated function, whereas if the autogenerated function would do the conversion this boilerplate code could be eliminated.

I don't know if memory saving is important when using BCrypt library, but I always try not to hide memory allocations when creating low level libraries.

I agree with your statement about not hiding allocations, but this proposal don't preclude //sys definitions to use *uint16 instead of string, it's just about making string params to be handled correctly regardless of the function naming convention. IMO we are not adding a new way to hide allocations, as functions accepting string params whose name end with W or A are already hiding allocations.

Having low memory allocation is primordial when using BCrypt library, but for my use case functions accepting UTF16 strings are not supposed to be called multiple times with the same string, so it makes no sense to cache the windows.UTF16PtrFromString result in order to reduce allocations. Passing the string directly to the autogenerated function would not have any performance impact, but would be more ergonomic.

I suggest you copy x/sys/windows/mkwinsyscall into your package and adjust it as required (probably 1 or 2 lines change), and use it for your case. If you discover more use cases for your proposal, I am happy to reconsider.

If you think this proposal covers a niche use case, I'm fine patching mkwinsyscall. But, as @zx2c4 has also mentioned, UTF16 strings are the standard for new Windows APIs, and it would be a pity to lose the capacity of passing string arguments instead of *uint16 to new Windows function.

@alexbrainman
Copy link
Member

whereas if the autogenerated function would do the conversion this boilerplate code could be eliminated.

Yes. If you don't care about allocations, then you can save your users from writing boilerplate code.

this proposal don't preclude //sys definitions to use *uint16 instead of string, it's just about making string params to be handled correctly regardless of the function naming convention. IMO we are not adding a new way to hide allocations, as functions accepting string params whose name end with W or A are already hiding allocations.

Agree.

Having low memory allocation is primordial when using BCrypt library, but for my use case functions accepting UTF16 strings are not supposed to be called multiple times with the same string, so it makes no sense to cache the windows.UTF16PtrFromString result in order to reduce allocations. Passing the string directly to the autogenerated function would not have any performance impact, but would be more ergonomic.

Thanks for explaining.

If you think this proposal covers a niche use case, ...

I do think that. Maybe my opinion will change in the future.

... I'm fine patching mkwinsyscall. But, as @zx2c4 has also mentioned, UTF16 strings are the standard for new Windows APIs, and it would be a pity to lose the capacity of passing string arguments instead of *uint16 to new Windows function.

You would have to copy mkwinsyscall program and make small changes to it. If you find this troublesome, I am happy to reconsider. I did exactly that in many of my personal projects. I do not regret making that decision after many years.

I also happy to reconsider if other projects find this functionality useful.

I am just trying to defer breaking users code till later.

Alex

@rsc rsc moved this to Incoming in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/425054 mentions this issue: syscall: rely on utf16.AppendRune

gopherbot pushed a commit that referenced this issue Aug 23, 2022
Using utf16.AppendRune instead of utf16.Encode safe a bunch
of allocations across the board, as many higher level functions
use it to call Windows syscalls, for example to `os` package:

name                old alloc/op   new alloc/op   delta
Readdirname-12        15.6kB ± 0%    15.6kB ± 0%   +0.26%  (p=0.008 n=5+5)
Readdir-12            29.4kB ± 0%    29.4kB ± 0%   +0.14%  (p=0.008 n=5+5)
ReadDir-12            29.4kB ± 0%    29.4kB ± 0%   +0.14%  (p=0.016 n=4+5)
StatDot-12              552B ± 0%      560B ± 0%   +1.45%  (p=0.008 n=5+5)
StatFile-12             512B ± 0%      336B ± 0%  -34.38%  (p=0.008 n=5+5)
StatDir-12              432B ± 0%      288B ± 0%  -33.33%  (p=0.008 n=5+5)
LstatDot-12             552B ± 0%      560B ± 0%   +1.45%  (p=0.008 n=5+5)
LstatFile-12            512B ± 0%      336B ± 0%  -34.38%  (p=0.008 n=5+5)
LstatDir-12             432B ± 0%      288B ± 0%  -33.33%  (p=0.008 n=5+5)
StatFile-12             4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.008 n=5+5)
StatDir-12              4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.008 n=5+5)
LstatFile-12            4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.008 n=5+5)
LstatDir-12             4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.008 n=5+5)

Updates #51786

Change-Id: I0a088cf1a96e9c304da9311bb3895b70443c1637
Reviewed-on: https://go-review.googlesource.com/c/go/+/425054
Reviewed-by: Tobias Klauser <[email protected]>
Run-TryBot: Ian Lance Taylor <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: David Chase <[email protected]>
Auto-Submit: Ian Lance Taylor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

5 participants