Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UTF8 to CharSet #17000

Closed
stephentoub opened this issue Apr 17, 2016 · 56 comments
Closed

Add UTF8 to CharSet #17000

stephentoub opened this issue Apr 17, 2016 · 56 comments
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.InteropServices
Milestone

Comments

@stephentoub
Copy link
Member

The CharSet enumeration is used to specify how strings should be marshaled:
https://github.com/dotnet/corefx/blob/master/src/System.Runtime/ref/System.Runtime.cs#L3019-L3023

public enum CharSet
{
    Ansi = 2,
    Unicode = 3,
}

Unicode specifies that UTF16 should be used, regardless of platform, but Ansi is interpreted differently based on platform: on Windows it's interpreted to mean the ANSI format, whereas on Unix it's interpreted to mean UTF8. This means that on Windows we lack the ability to specify UTF8 as the marshaling, and more generally we lack the ability to specify UTF8 marshaling regardless of platform, making writing cross-platform managed components more difficult.

We should add a new UTF8 enum value:

public enum CharSet
{
    Ansi = 2,
    Unicode = 3,
    UTF8 = 5,
}

that when used will cause the runtime's marshaling to be done with UTF8, which is the standard for modern services.

[Added-by-Yi]

We should also add a new corresponding UnmanagedType enum in UnmanagedType for UTF8 as well, for finer control and parity (between UnmanagedType and CharSet):
https://github.com/dotnet/corefx/blob/master/src/System.Runtime.InteropServices.PInvoke/ref/System.Runtime.InteropServices.PInvoke.cs

public enum UnmanagedType
{
    LPUTF8Str = 0x30
}

And new Marshal helpers while we are at it:

public class PInvokeMarshal
{
    public static string PtrToStringUTF8(System.IntPtr ptr);
    public static string PtrToStringUTF8(System.IntPtr ptr, int len);
    public static System.IntPtr StringToAllocatedMemoryUTF8(string s);
    public static System.IntPtr ZeroFreeMemoryUTF8(System.IntPtr s);
}
@migueldeicaza
Copy link
Contributor

Adding some extra information.

When these values are embedded in P/Invoke definitions on the metadata, in the "Flags for ImplMap" [PInvokeAttributes]. There is one unused bit there, that we could take (0x08) and we could take one value of those to mean Utf8, leaving some room for other things in the future as well.

@yizhang82
Copy link
Contributor

I think we should also add UnmanagedType=LpUtf8Str (or UTF8String, name TBD). All the capabilities in CharSet should be available in UnmanagedType.

@yizhang82
Copy link
Contributor

The work is the same since we have to create a new ilmarshaler anyway.

@yizhang82
Copy link
Contributor

Added UnmanagedType into @stephentoub original proposal.

@yizhang82
Copy link
Contributor

Added new PInvokeMarshal helpers into the proposal as well.

@masonwheeler
Copy link
Contributor

Yes, this has been needed for quite a while now. It definitely looks like a worthwhile proposal.

@yizhang82
Copy link
Contributor

Updated LpUTF8Str to LPUTF8Str according to @weshaggard 's feedback.

@tijoytom-zz
Copy link
Contributor

@yizhang82
cc @weshaggard @stephentoub
Does the below API names sound reasonable? I know you named them StringToAllocatedMemoryUTF8 , but StringToCoTaskMemUTF8 sounds consistent with what we have now for ANSI and UniCode(aka UTF16).

unsafe public static String PtrToStringUTF8(IntPtr ptr)
unsafe public static String PtrToStringUTF8(IntPtr ptr,int len)
unsafe public static IntPtr StringToCoTaskMemUTF8(String s)
unsafe public static void ZeroFreeCoTaskMemUTF8(IntPtr s)

@yizhang82
Copy link
Contributor

@tijoytom Please note that these are the new names we defined for PInvokeMarshal class where the naming is consistent and use AllocatedMemory (instead of CoTaskMem, to avoid windows-ness).

@whoisj
Copy link

whoisj commented May 11, 2016

Any chance we can call a duck a "duck" and change Unicode to Utf16? I'd even settle for Ucs2 at this point.

@yizhang82
Copy link
Contributor

@whoisj Unfortunately changing existing API contract is a breaking change. We can change the Unicode names in the new PInvokeMarshal class APIs, but the CharSet enum is going to be problematic. I think we can slowly introduce new API and enum with better names and slowly deprecate the old ones (LPWstr is another such example).

@bendono
Copy link
Contributor

bendono commented May 11, 2016

@yizhang82 What about leaving Unicode and adding Utf16, with both having the same value of 3?

@masonwheeler
Copy link
Contributor

@bendono That was my thought too. Or possibly even marking Unicode as [Obsolete]. (Can you do that with enum values?)

@whoisj
Copy link

whoisj commented May 11, 2016

@whoisj

Unfortunately changing existing API contract is a breaking change. We can change the Unicode names in the new PInvokeMarshal class APIs, but the CharSet enum is going to be problematic. I think we can slowly introduce new API and enum with better names and slowly deprecate the old ones (LPWstr is another such example).

I guess I do not understand why the enumeration value cannot be overloaded for "correctness" and "back compat". Something along the lines of:

public enum CharSet
{
    Ansi = 2,
    [Obsolete]
    Unicode = 3,
    UTF16 = 3,
    UTF8 = 5,
}

@yizhang82
Copy link
Contributor

@whoisj @masonwheeler Thanks for your suggestions. Yes, this is exactly what I was thinking as well (introduce new API and deprecate old ones). In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode. It is something a simple search/replace could fix, but still potentially a big impact regardless.

@whoisj
Copy link

whoisj commented May 11, 2016

@yizhang82 as a customer who would be impacted by such a change in the way you describe, I appreciate your concern.

I propose the following resolution:

  1. Add UTF16 = 3 to the enumeration, and pubically state that Unicode will be deprecated in the future.
  2. In the follow up release to NetFx 4.7 / 5 / 7, whatever it'll be named, add the [Obsolete] decorator. to the Unicode value.
  3. Sometime in the future, when appropriate, remove the Unicode value from the enumeration.

This seems to be a decent compromise.

@masonwheeler
Copy link
Contributor

@yizhang82

In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode.

But isn't that the whole point of warningaserror, declaring that they want their code to break if even something minor enough to be considered an error ends up being off about their codebase, including future changes?

@tijoytom-zz
Copy link
Contributor

@yizhang82
unsafe public static String PtrToStringUTF8(IntPtr ptr,int len)

Realized that the semantics of this method (ie for ANSI and UTF16) usually 'len' parameter is the length in number of characters , now for UTF8 the user won't know the number of character pointed to by ptr. Now even if the user somehow find the number of chars , we still need to get the nubmer of bytes to do the actual conversion.
Given this , i think it might be better to drop this overload of PtrToStringUTF8.

@whoisj
Copy link

whoisj commented May 12, 2016

@tijoytom not every uft8 block is null terminated. Many use-cases will need the len value to be the number of utf8 bytes, but absolutely not the count of characters or code points in the block.

@yizhang82
Copy link
Contributor

@tijoytom The len here is not count of characters. It is byte len. Our Documentation on PtrToStringAnsi is incorrect. Let's change the name of PtrToStringUTF8 to be PtrToStringUTF8ByteLen

@stephentoub
Copy link
Member Author

Let's change the name of PtrToStringUTF8 to be PtrToStringUTF8ByteLen

Wouldn't it be better to instead change the name of the parameter from len to byteLength or something like that?

@tijoytom-zz
Copy link
Contributor

Yes, changed it to byteLen

@AlexGhiondea
Copy link
Contributor

@yizhang82 @tijoytom is this something you are currently working on?

@jnm2
Copy link
Contributor

jnm2 commented Nov 30, 2016

@masonwheeler

@yizhang82

In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode.

But isn't that the whole point of warningaserror, declaring that they want their code to break if even something minor enough to be considered an error ends up being off about their codebase, including future changes?

This topic has come up several times here and in .NET design meetings. Being respectful of warnings-as-errors is a good thing, just as much for the sake of not introducing 1000 new build warnings even without warnings-as-errors. But at the end of the day, if people opt into warnings as errors, they should expect to have their builds broken now and then. They want micromanagement, and they will get it.

I think obsoleting public API immediately is okay. If you don't do that, at least clearly document the value as deprecated in IntelliSense.

Personally I leave it off until the implementation is past primary development. I turn it on during the stage where I don't expect to be making more changes myself besides potential dependency updating.

@danmoseley
Copy link
Member

@jnm2 the Visual Studio experience is the main concern. I can't speak for the compiler team but I understand they wish to avoid making new warnings appear in the VS error list on upgrade as over the years it has proven to deter people from upgrading their Visual Studio. Historically we have tried to encourage folks to disable specific warnings to get past this but it hasn't been sufficient. I completely agree with the sentiment because there are types we really want to steer folks away from but that's the current policy.

@whoisj
Copy link

whoisj commented Jan 3, 2017

@danmosemsft but CoreFx and VS are decoupled now, no?

@yizhang82
Copy link
Contributor

@whoisj They are decoupled - but VS needs to have a policy decision on which version of CoreFX meta package as the default in VS projects, and this has wide impact to developers using VS. At the end of the day, I think there is a trade off to be made case-by-case. In this case I don't see a strong reason. There will be some confusion, yes, but given that rest of the .NET escosystem (char, string, etc) is pretty much UTF-16 based, it is probably not worth impacting anyone that has CharSet.Unicode in their code.

We could simply add a new value UTF16 that has the same value without deprecating the Unicode value.

@danmoseley
Copy link
Member

@whoisj Visual Studio does not run on CoreFX, but it does offer a tooling experience for CoreFX which is what matters here. Possibly I misunderstand.

@whoisj
Copy link

whoisj commented Jan 3, 2017

@yizhang82 I think this:

We could simply add a new value UTF16 that has the same value without deprecating the Unicode value.

Is the right answer. For those coming from non-Windows platforms, seeing the options Ascii, Unicode, and UTF8 as options is kind of mind boggling. For many, Unicode is UTF8. 😏

@luqunl
Copy link
Contributor

luqunl commented Feb 9, 2018

assign this to @jeffschwMSFT and @luqunl

@brian-armstrong-discord
Copy link

I love that this is being looked at!

Wouldn't the correct casing be Utf8 in order to match Ansi? Keeping Ansi but adding UTF8 seems like an inconsistency.

@danmoseley
Copy link
Member

Re casing, The design guidelines require Utf8. Looking in CoreFXLabs, they consistently use Utf8.

However there is plenty of use of upper casings eg if you look in CoreFX public API and in https://apisof.net.

@stephentoub did you choose upper case to be consistent with UTF8Encoding?

@stephentoub
Copy link
Member Author

@stephentoub did you choose upper case to be consistent with UTF8Encoding?

With Encoding.UTF8

@luqunl
Copy link
Contributor

luqunl commented Jun 6, 2018

@brian-armstrong-discord , We are discussing whether it is necessary to add Charset.UTF8. Most of case(except array), User can just add MarshalAs(UnmanagedType.LPUTF8Str) to each appropriate arguments/fields to use UTF8.

@yizhang82
Copy link
Contributor

@luqun Besides char, char[], CharSet also affects default String/StringBuilder marshaling behavior.

@luqunl luqunl removed their assignment Oct 11, 2018
@joshfree
Copy link
Member

@jeffschwMSFT what's the current status of this approved API? do you need anything from the corefx team to continue making progress here?

/cc @layomia and @GrabYourPitchforks

Thanks

@adamsitnik
Copy link
Member

Any progress on this? The current behavior is confusing.

@jeffschwMSFT
Copy link
Member

This is not currently on our .NET Core 3.0 list of final items. We can take a look and see if this would align with 3.1.

cc @jkoritzinsky

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@stephentoub stephentoub removed the untriaged New issue has not been triaged by the area owner label Feb 25, 2020
@jkotas jkotas modified the milestones: 5.0, Future May 31, 2020
@lahma0
Copy link

lahma0 commented Aug 20, 2021

After 5 long years, why has this feature still not been implemented? Is there some technical or policy-related issue that is holding this up or is the can just being kicked down the road as a result of other priorities? It appears that most of the necessary discussions concerning technical matters ended in 2017 yet its implementation was skipped over in .NET Core 3.0, then skipped over in .NET Core 3.1, it was then added as a "5.0 milestone", then several months later removed as a "5.0 milestone", and finally was added as a "Future milestone" (which I have to assume just means "something we are going to ignore and don't want to worry about any time in the foreseeable future"). Not having this functionality implemented unnecessarily complicates both cross-platform development and working with native libraries which use/expect UTF8 strings. UTF8 is the most common character encoding method in use and has been for a long time... Will this ever be implemented?

@jkoritzinsky
Copy link
Member

@lahma0 due to a number of implementation difficulties, we have generally been exploring other improvements. We have been working on a source-generated interop solution (see #4257 (comment) for a little more information) that should enable solutions similar to this issue without the far reaching changes in the runtime itself that the tentative implementation of this API had.

In the meantime, you can use [MarshalAs(UnmanagedType.LPUTF8Str)] for each string parameter/return value to get identical behavior to what a CharSet.UTF8 API would provide on P/Invokes

@lahma0
Copy link

lahma0 commented Aug 20, 2021

@jkoritzinsky I appreciate your very quick response and I apologize if my comment came across as a little bitter. While implementing these type of things might not look like that large of a hurdle to outsiders (such as myself), I know they tend to be much more nuanced/difficult and can create a cascade of other problems. I read about the plans for the source-generated interop stuff quite a while back but I haven't kept up to date on its progress. I will definitely check out the links your provided. Thanks again.

@stephentoub
Copy link
Member Author

@dotnet/interop-contrib, have we decided not to do this? If so, we should close it.

@AaronRobinsonMSFT
Copy link
Member

@stephentoub The plan was to wait till we get the source generator for DllImport productized and then close these issues. However, I suppose we could just point to the productization issue now.

@AaronRobinsonMSFT
Copy link
Member

Closing this request in lieu of first-class support we are adding to the DllImport source generator. See #60595

@ghost ghost locked as resolved and limited conversation to collaborators Apr 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.InteropServices
Projects
None yet
Development

No branches or pull requests