Support non-ascii in fgVNBasedIntrinsicExpansionForCall_ReadUtf8 #89383

EgorBo · 2023-07-24T12:59:01Z

Jakob pointed me to WszWideCharToMultiByte API that seems to be enough to support non-ASCII values here too.
Closes #89375

…Utf8

ghost · 2023-07-24T12:59:13Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Jakob pointed me to WszWideCharToMultiByte API that seems to be enough to support non-ASCII values here too.
Closes #89375

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

src/coreclr/jit/helperexpansion.cpp

stephentoub

Thanks :)

EgorBo · 2023-07-24T14:50:36Z

Wrote a test to validate WszWideCharToMultiByte behaviour: https://gist.github.com/EgorBo/f65f1fd468a8644d33ac5224099d84d6 to match BCL's conversion

stephentoub · 2023-07-24T14:56:28Z

What does WszWideCharToMultiByte do for invalid input?

EgorBo · 2023-07-24T15:07:34Z

What does WszWideCharToMultiByte do for invalid input?

Can you provide an example - I'm a bit confused on what's expected to be done with it

stephentoub · 2023-07-24T15:25:55Z

Can you provide an example - I'm a bit confused on what's expected to be done with it

For example, this will throw an exception:

new UTF8Encoding(false, throwOnInvalidBytes: true).GetBytes("\uD800b")

because \uD800 is a high surrogate and it's not followed by a low surrogate.

Ideally the JIT simply wouldn't do anything if it detected the text was invalid and would just use the actual implementation of ReadUtf8, which would let UTF8Encoding do with it what it will, and you wouldn't need to match the handling for erroneous input.

EgorBo · 2023-07-24T16:12:27Z

Thanks, it seems that it still returns the same value as the BCL version, there is a WC_ERR_INVALID_CHARS flag for WszWideCharToMultiByte that makes it to return 0 for such input and forward it to the managed impl but not sure it's supported outside of Windows

I was also pointed to a minipal version for utf16-utf8 encoding (minipal_convert_utf16_to_utf8) that is used for evenstream, perhaps, I should use that instead? (+ extra size to clrjit I guess) cc @jkotas @elinor-fung (from #89036)

am11 · 2023-07-24T18:12:10Z

WszWideCharToMultiByte is (one of many) wrapper(s) around WideCharToMultiByte from coreclr/pal, which in turn is a wrapper around minipal/utf8. I think the consensus with @jkotas in #85558 was that we remove these PAL wrappers and directly use minipal APIs in both coreclr and mono. But as it stands, all native code in runtime repo is using the same minipal APIs for utf16<->utf8 conversion.

ps [lowkey]: there is a room to vectorize that minipal/utf8 C implementation, matching the C# one, to accelerate both runtimes paths which frequently convert strings.

EgorBo · 2023-07-24T19:02:00Z

I'm no expert in cmake, I tried to add utf8.c from minipal to jit sources and I've got:

  C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.37.32820\include\yvals_core.h(28): STL1003: Unexpected compiler, expected C++ compiler.
  C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.37.32820\include\yvals_core.h(29): fatal error C1189: #error:  Error in C++ Standard Library usage

🙁 how do I include a .c file to C++ codebase?

am11 · 2023-07-24T20:56:16Z

@EgorBo, seems like in "jit/static", we need to disable PCH for C sources. With this patch, build succeeds with VS 2022 (17.6.3):

diff --git a/src/coreclr/jit/CMakeLists.txt b/src/coreclr/jit/CMakeLists.txt
index d00b5b27fe5..daaf1aa1e38 100644
--- a/src/coreclr/jit/CMakeLists.txt
+++ b/src/coreclr/jit/CMakeLists.txt
@@ -176,6 +176,7 @@ set( JIT_SOURCES
   unwind.cpp
   utils.cpp
   valuenum.cpp
+  ${CLR_SRC_NATIVE_DIR}/minipal/utf8.c
 )

 if (CLR_CMAKE_TARGET_WIN32)
diff --git a/src/coreclr/jit/compiler.h b/src/coreclr/jit/compiler.h
index e752044acf8..38664a91e66 100644
--- a/src/coreclr/jit/compiler.h
+++ b/src/coreclr/jit/compiler.h
@@ -48,6 +48,8 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 #include "disasm.h"
 #endif

+#include <minipal/utf8.h>
+
 #include "codegeninterface.h"
 #include "regset.h"
 #include "jitgcinfo.h"
@@ -136,6 +138,7 @@ const int BAD_STK_OFFS = 0xBAADF00D; // for LclVarDsc::lvStkOffs
 //------------------------------------------------------------------------
 inline bool IsHfa(CorInfoHFAElemType kind)
 {
+    size_t test = minipal_get_length_utf8_to_utf16 (0, 0, 0);
     return kind != CORINFO_HFA_ELEM_NONE;
 }
 inline var_types HfaTypeFromElemKind(CorInfoHFAElemType kind)
diff --git a/src/coreclr/jit/static/CMakeLists.txt b/src/coreclr/jit/static/CMakeLists.txt
index 99ae15963b5..5a8052765cc 100644
--- a/src/coreclr/jit/static/CMakeLists.txt
+++ b/src/coreclr/jit/static/CMakeLists.txt
@@ -2,6 +2,10 @@ project(ClrJit)

 set_source_files_properties(${JIT_EXPORTS_FILE} PROPERTIES GENERATED TRUE)

+if(CLR_CMAKE_HOST_WIN32)
+set_source_files_properties("${CLR_SRC_NATIVE_DIR}/minipal/utf8.c" PROPERTIES SKIP_PRECOMPILE_HEADERS ON)
+endif(CLR_CMAKE_HOST_WIN32)
+
 add_library_clr(clrjit_obj
     OBJECT
     ${JIT_SOURCES}

jakobbotsch · 2023-07-24T21:04:50Z

Can we just stick to WszWideCharToMultiByte that is already used in other places in the JIT? We can convert all of them to use minipal as a separate PR.

EgorBo · 2023-07-25T00:38:42Z

@EgorBo, seems like in "jit/static", we need to disable PCH for C sources. With this patch, build succeeds with VS 2022 (17.6.3):

diff --git a/src/coreclr/jit/CMakeLists.txt b/src/coreclr/jit/CMakeLists.txt
index d00b5b27fe5..daaf1aa1e38 100644
--- a/src/coreclr/jit/CMakeLists.txt
+++ b/src/coreclr/jit/CMakeLists.txt
@@ -176,6 +176,7 @@ set( JIT_SOURCES
   unwind.cpp
   utils.cpp
   valuenum.cpp
+  ${CLR_SRC_NATIVE_DIR}/minipal/utf8.c
 )

 if (CLR_CMAKE_TARGET_WIN32)
diff --git a/src/coreclr/jit/compiler.h b/src/coreclr/jit/compiler.h
index e752044acf8..38664a91e66 100644
--- a/src/coreclr/jit/compiler.h
+++ b/src/coreclr/jit/compiler.h
@@ -48,6 +48,8 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 #include "disasm.h"
 #endif

+#include <minipal/utf8.h>
+
 #include "codegeninterface.h"
 #include "regset.h"
 #include "jitgcinfo.h"
@@ -136,6 +138,7 @@ const int BAD_STK_OFFS = 0xBAADF00D; // for LclVarDsc::lvStkOffs
 //------------------------------------------------------------------------
 inline bool IsHfa(CorInfoHFAElemType kind)
 {
+    size_t test = minipal_get_length_utf8_to_utf16 (0, 0, 0);
     return kind != CORINFO_HFA_ELEM_NONE;
 }
 inline var_types HfaTypeFromElemKind(CorInfoHFAElemType kind)
diff --git a/src/coreclr/jit/static/CMakeLists.txt b/src/coreclr/jit/static/CMakeLists.txt
index 99ae15963b5..5a8052765cc 100644
--- a/src/coreclr/jit/static/CMakeLists.txt
+++ b/src/coreclr/jit/static/CMakeLists.txt
@@ -2,6 +2,10 @@ project(ClrJit)

 set_source_files_properties(${JIT_EXPORTS_FILE} PROPERTIES GENERATED TRUE)

+if(CLR_CMAKE_HOST_WIN32)
+set_source_files_properties("${CLR_SRC_NATIVE_DIR}/minipal/utf8.c" PROPERTIES SKIP_PRECOMPILE_HEADERS ON)
+endif(CLR_CMAKE_HOST_WIN32)
+
 add_library_clr(clrjit_obj
     OBJECT
     ${JIT_SOURCES}

Thank you for educting me on this part! Agree on the point that since we already use it I'll delay that to a separate pr

jkotas · 2023-07-25T01:08:17Z

Thanks, it seems that it still returns the same value as the BCL version

Do we need tests for this?

Handling of invalid UTF8/UTF16 sequences is always a mess. I would be surprised if there are no subtle behavior differences between the managed Utf8 encoding implementation, the minipal Utf8 encoding implementation, and the Windows OS provided Utf8 encoding implementation that WszWideCharToMultiByte ends up calling on Windows.

jkotas · 2023-07-25T01:14:48Z

how do I include a .c file to C++ codebase?

I think you would want to add utf8.c to https://github.com/dotnet/runtime/blob/main/src/coreclr/minipal/Windows/CMakeLists.txt. (Adding it to JIT sources would collide with utf8.c that is part of Win32 PAL that the JIT still links with.)

Can we just stick to WszWideCharToMultiByte that is already used in other places in the JIT? We can convert all of them to use minipal as a separate PR.

WszWideCharToMultiByte has different underlying implementation on Windows vs. Unix, so it will expose you to OS-specific bugs. The existing uses of WszWideCharToMultiByte do not influence codegen directly. Something to consider.

am11 · 2023-07-25T05:21:04Z

I would be surprised if there are no subtle behavior differences between the managed Utf8 encoding implementation, the minipal Utf8 encoding implementation, and the Windows OS provided Utf8 encoding implementation that WszWideCharToMultiByte ends up calling on Windows.

Mono uses giconv->minipal on both Windows and Unix alike. Perhaps this is something we can consider for coreclr in the future, as a step to mitigate these differences?

Handling of invalid characters between minipal and managed implementation is pretty much the same (if not exactly the same). There are PAL and COM tests exercising the invalid codepoints; one slightest change in behavior and they fail. (during C++ to C conversion, I had to debug them quite a bit to achieve the precise behavior the tests were expecting; so if there still are differences in error modes, they should be minimum as a result of oversight / test gap)

…ascii

jkotas · 2023-07-25T11:11:41Z

To fix the stdbool build break, add a dummy stdbool.h to runtime\src\coreclr\pal\inc\rt\cpp, similar to other std*.h that are already there.

EgorBo · 2023-07-25T13:13:00Z

To fix the stdbool build break, add a dummy stdbool.h to runtime\src\coreclr\pal\inc\rt\cpp, similar to other std*.h that are already there.

Thanks, that works!

EgorBo · 2023-07-25T16:34:47Z

@jkotas does it look good now?

jkotas

LGTM

Use WszWideCharToMultiByte in fgVNBasedIntrinsicExpansionForCall_Read…

d7d9fe8

…Utf8

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 24, 2023

ghost assigned EgorBo Jul 24, 2023

jakobbotsch reviewed Jul 24, 2023

View reviewed changes

src/coreclr/jit/helperexpansion.cpp Show resolved Hide resolved

jakobbotsch approved these changes Jul 24, 2023

View reviewed changes

EgorBo marked this pull request as ready for review July 24, 2023 13:46

stephentoub reviewed Jul 24, 2023

View reviewed changes

EgorBo added 5 commits July 25, 2023 10:46

Move to minipal

a280163

Merge branch 'main' of github.com:dotnet/runtime into utf16-utf8-non-…

43526a6

…ascii

Add tests

3bc4bdd

Fix unix build

28e8cd0

revert

f94457d

Add dummy stdbool.h

4428f38

copy to cpp folder

ef0699e

EgorBo added 2 commits July 25, 2023 16:02

Make formatter happy

7aea0dc

Make gcc happy

3f01263

jkotas approved these changes Jul 25, 2023

View reviewed changes

EgorBo merged commit c33b304 into dotnet:main Jul 25, 2023
135 checks passed

EgorBo deleted the utf16-utf8-non-ascii branch July 25, 2023 17:59

ghost locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-ascii in fgVNBasedIntrinsicExpansionForCall_ReadUtf8 #89383

Support non-ascii in fgVNBasedIntrinsicExpansionForCall_ReadUtf8 #89383

EgorBo commented Jul 24, 2023

ghost commented Jul 24, 2023

stephentoub left a comment

EgorBo commented Jul 24, 2023

stephentoub commented Jul 24, 2023

EgorBo commented Jul 24, 2023

stephentoub commented Jul 24, 2023

EgorBo commented Jul 24, 2023 •

edited

Loading

am11 commented Jul 24, 2023 •

edited

Loading

EgorBo commented Jul 24, 2023

am11 commented Jul 24, 2023

jakobbotsch commented Jul 24, 2023

EgorBo commented Jul 25, 2023

jkotas commented Jul 25, 2023

jkotas commented Jul 25, 2023

am11 commented Jul 25, 2023

jkotas commented Jul 25, 2023

EgorBo commented Jul 25, 2023

EgorBo commented Jul 25, 2023

jkotas left a comment

Support non-ascii in fgVNBasedIntrinsicExpansionForCall_ReadUtf8 #89383

Support non-ascii in fgVNBasedIntrinsicExpansionForCall_ReadUtf8 #89383

Conversation

EgorBo commented Jul 24, 2023

ghost commented Jul 24, 2023

stephentoub left a comment

Choose a reason for hiding this comment

EgorBo commented Jul 24, 2023

stephentoub commented Jul 24, 2023

EgorBo commented Jul 24, 2023

stephentoub commented Jul 24, 2023

EgorBo commented Jul 24, 2023 • edited Loading

am11 commented Jul 24, 2023 • edited Loading

EgorBo commented Jul 24, 2023

am11 commented Jul 24, 2023

jakobbotsch commented Jul 24, 2023

EgorBo commented Jul 25, 2023

jkotas commented Jul 25, 2023

jkotas commented Jul 25, 2023

am11 commented Jul 25, 2023

jkotas commented Jul 25, 2023

EgorBo commented Jul 25, 2023

EgorBo commented Jul 25, 2023

jkotas left a comment

Choose a reason for hiding this comment

EgorBo commented Jul 24, 2023 •

edited

Loading

am11 commented Jul 24, 2023 •

edited

Loading