-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support non-ascii in fgVNBasedIntrinsicExpansionForCall_ReadUtf8 #89383
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsJakob pointed me to
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :)
Wrote a test to validate |
What does |
Can you provide an example - I'm a bit confused on what's expected to be done with it |
For example, this will throw an exception: new UTF8Encoding(false, throwOnInvalidBytes: true).GetBytes("\uD800b") because Ideally the JIT simply wouldn't do anything if it detected the text was invalid and would just use the actual implementation of ReadUtf8, which would let UTF8Encoding do with it what it will, and you wouldn't need to match the handling for erroneous input. |
Thanks, it seems that it still returns the same value as the BCL version, there is a I was also pointed to a minipal version for utf16-utf8 encoding ( |
ps [lowkey]: there is a room to vectorize that minipal/utf8 C implementation, matching the C# one, to accelerate both runtimes paths which frequently convert strings. |
I'm no expert in cmake, I tried to add
🙁 how do I include a .c file to C++ codebase? |
@EgorBo, seems like in "jit/static", we need to disable PCH for C sources. With this patch, build succeeds with VS 2022 (17.6.3): diff --git a/src/coreclr/jit/CMakeLists.txt b/src/coreclr/jit/CMakeLists.txt
index d00b5b27fe5..daaf1aa1e38 100644
--- a/src/coreclr/jit/CMakeLists.txt
+++ b/src/coreclr/jit/CMakeLists.txt
@@ -176,6 +176,7 @@ set( JIT_SOURCES
unwind.cpp
utils.cpp
valuenum.cpp
+ ${CLR_SRC_NATIVE_DIR}/minipal/utf8.c
)
if (CLR_CMAKE_TARGET_WIN32)
diff --git a/src/coreclr/jit/compiler.h b/src/coreclr/jit/compiler.h
index e752044acf8..38664a91e66 100644
--- a/src/coreclr/jit/compiler.h
+++ b/src/coreclr/jit/compiler.h
@@ -48,6 +48,8 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
#include "disasm.h"
#endif
+#include <minipal/utf8.h>
+
#include "codegeninterface.h"
#include "regset.h"
#include "jitgcinfo.h"
@@ -136,6 +138,7 @@ const int BAD_STK_OFFS = 0xBAADF00D; // for LclVarDsc::lvStkOffs
//------------------------------------------------------------------------
inline bool IsHfa(CorInfoHFAElemType kind)
{
+ size_t test = minipal_get_length_utf8_to_utf16 (0, 0, 0);
return kind != CORINFO_HFA_ELEM_NONE;
}
inline var_types HfaTypeFromElemKind(CorInfoHFAElemType kind)
diff --git a/src/coreclr/jit/static/CMakeLists.txt b/src/coreclr/jit/static/CMakeLists.txt
index 99ae15963b5..5a8052765cc 100644
--- a/src/coreclr/jit/static/CMakeLists.txt
+++ b/src/coreclr/jit/static/CMakeLists.txt
@@ -2,6 +2,10 @@ project(ClrJit)
set_source_files_properties(${JIT_EXPORTS_FILE} PROPERTIES GENERATED TRUE)
+if(CLR_CMAKE_HOST_WIN32)
+set_source_files_properties("${CLR_SRC_NATIVE_DIR}/minipal/utf8.c" PROPERTIES SKIP_PRECOMPILE_HEADERS ON)
+endif(CLR_CMAKE_HOST_WIN32)
+
add_library_clr(clrjit_obj
OBJECT
${JIT_SOURCES} |
Can we just stick to |
Thank you for educting me on this part! Agree on the point that since we already use it I'll delay that to a separate pr |
Do we need tests for this? Handling of invalid UTF8/UTF16 sequences is always a mess. I would be surprised if there are no subtle behavior differences between the managed Utf8 encoding implementation, the minipal Utf8 encoding implementation, and the Windows OS provided Utf8 encoding implementation that WszWideCharToMultiByte ends up calling on Windows. |
I think you would want to add utf8.c to https://github.com/dotnet/runtime/blob/main/src/coreclr/minipal/Windows/CMakeLists.txt. (Adding it to JIT sources would collide with utf8.c that is part of Win32 PAL that the JIT still links with.)
WszWideCharToMultiByte has different underlying implementation on Windows vs. Unix, so it will expose you to OS-specific bugs. The existing uses of WszWideCharToMultiByte do not influence codegen directly. Something to consider. |
Mono uses giconv->minipal on both Windows and Unix alike. Perhaps this is something we can consider for coreclr in the future, as a step to mitigate these differences? Handling of invalid characters between minipal and managed implementation is pretty much the same (if not exactly the same). There are PAL and COM tests exercising the invalid codepoints; one slightest change in behavior and they fail. (during C++ to C conversion, I had to debug them quite a bit to achieve the precise behavior the tests were expecting; so if there still are differences in error modes, they should be minimum as a result of oversight / test gap) |
To fix the stdbool build break, add a dummy stdbool.h to runtime\src\coreclr\pal\inc\rt\cpp, similar to other |
Thanks, that works! |
@jkotas does it look good now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Jakob pointed me to
WszWideCharToMultiByte
API that seems to be enough to support non-ASCII values here too.Closes #89375