-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Recognize 'bt' bit test idiom #72986
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsDescriptionI've been recently developing a chess engine in C# (.NET Core 6), StockNemo, where when I analyzed the code, RyuJIT was generating assembly far more complex than one would assume it should be. So, I decided to compare it with C++'s GCC compiler (with the Consider the following code: readonly ulong Internal = 0x003;
bool GetSetBit(int i) => (Internal >> i & 1UL) == 1UL; RyuJIT generates the following assembly for the method mov rax, qword ptr [rdi+8]
mov ecx, esi
shr rax, cl
test al, 1
setne al
movzx rax, al
ret The similar code in C++ looks like this: unsigned long long internal = 0x003;
bool get_set_bit(int i)
{
return (internal >> i & 1ULL) == 1ULL;
} GCC 12.1 x86-64 generates the following assembly for the method mov rax, QWORD PTR internal[rip]
bt rax, rdi
setc al
ret As one can see, the GCC-generated assembly is better. There is a way to get the same or nearly as simple and fast assembly as C++, bool GetSetBit(int i)
{
byte value = (byte)(Internal >> i & 1UL);
return Unsafe.As<byte, bool>(ref value);
} typedef int boolean;
#define true 1
#define false 0
boolean get_set_bit(int i)
{
return internal >> i & 1ULL;
} The generated assembly for this by RyuJIT is: mov rax, qword ptr [rdi+8]
mov ecx, esi
shr rax, cl
and eax, 1
ret ...and by GCC: mov rax, QWORD PTR internal[rip]
mov ecx, edi
shr rax, cl
and eax, 1
ret This is just one of many functions that have much more complicated assemblies when generated by RyuJIT (compared to GCC). When micro-optimization is necessary (in chess engines, it is), the generated assemblies are to be as performant. This is not the case by default here; one had to repurpose the code to get the exact same thing. Many times, due to missing language features, this just isn't possible. I'm not trying to shame or undermine the work done for RyuJIT but requesting better code understanding and generation. I love the C# language (which is why I chose to do the project in C# while knowing C++), and I wish that the code be as fast (or, if possible, faster) as C++.
|
Probably the best thing here is to open separate issues for each category of suboptimal code gen you encounter. |
The GCC output uses |
Thanks for the suggestion. I agree this may be the best way forward, and I shall do that. |
Note that the following pattern is properly recognized: static bool M(int x, int y) => (x & (1 << y)) != 0; C.M(Int32, Int32)
L0000: bt ecx, edx
L0003: setb al
L0006: movzx eax, al
L0009: ret |
It seems this is only possible with integers. When translating the code to the same specifications as the issue documentation, it fails: readonly ulong Internal = 0x003;
bool M(int x) => (Internal & (ulong)(1 << x)) != 0UL; mov eax, 1
mov ecx, esi
shl eax, cl
movsxd rax, eax
test qword ptr [rdi+8], rax
setne al
movzx rax, al
ret |
Looks like that's the x86 disassembly, it should work if you change to x64: Sharplab C.M3(UInt64, Int32)
L0000: bt rcx, rdx
L0004: setb al
L0007: movzx eax, al
L000a: ret Edit: it won't work if you cast the shift apparently ( |
Indeed that works. However, I still question the necessity of the |
.NET semantics are different. The return value is always "widened" to a stack type. So, the jit will always ensure that upper bytes of small return values are properly cleared/set. |
Description
I've been recently developing a chess engine in C# (.NET Core 6), StockNemo, where when I analyzed the code, RyuJIT was generating assembly far more complex than one would assume it should be. So, I decided to compare it with C++'s GCC compiler (with the
-O3
flag to ensure proper optimization, I imagine the equivalent to dotnet'sRelease
configuration) and turns out I was right.Consider the following code:
RyuJIT generates the following assembly for the method
GetSetBit
in release configuration:The similar code in C++ looks like this:
GCC 12.1 x86-64 generates the following assembly for the method
get_set_bit
with the-O3
argument:As one can see, the GCC-generated assembly is better. There is a way to get the same or nearly as simple and fast assembly as C++,
and that's by arranging the method like so, with its C++ counterpart below:
The generated assembly for this by RyuJIT is:
...and by GCC:
This is just one of many functions that have much more complicated assemblies when generated by RyuJIT (compared to GCC). When micro-optimization is necessary (in chess engines, it is), the generated assemblies are to be as performant. This is not the case by default here; one had to repurpose the code to get the exact same thing. Many times, due to missing language features, this just isn't possible.
I'm not trying to shame or undermine the work done for RyuJIT but requesting better code understanding and generation. I love the C# language (which is why I chose to do the project in C# while knowing C++), and I wish that the code be as fast (or, if possible, faster) as C++.
The text was updated successfully, but these errors were encountered: