-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize "X == Y" to "(X ^ Y) == 0 for SIMD #93818
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue Details
static bool Foo(string s) =>
s == "12345678123456781234567812345678"; ; Assembly listing for method Program:Foo(System.String):ubyte (FullOpts)
vzeroupper
test rcx, rcx
je SHORT G_M28968_IG05
cmp dword ptr [rcx+0x08], 32
jne SHORT G_M28968_IG05
vmovups zmm0, zmmword ptr [rcx+0x0C]
- vpxorq zmm0, zmm0, zmmword ptr [reloc @RWD00]
- vptestmq k1, zmm0, zmm0
+ vpcmpeqq k1, zmm0, zmmword ptr [reloc @RWD00]
kortestb k1, k1
- sete al
+ setb al
movzx rax, al
jmp SHORT G_M28968_IG06
G_M28968_IG05:
xor eax, eax
G_M28968_IG06:
vzeroupper
ret
-; Total bytes of code 58
+; Total bytes of code 52
|
d6bc8af
to
ea7c181
Compare
Co-authored-by: Jakob Botsch Nielsen <[email protected]>
I'm not sure this is "better". For example, Given that, I'd probably lean towards it being better to just keep things "as is" here and only do the Also notably, here are the timings for the instructions involved here
Which shows that, at least theoretically, |
Once you switched vector equality to use kortest even for XMM/YMM it regressed my benchmarks on AMD, we don't have AVX512 hardware so it was never reflected in our perflab infra via regressions. As you can see from my godbolt link, native compilers also don't use it.
I think you forgot to include RThrougput here. Also, it's smaller in terms of code size, hence, potentially better. And finally, uiCA stats on these for Tiger Lake (pretty much same for others): |
Benchmark: static Benchmarks()
{
arr1 = (byte*)NativeMemory.AlignedAlloc(8 * 1024, 64);
arr2 = (byte*)NativeMemory.AlignedAlloc(8 * 1024, 64);
}
static byte* arr1;
static byte* arr2;
[Benchmark]
public bool VectorEquals()
{
ref byte a = ref Unsafe.AsRef<byte>(arr1);
ref byte b = ref Unsafe.AsRef<byte>(arr2);
for (nuint i = 0; i < 1024; i+=16)
{
if (Vector128.LoadUnsafe(ref a, i) != Vector128.LoadUnsafe(ref b, i))
return false;
}
return true;
}
Ryzen 7950X, will find some modern intel to check there but I suspect the same results. But the difference is quite noticeable. |
Can you also check for YMM? Likewise, what happens in register heavy code where we end up needing to pick XMM16-XMM31 to avoid spilling? |
Ah diffs are empty if I disable it for non-AVX512 hw (presumably, most of our SPMI collections either don't have AVX512 or it's ignored because of throttling issues) so going to close for now. We need some better coverage fro AVX512 in SPMI |
Matches native compilers now https://godbolt.org/z/afaE18saG
Quick example: