JIT: Bad codegen with `Avx512DQ.VL.Range` #106610

jakobbotsch · 2024-08-19T06:59:02Z

// Generated by Fuzzlyn v2.2 on 2024-08-17 17:40:06
// Run on X86 Windows
// Seed: 1343518557351353159-vectort,vector128,vector256,vector512,x86aes,x86avx,x86avx2,x86avx512bw,x86avx512bwvl,x86avx512cd,x86avx512cdvl,x86avx512dq,x86avx512dqvl,x86avx512f,x86avx512fvl,x86avx512vbmi,x86avx512vbmivl,x86bmi1,x86bmi2,x86fma,x86lzcnt,x86pclmulqdq,x86popcnt,x86sse,x86sse2,x86sse3,x86sse41,x86sse42,x86ssse3,x86x86base
// Reduced from 171.2 KiB to 0.6 KiB in 00:06:37
// Debug: Outputs <4292870144, 0, 0, 0, 0, 0, 0, 0>
// Release: Outputs <0, 0, 0, 0, 0, 0, 0, 0>
using System;
using System.Runtime.CompilerServices;
using System.Numerics;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

public class C1
{
    public Vector256<float> F5;
    public C1(Vector256<float> f5)
    {
        F5 = f5;
    }
}

public class Program
{
    public static void Main()
    {
        var vr4 = Vector256.Create<float>(0);
        var vr5 = Vector256.CreateScalar(1f);
        var vr6 = Vector256.CreateScalar(-10f);
        var vr7 = Avx.Or(vr5, vr6);
        C1 vr8 = new C1(Avx512DQ.VL.Range(vr4, vr7, 0));
        System.Console.WriteLine(vr8.F5.AsUInt32());
    }
}

cc @dotnet/jit-contrib

dotnet-policy-service · 2024-08-19T06:59:22Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

AndyAyersMS · 2024-08-20T01:07:40Z

Preview 7 is ok

G_M000_IG01:                ;; offset=0x0000
       sub      esp, 32

G_M000_IG02:                ;; offset=0x0003
       vxorps   ymm0, ymm0, ymm0
       vrangeps ymm0, ymm0, ymmword ptr [@RWD00], 0
       vmovups  ymmword ptr [esp], ymm0
       mov      ecx, 0x8B3C8AC
       call     CORINFO_HELP_NEWSFAST
       vmovups  ymm0, ymmword ptr [esp]
       vmovups  ymmword ptr [eax+0x04], ymm0
       mov      ecx, eax
       call     [System.Console:WriteLine(System.Object)]

G_M000_IG03:                ;; offset=0x0033
       vzeroupper
       add      esp, 32
       ret

RWD00   dq      00000000FFA00000h, 0000000000000000h, 0000000000000000h, 0000000000000000h

but recent builds are not

G_M27646_IG01:  ;; offset=0x0000
       sub      esp, 32
                                                ;; size=3 bbWeight=1 PerfScore 0.25
G_M27646_IG02:  ;; offset=0x0003
       vxorps   ymm0, ymm0, ymm0
       vrangeps ymm0, ymm0, ymmword ptr [@RWD00], 0
       vmovups  ymmword ptr [esp], ymm0
       mov      ecx, 0xB6632E0      ; System.Runtime.Intrinsics.Vector256`1[uint]
       call     CORINFO_HELP_NEWSFAST
       vmovups  ymm0, ymmword ptr [esp]
       vmovups  ymmword ptr [eax+0x04], ymm0
       mov      ecx, eax
       call     [System.Console:WriteLine(System.Object)]
                                                ;; size=48 bbWeight=1 PerfScore 16.83
G_M27646_IG03:  ;; offset=0x0033
       vzeroupper
       add      esp, 32
       ret
                                                ;; size=7 bbWeight=1 PerfScore 2.25
RWD00   dq      00000000FFE00000h, 0000000000000000h, 0000000000000000h, 0000000000000000h

Only diff is the value in RWD00.

AndyAyersMS · 2024-08-20T01:11:46Z

***** BB01 [0000]
STMT00001 ( ??? ... 0x025 )
               [000009] DA---------                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V01 loc1         
               [000008] -----------                         \--*  HWINTRINSIC simd32 float Or
               [000003] -----------                            +--*  CNS_VEC   simd32<0x000000003f800000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000>
               [000005] -----------                            \--*  CNS_VEC   simd32<0x00000000c1200000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000>

becomes

               [000009] DA---------                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V01 loc1         
               [000005] -----+-----                         \--*  CNS_VEC   simd32<0x00000000ffe00000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000>

so somehow an extra bit gets set in the vector constant (FFE instead of FFA).

AndyAyersMS · 2024-08-20T01:41:44Z

Looks like this is nan normalization? 0xFFA00000 is a nan float, and we normalize it to 0xFFE00000.

tannergooding · 2024-08-20T15:10:00Z

Not quite NaN normalization, but rather 0xFFA0_0000 is an sNaN (signalling NaN) because the most significant bit of the significand is zero and sNaN when encountered by the CPU raise the "signal" (throw an exception if IEEE 754 floating-point exception handling is enabled, which it never is for .NET), and then suppress the signal from being raised again in the future by setting the bit that converts it to a qNaN (quiet NaN).

[0xFFC0_0000, 0xFFFF_FFFF] is -qNaN
[0xFF80_0001, 0xFFBF_FFFF] is -sNaN
0xFF80_0000 is -inf

0x7F80_0000 is +inf
[0x7F80_0001, 0x7FBF_FFFF] is +sNaN
[0x7FC0_0000, 0x7FFF_FFFF] is +qNaN

In general we (RyuJIT) try and avoid any explicit normalization and instead try to explicitly propagate wherever applicable (that is, if a NaN is passed in, we return precisely that NaN back). The only places we explicitly produce canonical NaN is when neither input is a NaN (so -inf + inf produces NaN, this requires a net new NaN to be produced so we choose the canonical NaN of 0xFFC0_0000 on xarch and 0x7FC0_0000 on arm/arm64/loongarch64/riscv64).

tannergooding · 2024-08-20T15:13:54Z

So this is somewhat expected behavior, but it would still be interesting to find out what is causing the CPU to see it as an sNaN for this particular case. I expect there's some place its being seen as a scalar float and we could instead pass it through as a scalar uint32_t to ensure that operations like and, not, or, xor, left shift, and right shift don't exhibit such behavior.

AndyAyersMS · 2024-08-20T15:42:40Z

The path we take is:

gtFoldExprHWIntrinsic

               [000008] ----------- >>>                     *  HWINTRINSIC simd32 float Or
               [000005] -----+-----                         +--*  CNS_VEC   simd32<0x00000000c1200000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000>
               [000003] -----+-----                         \--*  CNS_VEC   simd32<0x000000003f800000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000>

GenTreeVecCon::EvaluateBinaryInPlace
EvaluateBinarySimd<simd32_t>
EvaluateBinarySimd<simd32_t,float>
EvaluateBinaryScalar<float>(genTreeOps oper, float arg0, float arg1)
EvaluateBinaryScalarSpecialized<float>(genTreeOps oper, float arg0, float arg1)

uint32_t resultBits is 0xFFA0_0000

UInt32BitsToSingle produces float 0xFFE0_0000

tannergooding · 2024-08-20T16:17:41Z

👍, so we'd need to intercept this prir to extracting the scalar value out of the SIMD input (that is prior to EvaluateBinaryScalar

I think the simplest place to do that is EvaluateBinarySimd<simd32_t>, basically check for oper in case TYP_FLOAT/TYP_DOUBLE and dispatch instead to EvaluateBinarySimd<TSimd, int32_t/int64_t>: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simd.h#L1189-L1199

I can get a fix up for this.

AndyAyersMS · 2024-08-20T16:22:14Z

Ok, I will reassign this one to you.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 19, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Aug 19, 2024

JulieLeeMSFT assigned AndyAyersMS Aug 19, 2024

JulieLeeMSFT added this to the 9.0.0 milestone Aug 19, 2024

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Aug 19, 2024

AndyAyersMS assigned tannergooding and unassigned AndyAyersMS Aug 20, 2024

tannergooding mentioned this issue Aug 20, 2024

Ensure that constant folding of bitwise operations for float/double are bitwise #106711

Merged

dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Aug 20, 2024

tannergooding closed this as completed in #106711 Aug 22, 2024

tannergooding mentioned this issue Aug 22, 2024

[release/9.0] Ensure that constant folding of bitwise operations for float/double are bitwise #106830

Merged

4 tasks

github-actions bot locked and limited conversation to collaborators Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Bad codegen with `Avx512DQ.VL.Range` #106610

JIT: Bad codegen with `Avx512DQ.VL.Range` #106610

jakobbotsch commented Aug 19, 2024

dotnet-policy-service bot commented Aug 19, 2024

AndyAyersMS commented Aug 20, 2024

AndyAyersMS commented Aug 20, 2024

AndyAyersMS commented Aug 20, 2024 •

edited

Loading

tannergooding commented Aug 20, 2024

tannergooding commented Aug 20, 2024

AndyAyersMS commented Aug 20, 2024

tannergooding commented Aug 20, 2024

AndyAyersMS commented Aug 20, 2024

JIT: Bad codegen with Avx512DQ.VL.Range #106610

JIT: Bad codegen with Avx512DQ.VL.Range #106610

Comments

jakobbotsch commented Aug 19, 2024

dotnet-policy-service bot commented Aug 19, 2024

AndyAyersMS commented Aug 20, 2024

AndyAyersMS commented Aug 20, 2024

AndyAyersMS commented Aug 20, 2024 • edited Loading

tannergooding commented Aug 20, 2024

tannergooding commented Aug 20, 2024

AndyAyersMS commented Aug 20, 2024

tannergooding commented Aug 20, 2024

AndyAyersMS commented Aug 20, 2024

JIT: Bad codegen with `Avx512DQ.VL.Range` #106610

JIT: Bad codegen with `Avx512DQ.VL.Range` #106610

AndyAyersMS commented Aug 20, 2024 •

edited

Loading