Implement Vector{Size}<T>.AllBitsSet #33924

Gnbrkm41 · 2020-03-22T09:22:04Z

Resolves #30659

Note: I strongly advise you to ignore the changes made in c9bd4a9 "Auto-generate tests": They are automatically generated from the template and adds lots of LoCs.

Dotnet-GitSync-Bot · 2020-03-22T09:22:09Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Gnbrkm41 · 2020-03-22T09:22:16Z

cc @tannergooding

Gnbrkm41 · 2020-03-22T09:26:28Z

I believe we can use this in a couple places:

runtime/src/libraries/System.Collections/src/System/Collections/BitArray.cs

Lines 542 to 556 in 963b158

    
           if (Avx2.IsSupported) 
        
           { 
        
               Vector256<int> ones = Vector256.Create(-1); 
        
               fixed (int* ptr = thisArray) 
        
               { 
        
                   for (; i < count - (Vector256<int>.Count - 1); i += Vector256<int>.Count) 
        
                   { 
        
                       Vector256<int> vec = Avx.LoadVector256(ptr + i); 
        
                       Avx.Store(ptr + i, Avx2.Xor(vec, ones)); 
        
                   } 
        
               } 
        
           } 
        
           else if (Sse2.IsSupported) 
        
           { 
        
               Vector128<int> ones = Vector128.Create(-1);

Gnbrkm41 · 2020-03-22T14:09:09Z

All of the failures seem to be either of those:

Assertion failed '!"Jump into middle of try region"' in 'JIT.HardwareIntrinsics.General.Program:Vector64AllBitsSet()' during 'Optimize layout' (IL size 50) File: F:\workspace\_work\1\s\src\coreclr\src\jit\flowgraph.cpp Line: 20738\r\n Image: C:\h\w\AC45098D\p\CoreRun.exe\r\n\r\n\r\n

https://dev.azure.com/dnceng/public/_build/results?buildId=568937&view=ms.vss-test-web.build-test-results-tab&runId=17857130&resultId=102291&paneView=debug

Assert failure(PID 1220 [0x000004c4], Thread: 5788 [0x169c]): Assertion failed 'fixedArity == 0' in 'JIT.HardwareIntrinsics.General.Program:AllBitsSetByte()' during 'Do value numbering' (IL size 38)\r\n\r\n File: F:\workspace\_work\1\s\src\coreclr\src\jit\valuenum.cpp Line: 8588

https://dev.azure.com/dnceng/public/_build/results?buildId=568937&view=ms.vss-test-web.build-test-results-tab&runId=17857130&resultId=102416&paneView=debug

gfoidl · 2020-03-22T14:30:29Z

Vector256 ones = Vector256.Create(-1);

Would it be possible for the JIT to detect such cases and emit the cmpps?
If not or additional it would be great to have an analyzer and codefix that changes this to Vector256<int>.AllBitsSet.

Gnbrkm41 · 2020-03-22T15:12:59Z

~~Weird that I do not reproduce the test failures on my local machine :^(~~ As I said that I just had one happen.

Gnbrkm41 · 2020-03-22T16:58:50Z

So, it appears that this is happening because methods that can have different instructions depending on types need an extra VNF_SimdType arg:

runtime/src/coreclr/src/jit/hwintrinsic.cpp

Line 284 in 24c4cb1

    
           // If we see two (or more) different instructions we need the extra VNF_SimdType arg

What does this really mean?

tannergooding · 2020-03-22T17:44:37Z

So, it appears that this is happening because methods that can have different instructions depending on types need an extra VNF_SimdType arg

CC. @briansull who added the initial VN support for HWIntrinsics in #31834

tannergooding · 2020-03-22T17:44:56Z

Also CC. @echesakovMSFT and @CarolEidt

briansull · 2020-03-23T01:13:45Z

need an extra VNF_SimdType arg:

This prevents us from CSE-ing two different SIMD operations that would be implemented using different instructions.

I will investigate the assert and provide guidance

Assertion failed 'fixedArity == 0'

tannergooding · 2020-03-23T15:11:43Z

src/coreclr/src/jit/hwintrinsiccodegenxarch.cpp

+            else
+            {
+                assert(varTypeIsIntegral(baseType) || !compiler->compSupports(InstructionSet_AVX));
+                emit->emitIns_SIMD_R_R_R(ins, attr, targetReg, targetReg, targetReg);


What is used for the comparison for float/double when AVX isn't supported? Based on the instruction table, this is cmpps/cmppd still; but I don't think that is correct?

Ohhhh, good catch.

I wonder how we'd change it here, though? hard-code the instruction here?

We would need to special-case one of the paths.

That's also possibly an interesting case for CSE... If the table gives a particular set of instructions but codegen can optionally special-case something further or treat it slightly differently, how should that be handled @briansull ?

The ValueNumber & CSE phase only uses the table to determine if the result type needs to be an input when generating the ValueNumber for the node.

If for two trees that have the same operation (i.e. GT_MUL or the same HW intrinsic) and all of its operands have the identical value numbers, then normally we would give the same value number.

For GT_CAST we incorporate the castto type as an extra operand to the value number.

I noticed that we also needed to do this for some SIMD and HW instrinsic nodes.
I determined that safest and easiest way to to do this was to examine the table of instructions and always incorporate the result type when there were two or more different valid instructions listed for a SIMD or HW instrinsic node.

This process is used for x86 and x64, I believe that we need to be more conservative on ARM64, so we always include an extra result type operand.

As as long as the table has two or more different instructions we will be good.
It would be bad to list all the same instructions or an illegal instruction and then use hand code logic to decide on the instruction. If you want to record that an entry relies upon hand coded logic I would recommend using a brkpt or nop instruction as a marker for this behavior in the table. We can add a check for this and assume that different instructions could be generated. I don't think that it should matter if AVX is supported or not when deciding if we need an extra result type operand.

I don't think that it should matter if AVX is supported or not when deciding if we need an extra result type operand.

This would be a case of, when not using VEX encoding (SSE-SSE4.2) all types would use the same instruction. But when using the VEX encoding (AVX+), float/double would use different instructions that are more efficient.

Left a comment that float/double may use different instructions depending on the encoding available. Also hardcoded the instruction inside the integer / non-VEX path.

tannergooding · 2020-03-23T18:04:11Z

I will investigate the assert and provide guidance
Assertion failed 'fixedArity == 0'

I did some initial debugging and found that Vector128_Zero doesn't hit this issue as it has a single instruction (xorps) used by all types (and that looks to be the case for all other 0 arg intrinsics right now).

Vector128_AllBitSet has three different instructions and so it gets the additional VNF_SimdType arg and the arity becomes 1, failing the assert.

briansull · 2020-03-23T18:58:23Z

The fix is:

line 8569 in valuenum.cpp in function:
void Compiler::fgValueNumberHWIntrinsic(GenTree* tree)

    int      lookupNumArgs    = HWIntrinsicInfo::lookupNumArgs(hwIntrinsicNode->gtHWIntrinsicId);
    bool     encodeResultType = vnEncodesResultTypeForHWIntrinsic(hwIntrinsicNode->gtHWIntrinsicId);
    VNFunc   func             = GetVNFuncForNode(tree);

    ValueNumPair excSetPair = ValueNumStore::VNPForEmptyExcSet();
    ValueNumPair normalPair;
    ValueNumPair resvnp     = ValueNumPair();

    if (encodeResultType)
    {
        ValueNum vnSize = vnStore->VNForIntCon(hwIntrinsicNode->gtSIMDSize);
        ValueNum vnBaseType = vnStore->VNForIntCon(INT32(hwIntrinsicNode->gtSIMDBaseType));
        ValueNum simdTypeVN = vnStore->VNForFunc(TYP_REF, VNF_SimdType, vnSize, vnBaseType);
        resvnp.SetBoth(simdTypeVN);

#ifdef DEBUG
        if (verbose)
        {
            printf("    simdTypeVN is ");
            vnPrint(simdTypeVN, 1);
            printf("\n");
        }
#endif
    }

    // There are some HWINTRINSICS operations that have zero args, i.e.  NI_Vector128_Zero
    if (tree->AsOp()->gtOp1 == nullptr)
    {
        if (encodeResultType)
        {
            // There are zero arg HWINTRINSICS operations that encode the result type, i.e.  Vector128_AllBitSet 
            normalPair = vnStore->VNPairForFunc(tree->TypeGet(), func, resvnp);
            assert(vnStore->VNFuncArity(func) == 1);
        }
        else
        {
            normalPair = vnStore->VNPairForFunc(tree->TypeGet(), func);
            assert(vnStore->VNFuncArity(func) == 0);
        }

    }
    else if (tree->AsOp()->gtOp1->OperIs(GT_LIST) || (lookupNumArgs == -1))

Gnbrkm41 · 2020-03-24T17:37:46Z

*************** Starting PHASE Merge throw blocks

*************** In fgTailMergeThrows

*** Does not return call
               [000327] --CXG+------              *  CALL      void   System.ThrowHelper.ThrowNotSupportedException
               [000326] -----+------ arg0 in rcx  \--*  CNS_INT   int    63
    in BB04 is unique, marking it as canonical

*** Does not return call
               [000151] --CXG+------              *  CALL      void   System.ThrowHelper.ThrowNotSupportedException
               [000150] -----+------ arg0 in rcx  \--*  CNS_INT   int    63
    in BB02 can be dup'd to canonical BB04

*** found 1 merge candidates, rewriting flow

New Basic Block BB10 [0053] created.
*** BB01 now falling through to empty BB10 and then to BB04

*************** After fgTailMergeThrows(1 updates)

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1        [000..002)                                     i label target
BB10 [0053]  1       BB01                  1        [???..???)-> BB04 (always)                     internal
BB02 [0001]  0  0                          0        [002..00A)        (throw ) T0      try {       keep i try rare label gcsafe
BB03 [0019]  0  0                          1        [002..003)                 T0                  keep i label gcsafe
BB04 [0033]  2  0    BB03,BB10             0        [002..003)        (throw ) T0                  keep i rare label target gcsafe
BB05 [0046]  0  0                          1        [???..???)-> BB07 (always) T0      }           keep i internal label
BB07 [0003]  2       BB05,BB06             1        [00F..012)-> BB09 ( cond )                     i label target
BB08 [0004]  1       BB07                  0        [012..031)        (throw )                     i rare gcsafe newobj
BB09 [0005]  1       BB07                  1        [031..032)        (return)                     i label target
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow
BB06 [0002]  1     0                       1        [00A..00F)-> BB07 ( cret )    H0 F catch { }   keep i label target flet
-----------------------------------------------------------------------------------------------------------------------------------------
*************** In fgDebugCheckBBlist
Jump into the middle of try region: BB10 branches to BB04

Assert failure(PID 11372 [0x00002c6c], Thread: 3996 [0x0f9c]): Assertion failed '!"Jump into middle of try region"' in 'JIT.HardwareIntrinsics.General.Program:Vector64AllBitsSet()' during 'Merge throw blocks' (IL size 50)

    File: C:\Users\gotos\source\repos\runtime\src\coreclr\src\jit\flowgraph.cpp Line: 20738
    Image: c:\users\gotos\source\repos\runtime\artifacts\tests\coreclr\windows_nt.x64.checked\tests\core_root\corerun.exe

Full JitDump

An interesting failure...

AndyAyersMS · 2020-03-24T17:40:32Z

@Gnbrkm41 that's a new phase I added recently, probably a missing safety check. Let me take a look.

AndyAyersMS · 2020-03-24T17:46:57Z

fgTailMergeThrowsFallThroughHelper needs to ensure that the new BB is in the same EH region as the the "nonCanonicalBlock". It is not doing this properly if the nonCanonicalBlock is a try entry.

Might be simplest for now to disable throw helper merging in this case. Will keep looking.

AndyAyersMS · 2020-03-24T18:46:59Z

@Gnbrkm41 see if this patch fixes your failures. Haven't validated it yet -- trying to create a simple local repro, but no luck so far.

index 8ba6d2f8cc7..846677f482f 100644
--- a/src/coreclr/src/jit/flowgraph.cpp
+++ b/src/coreclr/src/jit/flowgraph.cpp
@@ -25869,6 +25869,14 @@ void Compiler::fgTailMergeThrows()
     // and there is less jumbled flow to sort out later.
     for (BasicBlock* block = fgLastBB; block != nullptr; block = block->bbPrev)
     {
+        // Workaround: don't consider try entry blocks as candidates
+        // for merging; if the canonical throw is later in the same try,
+        // we'll create invalid flow.
+        if ((block->bbFlags & BBF_TRY_BEG) != 0)
+        {
+            continue;
+        }
+
         // For throw helpers the block should have exactly one statement....
         // (this isn't guaranteed, but seems likely)
         Statement* stmt = block->firstStmt();

AndyAyersMS · 2020-03-24T19:46:41Z

Ok, I can repro now. I'll put up a fix.

Otherwise we may create a branch into the middle of a try. We could fix the transform, but if the first block of a try has a throw helper call, the rest of the try will subsequently be removed, so merging is not all that interesting. Addresses an issue that came up in dotnet#33924.

Otherwise we may create a branch into the middle of a try. We could fix the transform, but if the first block of a try has a throw helper call, the rest of the try will subsequently be removed, so merging is not all that interesting. Addresses an issue that came up in #33924.

BruceForstall · 2020-03-30T17:15:35Z

Does this need an arm64 implementation?

tannergooding · 2020-03-30T17:35:03Z

Does this need an arm64 implementation?

It will need one for Vector64<T> and Vector128<T>. @TamarChristinaArm could you advise as to which instructions should be used to efficiently create a vector where all bits are set?
That is, should we do a set and duplicate or maybe a compare tgt, tgt, tgt like we do on x86, or maybe something else?

TamarChristinaArm · 2020-03-31T08:06:25Z

Does this need an arm64 implementation?

It will need one for Vector64<T> and Vector128<T>. @TamarChristinaArm could you advise as to which instructions should be used to efficiently create a vector where all bits are set?
That is, should we do a set and duplicate or maybe a compare tgt, tgt, tgt like we do on x86, or maybe something else?

@tannergooding your best bet is to use mnvi or movi, both work in this case because the mask is simple:

mvni v0.4s, #0
movi v1.16b, #0xFF

will both give you a vectors with all bits set.

mvni v0.2s, #0
movi v1.8b, #0xFF

for only the bottom half of the vector.

Gnbrkm41 · 2020-04-08T16:57:33Z

Note that I got slightly busy recently. I hope I have some time to work on intrinsifying ARM64, but I am not sure how quick can happen; Do you think it'll be okay if I can attempt the ARM part later (hopefully soon, next weekend?) and open a follow up PR?

Co-Authored-By: Brian Sullivan <[email protected]>

This reverts commit a708e533368cb6b8f71aa0ada3d931611dea1722.

* compExactlyDependsOn(isa) ... Should never be used in an assert.

* Hard-code ivals * Remove custom importation logic (and replace with HW_Category_SimpleSIMD) * Insert newlines

Gnbrkm41 · 2020-04-28T13:48:41Z

Just rebased, resolved conflicts and addressed feedbacks. I've locally checked that the changes for both ARM and xarch do in fact generate appropriate instructions and all the tests pass; Could I get a final review on this? Thanks!

tannergooding · 2020-04-28T14:51:19Z

src/coreclr/src/jit/hwintrinsicarm64.cpp

@@ -281,7 +281,9 @@ GenTree* Compiler::impSpecialIntrinsic(NamedIntrinsic        intrinsic,

        if (!varTypeIsArithmetic(baseType))
        {
-            assert((intrinsic == NI_Vector64_AsByte) || (intrinsic == NI_Vector128_As));
+            assert((intrinsic == NI_Vector64_AsByte) || (intrinsic == NI_Vector128_As) ||


@echesakovMSFT, not related to this PR, but I think the first check in this is wrong. It should be intrinsic == NI_Vector64_As, right?

It looks so, however, I don't see Vector64.As intrinsic in hwintrinsiclistarm64.h, I will update #33308 to include this.

tannergooding

LGTM.

echesakov

LGTM

echesakov · 2020-04-28T16:39:49Z

src/coreclr/src/jit/hwintrinsicarm64.cpp

@@ -281,7 +281,9 @@ GenTree* Compiler::impSpecialIntrinsic(NamedIntrinsic        intrinsic,

        if (!varTypeIsArithmetic(baseType))
        {
-            assert((intrinsic == NI_Vector64_AsByte) || (intrinsic == NI_Vector128_As));
+            assert((intrinsic == NI_Vector64_AsByte) || (intrinsic == NI_Vector128_As) ||


It looks so, however, I don't see Vector64.As intrinsic in hwintrinsiclistarm64.h, I will update #33308 to include this.

echesakov · 2020-04-29T19:32:01Z

I did four attempts to re-run testing for this PR - runtime (Installer Build and Test coreclr FreeBSD_x64 Debug) keeps failing with

/root/runtime/.dotnet/sdk/5.0.100-preview.4.20202.8/NuGet.RestoreEx.targets(10,5): error : No space left on device [/root/runtime/tools-local/tasks/installer.tasks/installer.tasks.csproj]

Everything else is green, even though is reported as "non finished" - https://dev.azure.com/dnceng/public/_build/results?buildId=621331&view=logs&j=41e34ca2-d347-5e65-d632-b45724e78141. Merging so it would not conflict with #35594.

Thanks @Gnbrkm41 for contribution!

Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels Mar 22, 2020

tannergooding reviewed Mar 23, 2020

View reviewed changes

AndyAyersMS mentioned this pull request Mar 24, 2020

JIT: block throw helper merges for first block of a try #34039

Merged

Gnbrkm41 force-pushed the allbitsset branch from 19a0242 to 1dd0b11 Compare March 25, 2020 10:00

tannergooding mentioned this pull request Mar 25, 2020

[System.Runtime.Intrinsics] API suggestion: Introduce an intrinsic all bits set field to Vector intrinsic types #30659

Closed

briansull mentioned this pull request Apr 7, 2020

Fix fgValueNumberHWIntrinsic to support encodeResultType for Arity 0 nodes #34621

Merged

Gnbrkm41 and others added 19 commits April 28, 2020 22:42

Auto-generate tests

f871393

Update test projects to use wildcards

4abbb7f

Intrinsify AllBitsSet

4a357ed

Fix value numbering

1ed0a53

Co-Authored-By: Brian Sullivan <[email protected]>

Apply formatting

63c744d

Special-case the integer path

4990c3f

Revert "Fix value numbering"

616f937

This reverts commit a708e533368cb6b8f71aa0ada3d931611dea1722.

Use new ISA query methods

7082647

Use compIsaSupportedDebugOnly in asserts

e2ada30

* compExactlyDependsOn(isa) ... Should never be used in an assert.

Use opportunistically depends on instead

ff76349

Intrinsify Zero and AllBitsSet for ARM

0273451

Address PR feedback

78bfc0e

* Hard-code ivals * Remove custom importation logic (and replace with HW_Category_SimpleSIMD) * Insert newlines

Mark Zero and AllBitsSet as SpecialCodeGen

9dbaf04

Allow null op1s where needed

2afe553

Fix failures for unsupported types, mark methods as intrinsic

81def29

Fix compilation errors, edit comments per PR feedback

df14899

Use custom importation logics for AllBitsSet/Zero on ARM

7064d8e

Apply formatting

6c1398c

Address PR feedback

84719ab

Gnbrkm41 force-pushed the allbitsset branch from 7a0f7cb to 84719ab Compare April 28, 2020 13:44

tannergooding reviewed Apr 28, 2020

View reviewed changes

tannergooding approved these changes Apr 28, 2020

View reviewed changes

echesakov approved these changes Apr 28, 2020

View reviewed changes

echesakov merged commit a8ef873 into dotnet:master Apr 29, 2020

echesakov mentioned this pull request Apr 30, 2020

[Arm64] Implement Vector64/128.CreateScalar() using AdvSimd.Insert #35300

Merged

Gnbrkm41 deleted the allbitsset branch May 1, 2020 13:37

Gnbrkm41 mentioned this pull request May 1, 2020

Insert missing break in hwintrinsicarm64.cpp #35713

Merged

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Vector{Size}<T>.AllBitsSet #33924

Implement Vector{Size}<T>.AllBitsSet #33924

Gnbrkm41 commented Mar 22, 2020

Dotnet-GitSync-Bot commented Mar 22, 2020

Gnbrkm41 commented Mar 22, 2020

Gnbrkm41 commented Mar 22, 2020

Gnbrkm41 commented Mar 22, 2020

gfoidl commented Mar 22, 2020 •

edited

Loading

Gnbrkm41 commented Mar 22, 2020 •

edited

Loading

Gnbrkm41 commented Mar 22, 2020

tannergooding commented Mar 22, 2020

tannergooding commented Mar 22, 2020

briansull commented Mar 23, 2020 •

edited

Loading

tannergooding Mar 23, 2020

Gnbrkm41 Mar 23, 2020

tannergooding Mar 23, 2020

briansull Mar 25, 2020

tannergooding Mar 25, 2020

Gnbrkm41 Mar 25, 2020

tannergooding commented Mar 23, 2020

briansull commented Mar 23, 2020 •

edited

Loading

Gnbrkm41 commented Mar 24, 2020

AndyAyersMS commented Mar 24, 2020

AndyAyersMS commented Mar 24, 2020

AndyAyersMS commented Mar 24, 2020

AndyAyersMS commented Mar 24, 2020

BruceForstall commented Mar 30, 2020

tannergooding commented Mar 30, 2020

TamarChristinaArm commented Mar 31, 2020 •

edited

Loading

Gnbrkm41 commented Apr 8, 2020

Gnbrkm41 commented Apr 28, 2020

tannergooding Apr 28, 2020

echesakov Apr 28, 2020

tannergooding left a comment

echesakov left a comment

echesakov Apr 28, 2020

echesakov commented Apr 29, 2020

Implement Vector{Size}<T>.AllBitsSet #33924

Implement Vector{Size}<T>.AllBitsSet #33924

Conversation

Gnbrkm41 commented Mar 22, 2020

Dotnet-GitSync-Bot commented Mar 22, 2020

Gnbrkm41 commented Mar 22, 2020

Gnbrkm41 commented Mar 22, 2020

Gnbrkm41 commented Mar 22, 2020

gfoidl commented Mar 22, 2020 • edited Loading

Gnbrkm41 commented Mar 22, 2020 • edited Loading

Gnbrkm41 commented Mar 22, 2020

tannergooding commented Mar 22, 2020

tannergooding commented Mar 22, 2020

briansull commented Mar 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Mar 23, 2020

briansull commented Mar 23, 2020 • edited Loading

Gnbrkm41 commented Mar 24, 2020

AndyAyersMS commented Mar 24, 2020

AndyAyersMS commented Mar 24, 2020

AndyAyersMS commented Mar 24, 2020

AndyAyersMS commented Mar 24, 2020

BruceForstall commented Mar 30, 2020

tannergooding commented Mar 30, 2020

TamarChristinaArm commented Mar 31, 2020 • edited Loading

Gnbrkm41 commented Apr 8, 2020

Gnbrkm41 commented Apr 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding left a comment

Choose a reason for hiding this comment

echesakov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echesakov commented Apr 29, 2020

gfoidl commented Mar 22, 2020 •

edited

Loading

Gnbrkm41 commented Mar 22, 2020 •

edited

Loading

briansull commented Mar 23, 2020 •

edited

Loading

briansull commented Mar 23, 2020 •

edited

Loading

TamarChristinaArm commented Mar 31, 2020 •

edited

Loading