Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Vector{Size}<T>.AllBitsSet #33924

Merged
merged 21 commits into from
Apr 29, 2020
Merged

Conversation

Gnbrkm41
Copy link
Contributor

Resolves #30659

Note: I strongly advise you to ignore the changes made in c9bd4a9 "Auto-generate tests": They are automatically generated from the template and adds lots of LoCs.

@Dotnet-GitSync-Bot
Copy link
Collaborator

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels Mar 22, 2020
@Gnbrkm41
Copy link
Contributor Author

cc @tannergooding

@Gnbrkm41
Copy link
Contributor Author

I believe we can use this in a couple places:

if (Avx2.IsSupported)
{
Vector256<int> ones = Vector256.Create(-1);
fixed (int* ptr = thisArray)
{
for (; i < count - (Vector256<int>.Count - 1); i += Vector256<int>.Count)
{
Vector256<int> vec = Avx.LoadVector256(ptr + i);
Avx.Store(ptr + i, Avx2.Xor(vec, ones));
}
}
}
else if (Sse2.IsSupported)
{
Vector128<int> ones = Vector128.Create(-1);

@Gnbrkm41
Copy link
Contributor Author

All of the failures seem to be either of those:

Assertion failed '!"Jump into middle of try region"' in 'JIT.HardwareIntrinsics.General.Program:Vector64AllBitsSet()' during 'Optimize layout' (IL size 50) File: F:\workspace\_work\1\s\src\coreclr\src\jit\flowgraph.cpp Line: 20738\r\n Image: C:\h\w\AC45098D\p\CoreRun.exe\r\n\r\n\r\n

https://dev.azure.com/dnceng/public/_build/results?buildId=568937&view=ms.vss-test-web.build-test-results-tab&runId=17857130&resultId=102291&paneView=debug

Assert failure(PID 1220 [0x000004c4], Thread: 5788 [0x169c]): Assertion failed 'fixedArity == 0' in 'JIT.HardwareIntrinsics.General.Program:AllBitsSetByte()' during 'Do value numbering' (IL size 38)\r\n\r\n File: F:\workspace\_work\1\s\src\coreclr\src\jit\valuenum.cpp Line: 8588

https://dev.azure.com/dnceng/public/_build/results?buildId=568937&view=ms.vss-test-web.build-test-results-tab&runId=17857130&resultId=102416&paneView=debug

@gfoidl
Copy link
Member

gfoidl commented Mar 22, 2020

Vector256 ones = Vector256.Create(-1);

Would it be possible for the JIT to detect such cases and emit the cmpps?
If not or additional it would be great to have an analyzer and codefix that changes this to Vector256<int>.AllBitsSet.

@Gnbrkm41
Copy link
Contributor Author

Gnbrkm41 commented Mar 22, 2020

Weird that I do not reproduce the test failures on my local machine :^( As I said that I just had one happen.

@Gnbrkm41
Copy link
Contributor Author

So, it appears that this is happening because methods that can have different instructions depending on types need an extra VNF_SimdType arg:

// If we see two (or more) different instructions we need the extra VNF_SimdType arg

What does this really mean?

@tannergooding
Copy link
Member

So, it appears that this is happening because methods that can have different instructions depending on types need an extra VNF_SimdType arg

CC. @briansull who added the initial VN support for HWIntrinsics in #31834

@tannergooding
Copy link
Member

Also CC. @echesakovMSFT and @CarolEidt

@briansull
Copy link
Contributor

briansull commented Mar 23, 2020

need an extra VNF_SimdType arg:

This prevents us from CSE-ing two different SIMD operations that would be implemented using different instructions.

I will investigate the assert and provide guidance

Assertion failed 'fixedArity == 0'

else
{
assert(varTypeIsIntegral(baseType) || !compiler->compSupports(InstructionSet_AVX));
emit->emitIns_SIMD_R_R_R(ins, attr, targetReg, targetReg, targetReg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is used for the comparison for float/double when AVX isn't supported? Based on the instruction table, this is cmpps/cmppd still; but I don't think that is correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhhh, good catch.

I wonder how we'd change it here, though? hard-code the instruction here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to special-case one of the paths.

That's also possibly an interesting case for CSE... If the table gives a particular set of instructions but codegen can optionally special-case something further or treat it slightly differently, how should that be handled @briansull ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ValueNumber & CSE phase only uses the table to determine if the result type needs to be an input when generating the ValueNumber for the node.

If for two trees that have the same operation (i.e. GT_MUL or the same HW intrinsic) and all of its operands have the identical value numbers, then normally we would give the same value number.

For GT_CAST we incorporate the castto type as an extra operand to the value number.

I noticed that we also needed to do this for some SIMD and HW instrinsic nodes.
I determined that safest and easiest way to to do this was to examine the table of instructions and always incorporate the result type when there were two or more different valid instructions listed for a SIMD or HW instrinsic node.

This process is used for x86 and x64, I believe that we need to be more conservative on ARM64, so we always include an extra result type operand.

As as long as the table has two or more different instructions we will be good.
It would be bad to list all the same instructions or an illegal instruction and then use hand code logic to decide on the instruction. If you want to record that an entry relies upon hand coded logic I would recommend using a brkpt or nop instruction as a marker for this behavior in the table. We can add a check for this and assume that different instructions could be generated. I don't think that it should matter if AVX is supported or not when deciding if we need an extra result type operand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it should matter if AVX is supported or not when deciding if we need an extra result type operand.

This would be a case of, when not using VEX encoding (SSE-SSE4.2) all types would use the same instruction. But when using the VEX encoding (AVX+), float/double would use different instructions that are more efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment that float/double may use different instructions depending on the encoding available. Also hardcoded the instruction inside the integer / non-VEX path.

@tannergooding
Copy link
Member

I will investigate the assert and provide guidance
Assertion failed 'fixedArity == 0'

I did some initial debugging and found that Vector128_Zero doesn't hit this issue as it has a single instruction (xorps) used by all types (and that looks to be the case for all other 0 arg intrinsics right now).

Vector128_AllBitSet has three different instructions and so it gets the additional VNF_SimdType arg and the arity becomes 1, failing the assert.

@briansull
Copy link
Contributor

briansull commented Mar 23, 2020

The fix is:

line 8569 in valuenum.cpp in function:
void Compiler::fgValueNumberHWIntrinsic(GenTree* tree)

    int      lookupNumArgs    = HWIntrinsicInfo::lookupNumArgs(hwIntrinsicNode->gtHWIntrinsicId);
    bool     encodeResultType = vnEncodesResultTypeForHWIntrinsic(hwIntrinsicNode->gtHWIntrinsicId);
    VNFunc   func             = GetVNFuncForNode(tree);

    ValueNumPair excSetPair = ValueNumStore::VNPForEmptyExcSet();
    ValueNumPair normalPair;
    ValueNumPair resvnp     = ValueNumPair();

    if (encodeResultType)
    {
        ValueNum vnSize = vnStore->VNForIntCon(hwIntrinsicNode->gtSIMDSize);
        ValueNum vnBaseType = vnStore->VNForIntCon(INT32(hwIntrinsicNode->gtSIMDBaseType));
        ValueNum simdTypeVN = vnStore->VNForFunc(TYP_REF, VNF_SimdType, vnSize, vnBaseType);
        resvnp.SetBoth(simdTypeVN);

#ifdef DEBUG
        if (verbose)
        {
            printf("    simdTypeVN is ");
            vnPrint(simdTypeVN, 1);
            printf("\n");
        }
#endif
    }

    // There are some HWINTRINSICS operations that have zero args, i.e.  NI_Vector128_Zero
    if (tree->AsOp()->gtOp1 == nullptr)
    {
        if (encodeResultType)
        {
            // There are zero arg HWINTRINSICS operations that encode the result type, i.e.  Vector128_AllBitSet 
            normalPair = vnStore->VNPairForFunc(tree->TypeGet(), func, resvnp);
            assert(vnStore->VNFuncArity(func) == 1);
        }
        else
        {
            normalPair = vnStore->VNPairForFunc(tree->TypeGet(), func);
            assert(vnStore->VNFuncArity(func) == 0);
        }

    }
    else if (tree->AsOp()->gtOp1->OperIs(GT_LIST) || (lookupNumArgs == -1))

@Gnbrkm41
Copy link
Contributor Author

*************** Starting PHASE Merge throw blocks

*************** In fgTailMergeThrows

*** Does not return call
               [000327] --CXG+------              *  CALL      void   System.ThrowHelper.ThrowNotSupportedException
               [000326] -----+------ arg0 in rcx  \--*  CNS_INT   int    63
    in BB04 is unique, marking it as canonical

*** Does not return call
               [000151] --CXG+------              *  CALL      void   System.ThrowHelper.ThrowNotSupportedException
               [000150] -----+------ arg0 in rcx  \--*  CNS_INT   int    63
    in BB02 can be dup'd to canonical BB04

*** found 1 merge candidates, rewriting flow

New Basic Block BB10 [0053] created.
*** BB01 now falling through to empty BB10 and then to BB04

*************** After fgTailMergeThrows(1 updates)

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1        [000..002)                                     i label target
BB10 [0053]  1       BB01                  1        [???..???)-> BB04 (always)                     internal
BB02 [0001]  0  0                          0        [002..00A)        (throw ) T0      try {       keep i try rare label gcsafe
BB03 [0019]  0  0                          1        [002..003)                 T0                  keep i label gcsafe
BB04 [0033]  2  0    BB03,BB10             0        [002..003)        (throw ) T0                  keep i rare label target gcsafe
BB05 [0046]  0  0                          1        [???..???)-> BB07 (always) T0      }           keep i internal label
BB07 [0003]  2       BB05,BB06             1        [00F..012)-> BB09 ( cond )                     i label target
BB08 [0004]  1       BB07                  0        [012..031)        (throw )                     i rare gcsafe newobj
BB09 [0005]  1       BB07                  1        [031..032)        (return)                     i label target
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow
BB06 [0002]  1     0                       1        [00A..00F)-> BB07 ( cret )    H0 F catch { }   keep i label target flet
-----------------------------------------------------------------------------------------------------------------------------------------
*************** In fgDebugCheckBBlist
Jump into the middle of try region: BB10 branches to BB04

Assert failure(PID 11372 [0x00002c6c], Thread: 3996 [0x0f9c]): Assertion failed '!"Jump into middle of try region"' in 'JIT.HardwareIntrinsics.General.Program:Vector64AllBitsSet()' during 'Merge throw blocks' (IL size 50)

    File: C:\Users\gotos\source\repos\runtime\src\coreclr\src\jit\flowgraph.cpp Line: 20738
    Image: c:\users\gotos\source\repos\runtime\artifacts\tests\coreclr\windows_nt.x64.checked\tests\core_root\corerun.exe

Full JitDump

An interesting failure...

@AndyAyersMS
Copy link
Member

@Gnbrkm41 that's a new phase I added recently, probably a missing safety check. Let me take a look.

@AndyAyersMS
Copy link
Member

fgTailMergeThrowsFallThroughHelper needs to ensure that the new BB is in the same EH region as the the "nonCanonicalBlock". It is not doing this properly if the nonCanonicalBlock is a try entry.

Might be simplest for now to disable throw helper merging in this case. Will keep looking.

@AndyAyersMS
Copy link
Member

@Gnbrkm41 see if this patch fixes your failures. Haven't validated it yet -- trying to create a simple local repro, but no luck so far.

index 8ba6d2f8cc7..846677f482f 100644
--- a/src/coreclr/src/jit/flowgraph.cpp
+++ b/src/coreclr/src/jit/flowgraph.cpp
@@ -25869,6 +25869,14 @@ void Compiler::fgTailMergeThrows()
     // and there is less jumbled flow to sort out later.
     for (BasicBlock* block = fgLastBB; block != nullptr; block = block->bbPrev)
     {
+        // Workaround: don't consider try entry blocks as candidates
+        // for merging; if the canonical throw is later in the same try,
+        // we'll create invalid flow.
+        if ((block->bbFlags & BBF_TRY_BEG) != 0)
+        {
+            continue;
+        }
+
         // For throw helpers the block should have exactly one statement....
         // (this isn't guaranteed, but seems likely)
         Statement* stmt = block->firstStmt();

@AndyAyersMS
Copy link
Member

Ok, I can repro now. I'll put up a fix.

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this pull request Mar 24, 2020
Otherwise we may create a branch into the middle of a try. We could fix the
transform, but if the first block of a try has a throw helper call, the rest
of the try will subsequently be removed, so merging is not all that
interesting.

Addresses an issue that came up in dotnet#33924.
AndyAyersMS added a commit that referenced this pull request Mar 25, 2020
Otherwise we may create a branch into the middle of a try. We could fix the
transform, but if the first block of a try has a throw helper call, the rest
of the try will subsequently be removed, so merging is not all that
interesting.

Addresses an issue that came up in #33924.
@BruceForstall
Copy link
Member

Does this need an arm64 implementation?

@tannergooding
Copy link
Member

Does this need an arm64 implementation?

It will need one for Vector64<T> and Vector128<T>. @TamarChristinaArm could you advise as to which instructions should be used to efficiently create a vector where all bits are set?
That is, should we do a set and duplicate or maybe a compare tgt, tgt, tgt like we do on x86, or maybe something else?

@TamarChristinaArm
Copy link
Contributor

TamarChristinaArm commented Mar 31, 2020

Does this need an arm64 implementation?

It will need one for Vector64<T> and Vector128<T>. @TamarChristinaArm could you advise as to which instructions should be used to efficiently create a vector where all bits are set?
That is, should we do a set and duplicate or maybe a compare tgt, tgt, tgt like we do on x86, or maybe something else?

@tannergooding your best bet is to use mnvi or movi, both work in this case because the mask is simple:

mvni v0.4s, #0
movi v1.16b, #0xFF

will both give you a vectors with all bits set.

mvni v0.2s, #0
movi v1.8b, #0xFF

for only the bottom half of the vector.

@Gnbrkm41
Copy link
Contributor Author

Gnbrkm41 commented Apr 8, 2020

Note that I got slightly busy recently. I hope I have some time to work on intrinsifying ARM64, but I am not sure how quick can happen; Do you think it'll be okay if I can attempt the ARM part later (hopefully soon, next weekend?) and open a follow up PR?

@Gnbrkm41
Copy link
Contributor Author

Just rebased, resolved conflicts and addressed feedbacks. I've locally checked that the changes for both ARM and xarch do in fact generate appropriate instructions and all the tests pass; Could I get a final review on this? Thanks!

@@ -281,7 +281,9 @@ GenTree* Compiler::impSpecialIntrinsic(NamedIntrinsic intrinsic,

if (!varTypeIsArithmetic(baseType))
{
assert((intrinsic == NI_Vector64_AsByte) || (intrinsic == NI_Vector128_As));
assert((intrinsic == NI_Vector64_AsByte) || (intrinsic == NI_Vector128_As) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echesakovMSFT, not related to this PR, but I think the first check in this is wrong. It should be intrinsic == NI_Vector64_As, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks so, however, I don't see Vector64.As intrinsic in hwintrinsiclistarm64.h, I will update #33308 to include this.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -281,7 +281,9 @@ GenTree* Compiler::impSpecialIntrinsic(NamedIntrinsic intrinsic,

if (!varTypeIsArithmetic(baseType))
{
assert((intrinsic == NI_Vector64_AsByte) || (intrinsic == NI_Vector128_As));
assert((intrinsic == NI_Vector64_AsByte) || (intrinsic == NI_Vector128_As) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks so, however, I don't see Vector64.As intrinsic in hwintrinsiclistarm64.h, I will update #33308 to include this.

@echesakov
Copy link
Contributor

I did four attempts to re-run testing for this PR - runtime (Installer Build and Test coreclr FreeBSD_x64 Debug) keeps failing with

/root/runtime/.dotnet/sdk/5.0.100-preview.4.20202.8/NuGet.RestoreEx.targets(10,5): error : No space left on device [/root/runtime/tools-local/tasks/installer.tasks/installer.tasks.csproj]

Everything else is green, even though is reported as "non finished" - https://dev.azure.com/dnceng/public/_build/results?buildId=621331&view=logs&j=41e34ca2-d347-5e65-d632-b45724e78141. Merging so it would not conflict with #35594.

Thanks @Gnbrkm41 for contribution!

@echesakov echesakov merged commit a8ef873 into dotnet:master Apr 29, 2020
@Gnbrkm41 Gnbrkm41 deleted the allbitsset branch May 1, 2020 13:37
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation
Projects
None yet