-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimization "X & 1 == 1" to "X & 1" (#61412) #62818
Conversation
Tagging subscribers to this area: @JulieLeeMSFT |
src/coreclr/jit/lower.cpp
Outdated
// Optimizes (X & 1) == 1 to (X & 1) | ||
// GTF_RELOP_JMP_USED is used to make sure the optimization is used for return statements only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would look into moving this optimization to morph. We tend not to put general peepholes like that into lowering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression that optimizations which depend on GTF_RELOP_JMP_USED are better be in lower, after various copy-props - at least in my recent similar opt when I moved it to lower I got better diffs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be interesting to check what the diffs look like for this case. I cannot think off the top of my head why would we catch more cases in lower (well, modulo the "missed remorph" cases).
Relops under jumps are marked NO_CSE
, so, in theory, we should have about the same number of !GTF_RELOP_JMP_USED
ones before and after optimizations (modulo dead code and such).
It is a bit better to catch these things in morph in general because we help CSE make better decisions that way, but if it turns out the diffs say we should do it in lower, so be it.
src/coreclr/jit/lower.cpp
Outdated
{ | ||
GenTree* op1 = cmp->gtGetOp1(); | ||
|
||
if (op1->gtGetOp1()->gtOper == GT_LCL_VAR && op1->gtGetOp2()->IsIntegralConst(1) && !(cmp->gtFlags & GTF_RELOP_JMP_USED)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to check op1's operator here, it can be anything
src/coreclr/jit/lower.cpp
Outdated
|
||
if (op1->gtGetOp1()->gtOper == GT_LCL_VAR && op1->gtGetOp2()->IsIntegralConst(1) && !(cmp->gtFlags & GTF_RELOP_JMP_USED)) | ||
{ | ||
GenTree* next = cmp->gtNext; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to use LIR::Use::ReplaceWith
here if you want to replace the root node.
@EgorBo @SingleAccretion I'm concerned about the failed test cases. Most of them are "Assertion failed 'OperIsSimple()'" .It might mean that the optimization is applied not only to the return statements. Do you have any ideas what went wrong ? :) |
I am sure you just don't replace the node correctly, you indeed can start from doing it in morph.cpp first as it should be much easier to test and then we'll check the diffs |
The code assumes that the user of the relop will be the next node, which is not a correct assumption, there may be intervening nodes in between (an arbitrary number of them in fact). You'll need to use (I agree checking morph first would be better) |
src/coreclr/jit/morph.cpp
Outdated
// Optimizes (X & 1) == 1 to (X & 1) | ||
// GTF_RELOP_JMP_USED is used to make sure the optimization is used for return statements only. | ||
if (tree->gtGetOp2()->IsIntegralConst(1) && tree->gtGetOp1()->OperIs(GT_AND) && !(tree->gtFlags & GTF_RELOP_JMP_USED)) | ||
{ | ||
GenTree* op1 = tree->gtGetOp1(); | ||
|
||
if (op1->gtGetOp2()->IsIntegralConst(1)) | ||
{ | ||
DEBUG_DESTROY_NODE(tree->gtGetOp2()); | ||
DEBUG_DESTROY_NODE(tree); | ||
|
||
return fgMorphTree(op1); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done in post-order, when all the constants have been discovered.
src/coreclr/jit/morph.cpp
Outdated
// Optimizes (X & 1) == 1 to (X & 1) | ||
// GTF_RELOP_JMP_USED is used to make sure the optimization is used for return statements only. | ||
if (tree->gtGetOp2()->IsIntegralConst(1) && tree->gtGetOp1()->OperIs(GT_AND) && | ||
!(tree->gtFlags & GTF_RELOP_JMP_USED)) | ||
{ | ||
GenTree* op1 = tree->gtGetOp1(); | ||
|
||
if (op1->gtGetOp2()->IsIntegralConst(1)) | ||
{ | ||
DEBUG_DESTROY_NODE(tree->gtGetOp2()); | ||
DEBUG_DESTROY_NODE(tree); | ||
|
||
return op1; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(In post-order) ...and inside fgOptimizeEqualityComparisonWithConst
.
(Sorry for not being complete in my first comment :( )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to hear any corrections :) feel free
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(In post-order)
Do you mean the order the nodes are being destroyed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fgOptimizeEqualityComparisonWithConst
is supposed to return a comparison node, meanwhile we have to get rid of the equality node. Are you sure the optimization should be placed to there?
runtime/src/coreclr/jit/morph.cpp
Lines 12085 to 12086 in 99d82c2
tree = fgOptimizeEqualityComparisonWithConst(tree->AsOp()); | |
assert(tree->OperIsCompare()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure the optimization should be placed to there?
Yeah, this is an artificial restriction and can be removed. You may also need to remove a few downstream asserts too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right. It works. But I'm still confused about the ordering, what exactly must be in post-order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tree traversal terminology:
fgMorphSmpOp:
Pre-Order (before the operands were morphed)
Morphing the operands
Post-Order (after the operands were morphed)
As a general rule, optimizations should be done in post-order, because then we will know facts about the operands we wouldn't have known in pre-order (such as that they are constant, or have side effects).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tree traversal terminology:
fgMorphSmpOp: Pre-Order (before the operands were morphed) Morphing the operands Post-Order (after the operands were morphed)
As a general rule, optimizations should be done in post-order, because then we will know facts about the operands we wouldn't have known in pre-order (such as that they are constant, or have side effects).
Oh, I know about the terminology. Thank you for the explanation :)
src/coreclr/jit/morph.cpp
Outdated
// / \ | ||
// x CNS 1 | ||
// | ||
// GTF_RELOP_JMP_USED is used to make sure the optimization is used for return statements only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// GTF_RELOP_JMP_USED is used to make sure the optimization is used for return statements only. | |
// The compiler requires jumps to have relop operands, so we do not fold that case. |
Notably relops can appear anywhere r-values can, not just under GT_RETURN
s.
src/coreclr/jit/morph.cpp
Outdated
@@ -13550,6 +13545,26 @@ GenTree* Compiler::fgOptimizeEqualityComparisonWithConst(GenTreeOp* cmp) | |||
DEBUG_DESTROY_NODE(rshiftOp->gtGetOp2()); | |||
DEBUG_DESTROY_NODE(rshiftOp); | |||
} | |||
|
|||
// Optimizes (X & 1) == 1 to (X & 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think it'll be easier to reason about this if we move the (new) optimization just under the op2->IsIntegralConst(0) || op2->IsIntegralConst(1)
, before the other ones. On success this returns early, on failure does nothing, so subsequent transformations don't need to take it into account.
Also, presumably we would need to check op2Value == 1
.
src/coreclr/jit/morph.cpp
Outdated
DEBUG_DESTROY_NODE(cmp->gtGetOp2()); | ||
DEBUG_DESTROY_NODE(cmp); | ||
|
||
return op1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the function header: Currently only returns relop trees.
-> Currently only returns GTK_SMPOP trees.
@EgorBo @SingleAccretion It seems like I placed the code in the right place, but the tests are failed :( What can I do about it? |
It seems that's just the build failing:
I think the fix is to remove the trailing whitespaces in the picture. |
src/coreclr/jit/morph.cpp
Outdated
// x CNS 1 | ||
// | ||
// The compiler requires jumps to have relop operands, so we do not fold that case. | ||
if (op1->OperIs(GT_AND) && op2Value == 1 && !(cmp->gtFlags & GTF_RELOP_JMP_USED)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (op1->OperIs(GT_AND) && op2Value == 1 && !(cmp->gtFlags & GTF_RELOP_JMP_USED)) | |
if (op1->OperIs(GT_AND) && (op2Value == 1) && !(cmp->gtFlags & GTF_RELOP_JMP_USED)) |
src/coreclr/jit/morph.cpp
Outdated
@@ -13446,6 +13441,26 @@ GenTree* Compiler::fgOptimizeEqualityComparisonWithConst(GenTreeOp* cmp) | |||
{ | |||
ssize_t op2Value = static_cast<ssize_t>(op2->IntegralValue()); | |||
|
|||
// Optimizes (X & 1) == 1 to (X & 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs to check that we will be returning a properly typed tree here. X & 1
can be of TYP_LONG
, while relops are always TYP_INT
.
There are two ways to fix this: a) just give up if the AND
is not of TYP_INT
(genActualType(andNode) == relop->TypeGet()
), or try our luck with optNarrowTree
(see example of usage in this function). Note that optNarrowTree
should only be called under an fgGlobalMorph
guard.
The optNarrowTree
solution should get better diffs, the "always check types" one is simpler, up to you which one to implement.
Fixing this should fix the x86 failures we're seeing in CI.
(Also, for debugging, SPMI is a very useful tool.)
@SingleAccretion Well, it seems my usage of |
The pattern with
That said, from reading It seems the SPMI diffs failed, you should be able to run replay locally and debug the failures. |
@EgorBo @SingleAccretion Here is the last run logs that I got. |
@SkiFoD I think you need to replay the Windows x86 collection ( |
@SingleAccretion @EgorBo I did the replay with |
@SkiFoD looks like the replay didn't pick up the
Try |
60d4c0f
to
ed4e78a
Compare
@SingleAccretion Seems like I managed to fix the test failures. I ended up using the A option, because narrowing would show me this error during replays: runtime/src/coreclr/jit/lower.cpp Lines 2273 to 2274 in 46f5b99
I assume it happened because one of operators had been narrowed to TYPE_INT so it made the assertion failing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there are regressions in the diffs. What is their cause and can they be mitigated?
src/coreclr/jit/morph.cpp
Outdated
@@ -13446,6 +13441,29 @@ GenTree* Compiler::fgOptimizeEqualityComparisonWithConst(GenTreeOp* cmp) | |||
{ | |||
ssize_t op2Value = static_cast<ssize_t>(op2->IntegralValue()); | |||
|
|||
if (fgGlobalMorph) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fgGlobalMorph
can be dropped now that we don't use optNarrowTree
.
(This transform preserves VNs, and the whole method is under the "not CSE" guard)
src/coreclr/jit/morph.cpp
Outdated
if (op1->gtGetOp2()->IsIntegralConst(1) && (genActualType(op1) == cmp->TypeGet())) | ||
{ | ||
|
||
DEBUG_DESTROY_NODE(op2); | ||
DEBUG_DESTROY_NODE(cmp); | ||
|
||
return op1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a correctness requirement, but I would recommend transferring the relop's VN to the AND
tree. Relop VNs are more "valuable":
op1->SetVNsFromNode(cmp);
I got rid of the regressions, but had to return the check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeps showing up and I can't figure out what causes it.
I wouldn't worry about it, it is "expected".
The way SPMI works, it collects all the communication between the Jit and the EE for a given session, and then replay uses that data. But of course, it is not necessary that the Jit being used for replay would be asking the same exact questions that the Jit used for collections would have. Asking less is ok, asking more - that's where the MISSING
(data) problem comes from.
The collections are updated on a weekly basis (I believe), so even in normal operation MISSING
errors can happen, say because a change merged in a middle of the week tweaked the Jit-EE traffic in some way where more information is now needed (and there is none in the existing collections).
It is the case that for changes not expected to alter the Jit-EE traffic such as this one, MISSING
errors are (almost) entirely ignorable. However, if a change (even a pure Jit one) needs "too much" data from the collections that is not there, that's when SPMI becomes less useful for diffs (and replay), and regular PMI/CG-en-based-diffs are employed.
src/coreclr/jit/morph.cpp
Outdated
// The compiler requires jumps to have relop operands, so we do not fold that case. | ||
if (op1->OperIs(GT_AND) && (op2Value == 1) && !(cmp->gtFlags & GTF_RELOP_JMP_USED)) | ||
{ | ||
if ((op1->gtGetOp1()->gtOper == GT_LCL_VAR) && op1->gtGetOp2()->IsIntegralConst(1) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does restricting the AND
's operand to a local solve the regressions - what was their source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SingleAccretion Well, I got pretty many regressions and It seemed suspicious to me that such a tiny optimization could cause this. I used jit-analyzer to figure out what was the actual diff between the dasms, but ended up just comparing both side by side and found out that the optimization introduced extra asm instructions (I'm not good at low-level engineering, I'm learning), so I assumed that the optimization had been applied in cased it should have been applied.
The restriction for x
to be a local variable (This is how I understand the GT_LCL_VAR type) did the trick. I'm curious why @EgorBo said that x
can be anything and would like to dig deeper to figure out what is the source of the regressions when x
is anything. How do you debug such cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I got pretty many regressions and It seemed suspicious to me that such a tiny optimization could cause this.
I agree, that is why it would be good to know; the restriction on the local may just be hiding it. Some regressions cannot be avoided (it is usual that, for example, we get some due to different register allocation), but we should have a good understanding of the causes to make decisions.
How do you debug such cases?
I personally do it by: a) fully diffing with SPMI b) analyzing the cases individually by first diffing (as in git diff
, or some online tool) the assembly, then the JitDump
s (these ones are usually only git diff
-able...), and debugging as necessary. I believe these days you can do this with the SPMI script directly by using the -c
argument to pass the method context number that had the diffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the debugging advice. I've spent few days trying comparing dasm files. It is very hard for me to tell what is exactly wrong, but I found some suspicious facts:
- Regressions mostly occur when 'x' is a pointer (or ref variable).
Here is the diff example https://www.diffchecker.com/fQLJQ4YW - Regressions mostly occur in
MoveNext()
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking into this. I took a look myself, I agree it is a bit of a tricky case.
We lose tail duplication because it only looks at relops for our purposes. Then because of that we lose relop forward substitution, but perhaps more importantly, the flow that RA can, apparently, better work with.
If we look at PerfScore instead of code size, things looks better (just a few small regressions), this holds both for libraries.pmi
and libraries_tests.pmi
. The PerfScore outliers you note are indeed curious. It appears for the "diff" method the altered flow causes the compiler to use synthetic scales (instead of profile data) for more blocks, and for more general loops to exist.
(Side note: JitDisablePgo=1
doesn't appear to work well, so it is a bit hard to say why things are the way they are)
Overall, it appears doing this in lowering will be better after all.
0823b10
to
9dd21e4
Compare
e06aa3e
to
7e9949e
Compare
@EgorBo please review and approve this PR. |
Ping @EgorBo for community PR review. |
@SkiFoD could you please rebase this one as well (and tag me) |
7e9949e
to
175bac2
Compare
@EgorBo I updated the branch and run SPMI once again. I got confussed by the results so I'd like to know your opinion on them. |
Failure is due to #63854. |
* Add optimization "X & 1 == 1" to "X & 1" (dotnet#61412) * Moved the optimization to the morph phase (dotnet#61412) * Done in post-order (dotnet#61412) * Moved the optimization into fgOptimizeEqualityComparisonWithConst (dotnet#61412) * Some corrections due the comments (dotnet#61412) * Fix of the picture (dotnet#61412) * Add optNarrowTree use (dotnet#61412) * Change narrowing to the type check (dotnet#61412) * Fix regressions (dotnet#61412) * Moved the optimization to the lowering phase (dotnet#61412) * Reverted Morph changes (dotnet#61412) * Moved the optimization into OptimizeConstCompare method (dotnet#61412) * Add GT_EQ check(dotnet#61412)
Looks like I got it working :) At least the result for a simple function
public static bool Test(int x) => (x & 1) == 1;
looks like this
.
Would be glad to hear if I did something wrong and how can I make it better.