Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Improve div/mod by const power of 2 #5871

Merged
merged 2 commits into from
Jun 22, 2016
Merged

Improve div/mod by const power of 2 #5871

merged 2 commits into from
Jun 22, 2016

Conversation

mikedn
Copy link

@mikedn mikedn commented Jun 18, 2016

Same as #1241 with a bug fix

Fixes #1207

@mikedn
Copy link
Author

mikedn commented Jun 20, 2016

Diff summary:

Total bytes of diff: 2294
    diff is a regression.
Top file regressions by size (bytes):
        1291 : System.Private.CoreLib.dasm
         509 : System.Xml.ReaderWriter.dasm
         106 : Microsoft.CodeAnalysis.dasm
          72 : System.Linq.Parallel.dasm
          71 : System.Collections.dasm
Top file improvements by size (bytes):
          -4 : System.Console.dasm
20 total files with size differences.
Top method regessions by size (bytes):
          70 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteElementTextBlockNoFlush(long,long,byref):int:this
          63 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteCDataSectionNoFlush(ref,int,int,byref):int:this
          61 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteAttributeTextBlockNoFlush(long,long):int:this
          60 : System.Linq.Parallel.dasm - System.Linq.Parallel.SortHelper`2[__Canon,__Canon][System.__Canon,System.__Canon]:MergeSortCooperatively():this
          59 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteAttributeTextBlock(long,long):this
Top method improvements by size (bytes):
         -23 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteElementTextBlock(long,long):this
         -18 : System.Xml.ReaderWriter.dasm - System.Xml.XmlEncodedRawTextWriter:WriteCommentOrPi(ref,int):this
         -18 : System.Xml.ReaderWriter.dasm - System.Xml.XmlEncodedRawTextWriter:WriteCDataSection(ref):this
         -18 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteCommentOrPi(ref,int):this
         -18 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteCDataSection(ref):this
359 total methods with size differences.

@mikedn mikedn changed the title [WIP] Improve div/mod by const power of 2 Improve div/mod by const power of 2 Jun 20, 2016
@mikedn
Copy link
Author

mikedn commented Jun 20, 2016

I added a new function - CreateTemporary - that wraps fgInsertEmbeddedFormTemp and does the necessary work to lower embedded statements that have been moved before the current statement. The function is based on the code used in LowerArrElem with a couple of additions:

  • a check for already lowered statements to prevent double lowering
  • statement dumping
  • an assert for unexpected current statement changes

The code in LowerArrElem could be replaced with a call to CreateTemporary but this doesn't seem like something that should be done in this PR.

@RussKeldorph
Copy link

@dotnet/jit-contrib PTAL

@sejongoh
Copy link

sejongoh commented Jun 20, 2016

I am wondering if CreateTemporary is enough for #5867 or we need a more fundamental solution.

// stored in RAX := Quotient, RDX := Remainder.
// Move the result to the desired register, if necessary
if (oper == GT_DIV || oper == GT_UDIV)
// Signed divide RDX:RAX by r/m64, with result

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should update this comment, since this now applies to both signed and unsigned divide.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update the comment but this isn't code that I have changed. The GitHub diff viewer shows changes here because it lacks a "ignore whitespace changes" option.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ignore whitespace changes"

If we add w=1 to the URL query, then whitespace is ignored:

-- see more neat/cheat tricks like ts for tab space etc. here: http://git.io/sheet

@CarolEidt
Copy link

@sejongoh - CreateTemporary seems to be a general solution to #5867. I would suggest that we leave that open, and use that issue to motivate modifying existing calls to fgInsertEmbeddedFormTemp, as @mikedn notes in his TODO-Cleanup comment in that method.

if (isDiv)
{
if ((type == TYP_INT && divisorValue == INT_MIN) ||
(type == TYP_LONG && divisorValue == INT64_MIN))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see from below that you have tests that exercise this case, but I wonder if this will require special support for 32-bit targets, since the MSIL relational operators always return an int, and I suspect the JIT assumes the same for the GT_EQ and similar operators.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this currently runs after decomposition, I don't think it will cause issues, as there should not be any long relationals at this point, for a 32-bit target. However, I guess that also means that these transformations will be less effective for longs, in general. (The woes of phase ordering.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I suspect the JIT assumes the same for the GT_EQ and similar operators

codegenxarch considers both TYP_INT and TYP_LONG to be valid: https://github.com/dotnet/coreclr/blob/master/src/jit/codegenxarch.cpp#L7161. It generates the same code for both but that's normal given the way x64 zero fills 64 bit registers.

I'm not familiar with ARM but codegenarm64 seems to produce a 64 bit result anyway: https://github.com/dotnet/coreclr/blob/master/src/jit/instr.cpp#L307

for a 32-bit target. However, I guess that also means that these transformations will be less effective for longs, in general. (The woes of phase ordering.)

Yes, that's a bit of a problem. I think that the way things are set up now it doesn't actually matter as morph will replace all long div/mod operations with helper calls. JIT32 doesn't do any optimizations for long div by power of 2, it always uses the helper.

@CarolEidt
Copy link

The changes look good to me.

@CarolEidt
Copy link

@mikedn - I am going to run these changes against the internal tests, but at the moment I am running into some technical difficulties. I hope to get them resolved soon. I'd really like to see this merged.

@mikedn
Copy link
Author

mikedn commented Jun 21, 2016

Regarding CreateTemporary being a general solution: it tries to be that but it's worth noting there are use cases that don't use/need this "workaround" because sometimes it is known that the new statement won't be top level. Best example is in LowerArrElem - first time it makes a temporary it applies the workaround but next time it just asserts if it gets a top-level statement. We could change LowerArrElem to always use CreateTemporary but we would lose that assert, IMO that's not a problem.

I can't say I like the solution but I don't think there are many alternatives:

  • lower embedded statements before the top-level statement it contains them (might turn out to be a can of worms)
  • use the linear order instead of walking the tree (more work and probably another can of worms)

@CarolEidt
Copy link

I completed the internal tests, and it looks good. I did a very limited analysis of the diffs, and this is clearly getting more cases - mostly results in larger code, as expected, and uses more temps (due to introducing new ones in rationalize), but is actually smaller in some cases due to not forcing the operands to fixed registers.
@mikedn - Would it be possible to squash the commits?
@dotnet/jit-contrib - could I get another review so that I can merge this?

Optimizing GT_DIV/GT_UDIV/GT_MOD/GT_UMOD by power of 2 in codegen is problematic because the xarch DIV instruction has special register requirements. By the time codegen decides to perform the optimization the rax and rdx registers have been already allocated by LSRA even though they're not always needed (as it happens in the case of unsigned division where CDQ isn't used).

Since the JIT can't represent a CDQ instruction in its IR an arithmetic shift (GT_RSH) has been instead to extract the dividend sign. xarch's SAR is larger than CDQ but it has the advantage that it doesn't require specific registers. Also, arithmetic shifts are available on architectures other than xarch.

Example: method "static int foo(int x) => x / 8;" is now compiled to
mov      eax, ecx
mov      edx, eax
sar      edx, 31
and      edx, 7
add      edx, eax
mov      eax, edx
sar      eax, 3

instead of
mov      eax, ecx
cdq
and      edx, 7
add      eax, edx
sar      eax, 3

As a side-effect of this change the optimization now also works when the divisor is too large to be contained. Previously this wasn't possible because the divisor constant needed to be modified during codegen but the constant was already loaded into a register.

Example: method "static ulong foo(ulong x) => x / 4294967296;" is now compiled to
mov      rax, rcx
shr      rax, 32

whereas before a DIV instruction was used.

This change also fixes an issue in fgShouldUseMagicNumberDivide. The optimization that is done in lower can handle negative power of 2 divisors but fgShouldUseMagicNumberDivide handled those cases because it didn't check the absolute value of the divisor.

Example: method "static int foo(int x) => return x / -2;" is now compiled to

mov      eax, ecx
mov      edx, eax
shr      edx, 31
add      edx, eax
sar      edx, 1
mov      eax, edx
neg      eax

instead of
mov      eax, 0x7FFFFFFF
imul     edx:eax, ecx
mov      eax, edx
sub      eax, ecx
mov      edx, eax
shr      edx, 31
add      eax, edx
@mikedn
Copy link
Author

mikedn commented Jun 22, 2016

Would it be possible to squash the commits?

Sure, squashed down to 2 commits, one for tests and one for implementation.

mostly results in larger code, as expected, and uses more temps (due to introducing new ones in rationalize), but is actually smaller in some cases due to not forcing the operands to fixed

BTW, is there a way to ask LSRA to allocate RAX/RDX but only if they happen to be available? If we could ask the source of the sign extending shift to be in RAX and the destination in RDX then codegen could change RDX = RSH RAX, 63 to CDQ if the request is fulfilled by LSRA.

Thanks for review!

@CarolEidt
Copy link

Currently, register preferences (vs. requirements) are only set on lclVar intervals. The registerPreferences field exists on all Intervals, but is set during interval creation, based on the isTgtPref flag in the gtNodeInfo, as well as for direct copies. There isn't a convenient way to communicate preference info for tree nodes from Lowering to LSRA, though it may be something to consider in future. I've often thought it would be good to consider merging Lowering::TreeNodeInfoInit with the creation of Intervals and RefPositions (in fact I had a wild idea of eliminating RefPositions altogether for non-lclVars, but never had time to pursue it far enough).

// Perform the 'targetType' (64-bit or 32-bit) divide instruction
instruction ins;
if (oper == GT_UMOD || oper == GT_UDIV)
ins = INS_div;
Copy link

@briansull briansull Jun 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs braces (i.e.) and I would also add a comment says that INS_div is an unsigned divide (as I always found this mnemonic to be confusing on x86)
{
ins = INS_div; // unsigned divide
}
else
{
ins = INS_idiv; // signed divide
}

@briansull
Copy link

briansull commented Jun 22, 2016

Looks Good

// const divisor into equivalent but faster sequences.
//
// Arguments:
// pTree: pointer to the parent node's link to the node we care about
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function header needs to be updated as this method has another additional argument 'data'

@sivarv
Copy link
Member

sivarv commented Jun 22, 2016

Looks good.

@CarolEidt
Copy link

I ran asmdiffs for arm64, and took a brief look at them. They are similar to those for amd64. I'm going to merge this in the interest of forward progress. @mikedn - could you do a follow-up PR to address the feedback? @briansull - we may want to assess the perf implications for arm64; I don't know if the relative costs are similar.

@CarolEidt CarolEidt merged commit 3dfe85f into dotnet:master Jun 22, 2016
@mikedn
Copy link
Author

mikedn commented Jun 22, 2016

could you do a follow-up PR to address the feedback?

Sure, though I'm not sure what's the hurry with this one :)

@CarolEidt
Copy link

@mikedn - only that it's been so long, the outstanding feedback is relatively minor, and (IMO) it's easier to focus on just the new changes when it's a new PR.
BTW - thanks for all your patience on this one!

@mikedn mikedn deleted the modopt branch April 10, 2017 05:42
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Improve div/mod by const power of 2

Commit migrated from dotnet/coreclr@3dfe85f
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RyuJIT: dangling instructions after converting modulus to bitwise AND
7 participants