Improve div/mod by const power of 2 #5871

mikedn · 2016-06-18T12:24:49Z

Same as #1241 with a bug fix

Fixes #1207

mikedn · 2016-06-20T13:09:06Z

Diff summary:

Total bytes of diff: 2294
    diff is a regression.
Top file regressions by size (bytes):
        1291 : System.Private.CoreLib.dasm
         509 : System.Xml.ReaderWriter.dasm
         106 : Microsoft.CodeAnalysis.dasm
          72 : System.Linq.Parallel.dasm
          71 : System.Collections.dasm
Top file improvements by size (bytes):
          -4 : System.Console.dasm
20 total files with size differences.
Top method regessions by size (bytes):
          70 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteElementTextBlockNoFlush(long,long,byref):int:this
          63 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteCDataSectionNoFlush(ref,int,int,byref):int:this
          61 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteAttributeTextBlockNoFlush(long,long):int:this
          60 : System.Linq.Parallel.dasm - System.Linq.Parallel.SortHelper`2[__Canon,__Canon][System.__Canon,System.__Canon]:MergeSortCooperatively():this
          59 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteAttributeTextBlock(long,long):this
Top method improvements by size (bytes):
         -23 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteElementTextBlock(long,long):this
         -18 : System.Xml.ReaderWriter.dasm - System.Xml.XmlEncodedRawTextWriter:WriteCommentOrPi(ref,int):this
         -18 : System.Xml.ReaderWriter.dasm - System.Xml.XmlEncodedRawTextWriter:WriteCDataSection(ref):this
         -18 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteCommentOrPi(ref,int):this
         -18 : System.Xml.ReaderWriter.dasm - System.Xml.XmlUtf8RawTextWriter:WriteCDataSection(ref):this
359 total methods with size differences.

mikedn · 2016-06-20T13:18:36Z

I added a new function - CreateTemporary - that wraps fgInsertEmbeddedFormTemp and does the necessary work to lower embedded statements that have been moved before the current statement. The function is based on the code used in LowerArrElem with a couple of additions:

a check for already lowered statements to prevent double lowering
statement dumping
an assert for unexpected current statement changes

The code in LowerArrElem could be replaced with a call to CreateTemporary but this doesn't seem like something that should be done in this PR.

RussKeldorph · 2016-06-20T20:31:30Z

@dotnet/jit-contrib PTAL

sejongoh · 2016-06-20T23:18:15Z

I am wondering if CreateTemporary is enough for #5867 or we need a more fundamental solution.

CarolEidt · 2016-06-20T23:34:51Z

src/jit/codegenxarch.cpp

-            //    stored in RAX := Quotient, RDX := Remainder.
-            // Move the result to the desired register, if necessary
-            if (oper == GT_DIV || oper == GT_UDIV)
+        // Signed divide RDX:RAX by r/m64, with result


You should update this comment, since this now applies to both signed and unsigned divide.

I can update the comment but this isn't code that I have changed. The GitHub diff viewer shows changes here because it lacks a "ignore whitespace changes" option.

"ignore whitespace changes"

If we add w=1 to the URL query, then whitespace is ignored:

diff view: https://github.com/dotnet/coreclr/pull/5871/files?w=1

commit view: 651c419?w=1

-- see more neat/cheat tricks like ts for tab space etc. here: http://git.io/sheet

CarolEidt · 2016-06-20T23:45:20Z

@sejongoh - CreateTemporary seems to be a general solution to #5867. I would suggest that we leave that open, and use that issue to motivate modifying existing calls to fgInsertEmbeddedFormTemp, as @mikedn notes in his TODO-Cleanup comment in that method.

CarolEidt · 2016-06-20T23:57:02Z

src/jit/lower.cpp

+        if (isDiv)
+        {
+            if ((type == TYP_INT && divisorValue == INT_MIN) ||
+                (type == TYP_LONG && divisorValue == INT64_MIN))


I see from below that you have tests that exercise this case, but I wonder if this will require special support for 32-bit targets, since the MSIL relational operators always return an int, and I suspect the JIT assumes the same for the GT_EQ and similar operators.

Since this currently runs after decomposition, I don't think it will cause issues, as there should not be any long relationals at this point, for a 32-bit target. However, I guess that also means that these transformations will be less effective for longs, in general. (The woes of phase ordering.)

and I suspect the JIT assumes the same for the GT_EQ and similar operators

codegenxarch considers both TYP_INT and TYP_LONG to be valid: https://github.com/dotnet/coreclr/blob/master/src/jit/codegenxarch.cpp#L7161. It generates the same code for both but that's normal given the way x64 zero fills 64 bit registers.

I'm not familiar with ARM but codegenarm64 seems to produce a 64 bit result anyway: https://github.com/dotnet/coreclr/blob/master/src/jit/instr.cpp#L307

for a 32-bit target. However, I guess that also means that these transformations will be less effective for longs, in general. (The woes of phase ordering.)

Yes, that's a bit of a problem. I think that the way things are set up now it doesn't actually matter as morph will replace all long div/mod operations with helper calls. JIT32 doesn't do any optimizations for long div by power of 2, it always uses the helper.

CarolEidt · 2016-06-21T00:10:55Z

The changes look good to me.

CarolEidt · 2016-06-21T00:42:31Z

@mikedn - I am going to run these changes against the internal tests, but at the moment I am running into some technical difficulties. I hope to get them resolved soon. I'd really like to see this merged.

mikedn · 2016-06-21T05:07:18Z

Regarding CreateTemporary being a general solution: it tries to be that but it's worth noting there are use cases that don't use/need this "workaround" because sometimes it is known that the new statement won't be top level. Best example is in LowerArrElem - first time it makes a temporary it applies the workaround but next time it just asserts if it gets a top-level statement. We could change LowerArrElem to always use CreateTemporary but we would lose that assert, IMO that's not a problem.

I can't say I like the solution but I don't think there are many alternatives:

lower embedded statements before the top-level statement it contains them (might turn out to be a can of worms)
use the linear order instead of walking the tree (more work and probably another can of worms)

CarolEidt · 2016-06-22T15:54:00Z

I completed the internal tests, and it looks good. I did a very limited analysis of the diffs, and this is clearly getting more cases - mostly results in larger code, as expected, and uses more temps (due to introducing new ones in rationalize), but is actually smaller in some cases due to not forcing the operands to fixed registers.
@mikedn - Would it be possible to squash the commits?
@dotnet/jit-contrib - could I get another review so that I can merge this?

Optimizing GT_DIV/GT_UDIV/GT_MOD/GT_UMOD by power of 2 in codegen is problematic because the xarch DIV instruction has special register requirements. By the time codegen decides to perform the optimization the rax and rdx registers have been already allocated by LSRA even though they're not always needed (as it happens in the case of unsigned division where CDQ isn't used). Since the JIT can't represent a CDQ instruction in its IR an arithmetic shift (GT_RSH) has been instead to extract the dividend sign. xarch's SAR is larger than CDQ but it has the advantage that it doesn't require specific registers. Also, arithmetic shifts are available on architectures other than xarch. Example: method "static int foo(int x) => x / 8;" is now compiled to mov eax, ecx mov edx, eax sar edx, 31 and edx, 7 add edx, eax mov eax, edx sar eax, 3 instead of mov eax, ecx cdq and edx, 7 add eax, edx sar eax, 3 As a side-effect of this change the optimization now also works when the divisor is too large to be contained. Previously this wasn't possible because the divisor constant needed to be modified during codegen but the constant was already loaded into a register. Example: method "static ulong foo(ulong x) => x / 4294967296;" is now compiled to mov rax, rcx shr rax, 32 whereas before a DIV instruction was used. This change also fixes an issue in fgShouldUseMagicNumberDivide. The optimization that is done in lower can handle negative power of 2 divisors but fgShouldUseMagicNumberDivide handled those cases because it didn't check the absolute value of the divisor. Example: method "static int foo(int x) => return x / -2;" is now compiled to mov eax, ecx mov edx, eax shr edx, 31 add edx, eax sar edx, 1 mov eax, edx neg eax instead of mov eax, 0x7FFFFFFF imul edx:eax, ecx mov eax, edx sub eax, ecx mov edx, eax shr edx, 31 add eax, edx

mikedn · 2016-06-22T16:12:27Z

Would it be possible to squash the commits?

Sure, squashed down to 2 commits, one for tests and one for implementation.

mostly results in larger code, as expected, and uses more temps (due to introducing new ones in rationalize), but is actually smaller in some cases due to not forcing the operands to fixed

BTW, is there a way to ask LSRA to allocate RAX/RDX but only if they happen to be available? If we could ask the source of the sign extending shift to be in RAX and the destination in RDX then codegen could change RDX = RSH RAX, 63 to CDQ if the request is fulfilled by LSRA.

Thanks for review!

CarolEidt · 2016-06-22T16:34:08Z

Currently, register preferences (vs. requirements) are only set on lclVar intervals. The registerPreferences field exists on all Intervals, but is set during interval creation, based on the isTgtPref flag in the gtNodeInfo, as well as for direct copies. There isn't a convenient way to communicate preference info for tree nodes from Lowering to LSRA, though it may be something to consider in future. I've often thought it would be good to consider merging Lowering::TreeNodeInfoInit with the creation of Intervals and RefPositions (in fact I had a wild idea of eliminating RefPositions altogether for non-lclVars, but never had time to pursue it far enough).

briansull · 2016-06-22T18:07:40Z

src/jit/codegenxarch.cpp

+        // Perform the 'targetType' (64-bit or 32-bit) divide instruction
+        instruction ins;
+        if (oper == GT_UMOD || oper == GT_UDIV)
+            ins = INS_div;


Needs braces (i.e.) and I would also add a comment says that INS_div is an unsigned divide (as I always found this mnemonic to be confusing on x86)
{
ins = INS_div; // unsigned divide
}
else
{
ins = INS_idiv; // signed divide
}

briansull · 2016-06-22T18:14:30Z

Looks Good

sivarv · 2016-06-22T18:16:23Z

src/jit/lower.cpp

+// const divisor into equivalent but faster sequences.
+//
+// Arguments:
+//    pTree:   pointer to the parent node's link to the node we care about


This function header needs to be updated as this method has another additional argument 'data'

sivarv · 2016-06-22T18:34:08Z

Looks good.

CarolEidt · 2016-06-22T20:02:19Z

I ran asmdiffs for arm64, and took a brief look at them. They are similar to those for amd64. I'm going to merge this in the interest of forward progress. @mikedn - could you do a follow-up PR to address the feedback? @briansull - we may want to assess the perf implications for arm64; I don't know if the relative costs are similar.

mikedn · 2016-06-22T20:05:44Z

could you do a follow-up PR to address the feedback?

Sure, though I'm not sure what's the hurry with this one :)

CarolEidt · 2016-06-22T20:11:21Z

@mikedn - only that it's been so long, the outstanding feedback is relatively minor, and (IMO) it's easier to focus on just the new changes when it's a new PR.
BTW - thanks for all your patience on this one!

Improve div/mod by const power of 2 Commit migrated from dotnet/coreclr@3dfe85f

dnfclas added the cla-already-signed label Jun 18, 2016

mikedn force-pushed the modopt branch 2 times, most recently from 83fa0db to 71bad9a Compare June 20, 2016 11:20

mikedn changed the title ~~[WIP] Improve div/mod by const power of 2~~ Improve div/mod by const power of 2 Jun 20, 2016

CarolEidt reviewed Jun 20, 2016
View reviewed changes

CarolEidt mentioned this pull request Jun 21, 2016

Enable GT_CALL with long ret types for x86 RyuJIT #5890

Merged

mikedn force-pushed the modopt branch from 71bad9a to b487904 Compare June 21, 2016 19:52

mikedn added 2 commits June 22, 2016 18:59

Add tests for integer div/mod by const

4286b40

mikedn force-pushed the modopt branch from b487904 to babfca8 Compare June 22, 2016 16:04

briansull reviewed Jun 22, 2016
View reviewed changes

sivarv reviewed Jun 22, 2016
View reviewed changes

CarolEidt merged commit 3dfe85f into dotnet:master Jun 22, 2016

mikedn mentioned this pull request Jul 3, 2016

Address div/mod CR feedback #6106

Merged

karelz assigned mikedn Nov 9, 2016

mikedn deleted the modopt branch April 10, 2017 05:42

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022

Merge pull request dotnet/coreclr#5871 from mikedn/modopt

0012283

Improve div/mod by const power of 2 Commit migrated from dotnet/coreclr@3dfe85f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve div/mod by const power of 2 #5871

Improve div/mod by const power of 2 #5871

mikedn commented Jun 18, 2016

mikedn commented Jun 20, 2016

mikedn commented Jun 20, 2016

RussKeldorph commented Jun 20, 2016

sejongoh commented Jun 20, 2016 •

edited

Loading

CarolEidt Jun 20, 2016

mikedn Jun 21, 2016

ghost Jun 21, 2016

CarolEidt commented Jun 20, 2016

CarolEidt Jun 20, 2016

CarolEidt Jun 21, 2016

mikedn Jun 21, 2016

CarolEidt commented Jun 21, 2016

CarolEidt commented Jun 21, 2016

mikedn commented Jun 21, 2016

CarolEidt commented Jun 22, 2016

mikedn commented Jun 22, 2016

CarolEidt commented Jun 22, 2016

briansull Jun 22, 2016 •

edited

Loading

briansull commented Jun 22, 2016 •

edited

Loading

sivarv Jun 22, 2016

sivarv commented Jun 22, 2016

CarolEidt commented Jun 22, 2016

mikedn commented Jun 22, 2016

CarolEidt commented Jun 22, 2016

Improve div/mod by const power of 2 #5871

Improve div/mod by const power of 2 #5871

Conversation

mikedn commented Jun 18, 2016

mikedn commented Jun 20, 2016

mikedn commented Jun 20, 2016

RussKeldorph commented Jun 20, 2016

sejongoh commented Jun 20, 2016 • edited Loading

CarolEidt Jun 20, 2016

Choose a reason for hiding this comment

mikedn Jun 21, 2016

Choose a reason for hiding this comment

ghost Jun 21, 2016

Choose a reason for hiding this comment

CarolEidt commented Jun 20, 2016

CarolEidt Jun 20, 2016

Choose a reason for hiding this comment

CarolEidt Jun 21, 2016

Choose a reason for hiding this comment

mikedn Jun 21, 2016

Choose a reason for hiding this comment

CarolEidt commented Jun 21, 2016

CarolEidt commented Jun 21, 2016

mikedn commented Jun 21, 2016

CarolEidt commented Jun 22, 2016

mikedn commented Jun 22, 2016

CarolEidt commented Jun 22, 2016

briansull Jun 22, 2016 • edited Loading

Choose a reason for hiding this comment

briansull commented Jun 22, 2016 • edited Loading

sivarv Jun 22, 2016

Choose a reason for hiding this comment

sivarv commented Jun 22, 2016

CarolEidt commented Jun 22, 2016

mikedn commented Jun 22, 2016

CarolEidt commented Jun 22, 2016

sejongoh commented Jun 20, 2016 •

edited

Loading

briansull Jun 22, 2016 •

edited

Loading

briansull commented Jun 22, 2016 •

edited

Loading