-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace successive "ldr" and "str" instructions with "ldp" and "stp" #77540
Conversation
This change serves to address the following four Github tickets: 1. ARM64: Optimize pair of "ldr reg, [fp]" to ldp dotnet#35130 2. ARM64: Optimize pair of "ldr reg, [reg]" to ldp dotnet#35132 3. ARM64: Optimize pair of "str reg, [reg]" to stp dotnet#35133 4. ARM64: Optimize pair of "str reg, [fp]" to stp dotnet#35134 A technique was employed that involved detecting an optimisation opportunity as instruction sequences were being generated. The optimised instruction was then generated on top of the previous instruction, with no second instruction generated. Thus, there were no changes to instruction group size at “emission time” and no changes to jump instructions.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis change serves to address the following four Github tickets:
A technique was employed that involved detecting an optimisation opportunity as instruction sequences were being generated. The optimised instruction was then generated on top of the previous instruction, with no second instruction generated. Thus, there were no changes to instruction group size at “emission time” and no changes to jump instructions.
|
@dotnet/jit-contrib @BruceForstall |
While this is definitely a good change, it feels to me that we need to have some common method to do the replacement ( |
Hi, Kunal. I am not sure what you mean here. It seems to me that any instruction can either be emitted and added to the instruction group or used to overwrite the last emitted instruction. I cannot see any way that this can be achieved without altering each emitting function. Can you please advise? Thanks, Andy |
I think there is some missing GC tracking missing for the 2nd register. In below diff, we need to report that both (windows-arm64 benchmark diff 3861.dasm) Same here: (windows-arm64 benchmark diff 26954.dasm) We see that towards the end of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking closely at the PR, I think this should be fine to have the logic at emit_R_R_R_I()
. Added some comments to point the missing gc tracking information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should consider a different model where we support "back up" in the emitter.
It would be useful to include here in the comments a few examples of the improvement asm diffs. Also useful would be insight into what could be done to improve codegen related to this in the future (i.e., what cases are we missing after this change?). |
@AndyJGraham Given my comment #77540 (comment), I was curious if it would work. I implemented it with https://github.com/BruceForstall/runtime/tree/LdpStp_RemoveLastInstruction on top of this. I think it's the right way to go. It isn't completely bug free yet, though. However, while doing this, I noticed a couple things about the implementation:
but doesn't handle:
|
Saw them here too: #77540 (comment) |
I think that one has been addressed, but I think there is a potential str/str=>stp GC hole. |
consecutive STR and LDR instructions.
You might also want to take into account |
@AndyJGraham Can you please fix the conflict so tests can run? |
@a74nh @AndyJGraham @kunalspathak Assuming the perf regressions in #81551 are directly attributable to this work: are there cases where a single ldp/stp is slower on some platforms than two consecutive ldr/str? |
ldp/stp should never be slower than two consecutive ldr/stp. I ran this myself on an altra, ubuntu: 10 times with current head:
10 times with this ldr/str patch reverted:
The regression is reporting going from 4ns to 6ns. Looking at the 100,1000,10000 variations: current HEAD:
With ld/stp patch reverted:
Again, we're only seeing a few nanoseconds difference on a much larger range. My gut is to say we are within variance and this difference should just vanish on the next run of the test suite. How easy is it to rerun the CI for those tests? I'll give IterateForEach a run see if I get similar results. |
It's possible that loop alignment or some other peephole optimization is regressed from the differing instruction. Might be worth getting the disassembly to validate the before/after. |
Assembly for the main routine under test hasn't changed at all: (The LDP and STP here are prologue/epilogue entries outside the scope of this patch)
|
Meanwhile, I'm getting a consistent perf drop of 0.1us for System.Collections.IterateForEach when the optimisation is enabled. Will investigate this a little more. |
You might want to check the disassembly of |
Narrowed down the Disabling LDP/STP optimization only on
Full assembly for MoveNext.
LDP is in G_M39015_IG07m This is outside of a loop. The only branches to code after the LDP are error cases. Code for MoveNext()
LDP is used for the load of entry.key and entry.value. Entry struct
TKey and TValue are both ints. So we shouldn't have any alignment issues within the struct. And I would hope that everything else is the dictionary is generally aligned too. I'm a little concerned about register dependencies. x1/w1 is being used as source and dest. But that shouldn't cause any issues. Still digging.... |
Looks like @tannergooding was right with the alignment issues.
2 and 3 are the same because they are both doing two loads. Disassembly with addresses:
Looks like what we've now got is that some of those branch targets are misaligned addresses. When the LDP (at 0x0000ffffb74f26a8) is two LDRs, then the misaligned addresses become aligned. I think we need to check targets of branches are aligned, and if not insert a NOP to align them - that'll be the start of every basic block that where the predecessor isn't the previous block. LLVM has something already for this (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64Subtarget.cpp#L94), and is quite specific depending on the exact Arm processor. Before I start trying anything out - has aligning targets been discussed before? Is there already any code which does similar? |
We already have this, but it is a heuristic and isn't always done. CC. @kunalspathak |
Are you suggesting that the loop back-end branch |
I believe the perf jobs run every night. @kunalspathak would know for sure. |
Overall, if we can verify that the issue is due to loop alignment (or cache effect or some non-deterministic feature) and there isn't a bad interaction between the peephole optimization and the loop alignment implementation that is breaking the loop alignment implementation, then we might choose to simply ignore a regression. It's still worthwhile doing the due diligence to ensure we understand the regression to the best of our ability. |
They are run after every few hours on batch of commits. However, looking at the overall history of the benchmark at https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/main_arm64_ubuntu%2020.04/System.Hashing.GetStringHashCode(BytesCount%3a%2010).html , the regression looks fairly stable.
We added loop alignment back in .NET 6 and you can read the detail heuristics in https://devblogs.microsoft.com/dotnet/loop-alignment-in-net-6/. Essentially, we try to align the start of the loop (the target of the backedge) as much as possible, given that it fits various criteria like size of the loop body, how much padding is needed, if the loop is call-free or not, etc. We have seen cases in the past where the loops in code/benchmarks were aligned and because of optimizations, they stopped being aligned and end up in regression. Again, this usually happen because the algorithm thinks that the loop size is too large that it won't make sense to align or the amount of padding to be added to further align is more and we cannot afford to waste bytes in that.
I agree. |
Just to be sure, try disabling loop alignment using |
@kunalspathak Since this optimization only reduces code size, these cases shouldn't occur, right? Has the alignment padding already been committed when this peephole optimization occurs? Or will the alignment padding required be adjusted after the peep? |
Ideally yes, but to confirm that, we need to really make sure that
No, the alignment padding adjustment happens after the peep. |
Setting this didn't cause any difference. I then hacked coreclr so that it inserted a NOP at the start of the next block after the LDP, giving:
And this regained all the lost performance! Back to 2.18ms. Note that this is the only function I'm allowing peepholes to occur.
This is with LDP: This is with LDP and a NOP:
Not quite, it would be the jumps to 0x0000ffffb74f26c4 or 0x0000ffffb74f26dc. My NOP causes both of these to become aligned. |
I'd also notably expect the |
Which means that the loop alignment is definitely not affecting it, although from your experiment, it shows that aligning few places would improve the performance (but not necessarily the one that was lost with ldp change). Basically, if you add
Were they aligned before your ldp change? |
Before the LDP change, they were both aligned. So adding the NOP is making the rest of the instructions be in the same position as they were with two LDRs. I tried some more moving around, and moving the NOP after the ret (or anywhere else afterwards) drops the performance again back to 2.3ms.
Which is odd as G_M39015_IG09 is aligned and there is nothing branching to G_M39015_IG08.
I can give this a try too. The next step would be recreate this as a standalone binary using that block of assembly and get it in a simulator. It might take a bit of time to get it showing the exact same behaviour. If we think it's important enough, then I can give it a go. |
It certainly seems like there is some odd micro-architectural effect here. E.g., and this seems like grasping at straws, maybe there's an instruction prefetcher grabbing instructions at the presumably (!) cold, not-taken branch targets that are newly unaligned, causing conflicts with fetching the fall-through path? I'm not sure how much more I would invest into this investigation, although understanding more might save time the next time we see an unexplained regression. |
I went through the issues and I don't see any other regressions. |
Some updates..... I extracted the assembly for the entire function into a test program and set some dummy memory values for the dictionary. I ran this on a cycle accurate simulator for the N1, and extracted the traces (including pipeline stages). I did this once for the program with LDP, and once with an LDP plus a NOP. There was nothing to suggest any difference between the two, except for the NOP adding a slight delay. Sadly I'm unable to share any of the traces. What my test app doesn't replicate is the exact memory setup of the coreclr version (eg - the code has the same alignment, but is in a different location. The contents of the dictionary are different, and lives in a different location). So it's possible this is causing a difference. There's also differences to coreclr (eg GC), to take into account. As a diversion, now that I have some code to insert NOPs in arbitrary places during clr codegen, I experimented a bit more with moving around the NOP. I've annotated the code below with the benchmark result when a NOP (or 2 NOPs or 3) is placed there.
It's hard to make firm statements here, but:
|
Both the regressions seemed to come back after that despite I don't see any PR that would have improved them. The diff range is: 6ad1205...dce07a8 At this point, I won't spend much time on this given that your experiments proved it was around general alignment (and not necessarily loop alignment). Thank you @a74nh for spending time in investigating. |
@a74nh That's some amazing in-depth analysis. I agree with @kunalspathak that it doesn't seem worth spending any more time on it at this point. |
This change serves to address the following four Github tickets:
A technique was employed that involved detecting an optimisation opportunity as instruction sequences were being generated. The optimised instruction was then generated on top of the previous instruction, with no second instruction generated. Thus, there were no changes to instruction group size at “emission time” and no changes to jump instructions.