Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] When allocating VGPRs, VGPR spills are not part of the prologue #109439

Merged
merged 2 commits into from
Sep 30, 2024

Conversation

jayfoad
Copy link
Contributor

@jayfoad jayfoad commented Sep 20, 2024

PRs #69924 and #72140 modified SIInstrInfo::isBasicBlockPrologue to skip
over EXEC modifications and spills when allocating VGPRs. But treating
VGPR spills as part of the prologue can confuse the register allocator
as in #109294, so restrict it to SGPR spills, which were inserted during
SGPR allocation which is done in an earlier pass.

Fixes: #109294
Fixes: SWDEV-485841

PRs llvm#69924 and llvm#72140 modified SIInstrInfo::isBasicBlockPrologue to skip
over EXEC modifications and spills when allocating VGPRs. But treating
VGPR spills as part of the prologue can confuse the register allocator
as in llvm#109294, so restrict it to SGPR spills, which were inserted during
SGPR allocation which is done in an earlier pass.

Fixes: llvm#109294
Fixes: SWDEV-485841
@llvmbot
Copy link
Collaborator

llvmbot commented Sep 20, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Jay Foad (jayfoad)

Changes

PRs #69924 and #72140 modified SIInstrInfo::isBasicBlockPrologue to skip
over EXEC modifications and spills when allocating VGPRs. But treating
VGPR spills as part of the prologue can confuse the register allocator
as in #109294, so restrict it to SGPR spills, which were inserted during
SGPR allocation which is done in an earlier pass.

Fixes: #109294
Fixes: SWDEV-485841


Full diff: https://github.com/llvm/llvm-project/pull/109439.diff

1 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+3-2)
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 97e8b08270d615..509c5c56e15f57 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -8884,8 +8884,9 @@ bool SIInstrInfo::isBasicBlockPrologue(const MachineInstr &MI,
   // FIXME: Copies inserted in the block prolog for live-range split should also
   // be included.
   return IsNullOrVectorRegister &&
-         (isSpill(Opcode) || (!MI.isTerminator() && Opcode != AMDGPU::COPY &&
-                              MI.modifiesRegister(AMDGPU::EXEC, &RI)));
+         (isSGPRSpill(Opcode) ||
+          (!MI.isTerminator() && Opcode != AMDGPU::COPY &&
+           MI.modifiesRegister(AMDGPU::EXEC, &RI)));
 }
 
 MachineInstrBuilder

@jayfoad
Copy link
Contributor Author

jayfoad commented Sep 20, 2024

I don't know how to write a small test for this. The test case from #109294 is about 600K.

@cmc-rep cmc-rep self-requested a review September 20, 2024 18:03
@arsenm
Copy link
Contributor

arsenm commented Sep 20, 2024

I don't know how to write a small test for this. The test case from #109294 is about 600K.

You can try llvm-reducing mir test using a strict -stress-regalloc value, it sometimes works

@cdevadas
Copy link
Collaborator

@alex-t might have a smaller test case for this. He had the same observation and proposed the same patch.

@arsenm
Copy link
Contributor

arsenm commented Sep 21, 2024

Found #109514 while reducing this. Also ran into another edge case failure in the coalescer

@@ -8884,8 +8884,9 @@ bool SIInstrInfo::isBasicBlockPrologue(const MachineInstr &MI,
// FIXME: Copies inserted in the block prolog for live-range split should also
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This FIXME isn't relevant anymore. We are including only the SGPR spills to the prolog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@cdevadas
Copy link
Collaborator

reproducer.zip
This test has 254 lines and reproduces the error.
llc -O3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 reproducer.ll -o out.s

I tried to reduce it further. But couldn't. Tried even -stress-regalloc. But nothing worked out to reduce it further.

@arsenm
Copy link
Contributor

arsenm commented Sep 25, 2024

reproducer.zip

There's a simpler reproducer in #109678 I'm working on reducing

@ruiling
Copy link
Contributor

ruiling commented Sep 26, 2024

As I mentioned in the other PR, I think we also need to include wwm register reload.

@cdevadas
Copy link
Collaborator

As I mentioned in the other PR, I think we also need to include wwm register reload.

The WWM reloads will happen for all lanes with the manipulated exec mask. Why do you think they should be included as well?

@alex-t
Copy link
Contributor

alex-t commented Sep 26, 2024

I have tested exactly the same change myself as an alternative to the #108596
the change has passed PSDB but it only run on MI200. Unfortunately, it causes blender barbershop scene rendering to hang on Navi21. That is why I did not published the PR for it.
I am currently trying to find out the reason for the hang.

@ruiling
Copy link
Contributor

ruiling commented Sep 27, 2024

As I mentioned in the other PR, I think we also need to include wwm register reload.

The WWM reloads will happen for all lanes with the manipulated exec mask. Why do you think they should be included as well?

I think it could be possible that the sgpr_input of the s_or_bnn exec, exec, sgpr_input was restored from wwm-vgpr. like:

wwm_vgpr_reload v0, ...
v_readlane_b32 s0, v0, 0
s_or_b32 exec, exec, s0

My point is the wwm_vgpr_reload is part of the block prologue, right?

@cdevadas
Copy link
Collaborator

My point is the wwm_vgpr_reload is part of the block prologue, right?

Yes. In such cases, the wwm-spill-restore should precede the readlane that restores the sgpr. This could frequently occur in the FastAlloc path. The liveout values are spilled at the block end and restored at the successor blocks' begin. Matt had a workaround to fix such cases in the fastalloc.
https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/RegAllocFast.cpp#L656
https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/RegAllocFast.cpp#L699
But this could be an edge case in the Greedy allocator and cause problems. The InlineSpiller and SplitKit need a similar workaround made by Matt. They seem quite ugly though.
I don't recollect exactly why I used isSpill in the original patch. This could be one of the reasons.

We could conditionally add the wwm-spill-restore to the block begin when there is already an instruction in the bb-prolog that uses this restored register. The isBasicBlockPrologue function can accommodate that.

@alex-t
Copy link
Contributor

alex-t commented Sep 28, 2024

I have tested exactly the same change myself as an alternative to the #108596 the change has passed PSDB but it only run on MI200. Unfortunately, it causes blender barbershop scene rendering to hang on Navi21. That is why I did not published the PR for it. I am currently trying to find out the reason for the hang.

The blender hang on my Navi21 was due to the old/incompatible runtime. So, we can go with this fix.

@alex-t
Copy link
Contributor

alex-t commented Sep 28, 2024

My point is the wwm_vgpr_reload is part of the block prologue, right?

Yes. In such cases, the wwm-spill-restore should precede the readlane that restores the sgpr. This could frequently occur in the FastAlloc path. The liveout values are spilled at the block end and restored at the successor blocks' begin. Matt had a workaround to fix such cases in the fastalloc. https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/RegAllocFast.cpp#L656 https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/RegAllocFast.cpp#L699 But this could be an edge case in the Greedy allocator and cause problems. The InlineSpiller and SplitKit need a similar workaround made by Matt. They seem quite ugly though. I don't recollect exactly why I used isSpill in the original patch. This could be one of the reasons.

We could conditionally add the wwm-spill-restore to the block begin when there is already an instruction in the bb-prolog that uses this restored register. The isBasicBlockPrologue function can accommodate that.

The isSGPRSpill is still too large a hummer. SGPR spills have nothing to do with the prologue. We only need SGPR reloads that define registers used by other prologue instructions. I tried a more selective algorithm but it caused a segmentation fault in the blender.
I haven't yet found the exact reason. The spill/reload pattern seems to change significantly if we exclude unnecessary spills/reloads.

@jayfoad
Copy link
Contributor Author

jayfoad commented Sep 30, 2024

The isSGPRSpill is still too large a hummer.

What do you think about committing this patch as a small step in the right direction? We can refine it more later.

@alex-t
Copy link
Contributor

alex-t commented Sep 30, 2024

My point is the wwm_vgpr_reload is part of the block prologue, right?

Yes. In such cases, the wwm-spill-restore should precede the readlane that restores the sgpr. This could frequently occur in the FastAlloc path. The liveout values are spilled at the block end and restored at the successor blocks' begin. Matt had a workaround to fix such cases in the fastalloc. https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/RegAllocFast.cpp#L656 https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/RegAllocFast.cpp#L699 But this could be an edge case in the Greedy allocator and cause problems. The InlineSpiller and SplitKit need a similar workaround made by Matt. They seem quite ugly though. I don't recollect exactly why I used isSpill in the original patch. This could be one of the reasons.

We could conditionally add the wwm-spill-restore to the block begin when there is already an instruction in the bb-prolog that uses this restored register. The isBasicBlockPrologue function can accommodate that.

The isSGPRSpill is still too large a hummer. SGPR spills have nothing to do with the prologue. We only need SGPR reloads that define registers used by other prologue instructions. I tried a more selective algorithm but it caused a segmentation fault in the blender.
I haven't yet found the exact reason. The spill/reload pattern seems to change significantly if we exclude unnecessary spills/reloads.

The isSGPRSpill is still too large a hummer.

What do you think about committing this patch as a small step in the right direction? We can refine it more later.

I agree with that. I would have enough time to sort out what is wrong with the more selective approach provided this is committed and no more app crashes happen.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this prolog concept is a bit broken. This is also really tough to get a test out of, but I'm still trying (I'm hoping #110229 helps reduce it)

@jayfoad jayfoad merged commit 735a5f6 into llvm:main Sep 30, 2024
6 of 8 checks passed
@jayfoad jayfoad deleted the sgpr-spill-prologue branch September 30, 2024 12:25
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Oct 2, 2024
rever: hangs ocl tests at -O0
735a5f6 [AMDGPU] When allocating VGPRs, VGPR spills are not part of the prologue (llvm#109439)

Change-Id: If6452f3c5943af849b606cbe6f1262597c5e0f2f
xgupta pushed a commit to xgupta/llvm-project that referenced this pull request Oct 4, 2024
…gue (llvm#109439)

PRs llvm#69924 and llvm#72140 modified SIInstrInfo::isBasicBlockPrologue to skip
over EXEC modifications and spills when allocating VGPRs. But treating
VGPR spills as part of the prologue can confuse the register allocator
as in llvm#109294, so restrict it to SGPR spills, which were inserted during
SGPR allocation which is done in an earlier pass.

Fixes: llvm#109294
Fixes: SWDEV-485841
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Oct 8, 2024
…not part of the prologue (llvm#109439)"

This is supposed to be fixed by llvm@6636f32.

More context: https://ontrack-internal.amd.com/browse/SWDEV-489621

Change-Id: I49ec81989eec8fda897dffb1bd7b4dbb76a98c46
@alex-t
Copy link
Contributor

alex-t commented Oct 9, 2024

#111496 addresses the issue caused by the scenario when WWM reload must be inserted before the block prologue because they reload operands for the prologue instructions. This breaks the prologue and consequent VGPR reloads are inserted before the EXEC restoring.
Please note, that the solution taken in #111496 takes us back to the same point where we started a while ago concerning the SplitKit assertion because of the interference.
Please correct me if I am wrong, but the WWM reload creates a new live interval defining the VGPR. This interval starts inside the prologue (given that WWM reloads belong to the prologue). So, if we're splitting some VReg interval across the VGPR defined by the WWM reload, the COPY insertion point will appear after the prologue and, hence, will interfere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AMDGPU] Assertion `(!LeaveBefore || Idx <= LeaveBefore) && "Interference"' failed
6 participants