[AMDGPU] When allocating VGPRs, VGPR spills are not part of the prologue #109439

jayfoad · 2024-09-20T15:50:30Z

PRs #69924 and #72140 modified SIInstrInfo::isBasicBlockPrologue to skip
over EXEC modifications and spills when allocating VGPRs. But treating
VGPR spills as part of the prologue can confuse the register allocator
as in #109294, so restrict it to SGPR spills, which were inserted during
SGPR allocation which is done in an earlier pass.

Fixes: #109294
Fixes: SWDEV-485841

PRs llvm#69924 and llvm#72140 modified SIInstrInfo::isBasicBlockPrologue to skip over EXEC modifications and spills when allocating VGPRs. But treating VGPR spills as part of the prologue can confuse the register allocator as in llvm#109294, so restrict it to SGPR spills, which were inserted during SGPR allocation which is done in an earlier pass. Fixes: llvm#109294 Fixes: SWDEV-485841

llvmbot · 2024-09-20T15:51:02Z

@llvm/pr-subscribers-backend-amdgpu

Author: Jay Foad (jayfoad)

Changes

PRs #69924 and #72140 modified SIInstrInfo::isBasicBlockPrologue to skip
over EXEC modifications and spills when allocating VGPRs. But treating
VGPR spills as part of the prologue can confuse the register allocator
as in #109294, so restrict it to SGPR spills, which were inserted during
SGPR allocation which is done in an earlier pass.

Fixes: #109294
Fixes: SWDEV-485841

Full diff: https://github.com/llvm/llvm-project/pull/109439.diff

1 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+3-2)

diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 97e8b08270d615..509c5c56e15f57 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -8884,8 +8884,9 @@ bool SIInstrInfo::isBasicBlockPrologue(const MachineInstr &MI,
   // FIXME: Copies inserted in the block prolog for live-range split should also
   // be included.
   return IsNullOrVectorRegister &&
-         (isSpill(Opcode) || (!MI.isTerminator() && Opcode != AMDGPU::COPY &&
-                              MI.modifiesRegister(AMDGPU::EXEC, &RI)));
+         (isSGPRSpill(Opcode) ||
+          (!MI.isTerminator() && Opcode != AMDGPU::COPY &&
+           MI.modifiesRegister(AMDGPU::EXEC, &RI)));
 }
 
 MachineInstrBuilder

jayfoad · 2024-09-20T15:51:50Z

I don't know how to write a small test for this. The test case from #109294 is about 600K.

arsenm · 2024-09-20T18:49:24Z

I don't know how to write a small test for this. The test case from #109294 is about 600K.

You can try llvm-reducing mir test using a strict -stress-regalloc value, it sometimes works

cdevadas · 2024-09-21T06:58:32Z

@alex-t might have a smaller test case for this. He had the same observation and proposed the same patch.

arsenm · 2024-09-21T07:01:43Z

Found #109514 while reducing this. Also ran into another edge case failure in the coalescer

cdevadas · 2024-09-23T08:04:53Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

@@ -8884,8 +8884,9 @@ bool SIInstrInfo::isBasicBlockPrologue(const MachineInstr &MI,
  // FIXME: Copies inserted in the block prolog for live-range split should also


This FIXME isn't relevant anymore. We are including only the SGPR spills to the prolog.

cdevadas · 2024-09-25T14:47:20Z

reproducer.zip
This test has 254 lines and reproduces the error.
llc -O3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 reproducer.ll -o out.s

I tried to reduce it further. But couldn't. Tried even -stress-regalloc. But nothing worked out to reduce it further.

arsenm · 2024-09-25T15:13:45Z

reproducer.zip

There's a simpler reproducer in #109678 I'm working on reducing

ruiling · 2024-09-26T06:39:02Z

As I mentioned in the other PR, I think we also need to include wwm register reload.

cdevadas · 2024-09-26T06:56:42Z

As I mentioned in the other PR, I think we also need to include wwm register reload.

The WWM reloads will happen for all lanes with the manipulated exec mask. Why do you think they should be included as well?

alex-t · 2024-09-26T15:10:36Z

I have tested exactly the same change myself as an alternative to the #108596
the change has passed PSDB but it only run on MI200. Unfortunately, it causes blender barbershop scene rendering to hang on Navi21. That is why I did not published the PR for it.
I am currently trying to find out the reason for the hang.

ruiling · 2024-09-27T03:48:19Z

As I mentioned in the other PR, I think we also need to include wwm register reload.

The WWM reloads will happen for all lanes with the manipulated exec mask. Why do you think they should be included as well?

I think it could be possible that the sgpr_input of the s_or_bnn exec, exec, sgpr_input was restored from wwm-vgpr. like:

wwm_vgpr_reload v0, ...
v_readlane_b32 s0, v0, 0
s_or_b32 exec, exec, s0

My point is the wwm_vgpr_reload is part of the block prologue, right?

cdevadas · 2024-09-27T08:07:51Z

My point is the wwm_vgpr_reload is part of the block prologue, right?

Yes. In such cases, the wwm-spill-restore should precede the readlane that restores the sgpr. This could frequently occur in the FastAlloc path. The liveout values are spilled at the block end and restored at the successor blocks' begin. Matt had a workaround to fix such cases in the fastalloc.
https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/RegAllocFast.cpp#L656
https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/RegAllocFast.cpp#L699
But this could be an edge case in the Greedy allocator and cause problems. The InlineSpiller and SplitKit need a similar workaround made by Matt. They seem quite ugly though.
I don't recollect exactly why I used isSpill in the original patch. This could be one of the reasons.

We could conditionally add the wwm-spill-restore to the block begin when there is already an instruction in the bb-prolog that uses this restored register. The isBasicBlockPrologue function can accommodate that.

alex-t · 2024-09-28T19:06:12Z

I have tested exactly the same change myself as an alternative to the #108596 the change has passed PSDB but it only run on MI200. Unfortunately, it causes blender barbershop scene rendering to hang on Navi21. That is why I did not published the PR for it. I am currently trying to find out the reason for the hang.

The blender hang on my Navi21 was due to the old/incompatible runtime. So, we can go with this fix.

alex-t · 2024-09-28T20:19:52Z

My point is the wwm_vgpr_reload is part of the block prologue, right?

Yes. In such cases, the wwm-spill-restore should precede the readlane that restores the sgpr. This could frequently occur in the FastAlloc path. The liveout values are spilled at the block end and restored at the successor blocks' begin. Matt had a workaround to fix such cases in the fastalloc. https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/RegAllocFast.cpp#L656 https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/RegAllocFast.cpp#L699 But this could be an edge case in the Greedy allocator and cause problems. The InlineSpiller and SplitKit need a similar workaround made by Matt. They seem quite ugly though. I don't recollect exactly why I used isSpill in the original patch. This could be one of the reasons.

We could conditionally add the wwm-spill-restore to the block begin when there is already an instruction in the bb-prolog that uses this restored register. The isBasicBlockPrologue function can accommodate that.

The isSGPRSpill is still too large a hummer. SGPR spills have nothing to do with the prologue. We only need SGPR reloads that define registers used by other prologue instructions. I tried a more selective algorithm but it caused a segmentation fault in the blender.
I haven't yet found the exact reason. The spill/reload pattern seems to change significantly if we exclude unnecessary spills/reloads.

jayfoad · 2024-09-30T09:43:26Z

The isSGPRSpill is still too large a hummer.

What do you think about committing this patch as a small step in the right direction? We can refine it more later.

alex-t · 2024-09-30T10:54:56Z

My point is the wwm_vgpr_reload is part of the block prologue, right?

Yes. In such cases, the wwm-spill-restore should precede the readlane that restores the sgpr. This could frequently occur in the FastAlloc path. The liveout values are spilled at the block end and restored at the successor blocks' begin. Matt had a workaround to fix such cases in the fastalloc. https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/RegAllocFast.cpp#L656 https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/RegAllocFast.cpp#L699 But this could be an edge case in the Greedy allocator and cause problems. The InlineSpiller and SplitKit need a similar workaround made by Matt. They seem quite ugly though. I don't recollect exactly why I used isSpill in the original patch. This could be one of the reasons.

We could conditionally add the wwm-spill-restore to the block begin when there is already an instruction in the bb-prolog that uses this restored register. The isBasicBlockPrologue function can accommodate that.

The isSGPRSpill is still too large a hummer. SGPR spills have nothing to do with the prologue. We only need SGPR reloads that define registers used by other prologue instructions. I tried a more selective algorithm but it caused a segmentation fault in the blender.
I haven't yet found the exact reason. The spill/reload pattern seems to change significantly if we exclude unnecessary spills/reloads.

The isSGPRSpill is still too large a hummer.

What do you think about committing this patch as a small step in the right direction? We can refine it more later.

I agree with that. I would have enough time to sort out what is wrong with the more selective approach provided this is committed and no more app crashes happen.

arsenm

I still think this prolog concept is a bit broken. This is also really tough to get a test out of, but I'm still trying (I'm hoping #110229 helps reduce it)

rever: hangs ocl tests at -O0 735a5f6 [AMDGPU] When allocating VGPRs, VGPR spills are not part of the prologue (llvm#109439) Change-Id: If6452f3c5943af849b606cbe6f1262597c5e0f2f

…gue (llvm#109439) PRs llvm#69924 and llvm#72140 modified SIInstrInfo::isBasicBlockPrologue to skip over EXEC modifications and spills when allocating VGPRs. But treating VGPR spills as part of the prologue can confuse the register allocator as in llvm#109294, so restrict it to SGPR spills, which were inserted during SGPR allocation which is done in an earlier pass. Fixes: llvm#109294 Fixes: SWDEV-485841

…not part of the prologue (llvm#109439)" This is supposed to be fixed by llvm@6636f32. More context: https://ontrack-internal.amd.com/browse/SWDEV-489621 Change-Id: I49ec81989eec8fda897dffb1bd7b4dbb76a98c46

alex-t · 2024-10-09T14:54:34Z

#111496 addresses the issue caused by the scenario when WWM reload must be inserted before the block prologue because they reload operands for the prologue instructions. This breaks the prologue and consequent VGPR reloads are inserted before the EXEC restoring.
Please note, that the solution taken in #111496 takes us back to the same point where we started a while ago concerning the SplitKit assertion because of the interference.
Please correct me if I am wrong, but the WWM reload creates a new live interval defining the VGPR. This interval starts inside the prologue (given that WWM reloads belong to the prologue). So, if we're splitting some VReg interval across the VGPR defined by the WWM reload, the COPY insertion point will appear after the prologue and, hence, will interfere.

llvmbot added the backend:AMDGPU label Sep 20, 2024

jayfoad requested review from alex-t, nhaehnle and cdevadas September 20, 2024 15:51

cmc-rep self-requested a review September 20, 2024 18:03

cdevadas reviewed Sep 23, 2024

View reviewed changes

jayfoad requested a review from ruiling September 30, 2024 10:59

Remove irrelevant FIXME

7796e26

arsenm approved these changes Sep 30, 2024

View reviewed changes

jayfoad merged commit 735a5f6 into llvm:main Sep 30, 2024
6 of 8 checks passed

jayfoad deleted the sgpr-spill-prologue branch September 30, 2024 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] When allocating VGPRs, VGPR spills are not part of the prologue #109439

[AMDGPU] When allocating VGPRs, VGPR spills are not part of the prologue #109439

jayfoad commented Sep 20, 2024

llvmbot commented Sep 20, 2024

jayfoad commented Sep 20, 2024

arsenm commented Sep 20, 2024

cdevadas commented Sep 21, 2024

arsenm commented Sep 21, 2024

cdevadas Sep 23, 2024

jayfoad Sep 30, 2024

cdevadas commented Sep 25, 2024

arsenm commented Sep 25, 2024 •

edited

Loading

ruiling commented Sep 26, 2024

cdevadas commented Sep 26, 2024

alex-t commented Sep 26, 2024

ruiling commented Sep 27, 2024

cdevadas commented Sep 27, 2024

alex-t commented Sep 28, 2024

alex-t commented Sep 28, 2024

jayfoad commented Sep 30, 2024

alex-t commented Sep 30, 2024

arsenm left a comment

alex-t commented Oct 9, 2024

		@@ -8884,8 +8884,9 @@ bool SIInstrInfo::isBasicBlockPrologue(const MachineInstr &MI,
		// FIXME: Copies inserted in the block prolog for live-range split should also

[AMDGPU] When allocating VGPRs, VGPR spills are not part of the prologue #109439

[AMDGPU] When allocating VGPRs, VGPR spills are not part of the prologue #109439

Conversation

jayfoad commented Sep 20, 2024

llvmbot commented Sep 20, 2024

jayfoad commented Sep 20, 2024

arsenm commented Sep 20, 2024

cdevadas commented Sep 21, 2024

arsenm commented Sep 21, 2024

cdevadas Sep 23, 2024

Choose a reason for hiding this comment

jayfoad Sep 30, 2024

Choose a reason for hiding this comment

cdevadas commented Sep 25, 2024

arsenm commented Sep 25, 2024 • edited Loading

ruiling commented Sep 26, 2024

cdevadas commented Sep 26, 2024

alex-t commented Sep 26, 2024

ruiling commented Sep 27, 2024

cdevadas commented Sep 27, 2024

alex-t commented Sep 28, 2024

alex-t commented Sep 28, 2024

jayfoad commented Sep 30, 2024

alex-t commented Sep 30, 2024

arsenm left a comment

Choose a reason for hiding this comment

alex-t commented Oct 9, 2024

arsenm commented Sep 25, 2024 •

edited

Loading