Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Fix negative immediate offset for unbuffered smem loads #79553

Closed

Conversation

vangthao95
Copy link
Contributor

For unbuffered smem loads, It is illegal and undefined for the immediate offset to be negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is negative. As a workaround for this issue, if there is no SGPR[Offset] and the immediate offset is negative, subtract the absolute value of the immediate offset from the base address. Then change the immediate offset to 0.

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 26, 2024

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: None (vangthao95)

Changes

For unbuffered smem loads, It is illegal and undefined for the immediate offset to be negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is negative. As a workaround for this issue, if there is no SGPR[Offset] and the immediate offset is negative, subtract the absolute value of the immediate offset from the base address. Then change the immediate offset to 0.


Patch is 41.48 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/79553.diff

10 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUGISel.td (+4)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp (+42-7)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h (+9-5)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+34-1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h (+6-2)
  • (modified) llvm/lib/Target/AMDGPU/SMInstructions.td (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-constant.mir (+202-1)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-smem.ll (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll (+4-2)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-load.ll (+14-7)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
index 152f495a452ba2..0017f51ee5d925 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
@@ -116,6 +116,10 @@ def gi_smrd_sgpr_imm :
     GIComplexOperandMatcher<s64, "selectSmrdSgprImm">,
     GIComplexPatternEquiv<SMRDSgprImm>;
 
+def gi_smrd_prefetch_imm :
+    GIComplexOperandMatcher<s64, "selectSmrdPrefetchImm">,
+    GIComplexPatternEquiv<SMRDPrefetchImm>;
+
 def gi_flat_offset :
     GIComplexOperandMatcher<s64, "selectFlatOffset">,
     GIComplexPatternEquiv<FlatOffset>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 4f7bf3f7d35e71..d57314310184e3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -2072,13 +2072,16 @@ SDValue AMDGPUDAGToDAGISel::Expand32BitAddress(SDValue Addr) const {
 // true, match only 32-bit immediate offsets available on CI.
 bool AMDGPUDAGToDAGISel::SelectSMRDBaseOffset(SDValue Addr, SDValue &SBase,
                                               SDValue *SOffset, SDValue *Offset,
-                                              bool Imm32Only,
-                                              bool IsBuffer) const {
+                                              bool Imm32Only, bool IsBuffer,
+                                              bool IsPrefetch,
+                                              bool HasSOffset) const {
   if (SOffset && Offset) {
     assert(!Imm32Only && !IsBuffer);
     SDValue B;
-    return SelectSMRDBaseOffset(Addr, B, nullptr, Offset) &&
-           SelectSMRDBaseOffset(B, SBase, SOffset, nullptr);
+    return SelectSMRDBaseOffset(Addr, B, nullptr, Offset, false, false,
+                                IsPrefetch, true) &&
+           SelectSMRDBaseOffset(B, SBase, SOffset, nullptr, false, false,
+                                IsPrefetch, true);
   }
 
   // A 32-bit (address + offset) should not cause unsigned 32-bit integer
@@ -2097,12 +2100,39 @@ bool AMDGPUDAGToDAGISel::SelectSMRDBaseOffset(SDValue Addr, SDValue &SBase,
   }
   if (!N0 || !N1)
     return false;
+
+  bool Selected = false;
   if (SelectSMRDOffset(N1, SOffset, Offset, Imm32Only, IsBuffer)) {
     SBase = N0;
-    return true;
+    Selected = true;
   }
+
   if (SelectSMRDOffset(N0, SOffset, Offset, Imm32Only, IsBuffer)) {
     SBase = N1;
+    Selected = true;
+  }
+
+  if (Selected) {
+    // For unbuffered smem loads, it is illegal and undefined for the Immediate
+    // Offset to be negative if the resulting (Offset + (M0 or SOffset or zero)
+    // is negative. Handle the case where the Immediate Offset is negative and
+    // there is no SOffset.
+    //
+    // FIXME: Also handle M0 or SOffset case?
+    if (Offset && !HasSOffset && !IsBuffer && !IsPrefetch &&
+        Subtarget->getGeneration() >= AMDGPUSubtarget::GFX11) {
+      if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(*Offset)) {
+        if (C->getSExtValue() < 0) {
+          SDLoc SL(SBase);
+          *Offset = CurDAG->getTargetConstant(std::abs(C->getSExtValue()), SL,
+                                              MVT::i32);
+          const SDValue Ops[] = {SBase, *Offset};
+          SBase = SDValue(
+              CurDAG->getMachineNode(AMDGPU::S_SUB_U64, SL, MVT::i64, Ops), 0);
+          *Offset = CurDAG->getTargetConstant(0, SL, MVT::i32);
+        }
+      }
+    }
     return true;
   }
   return false;
@@ -2110,8 +2140,8 @@ bool AMDGPUDAGToDAGISel::SelectSMRDBaseOffset(SDValue Addr, SDValue &SBase,
 
 bool AMDGPUDAGToDAGISel::SelectSMRD(SDValue Addr, SDValue &SBase,
                                     SDValue *SOffset, SDValue *Offset,
-                                    bool Imm32Only) const {
-  if (SelectSMRDBaseOffset(Addr, SBase, SOffset, Offset, Imm32Only)) {
+                                    bool Imm32Only, bool IsPrefetch) const {
+  if (SelectSMRDBaseOffset(Addr, SBase, SOffset, Offset, Imm32Only, IsPrefetch)) {
     SBase = Expand32BitAddress(SBase);
     return true;
   }
@@ -2170,6 +2200,11 @@ bool AMDGPUDAGToDAGISel::SelectSMRDBufferSgprImm(SDValue N, SDValue &SOffset,
                               /* IsBuffer */ true);
 }
 
+bool AMDGPUDAGToDAGISel::SelectSMRDPrefetchImm(SDValue Addr, SDValue &SBase,
+                                       SDValue &Offset) const {
+  return SelectSMRD(Addr, SBase, /* SOffset */ nullptr, &Offset, false, true);
+}
+
 bool AMDGPUDAGToDAGISel::SelectMOVRELOffset(SDValue Index,
                                             SDValue &Base,
                                             SDValue &Offset) const {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
index 3b42d88df0c246..5328ba985474dd 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
@@ -194,11 +194,13 @@ class AMDGPUDAGToDAGISel : public SelectionDAGISel {
                         SDValue *Offset, bool Imm32Only = false,
                         bool IsBuffer = false) const;
   SDValue Expand32BitAddress(SDValue Addr) const;
-  bool SelectSMRDBaseOffset(SDValue Addr, SDValue &SBase, SDValue *SOffset,
-                            SDValue *Offset, bool Imm32Only = false,
-                            bool IsBuffer = false) const;
-  bool SelectSMRD(SDValue Addr, SDValue &SBase, SDValue *SOffset,
-                  SDValue *Offset, bool Imm32Only = false) const;
+  bool SelectSMRDBaseOffset(SDValue Addr, SDValue & SBase, SDValue * SOffset,
+                            SDValue * Offset, bool Imm32Only = false,
+                            bool IsBuffer = false, bool IsPrefetch = false,
+                            bool HasSOffset = false) const;
+  bool SelectSMRD(SDValue Addr, SDValue & SBase, SDValue * SOffset,
+                  SDValue * Offset, bool Imm32Only = false,
+                  bool IsPrefetch = false) const;
   bool SelectSMRDImm(SDValue Addr, SDValue &SBase, SDValue &Offset) const;
   bool SelectSMRDImm32(SDValue Addr, SDValue &SBase, SDValue &Offset) const;
   bool SelectSMRDSgpr(SDValue Addr, SDValue &SBase, SDValue &SOffset) const;
@@ -208,6 +210,8 @@ class AMDGPUDAGToDAGISel : public SelectionDAGISel {
   bool SelectSMRDBufferImm32(SDValue N, SDValue &Offset) const;
   bool SelectSMRDBufferSgprImm(SDValue N, SDValue &SOffset,
                                SDValue &Offset) const;
+  bool SelectSMRDPrefetchImm(SDValue Addr, SDValue & SBase, SDValue & Offset)
+      const;
   bool SelectMOVRELOffset(SDValue Index, SDValue &Base, SDValue &Offset) const;
 
   bool SelectVOP3ModsImpl(SDValue In, SDValue &Src, unsigned &SrcMods,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index f255d098b631c7..c7cc701c63dc54 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -4221,7 +4221,8 @@ AMDGPUInstructionSelector::selectVINTERPModsHi(MachineOperand &Root) const {
 bool AMDGPUInstructionSelector::selectSmrdOffset(MachineOperand &Root,
                                                  Register &Base,
                                                  Register *SOffset,
-                                                 int64_t *Offset) const {
+                                                 int64_t *Offset,
+                                                 bool IsPrefetch) const {
   MachineInstr *MI = Root.getParent();
   MachineBasicBlock *MBB = MI->getParent();
 
@@ -4257,6 +4258,27 @@ bool AMDGPUInstructionSelector::selectSmrdOffset(MachineOperand &Root,
   if (Offset && GEPI.SgprParts.size() == 1 && EncodedImm) {
     Base = GEPI.SgprParts[0];
     *Offset = *EncodedImm;
+    // For unbuffered smem loads, it is illegal and undefined for the Immediate
+    // Offset to be negative if the resulting (Offset + (M0 or SOffset or zero)
+    // is negative. Handle the case where the Immediate Offset is negative and
+    // there is no SOffset.
+    //
+    // FIXME: Also handle M0 or SOffset case?
+    if (!IsPrefetch && *Offset < 0 &&
+        STI.getGeneration() >= AMDGPUSubtarget::GFX11) {
+      // Subtract the absolute value of the offset from the base register and
+      // set the immediate offset to 0.
+      Register SubtractReg =
+          MRI->createVirtualRegister(&AMDGPU::SReg_64RegClass);
+
+      BuildMI(*MBB, MI, MI->getDebugLoc(), TII.get(AMDGPU::S_SUB_U64),
+              SubtractReg)
+          .addReg(Base)
+          .addImm(std::abs(*Offset));
+      Base = SubtractReg;
+      *Offset = 0;
+    }
+
     return true;
   }
 
@@ -4339,6 +4361,17 @@ AMDGPUInstructionSelector::selectSmrdSgprImm(MachineOperand &Root) const {
            [=](MachineInstrBuilder &MIB) { MIB.addImm(Offset); }}};
 }
 
+InstructionSelector::ComplexRendererFns
+AMDGPUInstructionSelector::selectSmrdPrefetchImm(MachineOperand &Root) const {
+  Register Base;
+  int64_t Offset;
+  if (!selectSmrdOffset(Root, Base, /* SOffset= */ nullptr, &Offset, true))
+    return std::nullopt;
+
+  return {{[=](MachineInstrBuilder &MIB) { MIB.addReg(Base); },
+           [=](MachineInstrBuilder &MIB) { MIB.addImm(Offset); }}};
+}
+
 std::pair<Register, int>
 AMDGPUInstructionSelector::selectFlatOffsetImpl(MachineOperand &Root,
                                                 uint64_t FlatVariant) const {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
index ef7630f137aca6..573bc9260c765b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
@@ -220,8 +220,10 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
   InstructionSelector::ComplexRendererFns
   selectVINTERPModsHi(MachineOperand &Root) const;
 
-  bool selectSmrdOffset(MachineOperand &Root, Register &Base, Register *SOffset,
-                        int64_t *Offset) const;
+  bool selectSmrdOffset(MachineOperand & Root, Register & Base,
+                        Register * SOffset, int64_t * Offset,
+                        bool IsPrefetch = false) const;
+
   InstructionSelector::ComplexRendererFns
   selectSmrdImm(MachineOperand &Root) const;
   InstructionSelector::ComplexRendererFns
@@ -230,6 +232,8 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
   selectSmrdSgpr(MachineOperand &Root) const;
   InstructionSelector::ComplexRendererFns
   selectSmrdSgprImm(MachineOperand &Root) const;
+  InstructionSelector::ComplexRendererFns
+  selectSmrdPrefetchImm(MachineOperand &Root) const;
 
   std::pair<Register, int> selectFlatOffsetImpl(MachineOperand &Root,
                                                 uint64_t FlatVariant) const;
diff --git a/llvm/lib/Target/AMDGPU/SMInstructions.td b/llvm/lib/Target/AMDGPU/SMInstructions.td
index f082de35b6ae9a..8c9e3fa5b9b738 100644
--- a/llvm/lib/Target/AMDGPU/SMInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SMInstructions.td
@@ -859,6 +859,7 @@ def SMRDSgprImm     : ComplexPattern<iPTR, 3, "SelectSMRDSgprImm">;
 def SMRDBufferImm   : ComplexPattern<iPTR, 1, "SelectSMRDBufferImm">;
 def SMRDBufferImm32 : ComplexPattern<iPTR, 1, "SelectSMRDBufferImm32">;
 def SMRDBufferSgprImm : ComplexPattern<iPTR, 2, "SelectSMRDBufferSgprImm">;
+def SMRDPrefetchImm : ComplexPattern<iPTR, 2, "SelectSMRDPrefetchImm">;
 
 multiclass SMRD_Pattern <string Instr, ValueType vt, bit immci = true> {
 
@@ -1080,7 +1081,7 @@ def i32imm_one : TImmLeaf <i32, [{
 
 multiclass SMPrefetchPat<string type, TImmLeaf cache_type> {
   def : GCNPat <
-    (smrd_prefetch (SMRDImm i64:$sbase, i32:$offset), timm, timm, cache_type),
+    (smrd_prefetch (SMRDPrefetchImm i64:$sbase, i32:$offset), timm, timm, cache_type),
     (!cast<SM_Prefetch_Pseudo>("S_PREFETCH_"#type) $sbase, $offset, (i32 SGPR_NULL), (i8 0))
   >;
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-constant.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-constant.mir
index c44477273dad09..b7010e4c65beb9 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-constant.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-constant.mir
@@ -3,7 +3,7 @@
 # RUN: llc -amdgpu-global-isel-new-legality -mtriple=amdgcn -mcpu=hawaii -run-pass=instruction-select -verify-machineinstrs  -global-isel-abort=0 -o - %s | FileCheck -check-prefix=GFX7 %s
 # RUN: llc -amdgpu-global-isel-new-legality -mtriple=amdgcn -mcpu=fiji -run-pass=instruction-select -verify-machineinstrs -global-isel-abort=0 -o - %s | FileCheck -check-prefix=GFX8 %s
 # RUN: llc -amdgpu-global-isel-new-legality -mtriple=amdgcn -mcpu=gfx1010 -run-pass=instruction-select -verify-machineinstrs -global-isel-abort=0 -o - %s | FileCheck -check-prefix=GFX10 %s
-# RUN: llc -amdgpu-global-isel-new-legality -mtriple=amdgcn -mcpu=gfx1100 -run-pass=instruction-select -verify-machineinstrs -global-isel-abort=0 -o - %s | FileCheck -check-prefix=GFX10 %s
+# RUN: llc -amdgpu-global-isel-new-legality -mtriple=amdgcn -mcpu=gfx1100 -run-pass=instruction-select -verify-machineinstrs -global-isel-abort=0 -o - %s | FileCheck -check-prefix=GFX11 %s
 
 ---
 
@@ -44,6 +44,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]], 0, 0 :: (load (s32), addrspace 4)
     ; GFX10-NEXT: $sgpr0 = COPY [[S_LOAD_DWORD_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_s32_from_4
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]], 0, 0 :: (load (s32), addrspace 4)
+    ; GFX11-NEXT: $sgpr0 = COPY [[S_LOAD_DWORD_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(s32) = G_LOAD %0 :: (load (s32), align 4, addrspace 4)
     $sgpr0 = COPY %1
@@ -89,6 +96,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]], 0, 0 :: (load (<2 x s16>), addrspace 4)
     ; GFX10-NEXT: $sgpr0 = COPY [[S_LOAD_DWORD_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_v2s16_from_4
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]], 0, 0 :: (load (<2 x s16>), addrspace 4)
+    ; GFX11-NEXT: $sgpr0 = COPY [[S_LOAD_DWORD_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(<2 x s16>) = G_LOAD %0 :: (load (<2 x s16>), align 4, addrspace 4)
     $sgpr0 = COPY %1
@@ -133,6 +147,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (<2 x s32>), addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_v2s32
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (<2 x s32>), addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(<2 x s32>) = G_LOAD %0 :: (load (<2 x s32>), align 8, addrspace 4)
     $sgpr0_sgpr1 = COPY %1
@@ -176,6 +197,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (<2 x s32>), align 4, addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_v2s32_align4
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (<2 x s32>), align 4, addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(<2 x s32>) = G_LOAD %0 :: (load (<2 x s32>), align 4, addrspace 4)
     $sgpr0_sgpr1 = COPY %1
@@ -219,6 +247,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (<4 x s16>), align 4, addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_v4s16_align4
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (<4 x s16>), align 4, addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(<4 x s16>) = G_LOAD %0 :: (load (<4 x s16>), align 4, addrspace 4)
     $sgpr0_sgpr1 = COPY %1
@@ -263,6 +298,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX4_IMM:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[COPY]], 0, 0 :: (load (<4 x s32>), align 4, addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1_sgpr2_sgpr3 = COPY [[S_LOAD_DWORDX4_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_v4s32_align4
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX4_IMM:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[COPY]], 0, 0 :: (load (<4 x s32>), align 4, addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1_sgpr2_sgpr3 = COPY [[S_LOAD_DWORDX4_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(<4 x  s32>) = G_LOAD %0 :: (load (<4 x s32>), align 4, addrspace 4)
     $sgpr0_sgpr1_sgpr2_sgpr3 = COPY %1
@@ -307,6 +349,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (s64), addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_s64
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (s64), addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(s64) = G_LOAD %0 :: (load (s64), align 8, addrspace 4)
     $sgpr0_sgpr1 = COPY %1
@@ -351,6 +400,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (s64), align 4, addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_s64_align4
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (s64), align 4, addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(s64) = G_LOAD %0 :: (load (s64), align 4, addrspace 4)
     $sgpr0_sgpr1 = COPY %1
@@ -395,6 +451,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX4_IMM:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[COPY]], 0, 0 :: (load (<2 x s64>), align 4, addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1_sgpr2_sgpr3 = COPY [[S_LOAD_DWORDX4_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_v2s64
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX4_IMM:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[COPY]], 0, 0 :: (load (<2 x s64>), align 4, addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1_sgpr2_sgpr3 = COPY [[S_LOAD_...
[truncated]

Copy link

github-actions bot commented Jan 26, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@vangthao95 vangthao95 force-pushed the vthao-fix-neg-imm-offset-smem-loads branch from 0880df2 to 8194ef1 Compare February 2, 2024 23:32
@vangthao95
Copy link
Contributor Author

ping

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp Outdated Show resolved Hide resolved
Comment on lines 119 to 121
def gi_smrd_prefetch_imm :
GIComplexOperandMatcher<s64, "selectSmrdPrefetchImm">,
GIComplexPatternEquiv<SMRDPrefetchImm>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like new globalisel support for prefetch, this should be done separately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is splitting the handling of prefetch from general handling of smrd selection. There wasn't a distinction between prefetch and s_load instruction so I had to create a way to distinguish between the two. It should be a NFC change. Should this still be done separately?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp Outdated Show resolved Hide resolved
@vangthao95
Copy link
Contributor Author

Hi @jayfoad @Pierre-vh @rampitec,

Matt is on vacation this week, may I get some help on reviewing this patch?

Thanks

@vangthao95
Copy link
Contributor Author

ping

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp Outdated Show resolved Hide resolved
Selected = true;
}

if (Selected) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing, you're allowing the invalid selection and then later trying to adjust the offset. It would be clearer if you adjusted the initial selection to reject cases where the immediate cannot be used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this code inside SelectSMRDOffset() where the offset is selected to make it clearer. Although for when sgpr[offset] (SOffset) is present, the offset is selected first before SOffset is selected so I had to adjust it after.

@vangthao95
Copy link
Contributor Author

ping

@vangthao95
Copy link
Contributor Author

Hi @arsenm or @Pierre-vh ,

Can you help take a look at the latest changes?

Thanks


KnownBits SKnown = CurDAG->computeKnownBits(*SOffset);
if (ByteOffset + SKnown.getMinValue().getSExtValue() < 0)
return subtractOffsetFromBase(&SBase, Offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still trying to fixup illegal offsets. Can you have the pattern fail to select in that case, and allow the standalone pointer operands select independently?

if (EncodedOffset >= 0 || IsBuffer || HasSOffset ||
!Subtarget->hasSignedSMRDImmOffset())
return true;
// For unbuffered smem loads, it is illegal and undefined for the Immediate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just say illegal, not "illegal and undefined"

Comment on lines 4228 to 4233
unsigned Opc;

if (Subtarget->hasScalarAddSub64())
Opc = AMDGPU::S_SUB_U64;
else
Opc = AMDGPU::S_SUB_U64_PSEUDO;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize Opc with ternary

michaelmaitland and others added 10 commits April 16, 2024 21:06
…l_for specialization (llvm#85611)

This pull request implements LWG3736: move_iterator missing
disable_sized_sentinel_for specialization.
…ing::SimplifySetCC

These test cases show some miscomplies for big-endian when dealing
with non byte-sized loads. One part of the problem is that LLVM IR
isn't really telling where the padding goes for non byte-sized
loads/stores. So currently TargetLowering::SimplifySetCC can't assume
anything about it. But the implementation also do not consider that
the TypeStoreSize could be larger than the TypeSize, resulting in
the offset calculation being wrong for big-endian.

Pre-commit for llvm#87646
)

The load narrowing part of TargetLowering::SimplifySetCC is updated
according to this:

1) The offset calculation (for big endian) did not work properly for
   non byte-sized types. This is basically solved by an early exit
   if the memory type isn't byte-sized. But the code is also corrected
   to use the store size when calculating the offset.
2) To still allow some optimizations for non-byte-sized types the
   TargetLowering::isPaddedAtMostSignificantBitsWhenStored hook is
   added. By default it assumes that scalar integer types are padded
   starting at the most significant bits, if the type needs padding
   when being stored to memory.
3) Allow optimizing when isPaddedAtMostSignificantBitsWhenStored is
   true, as that hook makes it possible for TargetLowering to know
   how the non byte-sized value is aligned in memory.
4) Update the algorithm to always search for a narrowed load with
   a power-of-2 byte-sized type. In the past the algorithm started
   with the the width of the original load, and then divided it by
   two for each iteration. But for a type such as i48 that would
   just end up trying to narrow the load into a i24 or i12 load,
   and then we would fail sooner or later due to not finding a
   newVT that fulfilled newVT.isRound().
   With this new approach we can narrow the i48 load into either
   an i8, i16 or i32 load. By checking if such a load is allowed
(e.g. alignment wise) for any "multiple of 8 offset", then we can find
   more opportunities for the optimization to trigger. So even for a
   byte-sized type such as i32 we may now end up narrowing the load
   into loading the 16 bits starting at offset 8 (if that is allowed
   by the target). The old algorithm did not even consider that case.
5) Also start using getObjectPtrOffset instead of getMemBasePlusOffset
   when creating the new ptr. This way we get "nsw" on the add.
urnathan and others added 9 commits April 16, 2024 21:07
Move bitfield access clipping to bitfield access computation.
In llvm#88317, the clang resource headers was converted to an interface
library. Update LLDB and fix the Xcode standalone build. Thanks Evan for
the help!
llvm#79940 put calls to
recomputeLiveIns into
a loop, to repeatedly call the function until the computation converges.
However,
this repeats a lot of code. This changes moves the loop into a function
to simplify
the handling.

Note that this changes the order in which recomputeLiveIns is called.
For example,

```
  bool anyChange = false;
  do {
    anyChange = recomputeLiveIns(*ExitMBB) || recomputeLiveIns(*LoopMBB);
  } while (anyChange);
```

only begins to recompute the live-ins for LoopMBB after the computation
for ExitMBB
has converged. With this change, all basic blocks have a recomputation
of the live-ins
for each loop iteration. This can result in less or more calls,
depending on the
situation.
Test, as expected, fails with Asan on system with 5lvl page tables.
Disabling the test to migrate buildbot.
…ommonInvocationForModuleBuild` into its own function (llvm#88447)

The new function is about clearing out benign codegen options and can be
applied for PCH invocations as well.
Links to `llvm.mlir.global` and `llvm.mlir.addressof` in the ["Globals"
section of LLVM dialect
documentation](https://mlir.llvm.org/docs/Dialects/LLVM/#globals) are
broken.
@vangthao95
Copy link
Contributor Author

Rebased got messed up. Sorry for the noise.

@vangthao95 vangthao95 closed this Apr 17, 2024
@Endilll Endilll removed their request for review April 17, 2024 04:29
vangthao95 added a commit to vangthao95/llvm-project that referenced this pull request Apr 18, 2024
For unbuffered smem loads, it is illegal for the immediate offset to be negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is negative.

New PR of llvm#79553.
qiaojbao pushed a commit to GPUOpen-Drivers/llvm-project that referenced this pull request Jun 5, 2024
…lim)

This is a cherry-pick of an unmerged upstream change
llvm#89165

Once the upstream change is finished and merged, this will need reverting.

Original commit message:

For unbuffered smem loads, it is illegal for the immediate offset to be negative
if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is negative.

New PR of llvm#79553.

Change-Id: I235ac5d0de5da2a1544760ab3c9749665340310d
vangthao95 added a commit that referenced this pull request Jun 24, 2024
)

For unbuffered smem loads, it is illegal for the immediate offset to be
negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is
negative.

New PR of #79553.
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Jul 9, 2024
…m#89165)

For unbuffered smem loads, it is illegal for the immediate offset to be
negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is
negative.

New PR of llvm#79553.

Change-Id: Ic920b497c378cbc6a7cc2bbca300e9391893b3a0
AlexisPerry pushed a commit to llvm-project-tlp/llvm-project that referenced this pull request Jul 9, 2024
…m#89165)

For unbuffered smem loads, it is illegal for the immediate offset to be
negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is
negative.

New PR of llvm#79553.
rocm-ci pushed a commit to ROCm/llvm-project that referenced this pull request Nov 6, 2024
…m#89165)

For unbuffered smem loads, it is illegal for the immediate offset to be
negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is
negative.

New PR of llvm#79553.

Change-Id: Ieb5681777d517fbc6ea59014af8babc9980c2b1e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.