[AMDGPU] Fix negative immediate offset for unbuffered smem loads #79553

vangthao95 · 2024-01-26T07:30:23Z

For unbuffered smem loads, It is illegal and undefined for the immediate offset to be negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is negative. As a workaround for this issue, if there is no SGPR[Offset] and the immediate offset is negative, subtract the absolute value of the immediate offset from the base address. Then change the immediate offset to 0.

llvmbot · 2024-01-26T07:30:40Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: None (vangthao95)

Changes

For unbuffered smem loads, It is illegal and undefined for the immediate offset to be negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is negative. As a workaround for this issue, if there is no SGPR[Offset] and the immediate offset is negative, subtract the absolute value of the immediate offset from the base address. Then change the immediate offset to 0.

Patch is 41.48 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/79553.diff

10 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUGISel.td (+4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp (+42-7)
(modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h (+9-5)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+34-1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h (+6-2)
(modified) llvm/lib/Target/AMDGPU/SMInstructions.td (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-constant.mir (+202-1)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-smem.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll (+4-2)
(modified) llvm/test/CodeGen/AMDGPU/global-saddr-load.ll (+14-7)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
index 152f495a452ba2..0017f51ee5d925 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGISel.td
@@ -116,6 +116,10 @@ def gi_smrd_sgpr_imm :
     GIComplexOperandMatcher<s64, "selectSmrdSgprImm">,
     GIComplexPatternEquiv<SMRDSgprImm>;
 
+def gi_smrd_prefetch_imm :
+    GIComplexOperandMatcher<s64, "selectSmrdPrefetchImm">,
+    GIComplexPatternEquiv<SMRDPrefetchImm>;
+
 def gi_flat_offset :
     GIComplexOperandMatcher<s64, "selectFlatOffset">,
     GIComplexPatternEquiv<FlatOffset>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 4f7bf3f7d35e71..d57314310184e3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -2072,13 +2072,16 @@ SDValue AMDGPUDAGToDAGISel::Expand32BitAddress(SDValue Addr) const {
 // true, match only 32-bit immediate offsets available on CI.
 bool AMDGPUDAGToDAGISel::SelectSMRDBaseOffset(SDValue Addr, SDValue &SBase,
                                               SDValue *SOffset, SDValue *Offset,
-                                              bool Imm32Only,
-                                              bool IsBuffer) const {
+                                              bool Imm32Only, bool IsBuffer,
+                                              bool IsPrefetch,
+                                              bool HasSOffset) const {
   if (SOffset && Offset) {
     assert(!Imm32Only && !IsBuffer);
     SDValue B;
-    return SelectSMRDBaseOffset(Addr, B, nullptr, Offset) &&
-           SelectSMRDBaseOffset(B, SBase, SOffset, nullptr);
+    return SelectSMRDBaseOffset(Addr, B, nullptr, Offset, false, false,
+                                IsPrefetch, true) &&
+           SelectSMRDBaseOffset(B, SBase, SOffset, nullptr, false, false,
+                                IsPrefetch, true);
   }
 
   // A 32-bit (address + offset) should not cause unsigned 32-bit integer
@@ -2097,12 +2100,39 @@ bool AMDGPUDAGToDAGISel::SelectSMRDBaseOffset(SDValue Addr, SDValue &SBase,
   }
   if (!N0 || !N1)
     return false;
+
+  bool Selected = false;
   if (SelectSMRDOffset(N1, SOffset, Offset, Imm32Only, IsBuffer)) {
     SBase = N0;
-    return true;
+    Selected = true;
   }
+
   if (SelectSMRDOffset(N0, SOffset, Offset, Imm32Only, IsBuffer)) {
     SBase = N1;
+    Selected = true;
+  }
+
+  if (Selected) {
+    // For unbuffered smem loads, it is illegal and undefined for the Immediate
+    // Offset to be negative if the resulting (Offset + (M0 or SOffset or zero)
+    // is negative. Handle the case where the Immediate Offset is negative and
+    // there is no SOffset.
+    //
+    // FIXME: Also handle M0 or SOffset case?
+    if (Offset && !HasSOffset && !IsBuffer && !IsPrefetch &&
+        Subtarget->getGeneration() >= AMDGPUSubtarget::GFX11) {
+      if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(*Offset)) {
+        if (C->getSExtValue() < 0) {
+          SDLoc SL(SBase);
+          *Offset = CurDAG->getTargetConstant(std::abs(C->getSExtValue()), SL,
+                                              MVT::i32);
+          const SDValue Ops[] = {SBase, *Offset};
+          SBase = SDValue(
+              CurDAG->getMachineNode(AMDGPU::S_SUB_U64, SL, MVT::i64, Ops), 0);
+          *Offset = CurDAG->getTargetConstant(0, SL, MVT::i32);
+        }
+      }
+    }
     return true;
   }
   return false;
@@ -2110,8 +2140,8 @@ bool AMDGPUDAGToDAGISel::SelectSMRDBaseOffset(SDValue Addr, SDValue &SBase,
 
 bool AMDGPUDAGToDAGISel::SelectSMRD(SDValue Addr, SDValue &SBase,
                                     SDValue *SOffset, SDValue *Offset,
-                                    bool Imm32Only) const {
-  if (SelectSMRDBaseOffset(Addr, SBase, SOffset, Offset, Imm32Only)) {
+                                    bool Imm32Only, bool IsPrefetch) const {
+  if (SelectSMRDBaseOffset(Addr, SBase, SOffset, Offset, Imm32Only, IsPrefetch)) {
     SBase = Expand32BitAddress(SBase);
     return true;
   }
@@ -2170,6 +2200,11 @@ bool AMDGPUDAGToDAGISel::SelectSMRDBufferSgprImm(SDValue N, SDValue &SOffset,
                               /* IsBuffer */ true);
 }
 
+bool AMDGPUDAGToDAGISel::SelectSMRDPrefetchImm(SDValue Addr, SDValue &SBase,
+                                       SDValue &Offset) const {
+  return SelectSMRD(Addr, SBase, /* SOffset */ nullptr, &Offset, false, true);
+}
+
 bool AMDGPUDAGToDAGISel::SelectMOVRELOffset(SDValue Index,
                                             SDValue &Base,
                                             SDValue &Offset) const {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
index 3b42d88df0c246..5328ba985474dd 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
@@ -194,11 +194,13 @@ class AMDGPUDAGToDAGISel : public SelectionDAGISel {
                         SDValue *Offset, bool Imm32Only = false,
                         bool IsBuffer = false) const;
   SDValue Expand32BitAddress(SDValue Addr) const;
-  bool SelectSMRDBaseOffset(SDValue Addr, SDValue &SBase, SDValue *SOffset,
-                            SDValue *Offset, bool Imm32Only = false,
-                            bool IsBuffer = false) const;
-  bool SelectSMRD(SDValue Addr, SDValue &SBase, SDValue *SOffset,
-                  SDValue *Offset, bool Imm32Only = false) const;
+  bool SelectSMRDBaseOffset(SDValue Addr, SDValue & SBase, SDValue * SOffset,
+                            SDValue * Offset, bool Imm32Only = false,
+                            bool IsBuffer = false, bool IsPrefetch = false,
+                            bool HasSOffset = false) const;
+  bool SelectSMRD(SDValue Addr, SDValue & SBase, SDValue * SOffset,
+                  SDValue * Offset, bool Imm32Only = false,
+                  bool IsPrefetch = false) const;
   bool SelectSMRDImm(SDValue Addr, SDValue &SBase, SDValue &Offset) const;
   bool SelectSMRDImm32(SDValue Addr, SDValue &SBase, SDValue &Offset) const;
   bool SelectSMRDSgpr(SDValue Addr, SDValue &SBase, SDValue &SOffset) const;
@@ -208,6 +210,8 @@ class AMDGPUDAGToDAGISel : public SelectionDAGISel {
   bool SelectSMRDBufferImm32(SDValue N, SDValue &Offset) const;
   bool SelectSMRDBufferSgprImm(SDValue N, SDValue &SOffset,
                                SDValue &Offset) const;
+  bool SelectSMRDPrefetchImm(SDValue Addr, SDValue & SBase, SDValue & Offset)
+      const;
   bool SelectMOVRELOffset(SDValue Index, SDValue &Base, SDValue &Offset) const;
 
   bool SelectVOP3ModsImpl(SDValue In, SDValue &Src, unsigned &SrcMods,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index f255d098b631c7..c7cc701c63dc54 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -4221,7 +4221,8 @@ AMDGPUInstructionSelector::selectVINTERPModsHi(MachineOperand &Root) const {
 bool AMDGPUInstructionSelector::selectSmrdOffset(MachineOperand &Root,
                                                  Register &Base,
                                                  Register *SOffset,
-                                                 int64_t *Offset) const {
+                                                 int64_t *Offset,
+                                                 bool IsPrefetch) const {
   MachineInstr *MI = Root.getParent();
   MachineBasicBlock *MBB = MI->getParent();
 
@@ -4257,6 +4258,27 @@ bool AMDGPUInstructionSelector::selectSmrdOffset(MachineOperand &Root,
   if (Offset && GEPI.SgprParts.size() == 1 && EncodedImm) {
     Base = GEPI.SgprParts[0];
     *Offset = *EncodedImm;
+    // For unbuffered smem loads, it is illegal and undefined for the Immediate
+    // Offset to be negative if the resulting (Offset + (M0 or SOffset or zero)
+    // is negative. Handle the case where the Immediate Offset is negative and
+    // there is no SOffset.
+    //
+    // FIXME: Also handle M0 or SOffset case?
+    if (!IsPrefetch && *Offset < 0 &&
+        STI.getGeneration() >= AMDGPUSubtarget::GFX11) {
+      // Subtract the absolute value of the offset from the base register and
+      // set the immediate offset to 0.
+      Register SubtractReg =
+          MRI->createVirtualRegister(&AMDGPU::SReg_64RegClass);
+
+      BuildMI(*MBB, MI, MI->getDebugLoc(), TII.get(AMDGPU::S_SUB_U64),
+              SubtractReg)
+          .addReg(Base)
+          .addImm(std::abs(*Offset));
+      Base = SubtractReg;
+      *Offset = 0;
+    }
+
     return true;
   }
 
@@ -4339,6 +4361,17 @@ AMDGPUInstructionSelector::selectSmrdSgprImm(MachineOperand &Root) const {
            [=](MachineInstrBuilder &MIB) { MIB.addImm(Offset); }}};
 }
 
+InstructionSelector::ComplexRendererFns
+AMDGPUInstructionSelector::selectSmrdPrefetchImm(MachineOperand &Root) const {
+  Register Base;
+  int64_t Offset;
+  if (!selectSmrdOffset(Root, Base, /* SOffset= */ nullptr, &Offset, true))
+    return std::nullopt;
+
+  return {{[=](MachineInstrBuilder &MIB) { MIB.addReg(Base); },
+           [=](MachineInstrBuilder &MIB) { MIB.addImm(Offset); }}};
+}
+
 std::pair<Register, int>
 AMDGPUInstructionSelector::selectFlatOffsetImpl(MachineOperand &Root,
                                                 uint64_t FlatVariant) const {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
index ef7630f137aca6..573bc9260c765b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
@@ -220,8 +220,10 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
   InstructionSelector::ComplexRendererFns
   selectVINTERPModsHi(MachineOperand &Root) const;
 
-  bool selectSmrdOffset(MachineOperand &Root, Register &Base, Register *SOffset,
-                        int64_t *Offset) const;
+  bool selectSmrdOffset(MachineOperand & Root, Register & Base,
+                        Register * SOffset, int64_t * Offset,
+                        bool IsPrefetch = false) const;
+
   InstructionSelector::ComplexRendererFns
   selectSmrdImm(MachineOperand &Root) const;
   InstructionSelector::ComplexRendererFns
@@ -230,6 +232,8 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
   selectSmrdSgpr(MachineOperand &Root) const;
   InstructionSelector::ComplexRendererFns
   selectSmrdSgprImm(MachineOperand &Root) const;
+  InstructionSelector::ComplexRendererFns
+  selectSmrdPrefetchImm(MachineOperand &Root) const;
 
   std::pair<Register, int> selectFlatOffsetImpl(MachineOperand &Root,
                                                 uint64_t FlatVariant) const;
diff --git a/llvm/lib/Target/AMDGPU/SMInstructions.td b/llvm/lib/Target/AMDGPU/SMInstructions.td
index f082de35b6ae9a..8c9e3fa5b9b738 100644
--- a/llvm/lib/Target/AMDGPU/SMInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SMInstructions.td
@@ -859,6 +859,7 @@ def SMRDSgprImm     : ComplexPattern<iPTR, 3, "SelectSMRDSgprImm">;
 def SMRDBufferImm   : ComplexPattern<iPTR, 1, "SelectSMRDBufferImm">;
 def SMRDBufferImm32 : ComplexPattern<iPTR, 1, "SelectSMRDBufferImm32">;
 def SMRDBufferSgprImm : ComplexPattern<iPTR, 2, "SelectSMRDBufferSgprImm">;
+def SMRDPrefetchImm : ComplexPattern<iPTR, 2, "SelectSMRDPrefetchImm">;
 
 multiclass SMRD_Pattern <string Instr, ValueType vt, bit immci = true> {
 
@@ -1080,7 +1081,7 @@ def i32imm_one : TImmLeaf <i32, [{
 
 multiclass SMPrefetchPat<string type, TImmLeaf cache_type> {
   def : GCNPat <
-    (smrd_prefetch (SMRDImm i64:$sbase, i32:$offset), timm, timm, cache_type),
+    (smrd_prefetch (SMRDPrefetchImm i64:$sbase, i32:$offset), timm, timm, cache_type),
     (!cast<SM_Prefetch_Pseudo>("S_PREFETCH_"#type) $sbase, $offset, (i32 SGPR_NULL), (i8 0))
   >;
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-constant.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-constant.mir
index c44477273dad09..b7010e4c65beb9 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-constant.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-constant.mir
@@ -3,7 +3,7 @@
 # RUN: llc -amdgpu-global-isel-new-legality -mtriple=amdgcn -mcpu=hawaii -run-pass=instruction-select -verify-machineinstrs  -global-isel-abort=0 -o - %s | FileCheck -check-prefix=GFX7 %s
 # RUN: llc -amdgpu-global-isel-new-legality -mtriple=amdgcn -mcpu=fiji -run-pass=instruction-select -verify-machineinstrs -global-isel-abort=0 -o - %s | FileCheck -check-prefix=GFX8 %s
 # RUN: llc -amdgpu-global-isel-new-legality -mtriple=amdgcn -mcpu=gfx1010 -run-pass=instruction-select -verify-machineinstrs -global-isel-abort=0 -o - %s | FileCheck -check-prefix=GFX10 %s
-# RUN: llc -amdgpu-global-isel-new-legality -mtriple=amdgcn -mcpu=gfx1100 -run-pass=instruction-select -verify-machineinstrs -global-isel-abort=0 -o - %s | FileCheck -check-prefix=GFX10 %s
+# RUN: llc -amdgpu-global-isel-new-legality -mtriple=amdgcn -mcpu=gfx1100 -run-pass=instruction-select -verify-machineinstrs -global-isel-abort=0 -o - %s | FileCheck -check-prefix=GFX11 %s
 
 ---
 
@@ -44,6 +44,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]], 0, 0 :: (load (s32), addrspace 4)
     ; GFX10-NEXT: $sgpr0 = COPY [[S_LOAD_DWORD_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_s32_from_4
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]], 0, 0 :: (load (s32), addrspace 4)
+    ; GFX11-NEXT: $sgpr0 = COPY [[S_LOAD_DWORD_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(s32) = G_LOAD %0 :: (load (s32), align 4, addrspace 4)
     $sgpr0 = COPY %1
@@ -89,6 +96,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]], 0, 0 :: (load (<2 x s16>), addrspace 4)
     ; GFX10-NEXT: $sgpr0 = COPY [[S_LOAD_DWORD_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_v2s16_from_4
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]], 0, 0 :: (load (<2 x s16>), addrspace 4)
+    ; GFX11-NEXT: $sgpr0 = COPY [[S_LOAD_DWORD_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(<2 x s16>) = G_LOAD %0 :: (load (<2 x s16>), align 4, addrspace 4)
     $sgpr0 = COPY %1
@@ -133,6 +147,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (<2 x s32>), addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_v2s32
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (<2 x s32>), addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(<2 x s32>) = G_LOAD %0 :: (load (<2 x s32>), align 8, addrspace 4)
     $sgpr0_sgpr1 = COPY %1
@@ -176,6 +197,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (<2 x s32>), align 4, addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_v2s32_align4
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (<2 x s32>), align 4, addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(<2 x s32>) = G_LOAD %0 :: (load (<2 x s32>), align 4, addrspace 4)
     $sgpr0_sgpr1 = COPY %1
@@ -219,6 +247,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (<4 x s16>), align 4, addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_v4s16_align4
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (<4 x s16>), align 4, addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(<4 x s16>) = G_LOAD %0 :: (load (<4 x s16>), align 4, addrspace 4)
     $sgpr0_sgpr1 = COPY %1
@@ -263,6 +298,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX4_IMM:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[COPY]], 0, 0 :: (load (<4 x s32>), align 4, addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1_sgpr2_sgpr3 = COPY [[S_LOAD_DWORDX4_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_v4s32_align4
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX4_IMM:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[COPY]], 0, 0 :: (load (<4 x s32>), align 4, addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1_sgpr2_sgpr3 = COPY [[S_LOAD_DWORDX4_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(<4 x  s32>) = G_LOAD %0 :: (load (<4 x s32>), align 4, addrspace 4)
     $sgpr0_sgpr1_sgpr2_sgpr3 = COPY %1
@@ -307,6 +349,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (s64), addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_s64
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (s64), addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(s64) = G_LOAD %0 :: (load (s64), align 8, addrspace 4)
     $sgpr0_sgpr1 = COPY %1
@@ -351,6 +400,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (s64), align 4, addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_s64_align4
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]], 0, 0 :: (load (s64), align 4, addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1 = COPY [[S_LOAD_DWORDX2_IMM]]
     %0:sgpr(p4) = COPY $sgpr0_sgpr1
     %1:sgpr(s64) = G_LOAD %0 :: (load (s64), align 4, addrspace 4)
     $sgpr0_sgpr1 = COPY %1
@@ -395,6 +451,13 @@ body: |
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[S_LOAD_DWORDX4_IMM:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[COPY]], 0, 0 :: (load (<2 x s64>), align 4, addrspace 4)
     ; GFX10-NEXT: $sgpr0_sgpr1_sgpr2_sgpr3 = COPY [[S_LOAD_DWORDX4_IMM]]
+    ;
+    ; GFX11-LABEL: name: load_constant_v2s64
+    ; GFX11: liveins: $sgpr0_sgpr1
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr0_sgpr1
+    ; GFX11-NEXT: [[S_LOAD_DWORDX4_IMM:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[COPY]], 0, 0 :: (load (<2 x s64>), align 4, addrspace 4)
+    ; GFX11-NEXT: $sgpr0_sgpr1_sgpr2_sgpr3 = COPY [[S_LOAD_...
[truncated]

github-actions · 2024-01-26T07:32:48Z

✅ With the latest revision this PR passed the C/C++ code formatter.

vangthao95 · 2024-02-06T19:03:03Z

ping

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

arsenm · 2024-02-09T16:24:27Z

llvm/lib/Target/AMDGPU/AMDGPUGISel.td

+def gi_smrd_prefetch_imm :
+    GIComplexOperandMatcher<s64, "selectSmrdPrefetchImm">,
+    GIComplexPatternEquiv<SMRDPrefetchImm>;


This looks like new globalisel support for prefetch, this should be done separately

This is splitting the handling of prefetch from general handling of smrd selection. There wasn't a distinction between prefetch and s_load instruction so I had to create a way to distinguish between the two. It should be a NFC change. Should this still be done separately?

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

vangthao95 · 2024-02-21T16:25:47Z

Hi @jayfoad @Pierre-vh @rampitec,

Matt is on vacation this week, may I get some help on reviewing this patch?

Thanks

vangthao95 · 2024-02-27T18:13:40Z

ping

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

arsenm · 2024-02-29T12:42:27Z

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

+    Selected = true;
+  }
+
+  if (Selected) {


This is confusing, you're allowing the invalid selection and then later trying to adjust the offset. It would be clearer if you adjusted the initial selection to reject cases where the immediate cannot be used

I moved this code inside SelectSMRDOffset() where the offset is selected to make it clearer. Although for when sgpr[offset] (SOffset) is present, the offset is selected first before SOffset is selected so I had to adjust it after.

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

vangthao95 · 2024-04-02T16:46:11Z

ping

vangthao95 · 2024-04-09T18:56:07Z

Hi @arsenm or @Pierre-vh ,

Can you help take a look at the latest changes?

Thanks

arsenm · 2024-04-16T18:42:50Z

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

+
+      KnownBits SKnown = CurDAG->computeKnownBits(*SOffset);
+      if (ByteOffset + SKnown.getMinValue().getSExtValue() < 0)
+        return subtractOffsetFromBase(&SBase, Offset);


This is still trying to fixup illegal offsets. Can you have the pattern fail to select in that case, and allow the standalone pointer operands select independently?

arsenm · 2024-04-16T18:43:10Z

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

+    if (EncodedOffset >= 0 || IsBuffer || HasSOffset ||
+        !Subtarget->hasSignedSMRDImmOffset())
+      return true;
+    // For unbuffered smem loads, it is illegal and undefined for the Immediate


Just say illegal, not "illegal and undefined"

arsenm · 2024-04-16T18:46:38Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+  unsigned Opc;
+
+  if (Subtarget->hasScalarAddSub64())
+    Opc = AMDGPU::S_SUB_U64;
+  else
+    Opc = AMDGPU::S_SUB_U64_PSEUDO;


Initialize Opc with ternary

Co-authored-by: Wang Pengcheng <[email protected]>

…nstructions by SEW Co-authored-by: Wang Pengcheng <[email protected]>

…uctions by SEW Co-authored-by: Wang Pengcheng <[email protected]>

Co-authored-by: Wang Pengcheng <[email protected]>

…l_for specialization (llvm#85611) This pull request implements LWG3736: move_iterator missing disable_sized_sentinel_for specialization.

…hes with Objective-C types (llvm#87807) Depends on llvm#87767

…ing::SimplifySetCC These test cases show some miscomplies for big-endian when dealing with non byte-sized loads. One part of the problem is that LLVM IR isn't really telling where the padding goes for non byte-sized loads/stores. So currently TargetLowering::SimplifySetCC can't assume anything about it. But the implementation also do not consider that the TypeStoreSize could be larger than the TypeSize, resulting in the offset calculation being wrong for big-endian. Pre-commit for llvm#87646

) The load narrowing part of TargetLowering::SimplifySetCC is updated according to this: 1) The offset calculation (for big endian) did not work properly for non byte-sized types. This is basically solved by an early exit if the memory type isn't byte-sized. But the code is also corrected to use the store size when calculating the offset. 2) To still allow some optimizations for non-byte-sized types the TargetLowering::isPaddedAtMostSignificantBitsWhenStored hook is added. By default it assumes that scalar integer types are padded starting at the most significant bits, if the type needs padding when being stored to memory. 3) Allow optimizing when isPaddedAtMostSignificantBitsWhenStored is true, as that hook makes it possible for TargetLowering to know how the non byte-sized value is aligned in memory. 4) Update the algorithm to always search for a narrowed load with a power-of-2 byte-sized type. In the past the algorithm started with the the width of the original load, and then divided it by two for each iteration. But for a type such as i48 that would just end up trying to narrow the load into a i24 or i12 load, and then we would fail sooner or later due to not finding a newVT that fulfilled newVT.isRound(). With this new approach we can narrow the i48 load into either an i8, i16 or i32 load. By checking if such a load is allowed (e.g. alignment wise) for any "multiple of 8 offset", then we can find more opportunities for the optimization to trigger. So even for a byte-sized type such as i32 we may now end up narrowing the load into loading the 16 bits starting at offset 8 (if that is allowed by the target). The old algorithm did not even consider that case. 5) Also start using getObjectPtrOffset instead of getMemBasePlusOffset when creating the new ptr. This way we get "nsw" on the add.

Move bitfield access clipping to bitfield access computation.

In llvm#88317, the clang resource headers was converted to an interface library. Update LLDB and fix the Xcode standalone build. Thanks Evan for the help!

llvm#79940 put calls to recomputeLiveIns into a loop, to repeatedly call the function until the computation converges. However, this repeats a lot of code. This changes moves the loop into a function to simplify the handling. Note that this changes the order in which recomputeLiveIns is called. For example, ``` bool anyChange = false; do { anyChange = recomputeLiveIns(*ExitMBB) || recomputeLiveIns(*LoopMBB); } while (anyChange); ``` only begins to recompute the live-ins for LoopMBB after the computation for ExitMBB has converged. With this change, all basic blocks have a recomputation of the live-ins for each loop iteration. This can result in less or more calls, depending on the situation.

Test, as expected, fails with Asan on system with 5lvl page tables. Disabling the test to migrate buildbot.

…ommonInvocationForModuleBuild` into its own function (llvm#88447) The new function is about clearing out benign codegen options and can be applied for PCH invocations as well.

Links to `llvm.mlir.global` and `llvm.mlir.addressof` in the ["Globals" section of LLVM dialect documentation](https://mlir.llvm.org/docs/Dialects/LLVM/#globals) are broken.

… explicit instructions

vangthao95 · 2024-04-17T01:33:09Z

Rebased got messed up. Sorry for the noise.

For unbuffered smem loads, it is illegal for the immediate offset to be negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is negative. New PR of llvm#79553.

…lim) This is a cherry-pick of an unmerged upstream change llvm#89165 Once the upstream change is finished and merged, this will need reverting. Original commit message: For unbuffered smem loads, it is illegal for the immediate offset to be negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is negative. New PR of llvm#79553. Change-Id: I235ac5d0de5da2a1544760ab3c9749665340310d

) For unbuffered smem loads, it is illegal for the immediate offset to be negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is negative. New PR of #79553.

…m#89165) For unbuffered smem loads, it is illegal for the immediate offset to be negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is negative. New PR of llvm#79553. Change-Id: Ic920b497c378cbc6a7cc2bbca300e9391893b3a0

…m#89165) For unbuffered smem loads, it is illegal for the immediate offset to be negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is negative. New PR of llvm#79553.

…m#89165) For unbuffered smem loads, it is illegal for the immediate offset to be negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is negative. New PR of llvm#79553. Change-Id: Ieb5681777d517fbc6ea59014af8babc9980c2b1e

vangthao95 added the backend:AMDGPU label Jan 26, 2024

vangthao95 requested review from jayfoad, arsenm and rampitec January 26, 2024 07:30

llvmbot added the llvm:globalisel label Jan 26, 2024

vangthao95 force-pushed the vthao-fix-neg-imm-offset-smem-loads branch from 0880df2 to 8194ef1 Compare February 2, 2024 23:32

arsenm reviewed Feb 9, 2024

View reviewed changes

vangthao95 requested a review from Pierre-vh February 20, 2024 20:40

arsenm reviewed Feb 29, 2024

View reviewed changes

Pierre-vh reviewed Mar 21, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp Outdated Show resolved Hide resolved

Pierre-vh mentioned this pull request Mar 21, 2024

[AMDGPU] Handle non-register operands for S_SUB/ADD_U64_PSEUDO #86104

Merged

arsenm reviewed Apr 16, 2024

View reviewed changes

michaelmaitland and others added 10 commits April 16, 2024 21:06

[RISCV] Split PseudoVFADD, PseudoVFSUB, and PseudoVFRSUB by SEW

39b4717

Co-authored-by: Wang Pengcheng <[email protected]>

[RISCV] Split PseudoVFWADD, PseudoVFWSUB, and PseudoVFWMUL by SEW

b1f9c39

Co-authored-by: Wang Pengcheng <[email protected]>

[RISCV] Split PseudoVFMUL by SEW

d5d3a59

Co-authored-by: Wang Pengcheng <[email protected]>

[RISCV] Split single width floating point fused multiple-add pseudo i…

feb2c51

…nstructions by SEW Co-authored-by: Wang Pengcheng <[email protected]>

[RISCV] Split widening floating point fused multiple-add pseudo instr…

88dcf8b

…uctions by SEW Co-authored-by: Wang Pengcheng <[email protected]>

[RISCV] Split PseudoVFRSQRT7 and PseudoVFREC7 by SEW

6c4f83b

Co-authored-by: Wang Pengcheng <[email protected]>

[libc++][ranges] LWG3736: move_iterator missing disable_sized_sentine…

0e56a92

…l_for specialization (llvm#85611) This pull request implements LWG3736: move_iterator missing disable_sized_sentinel_for specialization.

[lldb][test] Add tests for evaluating local variables whose name clas…

899a1df

…hes with Objective-C types (llvm#87807) Depends on llvm#87767

urnathan and others added 9 commits April 16, 2024 21:07

[clang] Move tailclipping to bitfield allocation (llvm#87090)

78f7d96

Move bitfield access clipping to bitfield access computation.

[gn] port e356f68 more

d9c9aeb

[lldb] Fix the standalone Xcode build after llvm#88317

a5defdc

In llvm#88317, the clang resource headers was converted to an interface library. Update LLDB and fix the Xcode standalone build. Thanks Evan for the help!

[test][sanitizer] Compile .c file as C

8ae69c1

[test][sanitizer] Temporarily disable test

0c5dfb7

Test, as expected, fails with Asan on system with 5lvl page tables. Disabling the test to migrate buildbot.

[clang/DependencyScanning/ModuleDepCollector] Refactor part of `makeC…

38104af

…ommonInvocationForModuleBuild` into its own function (llvm#88447) The new function is about clearing out benign codegen options and can be applied for PCH invocations as well.

[docs][mlir] Fix broken links in 'llvm' dialects. (llvm#88704)

e93f709

Links to `llvm.mlir.global` and `llvm.mlir.addressof` in the ["Globals" section of LLVM dialect documentation](https://mlir.llvm.org/docs/Dialects/LLVM/#globals) are broken.

Return false for illegal negative immediate offsets instead of adding…

e637e6a

… explicit instructions

vangthao95 requested review from daniel-grumberg, cyndyishida, aaupov, maksfb, rafaelauler, ayermolo, dcci, JDevlieghere, tbaederr and Endilll as code owners April 17, 2024 01:25

vangthao95 closed this Apr 17, 2024

Endilll removed their request for review April 17, 2024 04:29

vangthao95 mentioned this pull request Apr 18, 2024

[AMDGPU] Fix negative immediate offset for unbuffered smem loads #89165

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Fix negative immediate offset for unbuffered smem loads #79553

[AMDGPU] Fix negative immediate offset for unbuffered smem loads #79553

vangthao95 commented Jan 26, 2024

llvmbot commented Jan 26, 2024 •

edited

Loading

github-actions bot commented Jan 26, 2024 •

edited

Loading

vangthao95 commented Feb 6, 2024

arsenm Feb 9, 2024

vangthao95 Feb 12, 2024

arsenm Feb 29, 2024

vangthao95 commented Feb 21, 2024

vangthao95 commented Feb 27, 2024

arsenm Feb 29, 2024

vangthao95 Mar 21, 2024

vangthao95 commented Apr 2, 2024

vangthao95 commented Apr 9, 2024

arsenm Apr 16, 2024

arsenm Apr 16, 2024

arsenm Apr 16, 2024

vangthao95 commented Apr 17, 2024

[AMDGPU] Fix negative immediate offset for unbuffered smem loads #79553

[AMDGPU] Fix negative immediate offset for unbuffered smem loads #79553

Conversation

vangthao95 commented Jan 26, 2024

llvmbot commented Jan 26, 2024 • edited Loading

github-actions bot commented Jan 26, 2024 • edited Loading

vangthao95 commented Feb 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vangthao95 commented Feb 21, 2024

vangthao95 commented Feb 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vangthao95 commented Apr 2, 2024

vangthao95 commented Apr 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vangthao95 commented Apr 17, 2024

llvmbot commented Jan 26, 2024 •

edited

Loading

github-actions bot commented Jan 26, 2024 •

edited

Loading