Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Change CF intrinsics lowering to reconverge on predecessors. #92809

Closed
wants to merge 1 commit into from

Conversation

alex-t
Copy link
Contributor

@alex-t alex-t commented May 20, 2024

We currently lower the SI_IF/ELSE, SI_LOOP, and SI_END_CF to reconverge the wave at the beginning of the CF join basic block or on the loop exit block. This leads to numerous issues related to the spill/split insertion points. LLVM core kits consider the start of the block as the best point to reload the spilled registers. As a result, the vector loads are incorrectly masked out. A similar issue arose when the split kit split the live interval on the CF joining block: the spills were inserted before the exec mask was restored.

Change-Id: I8609c5abae7cd9307ffc4f6ace5011be860998e8
@alex-t alex-t requested a review from arsenm May 20, 2024 19:08
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:AMDGPU llvm:globalisel llvm:ir llvm:analysis labels May 20, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented May 20, 2024

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-clang

Author: None (alex-t)

Changes

We currently lower the SI_IF/ELSE, SI_LOOP, and SI_END_CF to reconverge the wave at the beginning of the CF join basic block or on the loop exit block. This leads to numerous issues related to the spill/split insertion points. LLVM core kits consider the start of the block as the best point to reload the spilled registers. As a result, the vector loads are incorrectly masked out. A similar issue arose when the split kit split the live interval on the CF joining block: the spills were inserted before the exec mask was restored.


Patch is 5.20 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/92809.diff

327 Files Affected:

  • (modified) clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu (+1-1)
  • (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+2-2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5-4)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h (+1-1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+8-11)
  • (modified) llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp (+51-41)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+30-4)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+39-4)
  • (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+8-4)
  • (modified) llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp (+194-350)
  • (modified) llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp (+3-1)
  • (modified) llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp (+1-1)
  • (added) llvm/test/%t (+1)
  • (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/hidden-diverge-gmir.mir (+2-2)
  • (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/temporal-divergence.mir (-21)
  • (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir (+1-3)
  • (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/deprecated/hidden-diverge.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomic_optimizations_mul_one.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+14-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.mir (+3-11)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+105-74)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.mir (+37-61)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+98-62)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir (+46-70)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+17-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.mir (+6-18)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+4-3)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.mir (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+24-14)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+88-56)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-no-rtn.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll (+35-41)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-atomicrmw.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-function-args.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/is-safe-to-sink-bug.ll (+9-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.end.cf.i32.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.end.cf.i64.ll (+5-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+72-68)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.make.buffer.rsrc.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.add.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.cmpswap.ll (+16-40)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.fadd.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.f16.ll (+6-15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.ll (+24-60)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f16.ll (+12-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f32.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.ll (+24-60)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.add.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.cmpswap.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.fadd.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.f16.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.ll (+12-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f16.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f32.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.ll (+12-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.f16.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.f16.ll (+12-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.i8.ll (+12-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.ll (+10-25)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.f16.ll (+6-15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.f16.ll (+18-45)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.i8.ll (+18-45)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.ll (+20-50)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.s.buffer.load.ll (+96-240)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.add.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.cmpswap.ll (+16-40)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.fadd.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.f16.ll (+6-15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f16.ll (+6-15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f32.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.add.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.cmpswap.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.fadd.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.f16.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f16.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f32.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.f16.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.f16.ll (+6-15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+214-170)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memmove.ll (+17-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/localizer.ll (+23-22)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (+22-13)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll (+25-17)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn-s-buffer-load.mir (+4-8)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.image.load.1d.ll (+8-16)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.image.sample.1d.ll (+12-24)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.raw.buffer.load.ll (+6-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.raw.ptr.buffer.load.ll (+6-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.s.buffer.load.ll (+48-72)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.buffer.load.ll (+6-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.buffer.store.ll (+6-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.ptr.buffer.load.ll (+6-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.ptr.buffer.store.ll (+6-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-waterfall-agpr.mir (+4-8)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll (+262-257)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+225-218)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+80-73)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+76-69)
  • (modified) llvm/test/CodeGen/AMDGPU/amdpal-callable.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+23-13)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+318-228)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+700-548)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+771-546)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll (+210-155)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+276-192)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+276-192)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+117-94)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-nand.ll (+12-9)
  • (modified) llvm/test/CodeGen/AMDGPU/atomics-cas-remarks-gfx90a.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/bb-prolog-spill-during-regalloc.ll (+24-20)
  • (modified) llvm/test/CodeGen/AMDGPU/block-should-not-be-in-alive-blocks.mir (+13-14)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-condition-and.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+414-371)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx10-branch-offset-bug.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+50-35)
  • (modified) llvm/test/CodeGen/AMDGPU/bug-sdag-emitcopyfromreg.ll (+6-65)
  • (modified) llvm/test/CodeGen/AMDGPU/bypass-div.ll (+60-36)
  • (modified) llvm/test/CodeGen/AMDGPU/byval-frame-setup.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/call-skip.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-flat.ll (+140-92)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx1030.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx908.ll (+5-3)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll (+32-26)
  • (modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.ll (+483-450)
  • (modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.mir (+303-198)
  • (modified) llvm/test/CodeGen/AMDGPU/constant-fold-imm-immreg.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/control-flow-fastregalloc.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/control-flow-optnone.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/convergence-tokens.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/convergent-inlineasm.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/cse-convergent.ll (+14-13)
  • (modified) llvm/test/CodeGen/AMDGPU/cse-phi-incoming-val.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/dag-divergence-atomic.ll (+12-9)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcombine-v1i8-extractvecelt-crash.ll (+5-3)
  • (modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+376-392)
  • (modified) llvm/test/CodeGen/AMDGPU/div_v2i128.ll (+503-415)
  • (modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+17-8)
  • (modified) llvm/test/CodeGen/AMDGPU/dpp_combine.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/dpp_combine_gfx11.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/early-tailduplicator-terminator.mir (+7-9)
  • (modified) llvm/test/CodeGen/AMDGPU/else.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/endcf-loop-header.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/fix-frame-ptr-reg-copy-livein.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomic-fadd.v2f16.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+732-534)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+732-534)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-combines.f16.ll (+22-15)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-combines.new.ll (+12-8)
  • (modified) llvm/test/CodeGen/AMDGPU/fold-fabs.ll (+26-11)
  • (modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+88-56)
  • (modified) llvm/test/CodeGen/AMDGPU/fptoi.i128.ll (+482-348)
  • (modified) llvm/test/CodeGen/AMDGPU/frame-index-elimination.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+13-6)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-no-rtn.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomics-fp-wrong-subtarget.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomics-fp.ll (+440-304)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+416-272)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+732-534)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+732-534)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+1601-969)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+1198-728)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+1198-728)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+1501-907)
  • (modified) llvm/test/CodeGen/AMDGPU/hoist-cond.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/hsa.ll (-1)
  • (modified) llvm/test/CodeGen/AMDGPU/i1-copy-from-loop.ll (+16-12)
  • (modified) llvm/test/CodeGen/AMDGPU/i1-copy-phi.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/i1_copy_phi_with_phi_incoming_value.mir (+8-19)
  • (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+237-297)
  • (modified) llvm/test/CodeGen/AMDGPU/image-sample-waterfall.ll (+8-4)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-call.ll (+144-174)
  • (modified) llvm/test/CodeGen/AMDGPU/infinite-loop.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/inline-asm.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+20-10)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+136-96)
  • (modified) llvm/test/CodeGen/AMDGPU/itofp.i128.bf.ll (+74-40)
  • (modified) llvm/test/CodeGen/AMDGPU/itofp.i128.ll (+464-260)
  • (modified) llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll (+51-39)
  • (modified) llvm/test/CodeGen/AMDGPU/lds-global-non-entry-func.ll (+110-66)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.format.f16.ll (+10-15)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.format.ll (+12-18)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.ll (+70-105)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f16.ll (+22-33)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f32.ll (+24-36)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.ll (+66-99)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.f16.ll (+10-15)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.ll (+12-18)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.ll (+70-105)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f16.ll (+22-33)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f32.ll (+24-36)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.ll (+66-99)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.f16.ll (+16-24)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.ll (+18-27)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.f16.ll (+20-30)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.ll (+50-75)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.load.f16.ll (+16-24)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.load.ll (+18-27)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.f16.ll (+20-30)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.ll (+50-75)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-soffset-mbuf.ll (+64-96)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.div.fmas.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.ordered.swap.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.exp.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i32.ll (+16-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+16-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ps.live.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+158-108)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+158-108)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sendmsg.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll (+20-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.format.v3f16.ll (+22-20)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.load.format.v3f16.ll (+17-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll (+204-168)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomics-fp.ll (+462-350)
  • (modified) llvm/test/CodeGen/AMDGPU/long-branch-reserve-register.ll (+11-9)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-live-out-copy-undef-subrange.ll (+4-3)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-on-function-argument.ll (+5-3)
  • (modified) llvm/test/CodeGen/AMDGPU/loop_break.ll (+25-17)
  • (modified) llvm/test/CodeGen/AMDGPU/loop_exit_with_xor.ll (+22-16)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-intervals.mir (+59-47)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-variables-update.mir (+96-101)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-variables-update.xfail.mir (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-other-terminators.mir (+33-39)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-i1-copies-clear-kills.mir (+4-8)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir (+15-14)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-lane-mask.mir (+2-6)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll (+24-16)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir (+2-6)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+226-181)
diff --git a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
index 946927d88a1ee..3ca766755a631 100644
--- a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
+++ b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
@@ -10,7 +10,7 @@
 // GFX90A-CAS: A compare and swap loop was generated for an atomic fadd operation at system memory scope
 // GFX90A-CAS-LABEL: _Z14atomic_add_casPf
 // GFX90A-CAS:  flat_atomic_cmpswap
-// GFX90A-CAS:  s_cbranch_execnz
+// GFX90A-CAS:  s_cbranch_scc1
 __device__ float atomic_add_cas(float *p) {
   return __atomic_fetch_add(p, 1.0f, memory_order_relaxed);
 }
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index be8048ca2459c..75ad7ed5e3fa2 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3172,8 +3172,8 @@ def int_amdgcn_loop : Intrinsic<[llvm_i1_ty],
   [llvm_anyint_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
 >;
 
-def int_amdgcn_end_cf : Intrinsic<[], [llvm_anyint_ty],
-  [IntrWillReturn, IntrNoCallback, IntrNoFree]>;
+def int_amdgcn_wave_reconverge : Intrinsic<[], [llvm_anyint_ty],
+  [IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
 
 // Represent unreachable in a divergent region.
 def int_amdgcn_unreachable : Intrinsic<[], [], [IntrConvergent, IntrNoCallback, IntrNoFree]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index b48a09489653a..9374933986080 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1553,11 +1553,12 @@ bool AMDGPUInstructionSelector::selectReturnAddress(MachineInstr &I) const {
   return true;
 }
 
-bool AMDGPUInstructionSelector::selectEndCfIntrinsic(MachineInstr &MI) const {
+bool AMDGPUInstructionSelector::selectWaveReconvergeIntrinsic(
+    MachineInstr &MI) const {
   // FIXME: Manually selecting to avoid dealing with the SReg_1 trick
   // SelectionDAG uses for wave32 vs wave64.
   MachineBasicBlock *BB = MI.getParent();
-  BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_END_CF))
+  BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_WAVE_RECONVERGE))
       .add(MI.getOperand(1));
 
   Register Reg = MI.getOperand(1).getReg();
@@ -2083,8 +2084,8 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
     MachineInstr &I) const {
   unsigned IntrinsicID = cast<GIntrinsic>(I).getIntrinsicID();
   switch (IntrinsicID) {
-  case Intrinsic::amdgcn_end_cf:
-    return selectEndCfIntrinsic(I);
+  case Intrinsic::amdgcn_wave_reconverge:
+    return selectWaveReconvergeIntrinsic(I);
   case Intrinsic::amdgcn_ds_ordered_add:
   case Intrinsic::amdgcn_ds_ordered_swap:
     return selectDSOrderedIntrinsic(I, IntrinsicID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
index f561d5d29efc4..44c89684893f7 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
@@ -119,7 +119,7 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
   bool selectReturnAddress(MachineInstr &I) const;
   bool selectG_INTRINSIC(MachineInstr &I) const;
 
-  bool selectEndCfIntrinsic(MachineInstr &MI) const;
+  bool selectWaveReconvergeIntrinsic(MachineInstr &MI) const;
   bool selectDSOrderedIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSGWSIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSAppendConsume(MachineInstr &MI, bool IsAppend) const;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 56345d14a331c..368cc98b9a585 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -785,8 +785,6 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
   const TargetRegisterClass *WaveRC = TRI->getWaveMaskRegClass();
   const unsigned MovExecOpc =
       Subtarget.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
-  const unsigned MovExecTermOpc =
-      Subtarget.isWave32() ? AMDGPU::S_MOV_B32_term : AMDGPU::S_MOV_B64_term;
 
   const unsigned XorTermOpc = Subtarget.isWave32() ?
     AMDGPU::S_XOR_B32_term : AMDGPU::S_XOR_B64_term;
@@ -949,9 +947,11 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
 
   B.setInsertPt(*BodyBB, BodyBB->end());
 
+  Register LoopMask = MRI.createVirtualRegister(
+      TRI->getRegClass(AMDGPU::SReg_1_XEXECRegClassID));
   // Update EXEC, switch all done bits to 0 and all todo bits to 1.
   B.buildInstr(XorTermOpc)
-    .addDef(ExecReg)
+    .addDef(LoopMask)
     .addReg(ExecReg)
     .addReg(NewExec);
 
@@ -959,18 +959,15 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
   // s_cbranch_scc0?
 
   // Loop back to V_READFIRSTLANE_B32 if there are still variants to cover.
-  B.buildInstr(AMDGPU::SI_WATERFALL_LOOP).addMBB(LoopBB);
+  B.buildInstr(AMDGPU::SI_WATERFALL_LOOP)
+     .addReg(LoopMask)
+     .addReg(NewExec)
+     .addMBB(LoopBB);
 
   // Save the EXEC mask before the loop.
   BuildMI(MBB, MBB.end(), DL, TII->get(MovExecOpc), SaveExecReg)
     .addReg(ExecReg);
 
-  // Restore the EXEC mask after the loop.
-  B.setMBB(*RestoreExecBB);
-  B.buildInstr(MovExecTermOpc)
-    .addDef(ExecReg)
-    .addReg(SaveExecReg);
-
   // Set the insert point after the original instruction, so any new
   // instructions will be in the remainder.
   B.setInsertPt(*RemainderBB, RemainderBB->begin());
@@ -4954,7 +4951,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
       OpdsMapping[1] = AMDGPU::getValueMapping(Bank, 32);
       break;
     }
-    case Intrinsic::amdgcn_end_cf: {
+    case Intrinsic::amdgcn_wave_reconverge: {
       unsigned Size = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
       OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
       break;
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index 08e1d6b87b0df..68d81a6ffaaff 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -15,6 +15,7 @@
 #include "GCNSubtarget.h"
 #include "llvm/Analysis/LoopInfo.h"
 #include "llvm/Analysis/UniformityAnalysis.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
 #include "llvm/IR/BasicBlock.h"
 #include "llvm/IR/Constants.h"
@@ -53,7 +54,7 @@ class SIAnnotateControlFlow : public FunctionPass {
   Function *Else;
   Function *IfBreak;
   Function *Loop;
-  Function *EndCf;
+  Function *WaveReconverge;
 
   DominatorTree *DT;
   StackVector Stack;
@@ -86,7 +87,7 @@ class SIAnnotateControlFlow : public FunctionPass {
 
   bool handleLoop(BranchInst *Term);
 
-  bool closeControlFlow(BasicBlock *BB);
+  bool tryWaveReconverge(BasicBlock *BB);
 
 public:
   static char ID;
@@ -141,7 +142,7 @@ void SIAnnotateControlFlow::initialize(Module &M, const GCNSubtarget &ST) {
   IfBreak = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_if_break,
                                       { IntMask });
   Loop = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_loop, { IntMask });
-  EndCf = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_end_cf, { IntMask });
+  WaveReconverge = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_wave_reconverge, { IntMask });
 }
 
 /// Is the branch condition uniform or did the StructurizeCFG pass
@@ -203,8 +204,6 @@ bool SIAnnotateControlFlow::eraseIfUnused(PHINode *Phi) {
 
 /// Open a new "If" block
 bool SIAnnotateControlFlow::openIf(BranchInst *Term) {
-  if (isUniform(Term))
-    return false;
 
   IRBuilder<> IRB(Term);
   Value *IfCall = IRB.CreateCall(If, {Term->getCondition()});
@@ -305,43 +304,43 @@ bool SIAnnotateControlFlow::handleLoop(BranchInst *Term) {
 }
 
 /// Close the last opened control flow
-bool SIAnnotateControlFlow::closeControlFlow(BasicBlock *BB) {
-  llvm::Loop *L = LI->getLoopFor(BB);
+bool SIAnnotateControlFlow::tryWaveReconverge(BasicBlock *BB) {
 
-  assert(Stack.back().first == BB);
+  if (succ_empty(BB))
+    return false;
 
-  if (L && L->getHeader() == BB) {
-    // We can't insert an EndCF call into a loop header, because it will
-    // get executed on every iteration of the loop, when it should be
-    // executed only once before the loop.
-    SmallVector <BasicBlock *, 8> Latches;
-    L->getLoopLatches(Latches);
+  BranchInst *Term = dyn_cast<BranchInst>(BB->getTerminator());
+  if (Term->getNumSuccessors() == 1) {
+    // The current BBs single successor is a top of the stack. We need to
+    // reconverge over thaqt path.
+    BasicBlock *SingleSucc = *succ_begin(BB);
+    BasicBlock::iterator InsPt = Term ? BasicBlock::iterator(Term) : BB->end();
 
-    SmallVector<BasicBlock *, 2> Preds;
-    for (BasicBlock *Pred : predecessors(BB)) {
-      if (!is_contained(Latches, Pred))
-        Preds.push_back(Pred);
+    if (isTopOfStack(SingleSucc)) {
+      Value *Exec = Stack.back().second;
+      IRBuilder<>(BB, InsPt).CreateCall(WaveReconverge, {Exec});
     }
-
-    BB = SplitBlockPredecessors(BB, Preds, "endcf.split", DT, LI, nullptr,
-                                false);
-  }
-
-  Value *Exec = popSaved();
-  BasicBlock::iterator FirstInsertionPt = BB->getFirstInsertionPt();
-  if (!isa<UndefValue>(Exec) && !isa<UnreachableInst>(FirstInsertionPt)) {
-    Instruction *ExecDef = cast<Instruction>(Exec);
-    BasicBlock *DefBB = ExecDef->getParent();
-    if (!DT->dominates(DefBB, BB)) {
-      // Split edge to make Def dominate Use
-      FirstInsertionPt = SplitEdge(DefBB, BB, DT, LI)->getFirstInsertionPt();
+  } else {
+    // We have a uniform conditional branch terminating the block.
+    // THis block may be the last in the Then path of the enclosing divergent
+    // IF.
+    if (!isUniform(Term))
+      // Divergent loop is going to be further processed in another place
+      return false;
+
+    for (auto Succ : Term->successors()) {
+      if (isTopOfStack(Succ)) {
+        // Just split to make a room for further WAVE_RECONVERGE insertion
+        SmallVector<BasicBlock*, 2> Preds;
+        for (auto P : predecessors(Succ)) {
+          if (DT->dominates(BB, P))
+            Preds.push_back(P);
+        }
+        DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Eager);
+        SplitBlockPredecessors(Succ, Preds, ".reconverge", &DTU, LI,
+                                            nullptr, false);
+      }
     }
-    IRBuilder<> IRB(FirstInsertionPt->getParent(), FirstInsertionPt);
-    // TODO: StructurizeCFG 'Flow' blocks have debug locations from the
-    // condition, for now just avoid copying these DebugLocs so that stepping
-    // out of the then/else block in a debugger doesn't step to the condition.
-    IRB.SetCurrentDebugLocation(DebugLoc());
-    IRB.CreateCall(EndCf, {Exec});
   }
 
   return true;
@@ -365,14 +364,20 @@ bool SIAnnotateControlFlow::runOnFunction(Function &F) {
 
     if (!Term || Term->isUnconditional()) {
       if (isTopOfStack(BB))
-        Changed |= closeControlFlow(BB);
+        Stack.pop_back();
+
+      Changed |= tryWaveReconverge(BB);
 
       continue;
     }
 
     if (I.nodeVisited(Term->getSuccessor(1))) {
       if (isTopOfStack(BB))
-        Changed |= closeControlFlow(BB);
+        Stack.pop_back();
+
+      // Let's take care of uniform loop latch that may be closing the Then
+      // path of the enclosing divergent branch.
+      Changed |= tryWaveReconverge(BB);
 
       if (DT->dominates(Term->getSuccessor(1), BB))
         Changed |= handleLoop(Term);
@@ -387,10 +392,15 @@ bool SIAnnotateControlFlow::runOnFunction(Function &F) {
         continue;
       }
 
-      Changed |= closeControlFlow(BB);
+      Stack.pop_back();
     }
 
-    Changed |= openIf(Term);
+    if (isUniform(Term))
+      // Uniform conditional branch may be in the block that closes the Then
+      // path of the divergent conditional branch.
+      Changed |= tryWaveReconverge(BB);
+    else
+      Changed |= openIf(Term);
   }
 
   if (!Stack.empty()) {
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index d7b6941fcf81d..ea1e7c782e02d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6299,7 +6299,7 @@ unsigned SITargetLowering::isCFIntrinsic(const SDNode *Intr) const {
       return AMDGPUISD::ELSE;
     case Intrinsic::amdgcn_loop:
       return AMDGPUISD::LOOP;
-    case Intrinsic::amdgcn_end_cf:
+    case Intrinsic::amdgcn_wave_reconverge:
       llvm_unreachable("should not occur");
     default:
       return 0;
@@ -9940,8 +9940,8 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
 
     return SDValue(Load, 0);
   }
-  case Intrinsic::amdgcn_end_cf:
-    return SDValue(DAG.getMachineNode(AMDGPU::SI_END_CF, DL, MVT::Other,
+  case Intrinsic::amdgcn_wave_reconverge:
+    return SDValue(DAG.getMachineNode(AMDGPU::SI_WAVE_RECONVERGE, DL, MVT::Other,
                                       Op->getOperand(2), Chain), 0);
   case Intrinsic::amdgcn_s_barrier_init:
   case Intrinsic::amdgcn_s_barrier_join:
@@ -15740,6 +15740,32 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
     }
   }
 
+  // ISel inserts copy to regs for the successor PHIs
+  // at the BB end. We need to move the SI_WAVE_RECONVERGE right before the
+  // branch.
+  for (auto &MBB : MF) {
+    for (auto &MI : MBB) {
+      if (MI.getOpcode() == AMDGPU::SI_WAVE_RECONVERGE) {
+        MachineBasicBlock::iterator I(MI);
+        MachineBasicBlock::iterator Next = std::next(I);
+        bool NeedToMove = false;
+        while (Next != MBB.end() && !Next->isBranch()) {
+          NeedToMove = true;
+          Next++;
+        }
+
+        assert((Next == MBB.end() || !Next->readsRegister(AMDGPU::SCC, TRI)) &&
+               "Malformed CFG detected!\n");
+
+        if (NeedToMove) {
+          MBB.splice(Next, &MBB, &MI);
+        }
+
+        break;
+      }
+    }
+  }
+
   // FIXME: This is a hack to fixup AGPR classes to use the properly aligned
   // classes if required. Ideally the register class constraints would differ
   // per-subtarget, but there's no easy way to achieve that right now. This is
@@ -16336,7 +16362,7 @@ static bool hasCFUser(const Value *V, SmallPtrSet<const Value *, 16> &Visited,
         default:
           Result = false;
           break;
-        case Intrinsic::amdgcn_end_cf:
+        case Intrinsic::amdgcn_wave_reconverge:
         case Intrinsic::amdgcn_loop:
           Result = true;
           break;
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 08351c49b2231..3412846a5abd9 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -2103,12 +2103,36 @@ bool SIInstrInfo::expandPostRAPseudo(MachineInstr &MI) const {
     MI.setDesc(get(AMDGPU::S_MOV_B64));
     break;
 
+  case AMDGPU::S_CMOV_B64_term:
+    // This is only a terminator to get the correct spill code placement during
+    // register allocation.
+    MI.setDesc(get(AMDGPU::S_CMOV_B64));
+    break;
+
   case AMDGPU::S_MOV_B32_term:
     // This is only a terminator to get the correct spill code placement during
     // register allocation.
     MI.setDesc(get(AMDGPU::S_MOV_B32));
     break;
 
+  case AMDGPU::S_CMOV_B32_term:
+    // This is only a terminator to get the correct spill code placement during
+    // register allocation.
+    MI.setDesc(get(AMDGPU::S_CMOV_B32));
+    break;
+
+  case AMDGPU::S_CSELECT_B32_term:
+    // This is only a terminator to get the correct spill code placement during
+    // register allocation.
+    MI.setDesc(get(AMDGPU::S_CSELECT_B32));
+    break;
+
+  case AMDGPU::S_CSELECT_B64_term:
+    // This is only a terminator to get the correct spill code placement during
+    // register allocation.
+    MI.setDesc(get(AMDGPU::S_CSELECT_B64));
+    break;
+
   case AMDGPU::S_XOR_B64_term:
     // This is only a terminator to get the correct spill code placement during
     // register allocation.
@@ -3088,20 +3112,25 @@ bool SIInstrInfo::analyzeBranch(MachineBasicBlock &MBB, MachineBasicBlock *&TBB,
   while (I != E && !I->isBranch() && !I->isReturn()) {
     switch (I->getOpcode()) {
     case AMDGPU::S_MOV_B64_term:
+    case AMDGPU::S_CMOV_B64_term:
     case AMDGPU::S_XOR_B64_term:
     case AMDGPU::S_OR_B64_term:
     case AMDGPU::S_ANDN2_B64_term:
     case AMDGPU::S_AND_B64_term:
     case AMDGPU::S_AND_SAVEEXEC_B64_term:
+    case AMDGPU::S_CSELECT_B64_term:
     case AMDGPU::S_MOV_B32_term:
+    case AMDGPU::S_CMOV_B32_term:
     case AMDGPU::S_XOR_B32_term:
     case AMDGPU::S_OR_B32_term:
     case AMDGPU::S_ANDN2_B32_term:
     case AMDGPU::S_AND_B32_term:
     case AMDGPU::S_AND_SAVEEXEC_B32_term:
+    case AMDGPU::S_CSELECT_B32_term:
       break;
     case AMDGPU::SI_IF:
     case AMDGPU::SI_ELSE:
+    case AMDGPU::SI_WAVE_RECONVERGE:
     case AMDGPU::SI_KILL_I1_TERMINATOR:
     case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:
       // FIXME: It's messy that these need to be considered here at all.
@@ -6386,6 +6415,7 @@ static void emitLoadScalarOpsFromVGPRLoop(
   }
 
   Register SaveExec = MRI.createVirtualRegister(BoolXExecRC);
+  Register LoopMask = MRI.createVirtualRegister(BoolXExecRC);
   MRI.setSimpleHint(SaveExec, CondReg);
 
   // Update EXEC to matching lanes, saving original to SaveExec.
@@ -6396,11 +6426,14 @@ static void emitLoadScalarOpsFromVGPRLoop(
   I = BodyBB.end();
 
   // Update EXEC, switch all done bits to 0 and all todo bits to 1.
-  BuildMI(BodyBB, I, DL, TII.get(XorTermOpc), Exec)
+  BuildMI(BodyBB, I, DL, TII.get(XorTermOpc), LoopMask)
       .addReg(Exec)
       .addReg(SaveExec);
 
-  BuildMI(BodyBB, I, DL, TII.get(AMDGPU::SI_WATERFALL_LOOP)).addMBB(&LoopBB);
+  BuildMI(BodyBB, I, DL, TII.get(AMDGPU::SI_WATERFALL_LOOP))
+      .addReg(LoopMask)
+      .addReg(SaveExec)
+      .addMBB(&LoopBB);
 }
 
 // Build a waterfall loop around \p MI, replacing the VGPR \p ScalarOp register
@@ -6502,8 +6535,10 @@ loadMBUFScalarOperandsFromVGPR(const SIInstrInfo &TII, MachineInstr &MI,
         .addImm(0);
   }
 
+  // BuildMI(*BodyBB, BodyBB->end(), DL, TII.get(AMDGPU::S_BRANCH))
+  //       .addMBB(RemainderBB);
   // Restore the EXEC mask
-  BuildMI(*RemainderBB, First, DL, TII.get(MovExecOpc), Exec).addReg(SaveExec);
+  // BuildMI(*RemainderBB, First, DL, TII.get(MovExecOpc), Exec).addReg(SaveExec);
   return BodyBB;
 }
 
@@ -8782,7 +8817,7 @@ void SIInstrInfo::convertNonUniformIfRegion(MachineBasicBlock *IfEntry,
             .add(Branch->getOperand(0))
             .add(Branch->getOperand(1));
     MachineInstr *SIEND =
-        BuildMI(*MF, Branch->getDebugLoc(), get(AMDGPU::SI_END_CF))
+        BuildMI(*MF, Branch->getDebugLoc(), get(AMDGPU::SI_WAVE_RECONVERGE))
             .addReg(DstReg);
 
     IfEntry->erase(TI);
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index e7aeaa017306c..c526d5ad662eb 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -350,6 +350,8 @@ class WrapTerminatorInst<SOP_Pseudo base_inst> : SPseudoInstSI<
 
 let WaveSizePredicate = isWave64 in {
 def S_MOV_B64_term : WrapTerminatorInst<S_MOV_B64>;
+def S_CMOV_B64_term : WrapTerminatorInst<S_CMOV_B64>;
+def S_CSELECT_B64_term : WrapTerminatorInst<S_CSELECT_B64>;
 def S_XOR_B64_term : WrapTerminatorInst<S_XOR_B64>;
 def S_OR_B64_term : WrapTerminatorInst<S_OR_B64>;
 def S_ANDN2_B64_term : WrapTerminatorInst<S_ANDN2_B64>;
@@ -359,6 +361,8 @@ def S_AND_SAVEEXEC_B64_term : WrapTerminatorInst<S_AND_SAVEEXEC_B64>;
 
 let WaveSizePredicate = isWave32 in {
 def S_MOV_B32_term : WrapTerminatorInst<S_MOV_B32>;
+def S_CMOV_B32_term : WrapTerminatorInst<S_CMOV_B32>;
+def S_CSELECT_B32_term : WrapTerminatorInst<S_CSELECT_B32>;
 def S_XOR_B32_term : WrapTerminatorInst<S_XOR_B32>;
 def S_OR_B32_term : WrapTerminatorInst<S_OR_B32>;
 def S_ANDN2_B32_term : WrapTerminatorInst<S_ANDN2_B32>;
@@ -460,7 +464,7 @@ def SI_ELSE : CFPseudoInstSI <
 
 def SI_WATERFALL_LOOP : CFPseudoInstSI <
   (outs),
-  (ins brtarget:$target), [], 1> {
+  (ins SReg_1:$LoopMask, SReg_1:$ExitMask, brtarget:$target), [], 1> {
   let Size = 8;...
[truncated]

@llvmbot
Copy link
Collaborator

llvmbot commented May 20, 2024

@llvm/pr-subscribers-llvm-globalisel

Author: None (alex-t)

Changes

We currently lower the SI_IF/ELSE, SI_LOOP, and SI_END_CF to reconverge the wave at the beginning of the CF join basic block or on the loop exit block. This leads to numerous issues related to the spill/split insertion points. LLVM core kits consider the start of the block as the best point to reload the spilled registers. As a result, the vector loads are incorrectly masked out. A similar issue arose when the split kit split the live interval on the CF joining block: the spills were inserted before the exec mask was restored.


Patch is 5.20 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/92809.diff

327 Files Affected:

  • (modified) clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu (+1-1)
  • (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+2-2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5-4)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h (+1-1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+8-11)
  • (modified) llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp (+51-41)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+30-4)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+39-4)
  • (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+8-4)
  • (modified) llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp (+194-350)
  • (modified) llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp (+3-1)
  • (modified) llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp (+1-1)
  • (added) llvm/test/%t (+1)
  • (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/hidden-diverge-gmir.mir (+2-2)
  • (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/temporal-divergence.mir (-21)
  • (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir (+1-3)
  • (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/deprecated/hidden-diverge.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomic_optimizations_mul_one.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+14-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.mir (+3-11)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+105-74)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.mir (+37-61)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+98-62)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir (+46-70)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+17-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.mir (+6-18)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+4-3)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.mir (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+24-14)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+88-56)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-no-rtn.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll (+35-41)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-atomicrmw.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-function-args.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/is-safe-to-sink-bug.ll (+9-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.end.cf.i32.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.end.cf.i64.ll (+5-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+72-68)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.make.buffer.rsrc.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.add.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.cmpswap.ll (+16-40)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.fadd.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.f16.ll (+6-15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.ll (+24-60)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f16.ll (+12-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f32.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.ll (+24-60)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.add.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.cmpswap.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.fadd.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.f16.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.ll (+12-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f16.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f32.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.ll (+12-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.f16.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.f16.ll (+12-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.i8.ll (+12-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.ll (+10-25)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.f16.ll (+6-15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.f16.ll (+18-45)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.i8.ll (+18-45)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.ll (+20-50)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.s.buffer.load.ll (+96-240)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.add.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.cmpswap.ll (+16-40)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.fadd.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.f16.ll (+6-15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f16.ll (+6-15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f32.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.add.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.cmpswap.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.fadd.ll (+8-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.f16.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f16.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f32.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.f16.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.ll (+2-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.f16.ll (+6-15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.ll (+4-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+214-170)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memmove.ll (+17-10)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/localizer.ll (+23-22)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (+22-13)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll (+25-17)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn-s-buffer-load.mir (+4-8)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.image.load.1d.ll (+8-16)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.image.sample.1d.ll (+12-24)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.raw.buffer.load.ll (+6-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.raw.ptr.buffer.load.ll (+6-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.s.buffer.load.ll (+48-72)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.buffer.load.ll (+6-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.buffer.store.ll (+6-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.ptr.buffer.load.ll (+6-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.ptr.buffer.store.ll (+6-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-waterfall-agpr.mir (+4-8)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll (+262-257)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+225-218)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+80-73)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+76-69)
  • (modified) llvm/test/CodeGen/AMDGPU/amdpal-callable.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+23-13)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+318-228)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+700-548)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+771-546)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll (+210-155)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+276-192)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+276-192)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+117-94)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-nand.ll (+12-9)
  • (modified) llvm/test/CodeGen/AMDGPU/atomics-cas-remarks-gfx90a.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/bb-prolog-spill-during-regalloc.ll (+24-20)
  • (modified) llvm/test/CodeGen/AMDGPU/block-should-not-be-in-alive-blocks.mir (+13-14)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-condition-and.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+414-371)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx10-branch-offset-bug.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+50-35)
  • (modified) llvm/test/CodeGen/AMDGPU/bug-sdag-emitcopyfromreg.ll (+6-65)
  • (modified) llvm/test/CodeGen/AMDGPU/bypass-div.ll (+60-36)
  • (modified) llvm/test/CodeGen/AMDGPU/byval-frame-setup.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/call-skip.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-flat.ll (+140-92)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx1030.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx908.ll (+5-3)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll (+32-26)
  • (modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.ll (+483-450)
  • (modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.mir (+303-198)
  • (modified) llvm/test/CodeGen/AMDGPU/constant-fold-imm-immreg.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/control-flow-fastregalloc.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/control-flow-optnone.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/convergence-tokens.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/convergent-inlineasm.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/cse-convergent.ll (+14-13)
  • (modified) llvm/test/CodeGen/AMDGPU/cse-phi-incoming-val.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/dag-divergence-atomic.ll (+12-9)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcombine-v1i8-extractvecelt-crash.ll (+5-3)
  • (modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+376-392)
  • (modified) llvm/test/CodeGen/AMDGPU/div_v2i128.ll (+503-415)
  • (modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+17-8)
  • (modified) llvm/test/CodeGen/AMDGPU/dpp_combine.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/dpp_combine_gfx11.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/early-tailduplicator-terminator.mir (+7-9)
  • (modified) llvm/test/CodeGen/AMDGPU/else.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/endcf-loop-header.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/fix-frame-ptr-reg-copy-livein.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomic-fadd.v2f16.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+732-534)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+732-534)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-combines.f16.ll (+22-15)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-combines.new.ll (+12-8)
  • (modified) llvm/test/CodeGen/AMDGPU/fold-fabs.ll (+26-11)
  • (modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+88-56)
  • (modified) llvm/test/CodeGen/AMDGPU/fptoi.i128.ll (+482-348)
  • (modified) llvm/test/CodeGen/AMDGPU/frame-index-elimination.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+13-6)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-no-rtn.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomics-fp-wrong-subtarget.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomics-fp.ll (+440-304)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+416-272)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+732-534)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+732-534)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+1601-969)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+1198-728)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+1198-728)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+1501-907)
  • (modified) llvm/test/CodeGen/AMDGPU/hoist-cond.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/hsa.ll (-1)
  • (modified) llvm/test/CodeGen/AMDGPU/i1-copy-from-loop.ll (+16-12)
  • (modified) llvm/test/CodeGen/AMDGPU/i1-copy-phi.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/i1_copy_phi_with_phi_incoming_value.mir (+8-19)
  • (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+237-297)
  • (modified) llvm/test/CodeGen/AMDGPU/image-sample-waterfall.ll (+8-4)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-call.ll (+144-174)
  • (modified) llvm/test/CodeGen/AMDGPU/infinite-loop.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/inline-asm.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+20-10)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+136-96)
  • (modified) llvm/test/CodeGen/AMDGPU/itofp.i128.bf.ll (+74-40)
  • (modified) llvm/test/CodeGen/AMDGPU/itofp.i128.ll (+464-260)
  • (modified) llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll (+51-39)
  • (modified) llvm/test/CodeGen/AMDGPU/lds-global-non-entry-func.ll (+110-66)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.format.f16.ll (+10-15)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.format.ll (+12-18)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.ll (+70-105)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f16.ll (+22-33)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f32.ll (+24-36)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.ll (+66-99)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.f16.ll (+10-15)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.ll (+12-18)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.ll (+70-105)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f16.ll (+22-33)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f32.ll (+24-36)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.ll (+66-99)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.f16.ll (+16-24)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.ll (+18-27)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.f16.ll (+20-30)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.ll (+50-75)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.load.f16.ll (+16-24)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.load.ll (+18-27)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.f16.ll (+20-30)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.ll (+50-75)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-soffset-mbuf.ll (+64-96)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.div.fmas.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.ordered.swap.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.exp.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i32.ll (+16-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+16-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ps.live.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+158-108)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+158-108)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sendmsg.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll (+20-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.format.v3f16.ll (+22-20)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.load.format.v3f16.ll (+17-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll (+204-168)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomics-fp.ll (+462-350)
  • (modified) llvm/test/CodeGen/AMDGPU/long-branch-reserve-register.ll (+11-9)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-live-out-copy-undef-subrange.ll (+4-3)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-on-function-argument.ll (+5-3)
  • (modified) llvm/test/CodeGen/AMDGPU/loop_break.ll (+25-17)
  • (modified) llvm/test/CodeGen/AMDGPU/loop_exit_with_xor.ll (+22-16)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-intervals.mir (+59-47)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-variables-update.mir (+96-101)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-variables-update.xfail.mir (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-other-terminators.mir (+33-39)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-i1-copies-clear-kills.mir (+4-8)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-ignorable-exec-use.mir (+15-14)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-lane-mask.mir (+2-6)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll (+24-16)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir (+2-6)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+226-181)
diff --git a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
index 946927d88a1ee..3ca766755a631 100644
--- a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
+++ b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
@@ -10,7 +10,7 @@
 // GFX90A-CAS: A compare and swap loop was generated for an atomic fadd operation at system memory scope
 // GFX90A-CAS-LABEL: _Z14atomic_add_casPf
 // GFX90A-CAS:  flat_atomic_cmpswap
-// GFX90A-CAS:  s_cbranch_execnz
+// GFX90A-CAS:  s_cbranch_scc1
 __device__ float atomic_add_cas(float *p) {
   return __atomic_fetch_add(p, 1.0f, memory_order_relaxed);
 }
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index be8048ca2459c..75ad7ed5e3fa2 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3172,8 +3172,8 @@ def int_amdgcn_loop : Intrinsic<[llvm_i1_ty],
   [llvm_anyint_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
 >;
 
-def int_amdgcn_end_cf : Intrinsic<[], [llvm_anyint_ty],
-  [IntrWillReturn, IntrNoCallback, IntrNoFree]>;
+def int_amdgcn_wave_reconverge : Intrinsic<[], [llvm_anyint_ty],
+  [IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
 
 // Represent unreachable in a divergent region.
 def int_amdgcn_unreachable : Intrinsic<[], [], [IntrConvergent, IntrNoCallback, IntrNoFree]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index b48a09489653a..9374933986080 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1553,11 +1553,12 @@ bool AMDGPUInstructionSelector::selectReturnAddress(MachineInstr &I) const {
   return true;
 }
 
-bool AMDGPUInstructionSelector::selectEndCfIntrinsic(MachineInstr &MI) const {
+bool AMDGPUInstructionSelector::selectWaveReconvergeIntrinsic(
+    MachineInstr &MI) const {
   // FIXME: Manually selecting to avoid dealing with the SReg_1 trick
   // SelectionDAG uses for wave32 vs wave64.
   MachineBasicBlock *BB = MI.getParent();
-  BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_END_CF))
+  BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_WAVE_RECONVERGE))
       .add(MI.getOperand(1));
 
   Register Reg = MI.getOperand(1).getReg();
@@ -2083,8 +2084,8 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
     MachineInstr &I) const {
   unsigned IntrinsicID = cast<GIntrinsic>(I).getIntrinsicID();
   switch (IntrinsicID) {
-  case Intrinsic::amdgcn_end_cf:
-    return selectEndCfIntrinsic(I);
+  case Intrinsic::amdgcn_wave_reconverge:
+    return selectWaveReconvergeIntrinsic(I);
   case Intrinsic::amdgcn_ds_ordered_add:
   case Intrinsic::amdgcn_ds_ordered_swap:
     return selectDSOrderedIntrinsic(I, IntrinsicID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
index f561d5d29efc4..44c89684893f7 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
@@ -119,7 +119,7 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
   bool selectReturnAddress(MachineInstr &I) const;
   bool selectG_INTRINSIC(MachineInstr &I) const;
 
-  bool selectEndCfIntrinsic(MachineInstr &MI) const;
+  bool selectWaveReconvergeIntrinsic(MachineInstr &MI) const;
   bool selectDSOrderedIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSGWSIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSAppendConsume(MachineInstr &MI, bool IsAppend) const;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 56345d14a331c..368cc98b9a585 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -785,8 +785,6 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
   const TargetRegisterClass *WaveRC = TRI->getWaveMaskRegClass();
   const unsigned MovExecOpc =
       Subtarget.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
-  const unsigned MovExecTermOpc =
-      Subtarget.isWave32() ? AMDGPU::S_MOV_B32_term : AMDGPU::S_MOV_B64_term;
 
   const unsigned XorTermOpc = Subtarget.isWave32() ?
     AMDGPU::S_XOR_B32_term : AMDGPU::S_XOR_B64_term;
@@ -949,9 +947,11 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
 
   B.setInsertPt(*BodyBB, BodyBB->end());
 
+  Register LoopMask = MRI.createVirtualRegister(
+      TRI->getRegClass(AMDGPU::SReg_1_XEXECRegClassID));
   // Update EXEC, switch all done bits to 0 and all todo bits to 1.
   B.buildInstr(XorTermOpc)
-    .addDef(ExecReg)
+    .addDef(LoopMask)
     .addReg(ExecReg)
     .addReg(NewExec);
 
@@ -959,18 +959,15 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
   // s_cbranch_scc0?
 
   // Loop back to V_READFIRSTLANE_B32 if there are still variants to cover.
-  B.buildInstr(AMDGPU::SI_WATERFALL_LOOP).addMBB(LoopBB);
+  B.buildInstr(AMDGPU::SI_WATERFALL_LOOP)
+     .addReg(LoopMask)
+     .addReg(NewExec)
+     .addMBB(LoopBB);
 
   // Save the EXEC mask before the loop.
   BuildMI(MBB, MBB.end(), DL, TII->get(MovExecOpc), SaveExecReg)
     .addReg(ExecReg);
 
-  // Restore the EXEC mask after the loop.
-  B.setMBB(*RestoreExecBB);
-  B.buildInstr(MovExecTermOpc)
-    .addDef(ExecReg)
-    .addReg(SaveExecReg);
-
   // Set the insert point after the original instruction, so any new
   // instructions will be in the remainder.
   B.setInsertPt(*RemainderBB, RemainderBB->begin());
@@ -4954,7 +4951,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
       OpdsMapping[1] = AMDGPU::getValueMapping(Bank, 32);
       break;
     }
-    case Intrinsic::amdgcn_end_cf: {
+    case Intrinsic::amdgcn_wave_reconverge: {
       unsigned Size = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
       OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
       break;
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index 08e1d6b87b0df..68d81a6ffaaff 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -15,6 +15,7 @@
 #include "GCNSubtarget.h"
 #include "llvm/Analysis/LoopInfo.h"
 #include "llvm/Analysis/UniformityAnalysis.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
 #include "llvm/IR/BasicBlock.h"
 #include "llvm/IR/Constants.h"
@@ -53,7 +54,7 @@ class SIAnnotateControlFlow : public FunctionPass {
   Function *Else;
   Function *IfBreak;
   Function *Loop;
-  Function *EndCf;
+  Function *WaveReconverge;
 
   DominatorTree *DT;
   StackVector Stack;
@@ -86,7 +87,7 @@ class SIAnnotateControlFlow : public FunctionPass {
 
   bool handleLoop(BranchInst *Term);
 
-  bool closeControlFlow(BasicBlock *BB);
+  bool tryWaveReconverge(BasicBlock *BB);
 
 public:
   static char ID;
@@ -141,7 +142,7 @@ void SIAnnotateControlFlow::initialize(Module &M, const GCNSubtarget &ST) {
   IfBreak = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_if_break,
                                       { IntMask });
   Loop = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_loop, { IntMask });
-  EndCf = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_end_cf, { IntMask });
+  WaveReconverge = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_wave_reconverge, { IntMask });
 }
 
 /// Is the branch condition uniform or did the StructurizeCFG pass
@@ -203,8 +204,6 @@ bool SIAnnotateControlFlow::eraseIfUnused(PHINode *Phi) {
 
 /// Open a new "If" block
 bool SIAnnotateControlFlow::openIf(BranchInst *Term) {
-  if (isUniform(Term))
-    return false;
 
   IRBuilder<> IRB(Term);
   Value *IfCall = IRB.CreateCall(If, {Term->getCondition()});
@@ -305,43 +304,43 @@ bool SIAnnotateControlFlow::handleLoop(BranchInst *Term) {
 }
 
 /// Close the last opened control flow
-bool SIAnnotateControlFlow::closeControlFlow(BasicBlock *BB) {
-  llvm::Loop *L = LI->getLoopFor(BB);
+bool SIAnnotateControlFlow::tryWaveReconverge(BasicBlock *BB) {
 
-  assert(Stack.back().first == BB);
+  if (succ_empty(BB))
+    return false;
 
-  if (L && L->getHeader() == BB) {
-    // We can't insert an EndCF call into a loop header, because it will
-    // get executed on every iteration of the loop, when it should be
-    // executed only once before the loop.
-    SmallVector <BasicBlock *, 8> Latches;
-    L->getLoopLatches(Latches);
+  BranchInst *Term = dyn_cast<BranchInst>(BB->getTerminator());
+  if (Term->getNumSuccessors() == 1) {
+    // The current BBs single successor is a top of the stack. We need to
+    // reconverge over thaqt path.
+    BasicBlock *SingleSucc = *succ_begin(BB);
+    BasicBlock::iterator InsPt = Term ? BasicBlock::iterator(Term) : BB->end();
 
-    SmallVector<BasicBlock *, 2> Preds;
-    for (BasicBlock *Pred : predecessors(BB)) {
-      if (!is_contained(Latches, Pred))
-        Preds.push_back(Pred);
+    if (isTopOfStack(SingleSucc)) {
+      Value *Exec = Stack.back().second;
+      IRBuilder<>(BB, InsPt).CreateCall(WaveReconverge, {Exec});
     }
-
-    BB = SplitBlockPredecessors(BB, Preds, "endcf.split", DT, LI, nullptr,
-                                false);
-  }
-
-  Value *Exec = popSaved();
-  BasicBlock::iterator FirstInsertionPt = BB->getFirstInsertionPt();
-  if (!isa<UndefValue>(Exec) && !isa<UnreachableInst>(FirstInsertionPt)) {
-    Instruction *ExecDef = cast<Instruction>(Exec);
-    BasicBlock *DefBB = ExecDef->getParent();
-    if (!DT->dominates(DefBB, BB)) {
-      // Split edge to make Def dominate Use
-      FirstInsertionPt = SplitEdge(DefBB, BB, DT, LI)->getFirstInsertionPt();
+  } else {
+    // We have a uniform conditional branch terminating the block.
+    // THis block may be the last in the Then path of the enclosing divergent
+    // IF.
+    if (!isUniform(Term))
+      // Divergent loop is going to be further processed in another place
+      return false;
+
+    for (auto Succ : Term->successors()) {
+      if (isTopOfStack(Succ)) {
+        // Just split to make a room for further WAVE_RECONVERGE insertion
+        SmallVector<BasicBlock*, 2> Preds;
+        for (auto P : predecessors(Succ)) {
+          if (DT->dominates(BB, P))
+            Preds.push_back(P);
+        }
+        DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Eager);
+        SplitBlockPredecessors(Succ, Preds, ".reconverge", &DTU, LI,
+                                            nullptr, false);
+      }
     }
-    IRBuilder<> IRB(FirstInsertionPt->getParent(), FirstInsertionPt);
-    // TODO: StructurizeCFG 'Flow' blocks have debug locations from the
-    // condition, for now just avoid copying these DebugLocs so that stepping
-    // out of the then/else block in a debugger doesn't step to the condition.
-    IRB.SetCurrentDebugLocation(DebugLoc());
-    IRB.CreateCall(EndCf, {Exec});
   }
 
   return true;
@@ -365,14 +364,20 @@ bool SIAnnotateControlFlow::runOnFunction(Function &F) {
 
     if (!Term || Term->isUnconditional()) {
       if (isTopOfStack(BB))
-        Changed |= closeControlFlow(BB);
+        Stack.pop_back();
+
+      Changed |= tryWaveReconverge(BB);
 
       continue;
     }
 
     if (I.nodeVisited(Term->getSuccessor(1))) {
       if (isTopOfStack(BB))
-        Changed |= closeControlFlow(BB);
+        Stack.pop_back();
+
+      // Let's take care of uniform loop latch that may be closing the Then
+      // path of the enclosing divergent branch.
+      Changed |= tryWaveReconverge(BB);
 
       if (DT->dominates(Term->getSuccessor(1), BB))
         Changed |= handleLoop(Term);
@@ -387,10 +392,15 @@ bool SIAnnotateControlFlow::runOnFunction(Function &F) {
         continue;
       }
 
-      Changed |= closeControlFlow(BB);
+      Stack.pop_back();
     }
 
-    Changed |= openIf(Term);
+    if (isUniform(Term))
+      // Uniform conditional branch may be in the block that closes the Then
+      // path of the divergent conditional branch.
+      Changed |= tryWaveReconverge(BB);
+    else
+      Changed |= openIf(Term);
   }
 
   if (!Stack.empty()) {
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index d7b6941fcf81d..ea1e7c782e02d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6299,7 +6299,7 @@ unsigned SITargetLowering::isCFIntrinsic(const SDNode *Intr) const {
       return AMDGPUISD::ELSE;
     case Intrinsic::amdgcn_loop:
       return AMDGPUISD::LOOP;
-    case Intrinsic::amdgcn_end_cf:
+    case Intrinsic::amdgcn_wave_reconverge:
       llvm_unreachable("should not occur");
     default:
       return 0;
@@ -9940,8 +9940,8 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
 
     return SDValue(Load, 0);
   }
-  case Intrinsic::amdgcn_end_cf:
-    return SDValue(DAG.getMachineNode(AMDGPU::SI_END_CF, DL, MVT::Other,
+  case Intrinsic::amdgcn_wave_reconverge:
+    return SDValue(DAG.getMachineNode(AMDGPU::SI_WAVE_RECONVERGE, DL, MVT::Other,
                                       Op->getOperand(2), Chain), 0);
   case Intrinsic::amdgcn_s_barrier_init:
   case Intrinsic::amdgcn_s_barrier_join:
@@ -15740,6 +15740,32 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
     }
   }
 
+  // ISel inserts copy to regs for the successor PHIs
+  // at the BB end. We need to move the SI_WAVE_RECONVERGE right before the
+  // branch.
+  for (auto &MBB : MF) {
+    for (auto &MI : MBB) {
+      if (MI.getOpcode() == AMDGPU::SI_WAVE_RECONVERGE) {
+        MachineBasicBlock::iterator I(MI);
+        MachineBasicBlock::iterator Next = std::next(I);
+        bool NeedToMove = false;
+        while (Next != MBB.end() && !Next->isBranch()) {
+          NeedToMove = true;
+          Next++;
+        }
+
+        assert((Next == MBB.end() || !Next->readsRegister(AMDGPU::SCC, TRI)) &&
+               "Malformed CFG detected!\n");
+
+        if (NeedToMove) {
+          MBB.splice(Next, &MBB, &MI);
+        }
+
+        break;
+      }
+    }
+  }
+
   // FIXME: This is a hack to fixup AGPR classes to use the properly aligned
   // classes if required. Ideally the register class constraints would differ
   // per-subtarget, but there's no easy way to achieve that right now. This is
@@ -16336,7 +16362,7 @@ static bool hasCFUser(const Value *V, SmallPtrSet<const Value *, 16> &Visited,
         default:
           Result = false;
           break;
-        case Intrinsic::amdgcn_end_cf:
+        case Intrinsic::amdgcn_wave_reconverge:
         case Intrinsic::amdgcn_loop:
           Result = true;
           break;
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 08351c49b2231..3412846a5abd9 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -2103,12 +2103,36 @@ bool SIInstrInfo::expandPostRAPseudo(MachineInstr &MI) const {
     MI.setDesc(get(AMDGPU::S_MOV_B64));
     break;
 
+  case AMDGPU::S_CMOV_B64_term:
+    // This is only a terminator to get the correct spill code placement during
+    // register allocation.
+    MI.setDesc(get(AMDGPU::S_CMOV_B64));
+    break;
+
   case AMDGPU::S_MOV_B32_term:
     // This is only a terminator to get the correct spill code placement during
     // register allocation.
     MI.setDesc(get(AMDGPU::S_MOV_B32));
     break;
 
+  case AMDGPU::S_CMOV_B32_term:
+    // This is only a terminator to get the correct spill code placement during
+    // register allocation.
+    MI.setDesc(get(AMDGPU::S_CMOV_B32));
+    break;
+
+  case AMDGPU::S_CSELECT_B32_term:
+    // This is only a terminator to get the correct spill code placement during
+    // register allocation.
+    MI.setDesc(get(AMDGPU::S_CSELECT_B32));
+    break;
+
+  case AMDGPU::S_CSELECT_B64_term:
+    // This is only a terminator to get the correct spill code placement during
+    // register allocation.
+    MI.setDesc(get(AMDGPU::S_CSELECT_B64));
+    break;
+
   case AMDGPU::S_XOR_B64_term:
     // This is only a terminator to get the correct spill code placement during
     // register allocation.
@@ -3088,20 +3112,25 @@ bool SIInstrInfo::analyzeBranch(MachineBasicBlock &MBB, MachineBasicBlock *&TBB,
   while (I != E && !I->isBranch() && !I->isReturn()) {
     switch (I->getOpcode()) {
     case AMDGPU::S_MOV_B64_term:
+    case AMDGPU::S_CMOV_B64_term:
     case AMDGPU::S_XOR_B64_term:
     case AMDGPU::S_OR_B64_term:
     case AMDGPU::S_ANDN2_B64_term:
     case AMDGPU::S_AND_B64_term:
     case AMDGPU::S_AND_SAVEEXEC_B64_term:
+    case AMDGPU::S_CSELECT_B64_term:
     case AMDGPU::S_MOV_B32_term:
+    case AMDGPU::S_CMOV_B32_term:
     case AMDGPU::S_XOR_B32_term:
     case AMDGPU::S_OR_B32_term:
     case AMDGPU::S_ANDN2_B32_term:
     case AMDGPU::S_AND_B32_term:
     case AMDGPU::S_AND_SAVEEXEC_B32_term:
+    case AMDGPU::S_CSELECT_B32_term:
       break;
     case AMDGPU::SI_IF:
     case AMDGPU::SI_ELSE:
+    case AMDGPU::SI_WAVE_RECONVERGE:
     case AMDGPU::SI_KILL_I1_TERMINATOR:
     case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:
       // FIXME: It's messy that these need to be considered here at all.
@@ -6386,6 +6415,7 @@ static void emitLoadScalarOpsFromVGPRLoop(
   }
 
   Register SaveExec = MRI.createVirtualRegister(BoolXExecRC);
+  Register LoopMask = MRI.createVirtualRegister(BoolXExecRC);
   MRI.setSimpleHint(SaveExec, CondReg);
 
   // Update EXEC to matching lanes, saving original to SaveExec.
@@ -6396,11 +6426,14 @@ static void emitLoadScalarOpsFromVGPRLoop(
   I = BodyBB.end();
 
   // Update EXEC, switch all done bits to 0 and all todo bits to 1.
-  BuildMI(BodyBB, I, DL, TII.get(XorTermOpc), Exec)
+  BuildMI(BodyBB, I, DL, TII.get(XorTermOpc), LoopMask)
       .addReg(Exec)
       .addReg(SaveExec);
 
-  BuildMI(BodyBB, I, DL, TII.get(AMDGPU::SI_WATERFALL_LOOP)).addMBB(&LoopBB);
+  BuildMI(BodyBB, I, DL, TII.get(AMDGPU::SI_WATERFALL_LOOP))
+      .addReg(LoopMask)
+      .addReg(SaveExec)
+      .addMBB(&LoopBB);
 }
 
 // Build a waterfall loop around \p MI, replacing the VGPR \p ScalarOp register
@@ -6502,8 +6535,10 @@ loadMBUFScalarOperandsFromVGPR(const SIInstrInfo &TII, MachineInstr &MI,
         .addImm(0);
   }
 
+  // BuildMI(*BodyBB, BodyBB->end(), DL, TII.get(AMDGPU::S_BRANCH))
+  //       .addMBB(RemainderBB);
   // Restore the EXEC mask
-  BuildMI(*RemainderBB, First, DL, TII.get(MovExecOpc), Exec).addReg(SaveExec);
+  // BuildMI(*RemainderBB, First, DL, TII.get(MovExecOpc), Exec).addReg(SaveExec);
   return BodyBB;
 }
 
@@ -8782,7 +8817,7 @@ void SIInstrInfo::convertNonUniformIfRegion(MachineBasicBlock *IfEntry,
             .add(Branch->getOperand(0))
             .add(Branch->getOperand(1));
     MachineInstr *SIEND =
-        BuildMI(*MF, Branch->getDebugLoc(), get(AMDGPU::SI_END_CF))
+        BuildMI(*MF, Branch->getDebugLoc(), get(AMDGPU::SI_WAVE_RECONVERGE))
             .addReg(DstReg);
 
     IfEntry->erase(TI);
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index e7aeaa017306c..c526d5ad662eb 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -350,6 +350,8 @@ class WrapTerminatorInst<SOP_Pseudo base_inst> : SPseudoInstSI<
 
 let WaveSizePredicate = isWave64 in {
 def S_MOV_B64_term : WrapTerminatorInst<S_MOV_B64>;
+def S_CMOV_B64_term : WrapTerminatorInst<S_CMOV_B64>;
+def S_CSELECT_B64_term : WrapTerminatorInst<S_CSELECT_B64>;
 def S_XOR_B64_term : WrapTerminatorInst<S_XOR_B64>;
 def S_OR_B64_term : WrapTerminatorInst<S_OR_B64>;
 def S_ANDN2_B64_term : WrapTerminatorInst<S_ANDN2_B64>;
@@ -359,6 +361,8 @@ def S_AND_SAVEEXEC_B64_term : WrapTerminatorInst<S_AND_SAVEEXEC_B64>;
 
 let WaveSizePredicate = isWave32 in {
 def S_MOV_B32_term : WrapTerminatorInst<S_MOV_B32>;
+def S_CMOV_B32_term : WrapTerminatorInst<S_CMOV_B32>;
+def S_CSELECT_B32_term : WrapTerminatorInst<S_CSELECT_B32>;
 def S_XOR_B32_term : WrapTerminatorInst<S_XOR_B32>;
 def S_OR_B32_term : WrapTerminatorInst<S_OR_B32>;
 def S_ANDN2_B32_term : WrapTerminatorInst<S_ANDN2_B32>;
@@ -460,7 +464,7 @@ def SI_ELSE : CFPseudoInstSI <
 
 def SI_WATERFALL_LOOP : CFPseudoInstSI <
   (outs),
-  (ins brtarget:$target), [], 1> {
+  (ins SReg_1:$LoopMask, SReg_1:$ExitMask, brtarget:$target), [], 1> {
   let Size = 8;...
[truncated]

Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 586ecd75606e70a8d16cb1717809acce652ffe7f 7cda2e3ce0d180688250856566b6c75ca07d7711 -- llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp llvm/lib/Target/AMDGPU/SIISelLowering.cpp llvm/lib/Target/AMDGPU/SIInstrInfo.cpp llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 368cc98b9a..97c9e9a32b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -950,19 +950,16 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
   Register LoopMask = MRI.createVirtualRegister(
       TRI->getRegClass(AMDGPU::SReg_1_XEXECRegClassID));
   // Update EXEC, switch all done bits to 0 and all todo bits to 1.
-  B.buildInstr(XorTermOpc)
-    .addDef(LoopMask)
-    .addReg(ExecReg)
-    .addReg(NewExec);
+  B.buildInstr(XorTermOpc).addDef(LoopMask).addReg(ExecReg).addReg(NewExec);
 
   // XXX - s_xor_b64 sets scc to 1 if the result is nonzero, so can we use
   // s_cbranch_scc0?
 
   // Loop back to V_READFIRSTLANE_B32 if there are still variants to cover.
   B.buildInstr(AMDGPU::SI_WATERFALL_LOOP)
-     .addReg(LoopMask)
-     .addReg(NewExec)
-     .addMBB(LoopBB);
+      .addReg(LoopMask)
+      .addReg(NewExec)
+      .addMBB(LoopBB);
 
   // Save the EXEC mask before the loop.
   BuildMI(MBB, MBB.end(), DL, TII->get(MovExecOpc), SaveExecReg)
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index 68d81a6ffa..8e909e5afb 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -13,9 +13,9 @@
 
 #include "AMDGPU.h"
 #include "GCNSubtarget.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
 #include "llvm/Analysis/LoopInfo.h"
 #include "llvm/Analysis/UniformityAnalysis.h"
-#include "llvm/Analysis/DomTreeUpdater.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
 #include "llvm/IR/BasicBlock.h"
 #include "llvm/IR/Constants.h"
@@ -142,7 +142,8 @@ void SIAnnotateControlFlow::initialize(Module &M, const GCNSubtarget &ST) {
   IfBreak = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_if_break,
                                       { IntMask });
   Loop = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_loop, { IntMask });
-  WaveReconverge = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_wave_reconverge, { IntMask });
+  WaveReconverge = Intrinsic::getDeclaration(
+      &M, Intrinsic::amdgcn_wave_reconverge, {IntMask});
 }
 
 /// Is the branch condition uniform or did the StructurizeCFG pass
@@ -331,14 +332,14 @@ bool SIAnnotateControlFlow::tryWaveReconverge(BasicBlock *BB) {
     for (auto Succ : Term->successors()) {
       if (isTopOfStack(Succ)) {
         // Just split to make a room for further WAVE_RECONVERGE insertion
-        SmallVector<BasicBlock*, 2> Preds;
+        SmallVector<BasicBlock *, 2> Preds;
         for (auto P : predecessors(Succ)) {
           if (DT->dominates(BB, P))
             Preds.push_back(P);
         }
         DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Eager);
-        SplitBlockPredecessors(Succ, Preds, ".reconverge", &DTU, LI,
-                                            nullptr, false);
+        SplitBlockPredecessors(Succ, Preds, ".reconverge", &DTU, LI, nullptr,
+                               false);
       }
     }
   }
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index ea1e7c782e..b3984d4124 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -9941,8 +9941,9 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
     return SDValue(Load, 0);
   }
   case Intrinsic::amdgcn_wave_reconverge:
-    return SDValue(DAG.getMachineNode(AMDGPU::SI_WAVE_RECONVERGE, DL, MVT::Other,
-                                      Op->getOperand(2), Chain), 0);
+    return SDValue(DAG.getMachineNode(AMDGPU::SI_WAVE_RECONVERGE, DL,
+                                      MVT::Other, Op->getOperand(2), Chain),
+                   0);
   case Intrinsic::amdgcn_s_barrier_init:
   case Intrinsic::amdgcn_s_barrier_join:
   case Intrinsic::amdgcn_s_wakeup_barrier: {
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 3412846a5a..9786a382f6 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -6538,7 +6538,8 @@ loadMBUFScalarOperandsFromVGPR(const SIInstrInfo &TII, MachineInstr &MI,
   // BuildMI(*BodyBB, BodyBB->end(), DL, TII.get(AMDGPU::S_BRANCH))
   //       .addMBB(RemainderBB);
   // Restore the EXEC mask
-  // BuildMI(*RemainderBB, First, DL, TII.get(MovExecOpc), Exec).addReg(SaveExec);
+  // BuildMI(*RemainderBB, First, DL, TII.get(MovExecOpc),
+  // Exec).addReg(SaveExec);
   return BodyBB;
 }
 
diff --git a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
index 99ecff2d95..c494897392 100644
--- a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
@@ -216,9 +216,8 @@ void SILowerControlFlow::emitIf(MachineInstr &MI) {
   // Get rid of the garbage bits in the Cond register which might be coming from
   // the bitwise arithmetic when one of the expression operands is coming from
   // the outer scope and hence having extra bits set.
-  MachineInstr *CondFiltered = BuildMI(MBB, I, DL, TII->get(AndOpc), MaskThen)
-                                   .add(Cond)
-                                   .addReg(Exec);
+  MachineInstr *CondFiltered =
+      BuildMI(MBB, I, DL, TII->get(AndOpc), MaskThen).add(Cond).addReg(Exec);
   if (LV)
     LV->replaceKillInstruction(CondReg, MI, *CondFiltered);
 
@@ -306,9 +305,9 @@ void SILowerControlFlow::emitLoop(MachineInstr &MI) {
                                  .addReg(MaskLoop)
                                  .addImm(TestMask);
 
-  MachineInstr *SetExec= BuildMI(MBB, &MI, DL, TII->get(Select), Exec)
-                                     .addReg(MaskLoop)
-                                     .addReg(Cond);
+  MachineInstr *SetExec = BuildMI(MBB, &MI, DL, TII->get(Select), Exec)
+                              .addReg(MaskLoop)
+                              .addReg(Cond);
 
   if (LV)
     LV->replaceKillInstruction(MI.getOperand(0).getReg(), MI, *SetExec);
@@ -341,15 +340,17 @@ void SILowerControlFlow::emitWaterfallLoop(MachineInstr &MI) {
   Register AndZero = MRI->createVirtualRegister(
       TRI->getRegClass(AMDGPU::SReg_1_XEXECRegClassID));
 
-  MachineInstr *MaskZeroTest = BuildMI(*BodyBB, I, DL, TII->get(AndTermOpc), AndZero)
-      .addReg(LoopMask)
-      .addImm(TestMask);
+  MachineInstr *MaskZeroTest =
+      BuildMI(*BodyBB, I, DL, TII->get(AndTermOpc), AndZero)
+          .addReg(LoopMask)
+          .addImm(TestMask);
 
   MachineInstr *UpdateExec = BuildMI(*BodyBB, I, DL, TII->get(Select), Exec)
-      .addReg(LoopMask)
-      .addReg(ExitMask);
+                                 .addReg(LoopMask)
+                                 .addReg(ExitMask);
 
-  MachineInstr *Branch = BuildMI(*BodyBB, I, DL, TII->get(AMDGPU::S_CBRANCH_SCC1)).addMBB(LoopBB);
+  MachineInstr *Branch =
+      BuildMI(*BodyBB, I, DL, TII->get(AMDGPU::S_CBRANCH_SCC1)).addMBB(LoopBB);
 
   if (LIS) {
     RecomputeRegs.insert(MI.getOperand(0).getReg());
@@ -405,7 +406,7 @@ void SILowerControlFlow::emitWaveDiverge(MachineInstr &MI,
     MachineInstr *CopyExec =
         BuildMI(MBB, I, DL, TII->get(AMDGPU::COPY), DisableLanesMask)
             .addReg(Exec);
-    if(LIS)
+    if (LIS)
       LIS->InsertMachineInstrInMaps(*CopyExec);
   }
   Register TestResultReg = MRI->createVirtualRegister(BoolRC);
@@ -463,7 +464,7 @@ void SILowerControlFlow::emitWaveDiverge(MachineInstr &MI,
   LIS->removeAllRegUnitsForPhysReg(Exec);
 }
 
-void SILowerControlFlow::emitWaveReconverge(MachineInstr &MI) { 
+void SILowerControlFlow::emitWaveReconverge(MachineInstr &MI) {
 
   MachineBasicBlock &BB = *MI.getParent();
   Register Mask = MI.getOperand(0).getReg();

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few code quality regressions, and XFAILed tests. The description needs more elaboration on what the strategy is here

@@ -1,3 +1,4 @@
; XFAIL: *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't just xfail tests

}

assert((Next == MBB.end() || !Next->readsRegister(AMDGPU::SCC, TRI)) &&
"Malformed CFG detected!\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No newline in the string, this isn't real printing

@@ -15740,6 +15740,32 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
}
}

// ISel inserts copy to regs for the successor PHIs
// at the BB end. We need to move the SI_WAVE_RECONVERGE right before the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you avoid this by gluing the pseudo to the root node? Also, I think you can avoid a second walk over the function by doing this in EmitInstrWithCustomInserter

@@ -0,0 +1 @@
remark: <unknown>:0:0: removing function 'needs_extimg': +extended-image-insts is not supported on the current target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accidentally added file?

Comment on lines +2130 to +2134
case AMDGPU::S_CSELECT_B64_term:
// This is only a terminator to get the correct spill code placement during
// register allocation.
MI.setDesc(get(AMDGPU::S_CSELECT_B64));
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split out the low level operation handling like this into a separate PR?

FirstInsertionPt = SplitEdge(DefBB, BB, DT, LI)->getFirstInsertionPt();
} else {
// We have a uniform conditional branch terminating the block.
// THis block may be the last in the Then path of the enclosing divergent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo 'THis'

@@ -3172,8 +3172,8 @@ def int_amdgcn_loop : Intrinsic<[llvm_i1_ty],
[llvm_anyint_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
>;

def int_amdgcn_end_cf : Intrinsic<[], [llvm_anyint_ty],
[IntrWillReturn, IntrNoCallback, IntrNoFree]>;
def int_amdgcn_wave_reconverge : Intrinsic<[], [llvm_anyint_ty],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should document what this means

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second that, all these control-flow pseudo need to have their semantics documented

@@ -305,43 +304,43 @@ bool SIAnnotateControlFlow::handleLoop(BranchInst *Term) {
}

/// Close the last opened control flow
bool SIAnnotateControlFlow::closeControlFlow(BasicBlock *BB) {
llvm::Loop *L = LI->getLoopFor(BB);
bool SIAnnotateControlFlow::tryWaveReconverge(BasicBlock *BB) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is one core part of this change. It would be nice to have more comment with examples before the function header showing when and where wave_converge is inserted.

@cmc-rep
Copy link
Contributor

cmc-rep commented Aug 21, 2024

@alex-t Just curious about the status of this PR. Both this PR and the register-allocation PR by CD will have significant impact to the generated code. If we decide this is the right direction, then I feel it would be better to get it in earlier, so we can access its impact in our downstream work.

@alex-t
Copy link
Contributor Author

alex-t commented Sep 13, 2024

@alex-t Just curious about the status of this PR. Both this PR and the register-allocation PR by CD will have significant impact to the generated code. If we decide this is the right direction, then I feel it would be better to get it in earlier, so we can access its impact in our downstream work.

I am sorry but I haven't updated this PR for too long. We recently decided that it is worth trying to upstream.
I am going to close this one and create a new one. My apologies for the inconvenience.
Please see #108596 to continue the discussion

@arsenm
Copy link
Contributor

arsenm commented Sep 15, 2024

Replaced by #108596

@arsenm arsenm closed this Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang Clang issues not falling into any other category llvm:analysis llvm:globalisel llvm:ir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants