Support generic reduction and scan cases. #14

ienkovich · 2024-06-04T21:49:30Z

No description provided.

minjang

Looks good to me! I just have a naming convention suggestion (ScanReduce -> ReduceScan) and some minor code clarity. I will accept soon.

This is a general question for reduce/scan with vectorization. So, basically is the scan algorithm like std::inclusive_scan? And use vector::shuffle to move the data inside SIMD registers and then finally accumulate.

third_party/cpu/lib/TritonToTritonCPU/ConvertScanOp.cpp

minjang · 2024-06-05T21:08:56Z

third_party/cpu/lib/TritonToTritonCPU/ConvertReductionOp.cpp

+             kind == vector::CombiningKind::MAXIMUMF) {
+      if (elemTy.isF32())
+        initVal =
+            rewriter.getF32FloatAttr(std::numeric_limits<float>::quiet_NaN());


Looks good. It's like std::fmin and std::min.

minjang · 2024-06-05T21:09:43Z

third_party/cpu/lib/TritonToTritonCPU/ConvertReductionOp.cpp

+      else if (elemTy.isF64())
+        initVal =
+            rewriter.getF64FloatAttr(std::numeric_limits<double>::quiet_NaN());
+      else


Not urgent, maybe we can support F16/BF16 using its raw binary representations for quite_NaN?

Yes, this needs to be added. I couldn't yet find examples of how such constants are created, used, and lowered. Did you see any examples?

I just saw one F16 testing cases in the test_core.py regarding reduction. But not urgent at all.

third_party/cpu/lib/TritonToTritonCPU/ConvertReductionOp.cpp

third_party/cpu/lib/TritonToTritonCPU/ScanReduceCommon.h

third_party/cpu/lib/TritonToTritonCPU/ConvertReductionOp.cpp

third_party/cpu/lib/TritonToTritonCPU/ConvertScanOp.cpp

third_party/cpu/lib/TritonToTritonCPU/ScanReduceCommon.h

Signed-off-by: Ilya Enkovich <[email protected]>

ienkovich · 2024-06-10T19:01:36Z

This is a general question for reduce/scan with vectorization. So, basically is the scan algorithm like std::inclusive_scan? And use vector::shuffle to move the data inside SIMD registers and then finally accumulate.

There are three general cases here. For scans on trailing dimension, we use shuffle to accumulate neighbors, then increase shuffle stride and use masks to keep already computed elements intact. For reductions on trailing dimensions we use shuffle to swap to vector parts and accumulate, then swap halves of halves, etc. For cases when a scan/reduction goes on a non-trailing dimension, we iterate through all sub-vectors and accumulate. Works like a fully unrolled loop.

Signed-off-by: Ilya Enkovich <[email protected]>

When running [convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924) Triton ends up computing a rank of a matrix with 0 columns during linear layout lowering, which trips up f2reduce, and causes undefined behavior, detectable through [UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html). Fix this by returning the rank (0) early in these cases, without calling f2reduce. <details><summary>Stack trace</summary> <p> ``` third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long' #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 #1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9 #2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3 #3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7 #4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41 #5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51 #6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14 #7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8 #8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19 #9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24 #10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7 #11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12 #12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12 #13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5 #14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9 #15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10 #16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12 #17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16 #18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5 #19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5 #20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3 #21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26 #22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7 #23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7 #24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5 #25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22 ... UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 ``` </p> </details>

Signed-off-by: Ilya Enkovich <[email protected]>

Adds extra optional padding that can be use to ensure that input matrices' strides are non-power-of-two to improve cache behavior. Currently, it is most useful with DYNAMIC_K_BLOCK enabled.

ienkovich requested a review from minjang June 4, 2024 21:49

ienkovich requested a review from ptillet as a code owner June 4, 2024 21:49

minjang reviewed Jun 5, 2024

View reviewed changes

minjang approved these changes Jun 10, 2024

View reviewed changes

Support generic reduction and scan cases.

9869e84

Signed-off-by: Ilya Enkovich <[email protected]>

ienkovich force-pushed the ienkovich/cpu/scan-reduce branch from bbed4e5 to 9869e84 Compare June 10, 2024 18:54

minjang merged commit 5adb663 into triton-lang:main Jun 10, 2024
2 of 4 checks passed

ienkovich deleted the ienkovich/cpu/scan-reduce branch June 10, 2024 22:19

minjang pushed a commit to minjang/triton-cpu that referenced this pull request Jun 22, 2024

Support generic reduction and scan cases. (triton-lang#14)

1e888df

Signed-off-by: Ilya Enkovich <[email protected]>

minjang pushed a commit that referenced this pull request Jun 24, 2024

Support generic reduction and scan cases. (#14)

0f9a0cf

Signed-off-by: Ilya Enkovich <[email protected]>

vivekvpandya mentioned this pull request Aug 9, 2024

OptimizeMask pass fails on following input #105

Closed

Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Aug 13, 2024

Support generic reduction and scan cases. (triton-lang#14)

ce4e891

Signed-off-by: Ilya Enkovich <[email protected]>

int3 pushed a commit that referenced this pull request Aug 29, 2024

Support generic reduction and scan cases. (#14)

5ba3fb5

Signed-off-by: Ilya Enkovich <[email protected]>

minjang pushed a commit that referenced this pull request Sep 22, 2024

Support generic reduction and scan cases. (#14)

76a8cee

Signed-off-by: Ilya Enkovich <[email protected]>

minjang pushed a commit that referenced this pull request Oct 22, 2024

Support generic reduction and scan cases. (#14)

db5b772

Signed-off-by: Ilya Enkovich <[email protected]>

minjang pushed a commit that referenced this pull request Oct 24, 2024

Support generic reduction and scan cases. (#14)

6b41ad2

Signed-off-by: Ilya Enkovich <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support generic reduction and scan cases. #14

Support generic reduction and scan cases. #14

ienkovich commented Jun 4, 2024

minjang left a comment

minjang Jun 5, 2024

minjang Jun 5, 2024

ienkovich Jun 10, 2024

minjang Jun 10, 2024

ienkovich commented Jun 10, 2024

Support generic reduction and scan cases. #14

Support generic reduction and scan cases. #14

Conversation

ienkovich commented Jun 4, 2024

minjang left a comment

Choose a reason for hiding this comment

minjang Jun 5, 2024

Choose a reason for hiding this comment

minjang Jun 5, 2024

Choose a reason for hiding this comment

ienkovich Jun 10, 2024

Choose a reason for hiding this comment

minjang Jun 10, 2024

Choose a reason for hiding this comment

ienkovich commented Jun 10, 2024