[CUDA codegen] add vectorization infrastructure #5278

ThomasRaoux · 2021-04-01T03:45:29Z

Enable vectorization for element-wise ops and prepare the infrastructure for more complex ops.

MaheshRavishankar

Just an initial comment. Will review in detail tomorrow.

MaheshRavishankar · 2021-04-01T06:17:18Z

iree/compiler/Conversion/LinalgToNVVM/KernelConfig.cpp

@@ -24,57 +24,22 @@ using namespace mlir::iree_compiler;

 static constexpr unsigned cudaWarpSize = 32;

-/// Fills `inputTypes` and `outputTypes` with the original input/output types


I think its fine to keep this. Its a pattern that should be easy to replace to use linalg.tile when we can. I am just cleaning up the concretize... pass and wondering why we have this attribute mechanism. Actually we should be able to use the ViewLikeInterface to get the source already. I am cleaning that up on the SPIR-V side. You could then adapt this?

I removed it because I realized it was currently not working. On the SPIRV path it relies on concretize pass to go up the ssa chain, analyse the subview ops and then set those attributes (somehow I thought this was set in flow). It felt a bit too much to pull all this logic so I would rather just wait for linalg.tile to be ready. What do you think? If the logic is simpler after your refactoring maybe I can match it, but I would rather avoid adding a fragile analyse of the subview op.

I sent you some WIP that I am trying to solve the similar problem on SPIR-V side. We actually dont need the attribute and traversal. All the ops we need to traverse implement a ViewOpInterface (something like this https://github.com/llvm/llvm-project/blob/d61b40ed27509d8a99b4d85499a8d5ca9f37f111/mlir/lib/Dialect/Linalg/Analysis/DependenceAnalysis.cpp#L69) to not use the attribute based hand-shake which is very shaky. I was going to try this out today. If you want you can wait for me to finish that up and use that instead

Interesting, it still requires going up the dependency chain to find the first subview op? I don't understand how else we can handle this since the original shape information is not in the tiled op anymore.
I won't really need this for a long time I think so that's why I would rather avoid going for an intermediate solution.

SubView implements the ViewOpInterface. It returns the source. and you can recurse to get the first subview op.

So you are saying, we can just have a function that recurse through the subViewOps until it finds the original memref? Yeah that makes sense. Thanks for sharing the code, I think I can apply just copy what you have. I would rather keep this change for a future PR in order to not add extra complexity to this one if this is okay with you.

Sounds good. Let me do a quick review.

Enable vectorization for element-wise ops

* 6bd5658 Merge google -> main (#5319) * 2e5257d Merge branch 'main' into google-to-main * 6936ee7 Patch VMLA performance by reserving vector size before pushing to it. (#5316) * f2f0041 NFC: Cleanup ConcretizeTileAmongstWorkgroupsPass. (#5297) * f96726a Add tests to run few other (smaller) models with Linalg on tensors path. (#5306) * fd64070 Revert "Add wasm-micro-runtime submodule and get building with CMake." (#5312) * ce0285f Continue pruning abseil usage: switch from absl::InlinedVector to std::vector... * 71e24b6 Removing hal.buffer.fill and hal.buffer.copy. (#5307) * 3c611d3 Add Mako benchmark config template file. (#5200) * 4d1a394 Fix RFFT bugs in VMLA. (#5308) * 0d55c95 Add configure_bazel.py step to TensorFlow getting started doc. * 1386d2c Switch simple_embedding_test to include drivers explicitly. (#5304) * 402550b Add StripAsserts pass and handle tf.Identity ops on tensor lists. (#5294) * fbdb4ef Add new metrics to MobileNetV2 benchmarks. (#5301) * 99c8eac Implementing Vulkan dispatch tracing. (#5287) * 2681dff Insert clones prior to mutation and not where it originates. (#5292) * aeafd9e Fix CUDA HAL bug and enable more execution tests (#5296) * 2801780 [CUDA Codegen] Enable tiling and vectorization for MatMulOp (#5293) * c61fefe Extend AffineMin canonicalization to support scf.parallel (#5289) * e0ee3f3 Add directory for microbenchmarking (#5260) * b8da32c Set wasm-export-name attributes on exported functions again. (#5286) * e2a2f81 Canonicalize affine min before applying tile-and-vecotrize passes (#5285) * 23861f7 [CUDA codegen] add vectorization infrastructure (#5278) * 6f443c4 Drop deps on Abseil's core_headers, synchronization, macros. (#5275) * e5b9e8a Actually run MobileNet with fake weights to check correctness (#5284) * e56db9a Remove dead code in LinalgToSPIRV (#5281) * 8863aa1 [NFC] Fix typos in variable names. (#5279) * 9cd93ba Turn vectorization on by default for linalg on tensors path (#5280) * 894dac6 Merge google -> main #5276 * b738162 Changing HAL dialect syntax to express all types. (#5239) * 1ba4e88 Merge branch 'main' into google-to-main * 531c73e Fix yml syntax (#5274) * 494fe32 Bumping the tracy version to 0.7.7 (WIP). (#5272) * 3616323 Disable Vulkan float16 tests on Pixel4 (#5273) * ade7ff1 Disable running BERT on Vulkan (see Issue #5268) (#5269) * 25ddc10 Add tracing to allocations made from VMA. (#5271) * df454f4 Changing iree_vm_list_resize to grow by 2x. (#5270) * bd9a113 Adding command buffer queue affinity. (#5265) * de834ae Make status matcher print the message when it fails. (#5266) * 10f5eaf Add f16 e2e tests for vulkan (#5257) * 1bdc3a4 Actually make MobileBERT run in the test. (#5264) * 2e05313 Add support for module almost_eq check for f16 type (#5261) COPYBARA_INTEGRATE_REVIEW=#5321 from NatashaKnk:main-to-google 6bd5658 PiperOrigin-RevId: 366926967

* 6bd5658 Merge google -> main (#5319) * 2e5257d Merge branch 'main' into google-to-main * 6936ee7 Patch VMLA performance by reserving vector size before pushing to it. (#5316) * f2f0041 NFC: Cleanup ConcretizeTileAmongstWorkgroupsPass. (#5297) * f96726a Add tests to run few other (smaller) models with Linalg on tensors path. (#5306) * fd64070 Revert "Add wasm-micro-runtime submodule and get building with CMake." (#5312) * ce0285f Continue pruning abseil usage: switch from absl::InlinedVector to std::vector... * 71e24b6 Removing hal.buffer.fill and hal.buffer.copy. (#5307) * 3c611d3 Add Mako benchmark config template file. (#5200) * 4d1a394 Fix RFFT bugs in VMLA. (#5308) * 0d55c95 Add configure_bazel.py step to TensorFlow getting started doc. * 1386d2c Switch simple_embedding_test to include drivers explicitly. (#5304) * 402550b Add StripAsserts pass and handle tf.Identity ops on tensor lists. (#5294) * fbdb4ef Add new metrics to MobileNetV2 benchmarks. (#5301) * 99c8eac Implementing Vulkan dispatch tracing. (#5287) * 2681dff Insert clones prior to mutation and not where it originates. (#5292) * aeafd9e Fix CUDA HAL bug and enable more execution tests (#5296) * 2801780 [CUDA Codegen] Enable tiling and vectorization for MatMulOp (#5293) * c61fefe Extend AffineMin canonicalization to support scf.parallel (#5289) * e0ee3f3 Add directory for microbenchmarking (#5260) * b8da32c Set wasm-export-name attributes on exported functions again. (#5286) * e2a2f81 Canonicalize affine min before applying tile-and-vecotrize passes (#5285) * 23861f7 [CUDA codegen] add vectorization infrastructure (#5278) * 6f443c4 Drop deps on Abseil's core_headers, synchronization, macros. (#5275) * e5b9e8a Actually run MobileNet with fake weights to check correctness (#5284) * e56db9a Remove dead code in LinalgToSPIRV (#5281) * 8863aa1 [NFC] Fix typos in variable names. (#5279) * 9cd93ba Turn vectorization on by default for linalg on tensors path (#5280) * 894dac6 Merge google -> main #5276 * b738162 Changing HAL dialect syntax to express all types. (#5239) * 1ba4e88 Merge branch 'main' into google-to-main * 531c73e Fix yml syntax (#5274) * 494fe32 Bumping the tracy version to 0.7.7 (WIP). (#5272) * 3616323 Disable Vulkan float16 tests on Pixel4 (#5273) * ade7ff1 Disable running BERT on Vulkan (see Issue #5268) (#5269) * 25ddc10 Add tracing to allocations made from VMA. (#5271) * df454f4 Changing iree_vm_list_resize to grow by 2x. (#5270) * bd9a113 Adding command buffer queue affinity. (#5265) * de834ae Make status matcher print the message when it fails. (#5266) * 10f5eaf Add f16 e2e tests for vulkan (#5257) * 1bdc3a4 Actually make MobileBERT run in the test. (#5264) * 2e05313 Add support for module almost_eq check for f16 type (#5261) PiperOrigin-RevId: 366926967

google-cla bot added the cla: yes label Apr 1, 2021

ThomasRaoux force-pushed the cuda_vectorize branch from b0aa680 to 8b69e12 Compare April 1, 2021 03:46

ThomasRaoux requested a review from MaheshRavishankar April 1, 2021 03:46

MaheshRavishankar requested changes Apr 1, 2021

View reviewed changes

ThomasRaoux force-pushed the cuda_vectorize branch from 8b69e12 to 026a6a9 Compare April 1, 2021 06:47

[CUDA codegen] add vectorization infrastructure

5af014d

Enable vectorization for element-wise ops

ThomasRaoux force-pushed the cuda_vectorize branch from 026a6a9 to 5af014d Compare April 1, 2021 07:05

ThomasRaoux requested a review from MaheshRavishankar April 1, 2021 16:10

MaheshRavishankar approved these changes Apr 1, 2021

View reviewed changes

ThomasRaoux merged commit 23861f7 into iree-org:main Apr 1, 2021

This was referenced Apr 5, 2021

Merge main -> google #5315

Closed

Merge main -> google #5321

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA codegen] add vectorization infrastructure #5278

[CUDA codegen] add vectorization infrastructure #5278

ThomasRaoux commented Apr 1, 2021

MaheshRavishankar left a comment

MaheshRavishankar Apr 1, 2021

ThomasRaoux Apr 1, 2021

MaheshRavishankar Apr 1, 2021

ThomasRaoux Apr 1, 2021

MaheshRavishankar Apr 1, 2021

ThomasRaoux Apr 1, 2021

MaheshRavishankar Apr 1, 2021

		@@ -24,57 +24,22 @@ using namespace mlir::iree_compiler;

		static constexpr unsigned cudaWarpSize = 32;

		/// Fills `inputTypes` and `outputTypes` with the original input/output types

[CUDA codegen] add vectorization infrastructure #5278

[CUDA codegen] add vectorization infrastructure #5278

Conversation

ThomasRaoux commented Apr 1, 2021

MaheshRavishankar left a comment

Choose a reason for hiding this comment

MaheshRavishankar Apr 1, 2021

Choose a reason for hiding this comment

ThomasRaoux Apr 1, 2021

Choose a reason for hiding this comment

MaheshRavishankar Apr 1, 2021

Choose a reason for hiding this comment

ThomasRaoux Apr 1, 2021

Choose a reason for hiding this comment

MaheshRavishankar Apr 1, 2021

Choose a reason for hiding this comment

ThomasRaoux Apr 1, 2021

Choose a reason for hiding this comment

MaheshRavishankar Apr 1, 2021

Choose a reason for hiding this comment