[LLVMCPU] Add an option for tiling reduction only to LLVMCPUTile. #13821

hanhanW · 2023-05-26T23:11:30Z

If the option is true, only tile the ops that has reduction loops. It is useful because it allows us to tile on reduction ops firstly and tileAndFuse on other operations later. We can greedily apply tileAndFuse on consumers because the reduction op will no longer be pulled in. There is a scf.for as barrier to stop fusion on reductions.

The changes to LLVMTileAndFuse is needed together because we follow the same pipeline behavior. Now we need to use TileAndFuse in last level of tiling for consumers. If there are no consumers, it should not be applied to reduction ops.

It is a step toward #13706 and #13474

If the option is true, only tile the ops that has reduction loops. It is useful because it allows us to tile on reduction ops firstly and tileAndFuse on other operations later. We can greedily apply tileAndFuse on consumers because the reduction op will no longer be pulled in. There is a scf.for as barrier to stop fusion on reductions.

github-actions · 2023-05-27T00:26:22Z

Abbreviated Benchmark Summary

@ commit 04f6c7ebf34013c0651096474fcb8a2055e88b96 (vs. base 8c41e5177f6a8cc33200bb91c8b01dee72bf0942)

Regressed Latencies 🚩

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
DeepLabV3\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,mmt4d] local\_sync(embedded\_elf)[full-inference,default-flags] with zeros @ pixel-4[big-core]	77.736 (vs. 72.468, 7.27%↑)	77.771	0.117

Improved Latencies 🎉

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
DeepLabV3\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_sync(embedded\_elf)[full-inference,default-flags] with zeros @ pixel-4[big-core]	72.755 (vs. 81.256, 10.46%↓)	77.671	7.450
PoseNet\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags] local\_task(embedded\_elf)[4-thread,full-inference,system-scheduling] with zeros @ pixel-4[big-core]	76.649 (vs. 83.602, 8.32%↓)	76.263	1.379
MobileNetV3Small\_fp32(tflite) [vmvx-generic-vmvx-vmvx][experimental-flags] local\_task(vmvx\_module)[4-thread,full-inference,system-scheduling] with zeros @ pixel-6-pro[big-core]	994.456 (vs. 1070.466, 7.10%↓)	1001.623	29.884

No improved or regressed compilation metrics 🏖️

For more information:

Source Workflow Run

MaheshRavishankar · 2023-05-30T16:08:39Z

compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp

@@ -447,7 +447,9 @@ void addMultiTilingExpertPassPipeline(OpPassManager &passManager,
  nestedModulePM.addNestedPass<func::FuncOp>(
      createLLVMCPUSplitReductionPass(clEnableReassociateFpReductions));
  nestedModulePM.addNestedPass<func::FuncOp>(
-      createLLVMCPUTilePass(numLevels - 1));
+      createLLVMCPUTilePass(numLevels - 1, /*reductionOnly=*/true));


This ordering might be problematic..... Once you tile the reduction, there is no real opportunity for tile and fuse... Why do we need the tile and fuse after this layer?

The order is intended. This is for the last level of tiling, i.e., the tiling level right before vectorization. We want to tile ops individually. We firstly tile the reduction ops, and then handle the consumer ops. There are no differences between tile and TileAndFuse if we have a single consumer op. They will just tile the consumer op. But it's important if there is a consumer ops chain, .g., reduction + broadcast + tensor.pack/pad ops. I want to tile and fuse broadcast + pack op in this case.

(maybe I should add a comment, that would help others to understand that it's intended)

Sorry, I'm still not getting why we need this order. Couldn't we just use a loop and pass the right enum values for each dim as we are doing in other pipelines?

(Now that I get it) THis is what I meant https://github.com/hanhanW/iree/blob/multi-lowering-config/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp#LL438C1-L438C1

Instead of invoking the pass for all the dimensions, couldn't we just add a loop that takes care of the parallel/reduction dimensions using the enums?

I don't get the idea.. I thought using enums will be done in your multi-level tiling PR.

What I did here is creating different passes for the last level tiling (which intends to be vector level). What do you mean by "invoking the pass for all the dimensions"? Am I doing it now?

The reduction tiling still take the same config (e.g., [0, 0, 16]) for tiling. It only tile the reduction loop.

TileAndFuse for consumer ops are tricky. They reuse the same configuration. I'm still working on adding multi configs support. I see this is a transition state and it is a incremental change towards multi lowering configs.

Oh, we only have enums for workgroups right now it seems? https://github.com/openxla/iree/blob/main/compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.h#L22

I was hoping we could do the dimension selection/filtering on the caller side, as we do in hanhanW/iree@multi-lowering-config/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp#LL438C1-L438C1, so that LLVMCPUTile/TileAndFuse only have to worry about the tiling and not the filtering. If that is not possible, please go ahead with this as is.

Yes, we only have enums for workgroups right now. The passes themselves only worry about the tiling. The filtering is handled by us, i.e., when we create the pass.

dcaballe · 2023-05-30T18:33:48Z

compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPUPasses.h

@@ -69,7 +69,7 @@ std::unique_ptr<OperationPass<func::FuncOp>> createLLVMCPUTileAndFusePass(

 /// Pass to tile TilingInterface ops with given tilingLevel.
 std::unique_ptr<OperationPass<func::FuncOp>> createLLVMCPUTilePass(
-    int64_t tilingLevel = -1);
+    int64_t tilingLevel = -1, bool reductionOnly = false);


I'm a bit confused. Why do we need this flag? Couldn't we use the tilingLevel to pass the specific reduction level we want to tile?

I want to tile and fuse consumer ops for vector level. Let's take reduction + broadcast + pack as an example. What I want is

scf.for ... // Tiling on reduction loop reduction op scf.for ... // Tile and fuse for broadcast + pack broadcast pack

If the reductionOnly flag is not passed, the pass will tile all the ops individually, which results in

scf.for ... // Tiling on reduction loop reduction op scf.for ... // Tiling on broadcast op broadcast scf.for ... // Tiling on pack op pack

Ok, thanks! Got it now

dcaballe · 2023-05-30T18:41:30Z

compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp

@@ -447,7 +447,9 @@ void addMultiTilingExpertPassPipeline(OpPassManager &passManager,
  nestedModulePM.addNestedPass<func::FuncOp>(
      createLLVMCPUSplitReductionPass(clEnableReassociateFpReductions));
  nestedModulePM.addNestedPass<func::FuncOp>(
-      createLLVMCPUTilePass(numLevels - 1));
+      createLLVMCPUTilePass(numLevels - 1, /*reductionOnly=*/true));


Sorry, I'm still not getting why we need this order. Couldn't we just use a loop and pass the right enum values for each dim as we are doing in other pipelines?

…ile. (#13821)" This reverts commit c9c2e83.

…ile." (#13867) Reverts #13821 It introduces definition issue about third tile size list. Revert the commit and we will land it in another way, which should have concrete definition for each tile list.

…ee-org#13821) If the option is true, only tile the ops that has reduction loops. It is useful because it allows us to tile on reduction ops firstly and tileAndFuse on other operations later. We can greedily apply tileAndFuse on consumers because the reduction op will no longer be pulled in. There is a scf.for as barrier to stop fusion on reductions. The changes to LLVMTileAndFuse is needed together because we follow the same pipeline behavior. Now we need to use TileAndFuse in last level of tiling for consumers. If there are no consumers, it will not be applied on reduction ops. It is a step toward iree-org#13706 and iree-org#13474

…ile." (iree-org#13867) Reverts iree-org#13821 It introduces definition issue about third tile size list. Revert the commit and we will land it in another way, which should have concrete definition for each tile list.

…ee-org#13821) If the option is true, only tile the ops that has reduction loops. It is useful because it allows us to tile on reduction ops firstly and tileAndFuse on other operations later. We can greedily apply tileAndFuse on consumers because the reduction op will no longer be pulled in. There is a scf.for as barrier to stop fusion on reductions. The changes to LLVMTileAndFuse is needed together because we follow the same pipeline behavior. Now we need to use TileAndFuse in last level of tiling for consumers. If there are no consumers, it will not be applied on reduction ops. It is a step toward iree-org#13706 and iree-org#13474

…ile." (iree-org#13867) Reverts iree-org#13821 It introduces definition issue about third tile size list. Revert the commit and we will land it in another way, which should have concrete definition for each tile list.

…ee-org#13821) If the option is true, only tile the ops that has reduction loops. It is useful because it allows us to tile on reduction ops firstly and tileAndFuse on other operations later. We can greedily apply tileAndFuse on consumers because the reduction op will no longer be pulled in. There is a scf.for as barrier to stop fusion on reductions. The changes to LLVMTileAndFuse is needed together because we follow the same pipeline behavior. Now we need to use TileAndFuse in last level of tiling for consumers. If there are no consumers, it will not be applied on reduction ops. It is a step toward iree-org#13706 and iree-org#13474

…ile." (iree-org#13867) Reverts iree-org#13821 It introduces definition issue about third tile size list. Revert the commit and we will land it in another way, which should have concrete definition for each tile list.

hanhanW requested review from dcaballe and MaheshRavishankar as code owners May 26, 2023 23:11

hanhanW changed the title ~~[LLVMCPU] Add an option to tile reduction only to LLVMCPUTile.~~ [LLVMCPU] Add an option for tiling reduction only to LLVMCPUTile. May 26, 2023

hanhanW added benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:comp-stats Run default compilation statistics benchmarks benchmarks:android-cpu Run default Android CPU benchmarks labels May 26, 2023

hanhanW mentioned this pull request May 26, 2023

[CPU] Enable 'iree-llvmcpu-reassociate-fp-reductions' by default #13822

Merged

add an early-skip to TileAndFuse

f7fef7a

MaheshRavishankar requested changes May 30, 2023

View reviewed changes

MaheshRavishankar approved these changes May 30, 2023

View reviewed changes

add a comment for vector level tiling

6809dc5

hanhanW enabled auto-merge (squash) May 30, 2023 18:29

dcaballe reviewed May 30, 2023

View reviewed changes

hanhanW disabled auto-merge May 30, 2023 18:47

hanhanW merged commit c9c2e83 into iree-org:main May 30, 2023

hanhanW deleted the multi-lowering-config branch May 30, 2023 20:39

hanhanW added a commit that referenced this pull request May 31, 2023

Revert "[LLVMCPU] Add an option for tiling reduction only to LLVMCPUT…

ce3b654

…ile. (#13821)" This reverts commit c9c2e83.

hanhanW mentioned this pull request May 31, 2023

Revert "[LLVMCPU] Add an option for tiling reduction only to LLVMCPUTile." #13867

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLVMCPU] Add an option for tiling reduction only to LLVMCPUTile. #13821

[LLVMCPU] Add an option for tiling reduction only to LLVMCPUTile. #13821

hanhanW commented May 26, 2023 •

edited

Loading

github-actions bot commented May 27, 2023 •

edited

Loading

MaheshRavishankar May 30, 2023

hanhanW May 30, 2023

hanhanW May 30, 2023

dcaballe May 30, 2023

dcaballe May 30, 2023

hanhanW May 30, 2023

dcaballe May 30, 2023

hanhanW May 30, 2023

dcaballe May 30, 2023

hanhanW May 30, 2023

dcaballe May 30, 2023

dcaballe May 30, 2023

[LLVMCPU] Add an option for tiling reduction only to LLVMCPUTile. #13821

[LLVMCPU] Add an option for tiling reduction only to LLVMCPUTile. #13821

Conversation

hanhanW commented May 26, 2023 • edited Loading

github-actions bot commented May 27, 2023 • edited Loading

Abbreviated Benchmark Summary

Regressed Latencies 🚩

Improved Latencies 🎉

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanhanW commented May 26, 2023 •

edited

Loading

github-actions bot commented May 27, 2023 •

edited

Loading