Plumb pooling ops through Linalg on tensors path. #5136

hanhanW · 2021-03-17T15:02:23Z

Set pooling ops as root op.
Convert linalg.init_tensor to a AllocOp.
Set pooling ops launch config for Linalg on tensors path.

The (2) is only created for shaped operand. The alloc op will be deleted
once lower to loops. This is the same behavior as Linalg on buffers
path.

Part of #5043

hanhanW · 2021-03-17T15:45:39Z

This will need to replace AllocOp with memref::AllocOp after the google -> main happens.

hanhanW · 2021-03-17T15:58:14Z

iree/compiler/Conversion/LinalgToSPIRV/KernelDispatchUtils.cpp

-  ts[ts.size() - 1] = tileSizeX;
+  SmallVector<int64_t, 4> ts;
+  if (options.usingLinalgOnTensors) {
+    ts.assign({0, tileSizeY, tileSizeX, 1});


To be honest, I don't really understand how to set this and what the constraints are...

Should the size of ts should be as same as the number of parallel loops in Linalg on tensors path?
What is the relationship between this and workloadPerWorkgroup? I tried 0, tileSizeY, tileSizeX, 0, but it would hit an assertion:

Assertion `workloadPerWorkgroup.size() <= 3 && "workloadPerWorkgroup size greater than 3 not handled"`

Now I'm just luckily having a config to make it work. I'm happy to learn more about this.

Yes, Hanhan, you are right about the fact that this is fairly haphazard. Its on my list of things to clean up.

The overall goal here is that you only want to partition upto 3 loops at the Flow level. The HAL layer can only support 3 levels of processor x,y,z. We can have Flow in theory handle any number of parallelism mode, but it gets tricky then to lower from Flow -> HAL. So for now the assumption is to try to keep 3 levels of parallelism at flow level.

W.R.T to how to decide which loops to parallelize and which not to parallelize, and what tile size to use. We should uniformly try to say tile size of 0 implies it is not parallelized, both in the Flow level when the distrubution happens, and here when setting the configuration.

Let me look through your change, in a bit. Its the thing I want to get to next, so I will have to see whats done currently and then give you feedback here.

It's complicated because of three main reasons: 1) existing buffer vs. new tensor path, 2) convolution/pooling ops has more than 3 parallel dimensions but right now we can only support <= 3 Flow tiling, 3) the tile sizes here matches linalg op implicit loop order, which is typically the reverse of distribution order.

In the existing buffer path by default we always tile along the first three parallel dimensions, namely N, OH, OW. So that's why you see we only have three numbers here for the existing buffer path: they are for the first 3 implicit loop of conv/pooling ops, namely N, OH, OW. That's not very good because N is typically 1 in inference so we are effectively dropping one dimension of parallelism. That's why we want to tile along the second three parallel dimensions, namely OH, OW, OC. The new tensor based approach adopts this scheme for conv, by having the special check in DispatchLinalgOnTensorsPass to force tiling along the second three parallel dimensions, namely OH, OW, OC. Now remember that these tiling configurations need to match with implicit loop order in linalg ops. So we need a special leading 0 to mean disabling tiling on the first parallel dimension, namely N. Given you made pooling ops be treated similarly like convolution ops, you need the same thing here. Hope this clarifies. :)

It will improve a bit once we can remove the existing buffer path. But the special checks in DispatchLinalgOnTensorsPass will remain. I would actually think making flow level workgroups really n-D (n being arbitrary number) good to address the special checks. I was doing that in the initial prototype. Will send out a pull request later to show how it looks and whether we should proceed with it.

Now I understand more about these, thanks for the detailed explanantion!

MaheshRavishankar · 2021-03-17T16:14:11Z

iree/compiler/Conversion/LinalgToSPIRV/KernelDispatchUtils.cpp

-  ts[ts.size() - 1] = tileSizeX;
+  SmallVector<int64_t, 4> ts;
+  if (options.usingLinalgOnTensors) {
+    ts.assign({0, tileSizeY, tileSizeX, 1});


Yes, Hanhan, you are right about the fact that this is fairly haphazard. Its on my list of things to clean up.

The overall goal here is that you only want to partition upto 3 loops at the Flow level. The HAL layer can only support 3 levels of processor x,y,z. We can have Flow in theory handle any number of parallelism mode, but it gets tricky then to lower from Flow -> HAL. So for now the assumption is to try to keep 3 levels of parallelism at flow level.

W.R.T to how to decide which loops to parallelize and which not to parallelize, and what tile size to use. We should uniformly try to say tile size of 0 implies it is not parallelized, both in the Flow level when the distrubution happens, and here when setting the configuration.

Let me look through your change, in a bit. Its the thing I want to get to next, so I will have to see whats done currently and then give you feedback here.

MaheshRavishankar · 2021-03-17T16:17:15Z

iree/compiler/Conversion/Common/LinalgBufferizePass.cpp

+static LogicalResult convertInitTensorOp(OpBuilder &b,
+                                         linalg::InitTensorOp initTensorOp,
+                                         BlockAndValueMapping &bvm) {
+  RankedTensorType tensorType = initTensorOp.getType();


You probably need to check if the result is already mapped to a buffer here. There is a pre-processing step for Linalg operations that tries to map the result buffer (if it is already mapped) to the outs operand. So here this will be undone.

I see, thanks!

Hey Hanhan, this is probably needed before this can land.

Does if (bvm.contains(initTensorOp.getResult())) return success(); address the comment? I thought bvm contains the information of mapping between outs and buffers.

Right. That works. Missed that.

iree/compiler/Conversion/Common/LinalgBufferizePass.cpp

antiagainst · 2021-03-17T17:15:21Z

iree/compiler/Conversion/LinalgToSPIRV/KernelDispatchUtils.cpp

-  ts[ts.size() - 1] = tileSizeX;
+  SmallVector<int64_t, 4> ts;
+  if (options.usingLinalgOnTensors) {
+    ts.assign({0, tileSizeY, tileSizeX, 1});


It's complicated because of three main reasons: 1) existing buffer vs. new tensor path, 2) convolution/pooling ops has more than 3 parallel dimensions but right now we can only support <= 3 Flow tiling, 3) the tile sizes here matches linalg op implicit loop order, which is typically the reverse of distribution order.

In the existing buffer path by default we always tile along the first three parallel dimensions, namely N, OH, OW. So that's why you see we only have three numbers here for the existing buffer path: they are for the first 3 implicit loop of conv/pooling ops, namely N, OH, OW. That's not very good because N is typically 1 in inference so we are effectively dropping one dimension of parallelism. That's why we want to tile along the second three parallel dimensions, namely OH, OW, OC. The new tensor based approach adopts this scheme for conv, by having the special check in DispatchLinalgOnTensorsPass to force tiling along the second three parallel dimensions, namely OH, OW, OC. Now remember that these tiling configurations need to match with implicit loop order in linalg ops. So we need a special leading 0 to mean disabling tiling on the first parallel dimension, namely N. Given you made pooling ops be treated similarly like convolution ops, you need the same thing here. Hope this clarifies. :)

It will improve a bit once we can remove the existing buffer path. But the special checks in DispatchLinalgOnTensorsPass will remain. I would actually think making flow level workgroups really n-D (n being arbitrary number) good to address the special checks. I was doing that in the initial prototype. Will send out a pull request later to show how it looks and whether we should proceed with it.

iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp

MaheshRavishankar · 2021-03-18T00:50:44Z

It will improve a bit once we can remove the existing buffer path. But the special checks in DispatchLinalgOnTensorsPass will remain. I would actually think making flow level workgroups really n-D (n being arbitrary number) good to address the special checks. I was doing that in the initial prototype. Will send out a pull request later to show how it looks and whether we should proceed with it.

@antiagainst I would push back against going down this path. There is really no reason we should have some non-trivial logic to convert nD distribution to 3D distribution in the Flow to HAL conversion. It is less surprising to just expose only three levels of parallelism at the Flow level itself. As we have discussed before, eventually everything needs to map to a 3D grid cause that is what hardware supports. We know what op we are targeting and would just be cleaner if we did not partition the operation to more than 3D to begin with by using the tile size of 0. It makes things less surprising. The only use case I can think of for considering more than 3D partitioning is if a HAL backends needs more than 3D partitioning.

MaheshRavishankar · 2021-03-18T22:07:07Z

iree/compiler/Conversion/Common/LinalgBufferizePass.cpp

+static LogicalResult convertInitTensorOp(OpBuilder &b,
+                                         linalg::InitTensorOp initTensorOp,
+                                         BlockAndValueMapping &bvm) {
+  RankedTensorType tensorType = initTensorOp.getType();


Hey Hanhan, this is probably needed before this can land.

antiagainst · 2021-03-19T13:01:15Z

It will improve a bit once we can remove the existing buffer path. But the special checks in DispatchLinalgOnTensorsPass will remain. I would actually think making flow level workgroups really n-D (n being arbitrary number) good to address the special checks. I was doing that in the initial prototype. Will send out a pull request later to show how it looks and whether we should proceed with it.

@antiagainst I would push back against going down this path. There is really no reason we should have some non-trivial logic to convert nD distribution to 3D distribution in the Flow to HAL conversion. It is less surprising to just expose only three levels of parallelism at the Flow level itself. As we have discussed before, eventually everything needs to map to a 3D grid cause that is what hardware supports. We know what op we are targeting and would just be cleaner if we did not partition the operation to more than 3D to begin with by using the tile size of 0. It makes things less surprising. The only use case I can think of for considering more than 3D partitioning is if a HAL backends needs more than 3D partitioning.

Understood your points. Not saying I'm super attached to that and must go with it. (I didn't implement that way when plumbing through conv support. ;-P) Note that my points were just to send a pull request to show how it looks like, given I initially have it already and I think it's not that complicated.

It is less surprising to just expose only three levels of parallelism at the Flow level itself.

+1. We claim flow workgroups can be arbitrary-D but in reality it can only use <= 3; so that's just a confusing point. If we aren't gonna use it, it's better to drop that. :)

antiagainst · 2021-03-19T13:04:01Z

Clear from me. But please address Mahesh's comments.

hanhanW · 2021-03-19T16:53:12Z

Rebase and delete special logic that is handled in #5172

hanhanW · 2021-03-21T15:25:41Z

@MaheshRavishankar please take another look. This rebased on a commit after 9f968d0. This needs your approval because you requested changes. Thanks!

MaheshRavishankar · 2021-03-21T20:33:36Z

Hey Hanhan, AFAICS, there is still one comment w.r.t to InitTensor initialization that still needs to be addressed?

hanhanW · 2021-03-22T03:16:09Z

I probably misunderstood something. Does if (bvm.contains(initTensorOp.getResult())) return success(); address the comment? I thought bvm contains the information of mapping between outs and buffers.

MaheshRavishankar · 2021-03-22T16:02:43Z

iree/compiler/Conversion/Common/LinalgBufferizePass.cpp

+static LogicalResult convertInitTensorOp(OpBuilder &b,
+                                         linalg::InitTensorOp initTensorOp,
+                                         BlockAndValueMapping &bvm) {
+  RankedTensorType tensorType = initTensorOp.getType();


Right. That works. Missed that.

1) Set pooling ops as root op. 2) Convert linalg.init_tensor to a AllocOp. 3) Update pooling ops launch config on both codegen path. The (2) is only created for shaped operand. The alloc op will be deleted once lower to loops. This is the same behavior as Linalg on buffers path. Part of iree-org#5043

* 82d8765 Merge branch 'google' into main-to-google * b7e2b09 Remove special casing for IreeFileCheck in iree_lit_test.cmake (#5209) * b67af5a Clean up several CMake options configured for our CI builds. (#5197) * ea8b82d Integrate MLIR-EmitC at iml130/mlir-emitc@68547d0 (#5205) * 10d9ea1 Make CPU/GPU end-to-end testing in Linalg on tensors/buffers align. (#5203) * ff69d02 Move prepration of benchmark files to python script. (#5201) * 0c2f9e8 Bump Abseil-cpp submodule to LTS 20200923, Patch 3 (#5198) * 6ac0023 Add runtime tracing support to passes under IREE's PassManager usage. (#5160) * 2e41217 Bump Benchmark submodule to 39c8d58 (#5196) * 663ab93 Adjust UnixLinkerTool flags used for Apple targets. (#5164) * 963beb9 Plumb pooling ops through Linalg on tensors path. (#5136) * a33f72f Plumb depthwise convolution in Linalg on tensors. (#5189) * 2864ee4 Add restrictions for enabling generic op vectorization. (#5188) * 4f218a8 Make sure convolution vectorization is off when the option flag is off (#5187) * 44736b5 build_tools:cmake:riscv: Fix the system library path (#5175) * 96e9c49 Switching from mhlo::ConvertOp to ZeroExtendI for i1->i8 conversion. (#5119) * da4ad51 Fix and re-enable jax testing (#5183) * 301bc15 Merge google -> main (#5182) * 400719c Shuffling around/removing some HLO deps. * 9ba1d57 Enable MobileBert on GPU for the linalg on tensors path. (#5180) * 52f02af Resolve name conflicts while linking executables in LLVM. (#5179) * 4a732d3 Removing the post-partitioning conversion pass as it is unused. We've built up.. COPYBARA_INTEGRATE_REVIEW=#5212 from ScottTodd:main-to-google 82d8765 PiperOrigin-RevId: 364658113

* 82d8765 Merge branch 'google' into main-to-google * b7e2b09 Remove special casing for IreeFileCheck in iree_lit_test.cmake (#5209) * b67af5a Clean up several CMake options configured for our CI builds. (#5197) * ea8b82d Integrate MLIR-EmitC at iml130/mlir-emitc@68547d0 (#5205) * 10d9ea1 Make CPU/GPU end-to-end testing in Linalg on tensors/buffers align. (#5203) * ff69d02 Move prepration of benchmark files to python script. (#5201) * 0c2f9e8 Bump Abseil-cpp submodule to LTS 20200923, Patch 3 (#5198) * 6ac0023 Add runtime tracing support to passes under IREE's PassManager usage. (#5160) * 2e41217 Bump Benchmark submodule to 39c8d58 (#5196) * 663ab93 Adjust UnixLinkerTool flags used for Apple targets. (#5164) * 963beb9 Plumb pooling ops through Linalg on tensors path. (#5136) * a33f72f Plumb depthwise convolution in Linalg on tensors. (#5189) * 2864ee4 Add restrictions for enabling generic op vectorization. (#5188) * 4f218a8 Make sure convolution vectorization is off when the option flag is off (#5187) * 44736b5 build_tools:cmake:riscv: Fix the system library path (#5175) * 96e9c49 Switching from mhlo::ConvertOp to ZeroExtendI for i1->i8 conversion. (#5119) * da4ad51 Fix and re-enable jax testing (#5183) * 301bc15 Merge google -> main (#5182) * 400719c Shuffling around/removing some HLO deps. * 9ba1d57 Enable MobileBert on GPU for the linalg on tensors path. (#5180) * 52f02af Resolve name conflicts while linking executables in LLVM. (#5179) * 4a732d3 Removing the post-partitioning conversion pass as it is unused. We've built up.. PiperOrigin-RevId: 364658113

hanhanW requested review from antiagainst and MaheshRavishankar March 17, 2021 15:02

google-cla bot added the cla: yes label Mar 17, 2021

hanhanW commented Mar 17, 2021

View reviewed changes

MaheshRavishankar requested changes Mar 17, 2021

View reviewed changes

antiagainst requested changes Mar 17, 2021

View reviewed changes

hanhanW force-pushed the pooling branch from bfb17ac to 8816d84 Compare March 18, 2021 12:09

hanhanW requested review from antiagainst and MaheshRavishankar March 18, 2021 12:15

hanhanW changed the base branch from main to google March 18, 2021 12:15

hanhanW changed the base branch from google to main March 18, 2021 12:16

MaheshRavishankar requested changes Mar 18, 2021

View reviewed changes

MaheshRavishankar mentioned this pull request Mar 18, 2021

Tile and distribute linalg.generic in DispatchLinalgOnTensor #5159

Merged

hanhanW force-pushed the pooling branch from 8ae4d8e to 26c53d1 Compare March 19, 2021 06:03

antiagainst approved these changes Mar 19, 2021

View reviewed changes

hanhanW force-pushed the pooling branch 2 times, most recently from 558e436 to 7a84a8d Compare March 19, 2021 16:43

hanhanW requested a review from MaheshRavishankar March 19, 2021 16:53

MaheshRavishankar approved these changes Mar 22, 2021

View reviewed changes

hanhanW added 2 commits March 22, 2021 10:13

rebase and address comments

30c9d1e

hanhanW force-pushed the pooling branch from 7a84a8d to 30c9d1e Compare March 22, 2021 17:13

hanhanW merged commit 963beb9 into iree-org:main Mar 22, 2021

hanhanW deleted the pooling branch March 22, 2021 17:49

hanhanW mentioned this pull request Mar 22, 2021

Merge main -> google #5194

Closed

ScottTodd mentioned this pull request Mar 23, 2021

Merge main -> google #5212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plumb pooling ops through Linalg on tensors path. #5136

Plumb pooling ops through Linalg on tensors path. #5136

hanhanW commented Mar 17, 2021 •

edited

Loading

hanhanW commented Mar 17, 2021

hanhanW Mar 17, 2021 •

edited

Loading

MaheshRavishankar Mar 17, 2021

antiagainst Mar 17, 2021

hanhanW Mar 18, 2021

MaheshRavishankar Mar 17, 2021

MaheshRavishankar Mar 17, 2021

hanhanW Mar 18, 2021

MaheshRavishankar Mar 18, 2021

hanhanW Mar 22, 2021

MaheshRavishankar Mar 22, 2021

antiagainst Mar 17, 2021

MaheshRavishankar commented Mar 18, 2021

MaheshRavishankar Mar 18, 2021

antiagainst commented Mar 19, 2021

antiagainst commented Mar 19, 2021

hanhanW commented Mar 19, 2021

hanhanW commented Mar 21, 2021

MaheshRavishankar commented Mar 21, 2021

hanhanW commented Mar 22, 2021

MaheshRavishankar Mar 22, 2021

Plumb pooling ops through Linalg on tensors path. #5136

Plumb pooling ops through Linalg on tensors path. #5136

Conversation

hanhanW commented Mar 17, 2021 • edited Loading

hanhanW commented Mar 17, 2021

hanhanW Mar 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaheshRavishankar commented Mar 18, 2021

Choose a reason for hiding this comment

antiagainst commented Mar 19, 2021

antiagainst commented Mar 19, 2021

hanhanW commented Mar 19, 2021

hanhanW commented Mar 21, 2021

MaheshRavishankar commented Mar 21, 2021

hanhanW commented Mar 22, 2021

Choose a reason for hiding this comment

hanhanW commented Mar 17, 2021 •

edited

Loading

hanhanW Mar 17, 2021 •

edited

Loading