Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plumb pooling ops through Linalg on tensors path. #5136

Merged
merged 2 commits into from
Mar 22, 2021

Conversation

hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Mar 17, 2021

  1. Set pooling ops as root op.
  2. Convert linalg.init_tensor to a AllocOp.
  3. Set pooling ops launch config for Linalg on tensors path.

The (2) is only created for shaped operand. The alloc op will be deleted
once lower to loops. This is the same behavior as Linalg on buffers
path.

Part of #5043

@hanhanW
Copy link
Contributor Author

hanhanW commented Mar 17, 2021

This will need to replace AllocOp with memref::AllocOp after the google -> main happens.

ts[ts.size() - 1] = tileSizeX;
SmallVector<int64_t, 4> ts;
if (options.usingLinalgOnTensors) {
ts.assign({0, tileSizeY, tileSizeX, 1});
Copy link
Contributor Author

@hanhanW hanhanW Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't really understand how to set this and what the constraints are...

Should the size of ts should be as same as the number of parallel loops in Linalg on tensors path?
What is the relationship between this and workloadPerWorkgroup? I tried 0, tileSizeY, tileSizeX, 0, but it would hit an assertion:

Assertion `workloadPerWorkgroup.size() <= 3 && "workloadPerWorkgroup size greater than 3 not handled"`

Now I'm just luckily having a config to make it work. I'm happy to learn more about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Hanhan, you are right about the fact that this is fairly haphazard. Its on my list of things to clean up.

The overall goal here is that you only want to partition upto 3 loops at the Flow level. The HAL layer can only support 3 levels of processor x,y,z. We can have Flow in theory handle any number of parallelism mode, but it gets tricky then to lower from Flow -> HAL. So for now the assumption is to try to keep 3 levels of parallelism at flow level.

W.R.T to how to decide which loops to parallelize and which not to parallelize, and what tile size to use. We should uniformly try to say tile size of 0 implies it is not parallelized, both in the Flow level when the distrubution happens, and here when setting the configuration.

Let me look through your change, in a bit. Its the thing I want to get to next, so I will have to see whats done currently and then give you feedback here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's complicated because of three main reasons: 1) existing buffer vs. new tensor path, 2) convolution/pooling ops has more than 3 parallel dimensions but right now we can only support <= 3 Flow tiling, 3) the tile sizes here matches linalg op implicit loop order, which is typically the reverse of distribution order.

In the existing buffer path by default we always tile along the first three parallel dimensions, namely N, OH, OW. So that's why you see we only have three numbers here for the existing buffer path: they are for the first 3 implicit loop of conv/pooling ops, namely N, OH, OW. That's not very good because N is typically 1 in inference so we are effectively dropping one dimension of parallelism. That's why we want to tile along the second three parallel dimensions, namely OH, OW, OC. The new tensor based approach adopts this scheme for conv, by having the special check in DispatchLinalgOnTensorsPass to force tiling along the second three parallel dimensions, namely OH, OW, OC. Now remember that these tiling configurations need to match with implicit loop order in linalg ops. So we need a special leading 0 to mean disabling tiling on the first parallel dimension, namely N. Given you made pooling ops be treated similarly like convolution ops, you need the same thing here. Hope this clarifies. :)

It will improve a bit once we can remove the existing buffer path. But the special checks in DispatchLinalgOnTensorsPass will remain. I would actually think making flow level workgroups really n-D (n being arbitrary number) good to address the special checks. I was doing that in the initial prototype. Will send out a pull request later to show how it looks and whether we should proceed with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I understand more about these, thanks for the detailed explanantion!

ts[ts.size() - 1] = tileSizeX;
SmallVector<int64_t, 4> ts;
if (options.usingLinalgOnTensors) {
ts.assign({0, tileSizeY, tileSizeX, 1});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Hanhan, you are right about the fact that this is fairly haphazard. Its on my list of things to clean up.

The overall goal here is that you only want to partition upto 3 loops at the Flow level. The HAL layer can only support 3 levels of processor x,y,z. We can have Flow in theory handle any number of parallelism mode, but it gets tricky then to lower from Flow -> HAL. So for now the assumption is to try to keep 3 levels of parallelism at flow level.

W.R.T to how to decide which loops to parallelize and which not to parallelize, and what tile size to use. We should uniformly try to say tile size of 0 implies it is not parallelized, both in the Flow level when the distrubution happens, and here when setting the configuration.

Let me look through your change, in a bit. Its the thing I want to get to next, so I will have to see whats done currently and then give you feedback here.

static LogicalResult convertInitTensorOp(OpBuilder &b,
linalg::InitTensorOp initTensorOp,
BlockAndValueMapping &bvm) {
RankedTensorType tensorType = initTensorOp.getType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to check if the result is already mapped to a buffer here. There is a pre-processing step for Linalg operations that tries to map the result buffer (if it is already mapped) to the outs operand. So here this will be undone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Hanhan, this is probably needed before this can land.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does if (bvm.contains(initTensorOp.getResult())) return success(); address the comment? I thought bvm contains the information of mapping between outs and buffers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. That works. Missed that.

iree/compiler/Conversion/Common/LinalgBufferizePass.cpp Outdated Show resolved Hide resolved
ts[ts.size() - 1] = tileSizeX;
SmallVector<int64_t, 4> ts;
if (options.usingLinalgOnTensors) {
ts.assign({0, tileSizeY, tileSizeX, 1});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's complicated because of three main reasons: 1) existing buffer vs. new tensor path, 2) convolution/pooling ops has more than 3 parallel dimensions but right now we can only support <= 3 Flow tiling, 3) the tile sizes here matches linalg op implicit loop order, which is typically the reverse of distribution order.

In the existing buffer path by default we always tile along the first three parallel dimensions, namely N, OH, OW. So that's why you see we only have three numbers here for the existing buffer path: they are for the first 3 implicit loop of conv/pooling ops, namely N, OH, OW. That's not very good because N is typically 1 in inference so we are effectively dropping one dimension of parallelism. That's why we want to tile along the second three parallel dimensions, namely OH, OW, OC. The new tensor based approach adopts this scheme for conv, by having the special check in DispatchLinalgOnTensorsPass to force tiling along the second three parallel dimensions, namely OH, OW, OC. Now remember that these tiling configurations need to match with implicit loop order in linalg ops. So we need a special leading 0 to mean disabling tiling on the first parallel dimension, namely N. Given you made pooling ops be treated similarly like convolution ops, you need the same thing here. Hope this clarifies. :)

It will improve a bit once we can remove the existing buffer path. But the special checks in DispatchLinalgOnTensorsPass will remain. I would actually think making flow level workgroups really n-D (n being arbitrary number) good to address the special checks. I was doing that in the initial prototype. Will send out a pull request later to show how it looks and whether we should proceed with it.

@MaheshRavishankar
Copy link
Contributor

It will improve a bit once we can remove the existing buffer path. But the special checks in DispatchLinalgOnTensorsPass will remain. I would actually think making flow level workgroups really n-D (n being arbitrary number) good to address the special checks. I was doing that in the initial prototype. Will send out a pull request later to show how it looks and whether we should proceed with it.

@antiagainst I would push back against going down this path. There is really no reason we should have some non-trivial logic to convert nD distribution to 3D distribution in the Flow to HAL conversion. It is less surprising to just expose only three levels of parallelism at the Flow level itself. As we have discussed before, eventually everything needs to map to a 3D grid cause that is what hardware supports. We know what op we are targeting and would just be cleaner if we did not partition the operation to more than 3D to begin with by using the tile size of 0. It makes things less surprising. The only use case I can think of for considering more than 3D partitioning is if a HAL backends needs more than 3D partitioning.

@hanhanW hanhanW changed the base branch from main to google March 18, 2021 12:15
@hanhanW hanhanW changed the base branch from google to main March 18, 2021 12:16
static LogicalResult convertInitTensorOp(OpBuilder &b,
linalg::InitTensorOp initTensorOp,
BlockAndValueMapping &bvm) {
RankedTensorType tensorType = initTensorOp.getType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Hanhan, this is probably needed before this can land.

@antiagainst
Copy link
Contributor

It will improve a bit once we can remove the existing buffer path. But the special checks in DispatchLinalgOnTensorsPass will remain. I would actually think making flow level workgroups really n-D (n being arbitrary number) good to address the special checks. I was doing that in the initial prototype. Will send out a pull request later to show how it looks and whether we should proceed with it.

@antiagainst I would push back against going down this path. There is really no reason we should have some non-trivial logic to convert nD distribution to 3D distribution in the Flow to HAL conversion. It is less surprising to just expose only three levels of parallelism at the Flow level itself. As we have discussed before, eventually everything needs to map to a 3D grid cause that is what hardware supports. We know what op we are targeting and would just be cleaner if we did not partition the operation to more than 3D to begin with by using the tile size of 0. It makes things less surprising. The only use case I can think of for considering more than 3D partitioning is if a HAL backends needs more than 3D partitioning.

Understood your points. Not saying I'm super attached to that and must go with it. (I didn't implement that way when plumbing through conv support. ;-P) Note that my points were just to send a pull request to show how it looks like, given I initially have it already and I think it's not that complicated.

It is less surprising to just expose only three levels of parallelism at the Flow level itself.

+1. We claim flow workgroups can be arbitrary-D but in reality it can only use <= 3; so that's just a confusing point. If we aren't gonna use it, it's better to drop that. :)

@antiagainst
Copy link
Contributor

Clear from me. But please address Mahesh's comments.

@hanhanW hanhanW force-pushed the pooling branch 2 times, most recently from 558e436 to 7a84a8d Compare March 19, 2021 16:43
@hanhanW
Copy link
Contributor Author

hanhanW commented Mar 19, 2021

Rebase and delete special logic that is handled in #5172

@hanhanW
Copy link
Contributor Author

hanhanW commented Mar 21, 2021

@MaheshRavishankar please take another look. This rebased on a commit after 9f968d0. This needs your approval because you requested changes. Thanks!

@MaheshRavishankar
Copy link
Contributor

Hey Hanhan, AFAICS, there is still one comment w.r.t to InitTensor initialization that still needs to be addressed?

@hanhanW
Copy link
Contributor Author

hanhanW commented Mar 22, 2021

I probably misunderstood something. Does if (bvm.contains(initTensorOp.getResult())) return success(); address the comment? I thought bvm contains the information of mapping between outs and buffers.

static LogicalResult convertInitTensorOp(OpBuilder &b,
linalg::InitTensorOp initTensorOp,
BlockAndValueMapping &bvm) {
RankedTensorType tensorType = initTensorOp.getType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. That works. Missed that.

1) Set pooling ops as root op.
2) Convert linalg.init_tensor to a AllocOp.
3) Update pooling ops launch config on both codegen path.

The (2) is only created for shaped operand. The alloc op will be deleted
once lower to loops. This is the same behavior as Linalg on buffers
path.

Part of iree-org#5043
@hanhanW hanhanW merged commit 963beb9 into iree-org:main Mar 22, 2021
@hanhanW hanhanW deleted the pooling branch March 22, 2021 17:49
@hanhanW hanhanW mentioned this pull request Mar 22, 2021
@ScottTodd ScottTodd mentioned this pull request Mar 23, 2021
copybara-service bot pushed a commit that referenced this pull request Mar 23, 2021
* 82d8765 Merge branch 'google' into main-to-google
* b7e2b09 Remove special casing for IreeFileCheck in iree_lit_test.cmake (#5209)
* b67af5a Clean up several CMake options configured for our CI builds. (#5197)
* ea8b82d Integrate MLIR-EmitC at iml130/mlir-emitc@68547d0 (#5205)
* 10d9ea1 Make CPU/GPU end-to-end testing in Linalg on tensors/buffers align. (#5203)
* ff69d02 Move prepration of benchmark files to python script. (#5201)
* 0c2f9e8 Bump Abseil-cpp submodule to LTS 20200923, Patch 3 (#5198)
* 6ac0023 Add runtime tracing support to passes under IREE's PassManager usage. (#5160)
* 2e41217 Bump Benchmark submodule to 39c8d58 (#5196)
* 663ab93 Adjust UnixLinkerTool flags used for Apple targets. (#5164)
* 963beb9 Plumb pooling ops through Linalg on tensors path. (#5136)
* a33f72f Plumb depthwise convolution in Linalg on tensors. (#5189)
* 2864ee4 Add restrictions for enabling generic op vectorization. (#5188)
* 4f218a8 Make sure convolution vectorization is off when the option flag is off (#5187)
* 44736b5 build_tools:cmake:riscv: Fix the system library path (#5175)
* 96e9c49 Switching from mhlo::ConvertOp to ZeroExtendI for i1->i8 conversion. (#5119)
* da4ad51 Fix and re-enable jax testing (#5183)
* 301bc15 Merge google -> main (#5182)
* 400719c Shuffling around/removing some HLO deps.
* 9ba1d57 Enable MobileBert on GPU for the linalg on tensors path. (#5180)
* 52f02af Resolve name conflicts while linking executables in LLVM. (#5179)
* 4a732d3 Removing the post-partitioning conversion pass as it is unused. We've built up..

COPYBARA_INTEGRATE_REVIEW=#5212 from ScottTodd:main-to-google 82d8765
PiperOrigin-RevId: 364658113
iree-github-actions-bot pushed a commit that referenced this pull request Mar 23, 2021
* 82d8765 Merge branch 'google' into main-to-google
* b7e2b09 Remove special casing for IreeFileCheck in iree_lit_test.cmake (#5209)
* b67af5a Clean up several CMake options configured for our CI builds. (#5197)
* ea8b82d Integrate MLIR-EmitC at iml130/mlir-emitc@68547d0 (#5205)
* 10d9ea1 Make CPU/GPU end-to-end testing in Linalg on tensors/buffers align. (#5203)
* ff69d02 Move prepration of benchmark files to python script. (#5201)
* 0c2f9e8 Bump Abseil-cpp submodule to LTS 20200923, Patch 3 (#5198)
* 6ac0023 Add runtime tracing support to passes under IREE's PassManager usage. (#5160)
* 2e41217 Bump Benchmark submodule to 39c8d58 (#5196)
* 663ab93 Adjust UnixLinkerTool flags used for Apple targets. (#5164)
* 963beb9 Plumb pooling ops through Linalg on tensors path. (#5136)
* a33f72f Plumb depthwise convolution in Linalg on tensors. (#5189)
* 2864ee4 Add restrictions for enabling generic op vectorization. (#5188)
* 4f218a8 Make sure convolution vectorization is off when the option flag is off (#5187)
* 44736b5 build_tools:cmake:riscv: Fix the system library path (#5175)
* 96e9c49 Switching from mhlo::ConvertOp to ZeroExtendI for i1->i8 conversion. (#5119)
* da4ad51 Fix and re-enable jax testing (#5183)
* 301bc15 Merge google -> main (#5182)
* 400719c Shuffling around/removing some HLO deps.
* 9ba1d57 Enable MobileBert on GPU for the linalg on tensors path. (#5180)
* 52f02af Resolve name conflicts while linking executables in LLVM. (#5179)
* 4a732d3 Removing the post-partitioning conversion pass as it is unused. We've built up..

PiperOrigin-RevId: 364658113
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants