-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plumb pooling ops through Linalg on tensors path. #5136
Conversation
This will need to replace |
ts[ts.size() - 1] = tileSizeX; | ||
SmallVector<int64_t, 4> ts; | ||
if (options.usingLinalgOnTensors) { | ||
ts.assign({0, tileSizeY, tileSizeX, 1}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't really understand how to set this and what the constraints are...
Should the size of ts
should be as same as the number of parallel loops in Linalg on tensors path?
What is the relationship between this and workloadPerWorkgroup
? I tried 0, tileSizeY, tileSizeX, 0
, but it would hit an assertion:
Assertion `workloadPerWorkgroup.size() <= 3 && "workloadPerWorkgroup size greater than 3 not handled"`
Now I'm just luckily having a config to make it work. I'm happy to learn more about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Hanhan, you are right about the fact that this is fairly haphazard. Its on my list of things to clean up.
The overall goal here is that you only want to partition upto 3 loops at the Flow level. The HAL layer can only support 3 levels of processor x,y,z. We can have Flow in theory handle any number of parallelism mode, but it gets tricky then to lower from Flow -> HAL. So for now the assumption is to try to keep 3 levels of parallelism at flow level.
W.R.T to how to decide which loops to parallelize and which not to parallelize, and what tile size to use. We should uniformly try to say tile size of 0 implies it is not parallelized, both in the Flow level when the distrubution happens, and here when setting the configuration.
Let me look through your change, in a bit. Its the thing I want to get to next, so I will have to see whats done currently and then give you feedback here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's complicated because of three main reasons: 1) existing buffer vs. new tensor path, 2) convolution/pooling ops has more than 3 parallel dimensions but right now we can only support <= 3 Flow tiling, 3) the tile sizes here matches linalg op implicit loop order, which is typically the reverse of distribution order.
In the existing buffer path by default we always tile along the first three parallel dimensions, namely N, OH, OW. So that's why you see we only have three numbers here for the existing buffer path: they are for the first 3 implicit loop of conv/pooling ops, namely N, OH, OW. That's not very good because N is typically 1 in inference so we are effectively dropping one dimension of parallelism. That's why we want to tile along the second three parallel dimensions, namely OH, OW, OC. The new tensor based approach adopts this scheme for conv, by having the special check in DispatchLinalgOnTensorsPass to force tiling along the second three parallel dimensions, namely OH, OW, OC. Now remember that these tiling configurations need to match with implicit loop order in linalg ops. So we need a special leading 0 to mean disabling tiling on the first parallel dimension, namely N. Given you made pooling ops be treated similarly like convolution ops, you need the same thing here. Hope this clarifies. :)
It will improve a bit once we can remove the existing buffer path. But the special checks in DispatchLinalgOnTensorsPass will remain. I would actually think making flow level workgroups really n-D (n being arbitrary number) good to address the special checks. I was doing that in the initial prototype. Will send out a pull request later to show how it looks and whether we should proceed with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I understand more about these, thanks for the detailed explanantion!
ts[ts.size() - 1] = tileSizeX; | ||
SmallVector<int64_t, 4> ts; | ||
if (options.usingLinalgOnTensors) { | ||
ts.assign({0, tileSizeY, tileSizeX, 1}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Hanhan, you are right about the fact that this is fairly haphazard. Its on my list of things to clean up.
The overall goal here is that you only want to partition upto 3 loops at the Flow level. The HAL layer can only support 3 levels of processor x,y,z. We can have Flow in theory handle any number of parallelism mode, but it gets tricky then to lower from Flow -> HAL. So for now the assumption is to try to keep 3 levels of parallelism at flow level.
W.R.T to how to decide which loops to parallelize and which not to parallelize, and what tile size to use. We should uniformly try to say tile size of 0 implies it is not parallelized, both in the Flow level when the distrubution happens, and here when setting the configuration.
Let me look through your change, in a bit. Its the thing I want to get to next, so I will have to see whats done currently and then give you feedback here.
static LogicalResult convertInitTensorOp(OpBuilder &b, | ||
linalg::InitTensorOp initTensorOp, | ||
BlockAndValueMapping &bvm) { | ||
RankedTensorType tensorType = initTensorOp.getType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably need to check if the result is already mapped to a buffer here. There is a pre-processing step for Linalg operations that tries to map the result buffer (if it is already mapped) to the outs
operand. So here this will be undone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Hanhan, this is probably needed before this can land.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does if (bvm.contains(initTensorOp.getResult())) return success();
address the comment? I thought bvm
contains the information of mapping between outs
and buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. That works. Missed that.
ts[ts.size() - 1] = tileSizeX; | ||
SmallVector<int64_t, 4> ts; | ||
if (options.usingLinalgOnTensors) { | ||
ts.assign({0, tileSizeY, tileSizeX, 1}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's complicated because of three main reasons: 1) existing buffer vs. new tensor path, 2) convolution/pooling ops has more than 3 parallel dimensions but right now we can only support <= 3 Flow tiling, 3) the tile sizes here matches linalg op implicit loop order, which is typically the reverse of distribution order.
In the existing buffer path by default we always tile along the first three parallel dimensions, namely N, OH, OW. So that's why you see we only have three numbers here for the existing buffer path: they are for the first 3 implicit loop of conv/pooling ops, namely N, OH, OW. That's not very good because N is typically 1 in inference so we are effectively dropping one dimension of parallelism. That's why we want to tile along the second three parallel dimensions, namely OH, OW, OC. The new tensor based approach adopts this scheme for conv, by having the special check in DispatchLinalgOnTensorsPass to force tiling along the second three parallel dimensions, namely OH, OW, OC. Now remember that these tiling configurations need to match with implicit loop order in linalg ops. So we need a special leading 0 to mean disabling tiling on the first parallel dimension, namely N. Given you made pooling ops be treated similarly like convolution ops, you need the same thing here. Hope this clarifies. :)
It will improve a bit once we can remove the existing buffer path. But the special checks in DispatchLinalgOnTensorsPass will remain. I would actually think making flow level workgroups really n-D (n being arbitrary number) good to address the special checks. I was doing that in the initial prototype. Will send out a pull request later to show how it looks and whether we should proceed with it.
@antiagainst I would push back against going down this path. There is really no reason we should have some non-trivial logic to convert nD distribution to 3D distribution in the Flow to HAL conversion. It is less surprising to just expose only three levels of parallelism at the Flow level itself. As we have discussed before, eventually everything needs to map to a 3D grid cause that is what hardware supports. We know what op we are targeting and would just be cleaner if we did not partition the operation to more than 3D to begin with by using the tile size of 0. It makes things less surprising. The only use case I can think of for considering more than 3D partitioning is if a HAL backends needs more than 3D partitioning. |
static LogicalResult convertInitTensorOp(OpBuilder &b, | ||
linalg::InitTensorOp initTensorOp, | ||
BlockAndValueMapping &bvm) { | ||
RankedTensorType tensorType = initTensorOp.getType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Hanhan, this is probably needed before this can land.
Understood your points. Not saying I'm super attached to that and must go with it. (I didn't implement that way when plumbing through conv support. ;-P) Note that my points were just to send a pull request to show how it looks like, given I initially have it already and I think it's not that complicated.
+1. We claim flow workgroups can be arbitrary-D but in reality it can only use <= 3; so that's just a confusing point. If we aren't gonna use it, it's better to drop that. :) |
Clear from me. But please address Mahesh's comments. |
558e436
to
7a84a8d
Compare
Rebase and delete special logic that is handled in #5172 |
@MaheshRavishankar please take another look. This rebased on a commit after 9f968d0. This needs your approval because you requested changes. Thanks! |
Hey Hanhan, AFAICS, there is still one comment w.r.t to InitTensor initialization that still needs to be addressed? |
I probably misunderstood something. Does |
static LogicalResult convertInitTensorOp(OpBuilder &b, | ||
linalg::InitTensorOp initTensorOp, | ||
BlockAndValueMapping &bvm) { | ||
RankedTensorType tensorType = initTensorOp.getType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. That works. Missed that.
1) Set pooling ops as root op. 2) Convert linalg.init_tensor to a AllocOp. 3) Update pooling ops launch config on both codegen path. The (2) is only created for shaped operand. The alloc op will be deleted once lower to loops. This is the same behavior as Linalg on buffers path. Part of iree-org#5043
* 82d8765 Merge branch 'google' into main-to-google * b7e2b09 Remove special casing for IreeFileCheck in iree_lit_test.cmake (#5209) * b67af5a Clean up several CMake options configured for our CI builds. (#5197) * ea8b82d Integrate MLIR-EmitC at iml130/mlir-emitc@68547d0 (#5205) * 10d9ea1 Make CPU/GPU end-to-end testing in Linalg on tensors/buffers align. (#5203) * ff69d02 Move prepration of benchmark files to python script. (#5201) * 0c2f9e8 Bump Abseil-cpp submodule to LTS 20200923, Patch 3 (#5198) * 6ac0023 Add runtime tracing support to passes under IREE's PassManager usage. (#5160) * 2e41217 Bump Benchmark submodule to 39c8d58 (#5196) * 663ab93 Adjust UnixLinkerTool flags used for Apple targets. (#5164) * 963beb9 Plumb pooling ops through Linalg on tensors path. (#5136) * a33f72f Plumb depthwise convolution in Linalg on tensors. (#5189) * 2864ee4 Add restrictions for enabling generic op vectorization. (#5188) * 4f218a8 Make sure convolution vectorization is off when the option flag is off (#5187) * 44736b5 build_tools:cmake:riscv: Fix the system library path (#5175) * 96e9c49 Switching from mhlo::ConvertOp to ZeroExtendI for i1->i8 conversion. (#5119) * da4ad51 Fix and re-enable jax testing (#5183) * 301bc15 Merge google -> main (#5182) * 400719c Shuffling around/removing some HLO deps. * 9ba1d57 Enable MobileBert on GPU for the linalg on tensors path. (#5180) * 52f02af Resolve name conflicts while linking executables in LLVM. (#5179) * 4a732d3 Removing the post-partitioning conversion pass as it is unused. We've built up.. COPYBARA_INTEGRATE_REVIEW=#5212 from ScottTodd:main-to-google 82d8765 PiperOrigin-RevId: 364658113
* 82d8765 Merge branch 'google' into main-to-google * b7e2b09 Remove special casing for IreeFileCheck in iree_lit_test.cmake (#5209) * b67af5a Clean up several CMake options configured for our CI builds. (#5197) * ea8b82d Integrate MLIR-EmitC at iml130/mlir-emitc@68547d0 (#5205) * 10d9ea1 Make CPU/GPU end-to-end testing in Linalg on tensors/buffers align. (#5203) * ff69d02 Move prepration of benchmark files to python script. (#5201) * 0c2f9e8 Bump Abseil-cpp submodule to LTS 20200923, Patch 3 (#5198) * 6ac0023 Add runtime tracing support to passes under IREE's PassManager usage. (#5160) * 2e41217 Bump Benchmark submodule to 39c8d58 (#5196) * 663ab93 Adjust UnixLinkerTool flags used for Apple targets. (#5164) * 963beb9 Plumb pooling ops through Linalg on tensors path. (#5136) * a33f72f Plumb depthwise convolution in Linalg on tensors. (#5189) * 2864ee4 Add restrictions for enabling generic op vectorization. (#5188) * 4f218a8 Make sure convolution vectorization is off when the option flag is off (#5187) * 44736b5 build_tools:cmake:riscv: Fix the system library path (#5175) * 96e9c49 Switching from mhlo::ConvertOp to ZeroExtendI for i1->i8 conversion. (#5119) * da4ad51 Fix and re-enable jax testing (#5183) * 301bc15 Merge google -> main (#5182) * 400719c Shuffling around/removing some HLO deps. * 9ba1d57 Enable MobileBert on GPU for the linalg on tensors path. (#5180) * 52f02af Resolve name conflicts while linking executables in LLVM. (#5179) * 4a732d3 Removing the post-partitioning conversion pass as it is unused. We've built up.. PiperOrigin-RevId: 364658113
The (2) is only created for shaped operand. The alloc op will be deleted
once lower to loops. This is the same behavior as Linalg on buffers
path.
Part of #5043