Avoid unnecessary padding in `iree-flow-set-encoding` #11632

bjacob · 2022-12-21T21:19:56Z

(Updated July 2023)

When data-tiling, during Flow we don't know yet the exact tile sizes, so we don't know yet exactly how much padding is needed to ensure that buffers are allocated large enough to account for padding to the next multiple of what will be the actual tile size.

The current code is simply using a hardcoded tile size of 16, along every dimension, in Flow.

Problems with that:

That's assuming that no tile size will ever be larger than 16, which is not a safe assumption with ever-wider SIMD. Intel AMX, Apple AMX, ARM SME, SiFive (RISC-V) Intelligence extensions are all defeating that. In fact, with narrow data types such as float16, even AVX-512 or ARM SVE or RISC-V RVV can already defeat that with tile sizes equal to 32. So resolving that blocks being able to target all these near-team SIMD targets.
When the matrix shapes are very narrow, meaning that some dimension size is less than 16, padding to the next multiple of 16 can be large overhead. Particularly for matrix-times-vector multiplications, the widening is from 1 to 16 columns, immediately rephrasing the problem at hand into one that involves 16x more arithmetic.
Incomplete tiles are costly to deal with in packing code. We would normally have at most one incomplete tile at the end of each dimension, but because of this, we can have as many as 16.
This requires the SetEncoding pass to be parametrized in a defaultPadding option, while it could otherwise be unparametrized.

The text was updated successfully, but these errors were encountered:

MaheshRavishankar · 2022-12-22T05:05:50Z

I would need @benvanik s suggestions on it. I am passing this to him for the time being

This is part 3/9 of #11755: Finish the transition to `--iree-flow-enable-data-tiling`. Currently (unless the defaultPadding is changed to a non-default value), dynamic-size set_encoding shapes get rounded up to the next multiple of 16. This causes padding by potentially more than one tile size. This invalidates an assumption that the `iree_uk_pack` ukernel was relying on, that at most one last tile along each dimension would require padding. I wish the assumption could resume holding. This is tracked in Issue #11632.

bjacob · 2023-07-10T21:01:37Z

Tentative fix: #14349

This has long been discussed, as the enum was just an initial design shortcut, but the number of enum cases was already getting uncomfortable due to the multiple dimensions there. This is a step in a chain towards fixing #11632. The reason is that in order to properly specialize for narrow matmul cases in `MaterializeEncoding`, selecting adequately narrow matmul kernels that avoid widening the entire matmul problem at hand, we will need to know there the original (pre-padding) matrix shape. Since `MaterializeEncoding` is a type-conversion, not just an ordinary rewrite pattern, this information will need to be encoded in types --- we won't just be able to walk from a value to its defining op to find the pre-padding value, there just aren't values there. So I will want to add the pre-padding shape (or type) to EncodingAttr. This is a step towards that: by making EncodingAttr a data structure, it's easy then to add another field. By contrast, if it's still an enum, the combinatorics get out of hand.

…ze choice to MaterializeEncoding. (#14349) This fixes #11632, by introducing a materializable `upper_bound_tile_size ` instead of hardcoding a fixed padding amount at Flow, and fixes it in sufficient generality to also solve the problem for narrow matmuls - let's explain that in more detail as this is an important part of what this PR is doing. For each combination of element types and each target, the MaterializeEncoding pass selects appropriate matmul tile shapes. Input tensors get padded to the next multiple of the next tile size. The padding increases the inherent arithmetic cost of the problem at hand. When, along some dimension, the original tensor size is smaller than the tile size, that can result in particulary large overhead. The extreme case, which is also a very common case, is matrix-times-vector multiplication. The "vector" right-hand side is really a matrix with one dimension size equal to 1, so if the general matmul tile shape along that dimension is 8 or 16, as is usually the case, that can be a 8x or 16x increase in the inherent arithmetic cost of the matmul op. The solution to that is to adjust MaterializeEncoding tile shapes to narrow dimensions. We had some logic in place to deal with that, but #11632 was leaving it moot: the flow-level padding of everything to the next multiple of 16 meant that our logic there never really had a chance of kicking in. With #11632 being fixed, this PR was the opportunity to also fix that along the way, and to ensure that the solution to #11632 worked also in that respect. As matrix-times-vector products were the common case that suffered the most from #11632, it would have been too bad to "solve" #11632 without addressing that. By the way, matrix-times-vector is only the extreme case, but other narrow cases matter too. When, e.g. on AVX-512, the general matmul tile size is 16, even width-8 matmuls (MxKx8) were suffering from 2x-widening. So the solution in this PR is making sure to address all narrow cases, defined as whenever a tensor dimension size is less than the general tile size. The difficulty was that when MaterializeEncoding runs on a dispatch function, it runs on an already-padded tensor; even as this PR introduces `upper_bound_tile_size`, that only makes it possible to select the right padding amount, but there's still a `tensor.pad` op and it's still getting in the way of knowing the actual, original tensor shape for the purpose of adjusting tile shapes for narrow cases. Moreover, as `MaterializeEncoding` is a type-converter pass, it can't just walk from a Value up to its defining-op to find the pre-padding tensor. There are no values there, only types. So the information about the pre-padding tensor shape has to be part of the tensor type that `MaterializeEncoding` sees, that its, the padded tensor type. The solution to that problem in this PR is to add a `original_type` field to `EncodingAttr`. Fixes #11632. Fixes a compiler issue encountered in #14398 but not the originally reported runtime crash by itself. This now also includes the removal of a now-useless VMVX pass, which was originally split into #14383 .

This has long been discussed, as the enum was just an initial design shortcut, but the number of enum cases was already getting uncomfortable due to the multiple dimensions there. This is a step in a chain towards fixing iree-org#11632. The reason is that in order to properly specialize for narrow matmul cases in `MaterializeEncoding`, selecting adequately narrow matmul kernels that avoid widening the entire matmul problem at hand, we will need to know there the original (pre-padding) matrix shape. Since `MaterializeEncoding` is a type-conversion, not just an ordinary rewrite pattern, this information will need to be encoded in types --- we won't just be able to walk from a value to its defining op to find the pre-padding value, there just aren't values there. So I will want to add the pre-padding shape (or type) to EncodingAttr. This is a step towards that: by making EncodingAttr a data structure, it's easy then to add another field. By contrast, if it's still an enum, the combinatorics get out of hand.

…ze choice to MaterializeEncoding. (iree-org#14349) This fixes iree-org#11632, by introducing a materializable `upper_bound_tile_size ` instead of hardcoding a fixed padding amount at Flow, and fixes it in sufficient generality to also solve the problem for narrow matmuls - let's explain that in more detail as this is an important part of what this PR is doing. For each combination of element types and each target, the MaterializeEncoding pass selects appropriate matmul tile shapes. Input tensors get padded to the next multiple of the next tile size. The padding increases the inherent arithmetic cost of the problem at hand. When, along some dimension, the original tensor size is smaller than the tile size, that can result in particulary large overhead. The extreme case, which is also a very common case, is matrix-times-vector multiplication. The "vector" right-hand side is really a matrix with one dimension size equal to 1, so if the general matmul tile shape along that dimension is 8 or 16, as is usually the case, that can be a 8x or 16x increase in the inherent arithmetic cost of the matmul op. The solution to that is to adjust MaterializeEncoding tile shapes to narrow dimensions. We had some logic in place to deal with that, but next multiple of 16 meant that our logic there never really had a chance of kicking in. With iree-org#11632 being fixed, this PR was the opportunity to also fix that along the way, and to ensure that the solution to iree-org#11632 worked also in that respect. As matrix-times-vector products were the common case that suffered the most from iree-org#11632, it would have been too bad to "solve" iree-org#11632 without addressing that. By the way, matrix-times-vector is only the extreme case, but other narrow cases matter too. When, e.g. on AVX-512, the general matmul tile size is 16, even width-8 matmuls (MxKx8) were suffering from 2x-widening. So the solution in this PR is making sure to address all narrow cases, defined as whenever a tensor dimension size is less than the general tile size. The difficulty was that when MaterializeEncoding runs on a dispatch function, it runs on an already-padded tensor; even as this PR introduces `upper_bound_tile_size`, that only makes it possible to select the right padding amount, but there's still a `tensor.pad` op and it's still getting in the way of knowing the actual, original tensor shape for the purpose of adjusting tile shapes for narrow cases. Moreover, as `MaterializeEncoding` is a type-converter pass, it can't just walk from a Value up to its defining-op to find the pre-padding tensor. There are no values there, only types. So the information about the pre-padding tensor shape has to be part of the tensor type that `MaterializeEncoding` sees, that its, the padded tensor type. The solution to that problem in this PR is to add a `original_type` field to `EncodingAttr`. Fixes iree-org#11632. Fixes a compiler issue encountered in iree-org#14398 but not the originally reported runtime crash by itself. This now also includes the removal of a now-useless VMVX pass, which was originally split into iree-org#14383 .

bjacob assigned MaheshRavishankar Dec 21, 2022

bjacob added codegen/llvm LLVM code generation compiler backend codegen/vmvx VMVX (IREE VM) code generation compiler backend performance ⚡ Performance/optimization related work across the compiler and runtime labels Dec 21, 2022

bjacob mentioned this issue Dec 22, 2022

Relax the assumption of only one padding tile in iree_uk_pack #11642

Merged

benvanik added performance/codegen and removed performance/codegen labels Jan 30, 2023

allieculp assigned benvanik and unassigned MaheshRavishankar Apr 14, 2023

bjacob mentioned this issue Jun 23, 2023

Adjust the universal flow-level padding for narrow static-sized dimensions #14206

Closed

bjacob changed the title ~~Explore ways to remove defaultPadding (current default value 16) in iree-flow-set-encoding~~ Avoid unnecessary padding in iree-flow-set-encoding Jul 7, 2023

bjacob mentioned this issue Jul 7, 2023

Change LinalgExt::EncodingAttr from enum to structured. #14336

Merged

bjacob assigned bjacob and unassigned benvanik Jul 10, 2023

bjacob mentioned this issue Jul 10, 2023

data-tiling: introduce upper_bound_tile_size op to defer padding-size choice to MaterializeEncoding. #14349

Merged

bjacob closed this as completed in #14349 Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unnecessary padding in `iree-flow-set-encoding` #11632

Avoid unnecessary padding in `iree-flow-set-encoding` #11632

bjacob commented Dec 21, 2022 •

edited

Loading

MaheshRavishankar commented Dec 22, 2022

bjacob commented Jul 10, 2023

Avoid unnecessary padding in iree-flow-set-encoding #11632

Avoid unnecessary padding in iree-flow-set-encoding #11632

Comments

bjacob commented Dec 21, 2022 • edited Loading

MaheshRavishankar commented Dec 22, 2022

bjacob commented Jul 10, 2023

Avoid unnecessary padding in `iree-flow-set-encoding` #11632

Avoid unnecessary padding in `iree-flow-set-encoding` #11632

bjacob commented Dec 21, 2022 •

edited

Loading