Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary padding in iree-flow-set-encoding #11632

Closed
bjacob opened this issue Dec 21, 2022 · 2 comments · Fixed by #14349
Closed

Avoid unnecessary padding in iree-flow-set-encoding #11632

bjacob opened this issue Dec 21, 2022 · 2 comments · Fixed by #14349
Assignees
Labels
codegen/llvm LLVM code generation compiler backend codegen/vmvx VMVX (IREE VM) code generation compiler backend performance ⚡ Performance/optimization related work across the compiler and runtime

Comments

@bjacob
Copy link
Contributor

bjacob commented Dec 21, 2022

(Updated July 2023)

When data-tiling, during Flow we don't know yet the exact tile sizes, so we don't know yet exactly how much padding is needed to ensure that buffers are allocated large enough to account for padding to the next multiple of what will be the actual tile size.

The current code is simply using a hardcoded tile size of 16, along every dimension, in Flow.

Problems with that:

  1. That's assuming that no tile size will ever be larger than 16, which is not a safe assumption with ever-wider SIMD. Intel AMX, Apple AMX, ARM SME, SiFive (RISC-V) Intelligence extensions are all defeating that. In fact, with narrow data types such as float16, even AVX-512 or ARM SVE or RISC-V RVV can already defeat that with tile sizes equal to 32. So resolving that blocks being able to target all these near-team SIMD targets.
  2. When the matrix shapes are very narrow, meaning that some dimension size is less than 16, padding to the next multiple of 16 can be large overhead. Particularly for matrix-times-vector multiplications, the widening is from 1 to 16 columns, immediately rephrasing the problem at hand into one that involves 16x more arithmetic.
  3. Incomplete tiles are costly to deal with in packing code. We would normally have at most one incomplete tile at the end of each dimension, but because of this, we can have as many as 16.
  4. This requires the SetEncoding pass to be parametrized in a defaultPadding option, while it could otherwise be unparametrized.
@bjacob bjacob added codegen/llvm LLVM code generation compiler backend codegen/vmvx VMVX (IREE VM) code generation compiler backend performance ⚡ Performance/optimization related work across the compiler and runtime labels Dec 21, 2022
@MaheshRavishankar
Copy link
Contributor

I would need @benvanik s suggestions on it. I am passing this to him for the time being

bjacob added a commit that referenced this issue Jan 12, 2023
This is part 3/9 of #11755: Finish the transition to
`--iree-flow-enable-data-tiling`.

Currently (unless the defaultPadding is changed to a non-default value),
dynamic-size set_encoding shapes get rounded up to the next multiple of
16. This causes padding by potentially more than one tile size. This
invalidates an assumption that the `iree_uk_pack` ukernel was relying
on, that at most one last tile along each dimension would require
padding.

I wish the assumption could resume holding. This is tracked in Issue
#11632.
jpienaar pushed a commit that referenced this issue May 1, 2023
This is part 3/9 of #11755: Finish the transition to
`--iree-flow-enable-data-tiling`.

Currently (unless the defaultPadding is changed to a non-default value),
dynamic-size set_encoding shapes get rounded up to the next multiple of
16. This causes padding by potentially more than one tile size. This
invalidates an assumption that the `iree_uk_pack` ukernel was relying
on, that at most one last tile along each dimension would require
padding.

I wish the assumption could resume holding. This is tracked in Issue
#11632.
@bjacob bjacob changed the title Explore ways to remove defaultPadding (current default value 16) in iree-flow-set-encoding Avoid unnecessary padding in iree-flow-set-encoding Jul 7, 2023
@bjacob bjacob assigned bjacob and unassigned benvanik Jul 10, 2023
@bjacob
Copy link
Contributor Author

bjacob commented Jul 10, 2023

Tentative fix: #14349

bjacob added a commit that referenced this issue Jul 11, 2023
This has long been discussed, as the enum was just an initial design
shortcut, but the number of enum cases was already getting uncomfortable
due to the multiple dimensions there.

This is a step in a chain towards fixing
#11632. The reason is that in
order to properly specialize for narrow matmul cases in
`MaterializeEncoding`, selecting adequately narrow matmul kernels that
avoid widening the entire matmul problem at hand, we will need to know
there the original (pre-padding) matrix shape. Since
`MaterializeEncoding` is a type-conversion, not just an ordinary rewrite
pattern, this information will need to be encoded in types --- we won't
just be able to walk from a value to its defining op to find the
pre-padding value, there just aren't values there. So I will want to add
the pre-padding shape (or type) to EncodingAttr. This is a step towards
that: by making EncodingAttr a data structure, it's easy then to add
another field. By contrast, if it's still an enum, the combinatorics get
out of hand.
bjacob added a commit that referenced this issue Jul 17, 2023
…ze choice to MaterializeEncoding. (#14349)

This fixes #11632, by introducing a materializable
`upper_bound_tile_size ` instead of hardcoding a fixed padding amount at
Flow, and fixes it in sufficient generality to also solve the problem
for narrow matmuls - let's explain that in more detail as this is an
important part of what this PR is doing.

For each combination of element types and each target, the
MaterializeEncoding pass selects appropriate matmul tile shapes. Input
tensors get padded to the next multiple of the next tile size. The
padding increases the inherent arithmetic cost of the problem at hand.
When, along some dimension, the original tensor size is smaller than the
tile size, that can result in particulary large overhead. The extreme
case, which is also a very common case, is matrix-times-vector
multiplication. The "vector" right-hand side is really a matrix with one
dimension size equal to 1, so if the general matmul tile shape along
that dimension is 8 or 16, as is usually the case, that can be a 8x or
16x increase in the inherent arithmetic cost of the matmul op.

The solution to that is to adjust MaterializeEncoding tile shapes to
narrow dimensions. We had some logic in place to deal with that, but
#11632 was leaving it moot: the flow-level padding of everything to the
next multiple of 16 meant that our logic there never really had a chance
of kicking in. With #11632 being fixed, this PR was the opportunity to
also fix that along the way, and to ensure that the solution to #11632
worked also in that respect. As matrix-times-vector products were the
common case that suffered the most from #11632, it would have been too
bad to "solve" #11632 without addressing that. By the way,
matrix-times-vector is only the extreme case, but other narrow cases
matter too. When, e.g. on AVX-512, the general matmul tile size is 16,
even width-8 matmuls (MxKx8) were suffering from 2x-widening. So the
solution in this PR is making sure to address all narrow cases, defined
as whenever a tensor dimension size is less than the general tile size.

The difficulty was that when MaterializeEncoding runs on a dispatch
function, it runs on an already-padded tensor; even as this PR
introduces `upper_bound_tile_size`, that only makes it possible to
select the right padding amount, but there's still a `tensor.pad` op and
it's still getting in the way of knowing the actual, original tensor
shape for the purpose of adjusting tile shapes for narrow cases.
Moreover, as `MaterializeEncoding` is a type-converter pass, it can't
just walk from a Value up to its defining-op to find the pre-padding
tensor. There are no values there, only types. So the information about
the pre-padding tensor shape has to be part of the tensor type that
`MaterializeEncoding` sees, that its, the padded tensor type.

The solution to that problem in this PR is to add a `original_type`
field to `EncodingAttr`.

Fixes  #11632.

Fixes a compiler issue encountered in #14398 but not the originally
reported runtime crash by itself.

This now also includes the removal of a now-useless VMVX pass, which was
originally split into #14383 .
nhasabni pushed a commit to plaidml/iree that referenced this issue Aug 24, 2023
This has long been discussed, as the enum was just an initial design
shortcut, but the number of enum cases was already getting uncomfortable
due to the multiple dimensions there.

This is a step in a chain towards fixing
iree-org#11632. The reason is that in
order to properly specialize for narrow matmul cases in
`MaterializeEncoding`, selecting adequately narrow matmul kernels that
avoid widening the entire matmul problem at hand, we will need to know
there the original (pre-padding) matrix shape. Since
`MaterializeEncoding` is a type-conversion, not just an ordinary rewrite
pattern, this information will need to be encoded in types --- we won't
just be able to walk from a value to its defining op to find the
pre-padding value, there just aren't values there. So I will want to add
the pre-padding shape (or type) to EncodingAttr. This is a step towards
that: by making EncodingAttr a data structure, it's easy then to add
another field. By contrast, if it's still an enum, the combinatorics get
out of hand.
nhasabni pushed a commit to plaidml/iree that referenced this issue Aug 24, 2023
…ze choice to MaterializeEncoding. (iree-org#14349)

This fixes iree-org#11632, by introducing a materializable
`upper_bound_tile_size ` instead of hardcoding a fixed padding amount at
Flow, and fixes it in sufficient generality to also solve the problem
for narrow matmuls - let's explain that in more detail as this is an
important part of what this PR is doing.

For each combination of element types and each target, the
MaterializeEncoding pass selects appropriate matmul tile shapes. Input
tensors get padded to the next multiple of the next tile size. The
padding increases the inherent arithmetic cost of the problem at hand.
When, along some dimension, the original tensor size is smaller than the
tile size, that can result in particulary large overhead. The extreme
case, which is also a very common case, is matrix-times-vector
multiplication. The "vector" right-hand side is really a matrix with one
dimension size equal to 1, so if the general matmul tile shape along
that dimension is 8 or 16, as is usually the case, that can be a 8x or
16x increase in the inherent arithmetic cost of the matmul op.

The solution to that is to adjust MaterializeEncoding tile shapes to
narrow dimensions. We had some logic in place to deal with that, but
next multiple of 16 meant that our logic there never really had a chance
of kicking in. With iree-org#11632 being fixed, this PR was the opportunity to
also fix that along the way, and to ensure that the solution to iree-org#11632
worked also in that respect. As matrix-times-vector products were the
common case that suffered the most from iree-org#11632, it would have been too
bad to "solve" iree-org#11632 without addressing that. By the way,
matrix-times-vector is only the extreme case, but other narrow cases
matter too. When, e.g. on AVX-512, the general matmul tile size is 16,
even width-8 matmuls (MxKx8) were suffering from 2x-widening. So the
solution in this PR is making sure to address all narrow cases, defined
as whenever a tensor dimension size is less than the general tile size.

The difficulty was that when MaterializeEncoding runs on a dispatch
function, it runs on an already-padded tensor; even as this PR
introduces `upper_bound_tile_size`, that only makes it possible to
select the right padding amount, but there's still a `tensor.pad` op and
it's still getting in the way of knowing the actual, original tensor
shape for the purpose of adjusting tile shapes for narrow cases.
Moreover, as `MaterializeEncoding` is a type-converter pass, it can't
just walk from a Value up to its defining-op to find the pre-padding
tensor. There are no values there, only types. So the information about
the pre-padding tensor shape has to be part of the tensor type that
`MaterializeEncoding` sees, that its, the padded tensor type.

The solution to that problem in this PR is to add a `original_type`
field to `EncodingAttr`.

Fixes  iree-org#11632.

Fixes a compiler issue encountered in iree-org#14398 but not the originally
reported runtime crash by itself.

This now also includes the removal of a now-useless VMVX pass, which was
originally split into iree-org#14383 .
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codegen/llvm LLVM code generation compiler backend codegen/vmvx VMVX (IREE VM) code generation compiler backend performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
No open projects
Status: No status
3 participants