Low hanging fruit optimizations in VMLA kernels #3601

ScottTodd · 2020-10-26T16:59:17Z

A few of the reference kernels in iree/hal/vmla/op_kernels_generic.h have particularly poor performance. While we expect the LLVM ahead-of-time backend to be more viable as a deployment target, having a faster reference backend is still generally useful.

Some of the slow kernels are already labeled:

https://github.com/google/iree/blob/2edc7d648b4e5a352a055666e17f58587a0a6ad6/iree/hal/vmla/op_kernels_generic.h#L238-L241

https://github.com/google/iree/blob/2edc7d648b4e5a352a055666e17f58587a0a6ad6/iree/hal/vmla/op_kernels_generic.h#L576-L578

https://github.com/google/iree/blob/2edc7d648b4e5a352a055666e17f58587a0a6ad6/iree/hal/vmla/op_kernels_generic.h#L498-L501

Profiling IREE with Tracy on a representative model shows in much, much more detail which kernels are being called frequently and which are taking large chunks of time. Our focus for 2020Q4 is the MobileBert model at https://github.com/google/iree/blob/main/iree/test/e2e/models/bert_encoder_unrolled_fake_weights.mlir (TODO: link to real weights / iree-translate compatible file):

We don't need to jump straight to building something like ruy for these kernels (we already use ruy for matmul), but there are many easy optimizations to make without sacrificing readability. Switching from absl::InlinedVector to std::vector or C arrays is one example.

The text was updated successfully, but these errors were encountered:

benvanik · 2020-10-26T17:17:13Z

In many cases these kernels should be a single (hopefully autovectorizable) loop with minimal indexing and all the things working with strides hoisted out of the inner loop. Currently they vary between recursive and having all stride/offset calculations happening in the inner loop, which prevents autovectorization (and just does a ton more work).

benvanik · 2020-11-22T04:21:44Z

#2863 will kill VMLA almost entirely and if any of it lives on it'll be as very tightly scoped kernels to things we can't do in wasm+simd, like ruy. It's fine to spend a day or two golfing these out to get familiar with the system/performance tooling/etc but I'd spend no longer than that.

ScottTodd · 2021-03-03T22:45:02Z

Closing this as

work is ramping up on using WASM
linalg on tensors codegen is on track to land relatively soon

ScottTodd added good first issue 🌱 Good for newcomers runtime Relating to the IREE runtime library performance ⚡ Performance/optimization related work across the compiler and runtime labels Oct 26, 2020

KoolJBlack self-assigned this Nov 3, 2020

benvanik added deprecated and removed good first issue 🌱 Good for newcomers labels Nov 22, 2020

KoolJBlack mentioned this issue Dec 12, 2020

Implementation of recursive vmla transpose kernel with test. #4197

Merged

ScottTodd closed this as completed Mar 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low hanging fruit optimizations in VMLA kernels #3601

Low hanging fruit optimizations in VMLA kernels #3601

ScottTodd commented Oct 26, 2020

benvanik commented Oct 26, 2020

benvanik commented Nov 22, 2020

ScottTodd commented Mar 3, 2021

Low hanging fruit optimizations in VMLA kernels #3601

Low hanging fruit optimizations in VMLA kernels #3601

Comments

ScottTodd commented Oct 26, 2020

benvanik commented Oct 26, 2020

benvanik commented Nov 22, 2020

ScottTodd commented Mar 3, 2021