Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low hanging fruit optimizations in VMLA kernels #3601

Closed
ScottTodd opened this issue Oct 26, 2020 · 3 comments
Closed

Low hanging fruit optimizations in VMLA kernels #3601

ScottTodd opened this issue Oct 26, 2020 · 3 comments
Assignees
Labels
performance ⚡ Performance/optimization related work across the compiler and runtime runtime Relating to the IREE runtime library

Comments

@ScottTodd
Copy link
Member

A few of the reference kernels in iree/hal/vmla/op_kernels_generic.h have particularly poor performance. While we expect the LLVM ahead-of-time backend to be more viable as a deployment target, having a faster reference backend is still generally useful.

Some of the slow kernels are already labeled:

https://github.com/google/iree/blob/2edc7d648b4e5a352a055666e17f58587a0a6ad6/iree/hal/vmla/op_kernels_generic.h#L238-L241

https://github.com/google/iree/blob/2edc7d648b4e5a352a055666e17f58587a0a6ad6/iree/hal/vmla/op_kernels_generic.h#L576-L578

https://github.com/google/iree/blob/2edc7d648b4e5a352a055666e17f58587a0a6ad6/iree/hal/vmla/op_kernels_generic.h#L498-L501

Profiling IREE with Tracy on a representative model shows in much, much more detail which kernels are being called frequently and which are taking large chunks of time. Our focus for 2020Q4 is the MobileBert model at https://github.com/google/iree/blob/main/iree/test/e2e/models/bert_encoder_unrolled_fake_weights.mlir (TODO: link to real weights / iree-translate compatible file):

image

image

image

We don't need to jump straight to building something like ruy for these kernels (we already use ruy for matmul), but there are many easy optimizations to make without sacrificing readability. Switching from absl::InlinedVector to std::vector or C arrays is one example.

@ScottTodd ScottTodd added good first issue 🌱 Good for newcomers runtime Relating to the IREE runtime library performance ⚡ Performance/optimization related work across the compiler and runtime labels Oct 26, 2020
@benvanik
Copy link
Collaborator

In many cases these kernels should be a single (hopefully autovectorizable) loop with minimal indexing and all the things working with strides hoisted out of the inner loop. Currently they vary between recursive and having all stride/offset calculations happening in the inner loop, which prevents autovectorization (and just does a ton more work).

@KoolJBlack KoolJBlack self-assigned this Nov 3, 2020
@benvanik
Copy link
Collaborator

#2863 will kill VMLA almost entirely and if any of it lives on it'll be as very tightly scoped kernels to things we can't do in wasm+simd, like ruy. It's fine to spend a day or two golfing these out to get familiar with the system/performance tooling/etc but I'd spend no longer than that.

@ScottTodd
Copy link
Member Author

Closing this as

  • work is ramping up on using WASM
  • linalg on tensors codegen is on track to land relatively soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance ⚡ Performance/optimization related work across the compiler and runtime runtime Relating to the IREE runtime library
Projects
None yet
Development

No branches or pull requests

3 participants