[compute/cker] Optimize BatchMatMul for x86 #14305

tomdol · 2024-11-05T17:27:28Z

This commit adds an optimized version of the BatchMatMul kernel. The optimization targets the x86 architecture, in all other cases the code is compiled with existing reference kernel.

The new kernel calls the optimized::Gemm function which uses Eigen internally.

Additionally to avoid code duplication a new BatchMatMulParams struct is introduced and reused in both reference and optimized kernels.

ONE-DCO-1.0-Signed-off-by: Tomasz Dolbniak [email protected]

This commit adds an optimized version of the BatchMatMul kernel. The optimization targets the x86 architecture, in all other cases the code is compiled with existing reference kernel. The new kernel calls the optimized::Gemm function which uses Eigen internally. Additionally to avoid code duplication a new BatchMatMulParams struct is introduced and reused in both reference and optimized kernels. ONE-DCO-1.0-Signed-off-by: Tomasz Dolbniak <[email protected]>

tomdol · 2024-11-05T17:28:08Z

This PR is a follow-up of this draft #14238 and was submitted to (partially) solve this issue #12140

glistening · 2024-11-06T04:28:41Z

@tomdol Just for your information, our main target is arm, not x64. I guess you're already aware of it since you used (partially) solve. Also, for LLM, we will use GGML kernel which provides the quantized type kernel for lower than 8 bit.

glistening · 2024-11-06T04:34:15Z

In addition, this PR does not have test. How did you test this kernel?

tomdol · 2024-11-06T08:36:34Z

@glistening

How did you test this kernel?

There are existing tests for this kernel. I was thinking if I should add any but it seems that all use cases are covered.

[ RUN      ] GeneratedTests.batch_matmul_ex_dynamic_nnfw
[       OK ] GeneratedTests.batch_matmul_ex_dynamic_nnfw (1 ms)
[ RUN      ] GeneratedTests.batch_matmul_ex_float_simple
[       OK ] GeneratedTests.batch_matmul_ex_float_simple (1 ms)
[ RUN      ] GeneratedTests.batch_matmul_ex_float_adj_y
[       OK ] GeneratedTests.batch_matmul_ex_float_adj_y (1 ms)
[ RUN      ] GeneratedTests.batch_matmul_ex_float_adj_x
[       OK ] GeneratedTests.batch_matmul_ex_float_adj_x (1 ms)
[ RUN      ] GeneratedTests.batch_matmul_ex_float_batch2
[       OK ] GeneratedTests.batch_matmul_ex_float_batch2 (1 ms)
[ RUN      ] GeneratedTests.batch_matmul_ex_float_broadcast
[       OK ] GeneratedTests.batch_matmul_ex_float_broadcast (1 ms)
[ RUN      ] GeneratedTests.batch_matmul_ex_float_broadcast_adj_x
[       OK ] GeneratedTests.batch_matmul_ex_float_broadcast_adj_x (1 ms)
[ RUN      ] GeneratedTests.batch_matmul_ex_float_broadcast2_adj_xy
[       OK ] GeneratedTests.batch_matmul_ex_float_broadcast2_adj_xy (1 ms)

Regarding the GGML kernel - is someone already working on it? I was going to attempt to write an optimized version for ARM too in the next step, I would just like to know if I should proceed.

glistening · 2024-11-07T03:03:06Z

@tomdol Thank you for answer. Test was done via nnap tests.

Regarding the GGML kernel - is someone already working on it?

I checked out model. For our model, BatchMatMul f32 (both lhs, rhs) is necessary.

glistening

LGTM

glistening · 2024-11-08T01:20:39Z

@tomdol For arm optimized kernel, I am thinking of using ggml mul_mat, which is already in our repo. It supports multithread and use neon optimized code. What kernel are you thinking of?

zetwhite

LGTM 👍

tomdol · 2024-11-12T08:47:21Z

@tomdol For arm optimized kernel, I am thinking of using ggml mul_mat, which is already in our repo. It supports multithread and use neon optimized code. What kernel are you thinking of?

@glistening sorry about the delay in replying. I didn't think about any particular kernel yet except that there was a need for an ARM-targetting optimized version too. I was hoping to figure out more by discussing it in #12140

I would appreciate some guidelines and would like to offer to help with this part of the BatchMatMul optimization work. Unless of course someone is already taking care of it :)

tomdol requested a review from glistening November 5, 2024 17:28

tomdol added 2 commits November 5, 2024 22:17

Formatting and namespace usage

1cf9198

Disable the optimized kernel in non-x86 envs

ebbbb0a

glistening approved these changes Nov 7, 2024

View reviewed changes

glistening requested a review from a team November 7, 2024 03:29

zetwhite approved these changes Nov 8, 2024

View reviewed changes

glistening merged commit 1e09707 into Samsung:master Nov 8, 2024
9 checks passed

tomdol deleted the bmm_opt_x86 branch November 12, 2024 08:47

glistening mentioned this pull request Nov 13, 2024

[onert] Optimize BatchMatMul kernel in cpu backend #12140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[compute/cker] Optimize BatchMatMul for x86 #14305

[compute/cker] Optimize BatchMatMul for x86 #14305

tomdol commented Nov 5, 2024

tomdol commented Nov 5, 2024

glistening commented Nov 6, 2024

glistening commented Nov 6, 2024

tomdol commented Nov 6, 2024

glistening commented Nov 7, 2024 •

edited

Loading

glistening left a comment

glistening commented Nov 8, 2024

zetwhite left a comment

tomdol commented Nov 12, 2024

[compute/cker] Optimize BatchMatMul for x86 #14305

[compute/cker] Optimize BatchMatMul for x86 #14305

Conversation

tomdol commented Nov 5, 2024

tomdol commented Nov 5, 2024

glistening commented Nov 6, 2024

glistening commented Nov 6, 2024

tomdol commented Nov 6, 2024

glistening commented Nov 7, 2024 • edited Loading

glistening left a comment

Choose a reason for hiding this comment

glistening commented Nov 8, 2024

zetwhite left a comment

Choose a reason for hiding this comment

tomdol commented Nov 12, 2024

glistening commented Nov 7, 2024 •

edited

Loading