Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[compute/cker] Optimize BMM for X86 #14238

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

tomdol
Copy link
Contributor

@tomdol tomdol commented Oct 18, 2024

This commit rewrites the BatchMatMul reference implementation with the usage of optimized::Gemm

ONE-DCO-1.0-Signed-off-by: Tomasz Dolbniak [email protected]

This draft is a follow-up of #13919 and is an attempt to solve #12140 which I took over after Jan

This commit rewrites the BatchMatMul reference implementation with the usage of optimized::Gemm

ONE-DCO-1.0-Signed-off-by: Tomasz Dolbniak <[email protected]>
@tomdol tomdol requested a review from ragmani October 18, 2024 06:51
BMMParams(const Shape &lhs_shape, const Shape &rhs_shape)
{
const Shape extended_lhs_shape = Shape::ExtendedShape(5, lhs_shape);
const Shape extended_rhs_shape = Shape::ExtendedShape(5, rhs_shape);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some checks in the code that limit the rank of the incoming tensors to no more than 4D. What is the reasoning behind extending the shape to 5D every time? In all test cases I executed the first dimension of the 5D extended shape was always equal to 1 but since this is a runtime-known information the compiler could not do anything about the outermost loops below.

So in other words, would it be ok to change this code to Shape::ExtendedShape(4, ...) or should 5D be kept because there's a use case why this is needed?

Copy link
Contributor

@glistening glistening Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also curious of the meaning of 5. As you said, we usually work with 4.
I searched the history, and it pointed out this. The code is written by @ragmani, and he answered the following for the same question about 5 year ago.

BatchMatMul kernel supports rank up to 5.

Why do you want to reduce to 4? Does your kernel in this PR only support up to 4D? Also, if your code is more optimized one, why do you want to replace the kernel in reference code?

In addition, could you please you the full name of BatchMatMul, instead of BMM? I hardly imagine BMM means BatchMatMul.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to reduce to 4? Does your kernel in this PR only support up to 4D?

It currently works with 5D, I've used the same approach as the reference implementation does but I was just curious about the reason of extending the shapes to 5 dimensions. By reading the code I still could not find a use case where a 5D shape is passed to the kernel itself so I thought wr could remove this outermost loop. Anyway we can leave it as is and possibly get back to this topic later.

if your code is more optimized one, why do you want to replace the kernel in reference code?

This was just done in the draft to present the overall approach and ask for some guidance. In this case should I add the branching between the optimized version and the reference version somewhere around here? https://github.com/Samsung/ONE/blob/master/compute/cker/include/cker/operation/BatchMatMul.h#L105

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case should I add the branching between the optimized version and the reference version somewhere around here? https://github.com/Samsung/ONE/blob/master/compute/cker/include/cker/operation/BatchMatMul.h#L105

Yes.

In addition, please put optimized kernel in optimized/BatchMatMul.h.

Please use BatchMatMul instead of BMM.

@tomdol
Copy link
Contributor Author

tomdol commented Oct 31, 2024

@ragmani could you please have a look and let me know if this is the right way to go for the first step of this op's optimization?

@ragmani
Copy link
Contributor

ragmani commented Oct 31, 2024

@ragmani could you please have a look and let me know if this is the right way to go for the first step of this op's optimization?

@tomdol Sorry, I'm out of the office for a while. Anyway, this seems like a right way to go for the first step.
@Samsung/one_onert Could you check this draft instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants