Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu: aarch64: Expand brgemm aarch64 unsupported cases handling mechanism #2099

Conversation

Radu2k
Copy link
Contributor

@Radu2k Radu2k commented Sep 18, 2024

Remove the asserts that cause segfaults in the test and handle unsupported cases based on oneDNN status type. As well remove the forgotten asserts in brgemm_matmul.cpp and modify acl_deconvolution.hpp to make op/f32/deconv_bwd_d test pass.

Description

This replaces the asserts in brgemm that make test_benchdnn_modeC_graph_ci_cpu segfault on AArch64 by replacing the asserts with status_t type in oneDNN. As well, the segfault of --graph --skip-impl=ref --case=op/f32/deconv.json test when built with gcc-10 and g++-10, which is part of the graph test batch.

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Bug fixes

  • Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
  • Have you added relevant regression tests?

Remove the asserts that cause segfaults in the test and handle
unsupported cases based on oneDNN status type. As well remove
the forgotten asserts in brgemm_matmul.cpp and modify
acl_deconvolution.hpp to make op/f32/deconv_bwd_d test pass.
@Radu2k Radu2k requested review from a team as code owners September 18, 2024 16:12
@github-actions github-actions bot added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Sep 18, 2024
Comment on lines 248 to 249
const dnn_mem_t &mem_dt = args.find(arg);
const dnn_mem_t &mem_fp = args_.find(arg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed? What's the issue with auto?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzarukin Hi Dmitry, thanks for having a look into this. I have replaced auto with explicit data types for improved readability, consistency, and type safety. When debugging it was particularly useful to have the data type specified and I did go an extra mile checking that the function returns the same type. Should not be a problem but I am happy to review it if it is a deal breaker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve been working on this the past few days, and all tests for the Graph API should run fine. There are a few edge cases that fail but I’ll detail in an issue soon and shall be fixed in the foreseeable future.

Thanks for the approval!

Copy link
Contributor

@mgouicem mgouicem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you split the change to acl_deconvolution in a separate commit as it seems to fix a different issue (unrelated to brgemm)?

@Radu2k
Copy link
Contributor Author

Radu2k commented Sep 19, 2024

@mgouicem Hi Mourad, thanks for having a look into this. I understand your opinion but why I have added it to this PR is because --graph --skip-impl=ref --case=op/f32/deconv.json is part of test_benchdnn_modeC_graph_ci_cpu batch apparently, so it is a direct fix of the test. That being said it is the only place where we have spotted the problem.

@theComputeKid
Copy link
Contributor

@mgouicem and @dzarukin can you review on behalf of the graph team and merge?

@mgouicem
Copy link
Contributor

We are not approvers for graph component. @TaoLv could you help?

@vpirogov
Copy link
Member

The change that triggers review from Graph component is discussed here. Considering that it's purely stylistical it may make sense to just drop it.

@Radu2k
Copy link
Contributor Author

Radu2k commented Sep 27, 2024

The change that triggers review from Graph component is discussed here. Considering that it's purely stylistical it may make sense to just drop it.

Just to clarify on this topic. This is needed due to the segfaulting behaviour of the test. These are on high priority because segfaults are a security risk. The patch that has introduced this is not coming from us and we will take further action once this change has been submitted. We already have an Issue prepared.

We are promptly on a clear path to minimise our tolerance for any tests that do not guard properly any unsupported cases or any PRs that have slipped trough our fingers in the past.

@theComputeKid
Copy link
Contributor

@vpirogov and @mgouicem : removed the graph-related change, should be good to merge.

@mgouicem mgouicem removed the request for review from a team September 30, 2024 19:04
@mgouicem mgouicem merged commit 932dab8 into oneapi-src:main Sep 30, 2024
16 checks passed
@Shreyas-fuj
Copy link
Contributor

Shreyas-fuj commented Oct 4, 2024

@dzarukin, @theComputeKid, @mgouicem, @Radu2k after this PR is merged the matmul calls are going to gemm:jit:f32, instead of going to brg:sve_256 on Graviton 3. This would slow down matmuls to a great extent on aarch64 when OneDNN is not built with ACL.

@theComputeKid
Copy link
Contributor

Thanks for letting us know. @Radu2k can you please take a look?

@theComputeKid
Copy link
Contributor

@Shreyas-fuj : I can see that the issue is that the conditions of the assert were flipped:

    if (dt_a != data_type::undef && dt_b != data_type::undef)
         return status::unimplemented;
assert(dt_a != data_type::undef && dt_b != data_type::undef);

These two are not equivalent, but flipped. I'll fix. Thanks once again for picking this up.

@Ryo-not-rio
Copy link
Contributor

@Shreyas-fuj Thank you for bringing the issue to our attention. This PR is the fix for the observed behaviour

@Shreyas-fuj
Copy link
Contributor

@theComputeKid , @Ryo-not-rio , thanks for the fix!

@Radu2k Radu2k deleted the loc/github_forks/main/bugfix/brgemm_and_fp32_fix branch October 11, 2024 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants