-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CodeGen][CUDA] Vectorization for intrinsics #5101
Conversation
Fixing missing features exposed by #4968 |
b3be9bd
to
6f21588
Compare
This seems like a great change! Have you done any tests on how it affects performance? I'd love to know how much this speeds things up. |
I measured vectorization benefits on a vector add micro-benchmark in this PR #4968 The speedup could be as high as 20+%. This is a missing feature to enable PR4968. |
6f21588
to
56a61d1
Compare
56a61d1
to
e1563b4
Compare
// | ||
// Emit an unsupported vector call | ||
// | ||
// v = intrin_f((float4*)A[0], (float4*)B[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean intrin_f(((float4*)A)[0], ((float4*)B)[0])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is the CallNode representation, not supported in CUDA. We are going to emit a few scalar calls instead here.
- This allows to emit vectorized loads/stores for CUDA math intrinsics. - A few intrinsics should be lowered as CUDAMath not CUDAFastMath ones. - Fixed the code block identation.
e1563b4
to
9937bea
Compare
Thanks @wpan11nv this is merged |
- This allows to emit vectorized loads/stores for CUDA math intrinsics. - A few intrinsics should be lowered as CUDAMath not CUDAFastMath ones. - Fixed the code block identation.
- This allows to emit vectorized loads/stores for CUDA math intrinsics. - A few intrinsics should be lowered as CUDAMath not CUDAFastMath ones. - Fixed the code block identation.
This allows to emit vectorized loads/stores
for CUDA math intrinsics.
Fixed a few intrinsics that should be lowered as
CUDAMath not CUDAFastMath ones.
Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.