Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CodeGen][CUDA] Vectorization for intrinsics #5101

Merged
merged 1 commit into from
Mar 22, 2020

Conversation

wpan11nv
Copy link
Contributor

  • This allows to emit vectorized loads/stores
    for CUDA math intrinsics.

  • Fixed a few intrinsics that should be lowered as
    CUDAMath not CUDAFastMath ones.

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

@wpan11nv
Copy link
Contributor Author

wpan11nv commented Mar 19, 2020

Fixing missing features exposed by #4968

@jwfromm
Copy link
Contributor

jwfromm commented Mar 19, 2020

This seems like a great change! Have you done any tests on how it affects performance? I'd love to know how much this speeds things up.

@wpan11nv
Copy link
Contributor Author

wpan11nv commented Mar 19, 2020

This seems like a great change! Have you done any tests on how it affects performance? I'd love to know how much this speeds things up.

I measured vectorization benefits on a vector add micro-benchmark in this PR #4968 The speedup could be as high as 20+%. This is a missing feature to enable PR4968.

@wpan11nv
Copy link
Contributor Author

@vinx13 @masahi could you please help review this PR?

src/target/source/codegen_cuda.cc Outdated Show resolved Hide resolved
//
// Emit an unsupported vector call
//
// v = intrin_f((float4*)A[0], (float4*)B[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean intrin_f(((float4*)A)[0], ((float4*)B)[0])?

Copy link
Contributor Author

@wpan11nv wpan11nv Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is the CallNode representation, not supported in CUDA. We are going to emit a few scalar calls instead here.

- This allows to emit vectorized loads/stores
  for CUDA math intrinsics.

- A few intrinsics should be lowered as CUDAMath not CUDAFastMath ones.

- Fixed the code block identation.
@vinx13 vinx13 merged commit 05b0f7e into apache:master Mar 22, 2020
@vinx13
Copy link
Member

vinx13 commented Mar 22, 2020

Thanks @wpan11nv this is merged

@wpan11nv wpan11nv deleted the intrinsic_vectorize branch March 23, 2020 03:44
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Apr 16, 2020
- This allows to emit vectorized loads/stores
  for CUDA math intrinsics.

- A few intrinsics should be lowered as CUDAMath not CUDAFastMath ones.

- Fixed the code block identation.
zhiics pushed a commit to neo-ai/tvm that referenced this pull request Apr 17, 2020
- This allows to emit vectorized loads/stores
  for CUDA math intrinsics.

- A few intrinsics should be lowered as CUDAMath not CUDAFastMath ones.

- Fixed the code block identation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants