-
Notifications
You must be signed in to change notification settings - Fork 22.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-enable AVX512 ATen kernels for compute-intensive ops #104165
Re-enable AVX512 ATen kernels for compute-intensive ops #104165
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104165
Note: Links to docs will display an error until the docs builds have been completed. ✅ 1 Unrelated FailureAs of commit d5002f4 with merge base 3336aa1 (): UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -424,6 +424,8 @@ void replication_pad3d_backward_kernel_impl( | |||
|
|||
} // anonymous namespace | |||
|
|||
// These kernels are slower with AVX512 than with AVX2. | |||
#ifndef CPU_CAPABILITY_AVX512 | |||
// reflection padding | |||
REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to add an default argument to let decide whether this kernel is dispatched to avx512 explicitly ? Probably something like:
// if you want this one to go avx512:
REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl, /* enable_avx512_dispatch */ true);
// if you want to skip avx512 (use avx2 only), just use the old way:
REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @mingfeima and I think it is a better idea to explicitly turn on AVX512 for particular kernels instead of turning off others since there are more kernels to turn off than those to turn on. This can reduces the lines of code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, @mingfeima @jgong5, I had missed your comments. I had started with such an approach when I first started this task, so that the number of changes in the code could be reduced. It led to symbol resolution issues, so I abandoned it.
Basically, since kernels are compiled separately for each AVX-n capability, if we only register AVX512 dispatch for some kernels, then we would have missing definitions for kernels that wouldn't be dispatched to AVX512.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to always register null AVX512 kernels by default while checking if there are already existing registrations (e.g., via some macro def checks)? The real AVX512 kernels can be registered beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for your inputs, @jgong5! Yes, such an approach is feasible.
The approach below to improve readability, and decrease the number of changes is also similar in nature to @mingfeima 's suggestion above (edited in-place below).
// if you want this one to go avx512:
ALSO_REGISTER_AVX512_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl);
// if you want to skip avx512 (use avx2 only), just use the old way:
REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl);
@jgong5, its implementation is a bit different from your original suggestion, though.
In the code snippet below, we are leveraging the fact that we compile separately for each AVX-n vectorization ISA -
In DispatchStub.h
,
#elif defined(CPU_CAPABILITY)
#ifdef CPU_CAPABILITY_AVX512
// REGISTER_DISPATCH now dispatches an AVX512 kernel to nullptr
// ALSO_REGISTER_AVX512_DISPATCH should be used for ensuring AVX512 dispatch
#define REGISTER_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, nullptr)
#else
#define REGISTER_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, fn)
#endif
#define ALSO_REGISTER_AVX512_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, fn)
I'll push changes & then re-request review to seek feedback on it. Thanks!
@@ -424,6 +424,8 @@ void replication_pad3d_backward_kernel_impl( | |||
|
|||
} // anonymous namespace | |||
|
|||
// These kernels are slower with AVX512 than with AVX2. | |||
#ifndef CPU_CAPABILITY_AVX512 | |||
// reflection padding | |||
REGISTER_DISPATCH(reflection_pad1d_kernel, &reflection_pad1d_kernel_impl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @mingfeima and I think it is a better idea to explicitly turn on AVX512 for particular kernels instead of turning off others since there are more kernels to turn off than those to turn on. This can reduces the lines of code changes.
This comment was marked as off-topic.
This comment was marked as off-topic.
Disable AVX512 dispatch for fmod & remainder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now.
This comment was marked as resolved.
This comment was marked as resolved.
Hi @mingfeima, I'll add benchmarking data soon. @jgong5, I was wondering if we should rename the macro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
Hi @ezyang, |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary
Enables AVX512 dispatch by default for some kernels, for which AVX512 performs better than AVX2.
For other kernels, their AVX2 counterparts are used.
Implementation details
REGISTER_DISPATCH
should now only be used for non-AVX512 dispatch.ALSO_REGISTER_AVX512_DISPATCH
should be used when AVX512 dispatch should also be done for a kernel.Benchmarking results with #104655
Raw data at GitHub Gist (Click on
Download ZIP
)Insights gleaned through collected data (future action-items):
Data was collected with 26 physical threads of one socket of Intel Xeon 8371HC. Intel OpenMP & tcmalloc were preloaded.
cc @jgong5 @mingfeima @XiaobingSuper @ashokei @jingxu10