-
Notifications
You must be signed in to change notification settings - Fork 987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu: intel: Optimize reusable layer normalization using work-group based reductions #1990
base: main
Are you sure you want to change the base?
gpu: intel: Optimize reusable layer normalization using work-group based reductions #1990
Conversation
make test |
uint32_t sg_size; | ||
|
||
/// The number of work-items in the work group | ||
uint32_t wg_size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you don't actually need this variable. It's used in two places:
- Set as a build option to get passed to
reqd_work_group_size
: You can probably just remove this and performance changes will be minimal. - Computing the nd_range_t: Reconstruct it directly in the
execute
function (based onselect_work_group_kernel
,vector_size
, andpd()
)
If you can remove this, the kernel will be far more reusable.
/// Use the cl-intel-256-GRF-per-thread flag | ||
bool large_grf = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about using large GRF mode as a heuristic. On most intel GPUs, switching the GRF mode requires stalling the pipeline which can lead to performance losses. You can (probably) see this by running a benchdnn batch on layers that get small/large/small GRF modes, and you should see performance much lower than when they're run separately.
Usually, the GRF mode is passed in by the user as a GPU attr, and the kernels are just tasked with sticking to it.
#if PVT_MEM_SIZE > 1 | ||
VECT_FLOAT_T val[PVT_MEM_SIZE]; | ||
unroll_for_by(N_UNROLL)(int sg_idx = 0, i = 0; i < PVT_MEM_SIZE; | ||
sg_idx += GROUP_STRIDE, i++) { | ||
val[i] = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T(VECT_BLOCK_READ( | ||
(const __global BLOCK_DATA_T *)(&src[sg_idx])))); | ||
} | ||
#else | ||
VECT_FLOAT_T val = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T( | ||
VECT_BLOCK_READ((const __global BLOCK_DATA_T *)(src)))); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the compiler should be able to optimize this incantation. Give it a shot and let me know.
#if PVT_MEM_SIZE > 1 | |
VECT_FLOAT_T val[PVT_MEM_SIZE]; | |
unroll_for_by(N_UNROLL)(int sg_idx = 0, i = 0; i < PVT_MEM_SIZE; | |
sg_idx += GROUP_STRIDE, i++) { | |
val[i] = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T(VECT_BLOCK_READ( | |
(const __global BLOCK_DATA_T *)(&src[sg_idx])))); | |
} | |
#else | |
VECT_FLOAT_T val = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T( | |
VECT_BLOCK_READ((const __global BLOCK_DATA_T *)(src)))); | |
#endif | |
VECT_FLOAT_T val[PVT_MEM_SIZE]; | |
int sg_idx = 0; | |
for (int i = 0; i < PVT_MEM_SIZE; i++) { | |
val[i] = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T(VECT_BLOCK_READ( | |
(const __global BLOCK_DATA_T *)(&src[sg_idx])))); | |
sg_idx += GROUP_STRIDE; | |
} |
Description
This pull request adds an alternative kernel that performs better under certain conditions compared to the sub-group reduction kernel previously implemented. The new kernel uses the work_group_reduce_add function to perform the mean an variance reductions instead of using the sub_group based reductions. One benefit of this kernel is that it performs better for sizes that do not fully utilize the device when sub-group based reductions are used.
Optimizations
work-group based reductions vs sub-group based reductions
There are two kernels implemented for the reusable layer normalization layer. These kernels differ in the way the summation operation is performed in the variance and mean calculation. The work-group kernel will launch a work-item for each element in the lnorm axis and the sub-group kernel will launch one SIMD worth of work-items in the lnorm axis. The work group kernel will use work_group_reduction_add function and sub-group version will use the sub_group_reduction_add function to perform the summation. Here is a heatmap of the two kernel and how they perform over the different shapes of the input tensor.
Use of fixed sized loops vs variable sized loops
Example: https://github.com/oneapi-src/oneDNN/compare/main...umar456:oneDNN:uarshad/reusable_vectorized_lnorm?expand=1#diff-399297f4e437e8a12e0e654089b8af8c938a7a8430c6efab4ea61a029a683f8cR53
There is a significant penalty when a runtime variable is used to exit the loop condition. Here are heatmaps between a runtime condition vs a compile time condition for the for loop:
Use macro to avoid loops in work-group kernel
The ifdef here: https://github.com/oneapi-src/oneDNN/compare/main...umar456:oneDNN:uarshad/reusable_vectorized_lnorm?expand=1#diff-399297f4e437e8a12e0e654089b8af8c938a7a8430c6efab4ea61a029a683f8cR51
is used to avoid adding loop in the work-group kernel. I had originally assumed that if the compiler knew that we were only iterating the loop one time it would be able to remove the overhead of adding the loop but it seems its not the case. Here is a heatmap of using the macro to remove the loop in the work-group kernel.
Use large GRF for certain shapes in the sub-group based kernel
Large GRF flag can significantly improve the speed of the kernel under certain situation. You see the greatest speedup in situations where the tensor is small enough to fit in the device cache and there is the lnorm axis is larger than 768. I suspect this is because it allows the device to queue more load transactions than without the flag. Additionally there is a significant slowdown where the lnorm axis is small and the number of subgroups launched is greater than one wave. I suspect his is because fewer sub-groups are active when the large GRF flag is used. Here is the heatmap of the sub-group kernel with and without the GRF flag.
Overall Speedup
512 EU PVC
Heatmap vs. Original Vectorized: