gpu: intel: Optimize reusable layer normalization using work-group based reductions #1990

umar456 · 2024-07-10T19:55:59Z

Description

This pull request adds an alternative kernel that performs better under certain conditions compared to the sub-group reduction kernel previously implemented. The new kernel uses the work_group_reduce_add function to perform the mean an variance reductions instead of using the sub_group based reductions. One benefit of this kernel is that it performs better for sizes that do not fully utilize the device when sub-group based reductions are used.

Optimizations

work-group based reductions vs sub-group based reductions

There are two kernels implemented for the reusable layer normalization layer. These kernels differ in the way the summation operation is performed in the variance and mean calculation. The work-group kernel will launch a work-item for each element in the lnorm axis and the sub-group kernel will launch one SIMD worth of work-items in the lnorm axis. The work group kernel will use work_group_reduction_add function and sub-group version will use the sub_group_reduction_add function to perform the summation. Here is a heatmap of the two kernel and how they perform over the different shapes of the input tensor.

Use of fixed sized loops vs variable sized loops

Example: https://github.com/oneapi-src/oneDNN/compare/main...umar456:oneDNN:uarshad/reusable_vectorized_lnorm?expand=1#diff-399297f4e437e8a12e0e654089b8af8c938a7a8430c6efab4ea61a029a683f8cR53

There is a significant penalty when a runtime variable is used to exit the loop condition. Here are heatmaps between a runtime condition vs a compile time condition for the for loop:

Use macro to avoid loops in work-group kernel

The ifdef here: https://github.com/oneapi-src/oneDNN/compare/main...umar456:oneDNN:uarshad/reusable_vectorized_lnorm?expand=1#diff-399297f4e437e8a12e0e654089b8af8c938a7a8430c6efab4ea61a029a683f8cR51

is used to avoid adding loop in the work-group kernel. I had originally assumed that if the compiler knew that we were only iterating the loop one time it would be able to remove the overhead of adding the loop but it seems its not the case. Here is a heatmap of using the macro to remove the loop in the work-group kernel.

Use large GRF for certain shapes in the sub-group based kernel

Large GRF flag can significantly improve the speed of the kernel under certain situation. You see the greatest speedup in situations where the tensor is small enough to fit in the device cache and there is the lnorm axis is larger than 768. I suspect this is because it allows the device to queue more load transactions than without the flag. Additionally there is a significant slowdown where the lnorm axis is small and the number of subgroups launched is greater than one wave. I suspect his is because fewer sub-groups are active when the large GRF flag is used. Here is the heatmap of the sub-group kernel with and without the GRF flag.

Overall Speedup

512 EU PVC

Heatmap vs. Original Vectorized:
$reusable_pvc512$

umar456 · 2024-07-10T22:32:35Z

make test
enable device_gpu
disable device_cpu
disable benchdnn_all
enable benchdnn_lnorm
enable arch_xe-hpc
enable arch_xe-hpg-atsm
enable arch-xe-hpg-dg2
enable arch_xe-lpg
enable arch_xe2-lpg
enable arch_xe2-hpg-bmg

Simonsays095 · 2024-07-17T16:37:41Z

src/gpu/intel/ocl/reusable_vectorized_lnorm.hpp

+    uint32_t sg_size;
+
+    /// The number of work-items in the work group
+    uint32_t wg_size = 0;


I suspect you don't actually need this variable. It's used in two places:

Set as a build option to get passed to reqd_work_group_size: You can probably just remove this and performance changes will be minimal.

Computing the nd_range_t: Reconstruct it directly in the execute function (based on select_work_group_kernel, vector_size, and pd())

If you can remove this, the kernel will be far more reusable.

Simonsays095 · 2024-07-17T16:42:55Z

src/gpu/intel/ocl/reusable_vectorized_lnorm.hpp

+    /// Use the cl-intel-256-GRF-per-thread flag
+    bool large_grf = false;


I worry about using large GRF mode as a heuristic. On most intel GPUs, switching the GRF mode requires stalling the pipeline which can lead to performance losses. You can (probably) see this by running a benchdnn batch on layers that get small/large/small GRF modes, and you should see performance much lower than when they're run separately.

Usually, the GRF mode is passed in by the user as a GPU attr, and the kernels are just tasked with sticking to it.

Simonsays095 · 2024-07-17T17:16:39Z

src/gpu/intel/ocl/reusable_vectorized_lnorm.cl

+#if PVT_MEM_SIZE > 1
+    VECT_FLOAT_T val[PVT_MEM_SIZE];
+    unroll_for_by(N_UNROLL)(int sg_idx = 0, i = 0; i < PVT_MEM_SIZE;
+                            sg_idx += GROUP_STRIDE, i++) {
+        val[i] = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T(VECT_BLOCK_READ(
+                (const __global BLOCK_DATA_T *)(&src[sg_idx]))));
+    }
+#else
+    VECT_FLOAT_T val = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T(
+            VECT_BLOCK_READ((const __global BLOCK_DATA_T *)(src))));
+#endif


I think the compiler should be able to optimize this incantation. Give it a shot and let me know.

Suggested change

#if PVT_MEM_SIZE > 1

VECT_FLOAT_T val[PVT_MEM_SIZE];

unroll_for_by(N_UNROLL)(int sg_idx = 0, i = 0; i < PVT_MEM_SIZE;

sg_idx += GROUP_STRIDE, i++) {

val[i] = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T(VECT_BLOCK_READ(

(const __global BLOCK_DATA_T *)(&src[sg_idx]))));

}

#else

VECT_FLOAT_T val = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T(

VECT_BLOCK_READ((const __global BLOCK_DATA_T *)(src))));

#endif

VECT_FLOAT_T val[PVT_MEM_SIZE];

int sg_idx = 0;

for (int i = 0; i < PVT_MEM_SIZE; i++) {

val[i] = CONVERT_VECT_FLOAT_T(AS_VECT_DATA_T(VECT_BLOCK_READ(

(const __global BLOCK_DATA_T *)(&src[sg_idx]))));

sg_idx += GROUP_STRIDE;

}

umar456 added 6 commits July 8, 2024 13:47

gpu: intel: lnorm: Optimize kernel using more wg

ad4fdd1

gpu: intel: lnorm: Optimize wg/sg kernel boundry

ff83934

gpu: intel: lnorm: Fixed loops, LGRF, better boundaries

2a2ce54

gpu: intel: lnorm: Update launch conf criteria for wg kernel on XeLPG

cbc3aa0

gpu: intel: lnorm: Refine boundary conditions

89d33a4

gpu: intel: lnorm: Fix failure with use_global_stats flag

658c628

umar456 added performance platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel labels Jul 10, 2024

Simonsays095 reviewed Jul 17, 2024

View reviewed changes

vpirogov added this to the v3.6 milestone Jul 18, 2024

vpirogov modified the milestone: v3.6 Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu: intel: Optimize reusable layer normalization using work-group based reductions #1990

gpu: intel: Optimize reusable layer normalization using work-group based reductions #1990

umar456 commented Jul 10, 2024 •

edited by mgouicem

Loading

umar456 commented Jul 10, 2024

Simonsays095 Jul 17, 2024

Simonsays095 Jul 17, 2024

Simonsays095 Jul 17, 2024

		/// Use the cl-intel-256-GRF-per-thread flag
		bool large_grf = false;

gpu: intel: Optimize reusable layer normalization using work-group based reductions #1990

Are you sure you want to change the base?

gpu: intel: Optimize reusable layer normalization using work-group based reductions #1990

Conversation

umar456 commented Jul 10, 2024 • edited by mgouicem Loading

Description

Optimizations

work-group based reductions vs sub-group based reductions

Use of fixed sized loops vs variable sized loops

Use macro to avoid loops in work-group kernel

Use large GRF for certain shapes in the sub-group based kernel

Overall Speedup

512 EU PVC

umar456 commented Jul 10, 2024

Simonsays095 Jul 17, 2024

Choose a reason for hiding this comment

Simonsays095 Jul 17, 2024

Choose a reason for hiding this comment

Simonsays095 Jul 17, 2024

Choose a reason for hiding this comment

umar456 commented Jul 10, 2024 •

edited by mgouicem

Loading