-
Notifications
You must be signed in to change notification settings - Fork 987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu: intel: Optimize reusable layer normalization using work-group based reductions #1990
base: main
Are you sure you want to change the base?
Changes from all commits
ad4fdd1
ff83934
2a2ce54
cbc3aa0
89d33a4
658c628
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -56,11 +56,15 @@ struct reusable_vectorized_lnorm_params_t | |
compute::kernel_ctx_t get_kernel_ctx() const; | ||
|
||
compute::dispatch_compile_params_t gws_params; | ||
|
||
/// Number of work items in a sub-group | ||
int sg_size; | ||
uint32_t sg_size; | ||
|
||
/// The number of work-items in the work group | ||
uint32_t wg_size = 0; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suspect you don't actually need this variable. It's used in two places:
If you can remove this, the kernel will be far more reusable. |
||
|
||
/// Number of elements to process in each work-item | ||
int vector_size; | ||
uint32_t vector_size; | ||
|
||
/// The number of times the loops need to unroll | ||
int unroll; | ||
|
@@ -78,7 +82,16 @@ struct reusable_vectorized_lnorm_params_t | |
/// Saves the mean and variance to memory | ||
bool save_stats = false; | ||
|
||
uint8_t padding[4] = {false}; | ||
/// Select the work_group based reduction kernel | ||
bool select_work_group_kernel = false; | ||
|
||
/// Use the cl-intel-256-GRF-per-thread flag | ||
bool large_grf = false; | ||
Comment on lines
+88
to
+89
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I worry about using large GRF mode as a heuristic. On most intel GPUs, switching the GRF mode requires stalling the pipeline which can lead to performance losses. You can (probably) see this by running a benchdnn batch on layers that get small/large/small GRF modes, and you should see performance much lower than when they're run separately. Usually, the GRF mode is passed in by the user as a GPU attr, and the kernels are just tasked with sticking to it. |
||
|
||
/// The number of elements to allocate in the val array in the kernel | ||
uint8_t private_mem_size = 0; | ||
|
||
uint8_t padding[5] = {false}; | ||
}; | ||
|
||
struct reusable_vectorized_lnorm_runtime_params_t { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the compiler should be able to optimize this incantation. Give it a shot and let me know.