`AllocateSharedMemoryPass` has possibility to allocate SLM size greater than device max share memory #1716

LiyangLingIntel · 2024-07-29T10:04:52Z

Running gemm kernels like gemm_splitk_benchmark.py with the latest llvm-target branch will fail for

triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 266240, Hardware limit: 131072.

The text was updated successfully, but these errors were encountered:

chengjunlu · 2024-08-01T01:02:21Z

Information from @whitneywhtsang

FYI: The failures reported in CI are due to enabling large 2d block load a74da7d.

LiyangLingIntel · 2024-09-03T11:17:28Z

This task is still under progress.
This issue maybe partially cased by that we use larger size of dpas layout with repCluster while the allocation analysis algorithm uses NvidiaMmaLayout, the calculated buffer scratch size would be different.
On the other hand, the original allocation algorithm does not seem to be quite perfect, it potentially allocates oversized shared memory. It needs more investigation to find an appropriate solution.

LiyangLingIntel · 2024-09-14T08:43:12Z

The root cause of this issue is the large 2d load with large repCluster requires large size of shared local memory for ConvertLayout op when converting dpas layout to blocked layout. Minor changes to the allocation pass does not help.
After syncing with @chengjunlu, we think

One possible solution is to remove the ConvertLayout op for the stream-k and split-k by propagating dpas layout to atomic op.
The another one is to reduce the repCluster size when we find there is a ConvertLayout op cannot be removed.
We can use this way as a short term solution to work around this issue. For the long term, there would not be this issue when leveraging the linear layout.

I'm trying to do some changes on the first way, if there are some common ops make the ConvertLayout op not removable. I'll switch to the second way which is add another pass to check and reduce large 2d load size to make it work functionally at first.

This change help resolve issue [#1716](#1716). Propagating mma layout from dot to atomic_rmw op help eliminating `convert_layout` op from/to large size mma layout, which requires oversized shared memory.

LiyangLingIntel mentioned this issue Jul 29, 2024

Add StreamK and SplitK to regular CI #1603

Closed

vlad-penkin added bug Something isn't working codegen: mlir labels Jul 29, 2024

vlad-penkin added this to the 0.3 [Triton] Language and Runtime milestone Jul 29, 2024

vlad-penkin assigned LiyangLingIntel Jul 29, 2024

chengjunlu mentioned this issue Aug 1, 2024

Refactor StreamK and SplitK benchmarks #1711

Merged

LiyangLingIntel mentioned this issue Sep 23, 2024

Propagate mma layout to atomic_rmw op #2312

Merged

LiyangLingIntel linked a pull request Sep 23, 2024 that will close this issue

Propagate mma layout to atomic_rmw op #2312

Merged

LiyangLingIntel closed this as completed in #2312 Sep 27, 2024

This was referenced Sep 27, 2024

Failure to compile gemm_postop_addmatrix_benchmark.py with #2378

Closed

Improve out-of-box performance for GEMM kernels variants #2379

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`AllocateSharedMemoryPass` has possibility to allocate SLM size greater than device max share memory #1716

`AllocateSharedMemoryPass` has possibility to allocate SLM size greater than device max share memory #1716

LiyangLingIntel commented Jul 29, 2024

chengjunlu commented Aug 1, 2024

LiyangLingIntel commented Sep 3, 2024 •

edited

Loading

LiyangLingIntel commented Sep 14, 2024 •

edited

Loading

AllocateSharedMemoryPass has possibility to allocate SLM size greater than device max share memory #1716

AllocateSharedMemoryPass has possibility to allocate SLM size greater than device max share memory #1716

Comments

LiyangLingIntel commented Jul 29, 2024

chengjunlu commented Aug 1, 2024

LiyangLingIntel commented Sep 3, 2024 • edited Loading

LiyangLingIntel commented Sep 14, 2024 • edited Loading

`AllocateSharedMemoryPass` has possibility to allocate SLM size greater than device max share memory #1716

`AllocateSharedMemoryPass` has possibility to allocate SLM size greater than device max share memory #1716

LiyangLingIntel commented Sep 3, 2024 •

edited

Loading

LiyangLingIntel commented Sep 14, 2024 •

edited

Loading