Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AllocateSharedMemoryPass has possibility to allocate SLM size greater than device max share memory #1716

Closed
LiyangLingIntel opened this issue Jul 29, 2024 · 3 comments · Fixed by #2312
Assignees
Labels
bug Something isn't working codegen: mlir

Comments

@LiyangLingIntel
Copy link
Contributor

Running gemm kernels like gemm_splitk_benchmark.py with the latest llvm-target branch will fail for

triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 266240, Hardware limit: 131072.
@chengjunlu
Copy link
Contributor

Information from @whitneywhtsang

FYI: The failures reported in CI are due to enabling large 2d block load a74da7d.

@LiyangLingIntel
Copy link
Contributor Author

LiyangLingIntel commented Sep 3, 2024

This task is still under progress.
This issue maybe partially cased by that we use larger size of dpas layout with repCluster while the allocation analysis algorithm uses NvidiaMmaLayout, the calculated buffer scratch size would be different.
On the other hand, the original allocation algorithm does not seem to be quite perfect, it potentially allocates oversized shared memory. It needs more investigation to find an appropriate solution.

@LiyangLingIntel
Copy link
Contributor Author

LiyangLingIntel commented Sep 14, 2024

The root cause of this issue is the large 2d load with large repCluster requires large size of shared local memory for ConvertLayout op when converting dpas layout to blocked layout. Minor changes to the allocation pass does not help.
After syncing with @chengjunlu, we think

  • One possible solution is to remove the ConvertLayout op for the stream-k and split-k by propagating dpas layout to atomic op.
  • The another one is to reduce the repCluster size when we find there is a ConvertLayout op cannot be removed.
    We can use this way as a short term solution to work around this issue. For the long term, there would not be this issue when leveraging the linear layout.

I'm trying to do some changes on the first way, if there are some common ops make the ConvertLayout op not removable. I'll switch to the second way which is add another pass to check and reduce large 2d load size to make it work functionally at first.

@LiyangLingIntel LiyangLingIntel linked a pull request Sep 23, 2024 that will close this issue
LiyangLingIntel added a commit that referenced this issue Sep 27, 2024
This change help resolve issue
[#1716](#1716).
Propagating mma layout from dot to atomic_rmw op help eliminating
`convert_layout` op from/to large size mma layout, which requires
oversized shared memory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working codegen: mlir
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants