[Performance] The GEMM performance with the column major B matrix is not as good as row major B matrix. #2354

chengjunlu · 2024-09-26T00:09:55Z

The performance gap is found in #2347

Need to investigate root cause of the performance drops of the column major B matrix case.
Roughly 1.5x worse than the row major B matrix case.

(I): Detected 7680 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 0 spills
✅ Triton and Torch match
Time for torch: 0.31633758544921875 ms
Time for triton: 0.44517597556114197 ms
Compute A x B.T
OpenCL API not available for this operation
OpenCL API not available for this operation
OpenCL API not available for this operation
OpenCL API not available for this operation
(I): Detected 7680 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 0 spills
✅ Triton and Torch match
Time for torch: 0.3375360071659088 ms
Time for triton: 0.6348815560340881 ms

Egor-Krivov · 2024-10-04T08:44:10Z

I think this issue is essential for GEMM perf. Very often weights are stored with K dimensions as the last. Even pytorch linear layer does that: weight torch.Tensor – the learnable weights of the module of shape : (out_features, in_features)

https://pytorch.org/docs/stable/generated/torch.nn.Linear.html

alexbaden · 2024-10-11T02:16:16Z

Adding to this, if the A matrix is column-major we have similar problems.

Egor-Krivov · 2024-10-11T13:33:52Z

We now have microbenchmarks to track this performance. Currently GeoMean for onednn is ~90-100TFLOPs for both cases of A.T@B and for [email protected].

[email protected] for triton currently stands at ~60TFLOPs. Dashboard gemm-bt
A.T@B for triton currently stands at ~30TFLOPs, it significantly improved and was ~15TFLOPs recently. Dashboard gemm-at

So onednn is 1.5 times faster for B.T and 3 times faster for A.T

Egor-Krivov · 2024-10-11T13:35:04Z

@alexbaden Should we change the title to reflect issue with A.T as well or create separate issue for that case?

chengjunlu mentioned this issue Sep 26, 2024

Improve GEMM perf when one matrix is transposed #2347

Merged

vlad-penkin added performance enhancement New feature or request labels Sep 27, 2024

vlad-penkin added this to the 4.0 [Performance] Core milestone Sep 27, 2024

Egor-Krivov mentioned this issue Oct 4, 2024

[Benchmarks] Add microbenchmark with A@B^t #2414

Closed

alexbaden mentioned this issue Oct 11, 2024

[GEMM-perf] matmul is slower when one input needs to be transposed #1795

Closed

vlad-penkin assigned alexbaden Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] The GEMM performance with the column major B matrix is not as good as row major B matrix. #2354

[Performance] The GEMM performance with the column major B matrix is not as good as row major B matrix. #2354

chengjunlu commented Sep 26, 2024

Egor-Krivov commented Oct 4, 2024

alexbaden commented Oct 11, 2024

Egor-Krivov commented Oct 11, 2024 •

edited

Loading

Egor-Krivov commented Oct 11, 2024

[Performance] The GEMM performance with the column major B matrix is not as good as row major B matrix. #2354

[Performance] The GEMM performance with the column major B matrix is not as good as row major B matrix. #2354

Comments

chengjunlu commented Sep 26, 2024

Egor-Krivov commented Oct 4, 2024

alexbaden commented Oct 11, 2024

Egor-Krivov commented Oct 11, 2024 • edited Loading

Egor-Krivov commented Oct 11, 2024

Egor-Krivov commented Oct 11, 2024 •

edited

Loading