You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Need to investigate root cause of the performance drops of the column major B matrix case.
Roughly 1.5x worse than the row major B matrix case.
(I): Detected 7680 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 0 spills
✅ Triton and Torch match
Time for torch: 0.31633758544921875 ms
Time for triton: 0.44517597556114197 ms
Compute A x B.T
OpenCL API not available for this operation
OpenCL API not available for this operation
OpenCL API not available for this operation
OpenCL API not available for this operation
(I): Detected 7680 spills, recompiling the kernel using large GRF mode
(I): Kernel has now 0 spills
✅ Triton and Torch match
Time for torch: 0.3375360071659088 ms
Time for triton: 0.6348815560340881 ms
The text was updated successfully, but these errors were encountered:
I think this issue is essential for GEMM perf. Very often weights are stored with K dimensions as the last. Even pytorch linear layer does that: weight torch.Tensor – the learnable weights of the module of shape : (out_features, in_features)
We now have microbenchmarks to track this performance. Currently GeoMean for onednn is ~90-100TFLOPs for both cases of A.T@B and for [email protected].
[email protected] for triton currently stands at ~60TFLOPs. Dashboard gemm-bt A.T@B for triton currently stands at ~30TFLOPs, it significantly improved and was ~15TFLOPs recently. Dashboard gemm-at
So onednn is 1.5 times faster for B.T and 3 times faster for A.T
The performance gap is found in #2347
Need to investigate root cause of the performance drops of the column major B matrix case.
Roughly 1.5x worse than the row major B matrix case.
The text was updated successfully, but these errors were encountered: