-
Notifications
You must be signed in to change notification settings - Fork 22.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
squash xblock for persistent inner reduction #102444
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102444
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 235b88e: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm5.4.2-py3.8 / test (default, 1, 3, linux.rocm.gpu) Details for Dev Infra teamRaised by workflow job |
cc @jataylo, I had to exclude rocm from this optimization, because rocm is on the old triton version that doesn't have |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Thanks for the heads up @ngimel, we hit a blocker that stopped us updating triton for awhile to bring in the I'll remove the conditionalisation of this commit in the PR that bumps our triton commit and add you as a reviewer. cc: @dllehr-amd |
Revert aten.prod explicit fallback on ROCm and enabling the use of tl.reduce in triton codegen. This PR also enables an optimisation that was previously conditionalised out for ROCm #102444 Pull Request resolved: #104099 Approved by: https://github.com/peterbell10, https://github.com/malfet
Currently layer norm kernel performance is pretty bad due to triton perf bug https://gist.github.com/ngimel/c1e7f70f8268f038e710e835b0065f63, but since XBLOCK for persistent reduction is
1
we can just drop this dimension and operate on 1d tensors (and then perf of ln kernels improves a lot)Perf results http://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2022%20May%202023%2001%3A27%3A25%20GMT&stopTime=Mon%2C%2029%20May%202023%2001%3A27%3A25%20GMT&suite=torchbench&mode=training&dtype=amp&lBranch=ngimel/persistent_1d&lCommit=1d5175f5e682f37aae15fd217bc3767e1788bacf&rBranch=main&rCommit=c9f4f01981fd73fcc7c27676cc50230cd1b5bc22, approx 4% on hf
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10