-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
try to make a ROCm big kernel reproducer #1477
Comments
Just for future reference: this appears to depend on whether we run the primordial_chem network from the application code or standalone (https://github.com/AMReX-Astro/Microphysics/tree/main/unit_test/burn_cell_primordial_chem). The latter does not trigger the memory fault we see when running the network in Quokka. |
we should try |
Turning off force-inlining of all functions in the kernel appears to fix the memory issues in Castro. We could try something like |
@psharda found that turning off force-inlining also fixes the memory issues in Quokka. |
More context on the compiler bug here: https://discourse.llvm.org/t/how-to-verify-correct-regalloc-for-a-kernel/80811 TL;DR the underlying issue is well-understood by the compiler developers, and it is supposed to be fixed by this LLVM PR: llvm/llvm-project#93526 Should we close this issue? |
Very nice. Does this mean a future ROCm version will have this fix? |
Since the ROCm compiler is derived from the upstream LLVM sources, I think, in principle, yes. No idea when this will be. |
closing this since it seems to be recognized to be an LLVM bug with a PR fix |
ROCm seems to have trouble with large kernels, leading to memory issues. We can try to create a reproducer using
test_react
, starting with a small net and make bigger and bigger nets (via pynucastro) until we find a size that breaks things. We might also be able to strip out neutrinos, the EOS, and other bits.The text was updated successfully, but these errors were encountered: