-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
__shfl_sync instructions may have wrong member mask #854
Comments
This had a PR (#155). It looks like it closed during the reorganizations. We should resurrect the PR, rebase it on |
@jglaser I know this issue is several years old, but do you happen to have some code that reproduces this or a recollection of how this situation occurred for you? I'll summarize your report to make sure I understand it:
Basically, this: https://www.godbolt.org/z/T6jrx75M6 I'm trying to find a repro where the compiler would implement the branch using an execution mask instead of branching so I can test this, but it's not clear how this would happen. |
We would need to get a reproducer on this to keep this open. @jglaser could you provide a recent reproducer? |
I am afraid that I am not actively working on the CUB code inside HOOMD-blue which would allow me to test the above behavior again. Feel free to close, if I remember correctly it may have been a false positive with cuda-memcheck. If not, can always reopen later. |
When using
WarpScanShfl
fromwarp_scan_shfl.cuh
inside awhile()
loop and in conjunction with a sub-warpLOGICAL_WARP_THREADS
argument, i.e.LOGICAL_WARP_THREADS=2^n
withn<5
, I get lots of errors like these withcuda-memcheck --tool synccheck
I believe the root cause is the following.
WarpScanShfl
sets itsmember_mask
for theshfl_sync
to reflect the sub-warp membership. However, what happens if some threads exit early, the compiler may predicate off this initialization statement, leaving member_mask in an invalid state. Later, when the PTX assembly instructionshfl.sync.idx.b32
is hit, it is executed without predicate (such as@p
) and thus with a wrong mask. Then cuda-memcheck finds that the executing warp lane executes an implicitsyncwarp
but without its mask bits set, and issues an error, as documented here:https://docs.nvidia.com/cuda/cuda-memcheck/index.html#synccheck-demo-illegal-syncwarp
The safe solution would be to always use the full mask (
0xffffffffu
) to synchronize the entire warp. I realize this may not fully take advantage of Volta's independent thread scheduling. However, if that were the goal I think the CUB API would have to expose themember_mask
somehow to allow the user to set it, so that it is possible to issue e.g. aballot_sync
outside CUB first, and then pass the member mask to CUB. As discussed here: https://devblogs.nvidia.com/using-cuda-warp-level-primitives/I will submit a pull request shortly for this solution.
The text was updated successfully, but these errors were encountered: