-
Notifications
You must be signed in to change notification settings - Fork 758
Thrust: providing the error messages about the lack of GPU or a GPU w… #1848
Thrust: providing the error messages about the lack of GPU or a GPU w… #1848
Conversation
…ith an incompatible architecture
Can one of the admins verify this patch? Admins can comment |
This is essentially what I was thinking for how to solve this problem. I don't know if the details are correct or if the code is as efficient as it should be, but having Thrust detect a bad GPU before invoking any kernels and throwing an exception with a good message is the correct approach. |
Co-authored-by: Michael Schellenberger Costa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zkhatami, @dkolsen-pgi I've made a few adjustments to make it work on both cpu and gpu sides. I've also changed the signature, since the semantic of a function that returns an optional but always throws or returns a value is controversial. Please, take a look and let me know if you agree with the changes.
run tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me with a minor request for simplification
run tests |
run tests |
Looks good to me as well. Thanks! |
run tests |
To give a user some clue what's happening if the program gets compiled on a node with no GPU or if it gets compiled with different compute capability than the one it's running on. In both scenarios no good error message was produced before. The proposed changes will improve the user experience and make it easier for users to troubleshoot problems.
This fix is for addressing the issue#1785 reported on Thrust NVIDIA/cccl#818
From issue#1785 on thrust (NVIDIA/cccl#818), for this small test case:
#include <thrust/device_vector.h> #include <thrust/sort.h> int main() { thrust::device_vector<int> dv; thrust::sort(dv.begin(), dv.end()); }
when compiled with -gpu=cc60 and then run it on a system with cc80, the error message would be:
terminate called after throwing an instance of 'thrust::system::system_error' what(): radix_sort: failed on 1st step: cudaErrorUnsupportedPtxVersion: the provided PTX was compiled with an unsupported toolchain. Aborted
This doesn't help user to understand what's happening. I tried to address it in this change so that better message will show up:
Incompatible GPU: you are trying to run this program on sm_80, different from the one that it was compiled for.