Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Thrust: providing the error messages about the lack of GPU or a GPU w… #1848

Merged
merged 6 commits into from
Jan 30, 2023

Conversation

zkhatami
Copy link
Contributor

To give a user some clue what's happening if the program gets compiled on a node with no GPU or if it gets compiled with different compute capability than the one it's running on. In both scenarios no good error message was produced before. The proposed changes will improve the user experience and make it easier for users to troubleshoot problems.

This fix is for addressing the issue#1785 reported on Thrust NVIDIA/cccl#818

From issue#1785 on thrust (NVIDIA/cccl#818), for this small test case:

#include <thrust/device_vector.h> #include <thrust/sort.h> int main() { thrust::device_vector<int> dv; thrust::sort(dv.begin(), dv.end()); }

when compiled with -gpu=cc60 and then run it on a system with cc80, the error message would be:

terminate called after throwing an instance of 'thrust::system::system_error' what(): radix_sort: failed on 1st step: cudaErrorUnsupportedPtxVersion: the provided PTX was compiled with an unsupported toolchain. Aborted

This doesn't help user to understand what's happening. I tried to address it in this change so that better message will show up:
Incompatible GPU: you are trying to run this program on sm_80, different from the one that it was compiled for.

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

@dkolsen-pgi
Copy link
Collaborator

This is essentially what I was thinking for how to solve this problem. I don't know if the details are correct or if the code is as efficient as it should be, but having Thrust detect a bad GPU before invoking any kernels and throwing an exception with a good message is the correct approach.

Co-authored-by: Michael Schellenberger Costa <[email protected]>
Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zkhatami, @dkolsen-pgi I've made a few adjustments to make it work on both cpu and gpu sides. I've also changed the signature, since the semantic of a function that returns an optional but always throws or returns a value is controversial. Please, take a look and let me know if you agree with the changes.

@gevtushenko
Copy link
Collaborator

run tests

Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me with a minor request for simplification

thrust/system/cuda/detail/core/util.h Outdated Show resolved Hide resolved
thrust/system/cuda/detail/core/util.h Show resolved Hide resolved
thrust/system/cuda/detail/core/util.h Outdated Show resolved Hide resolved
@gevtushenko
Copy link
Collaborator

run tests

@gevtushenko
Copy link
Collaborator

run tests

@zkhatami
Copy link
Contributor Author

Looks good to me as well. Thanks!

@gevtushenko
Copy link
Collaborator

run tests

@gevtushenko gevtushenko merged commit bf941ec into NVIDIA:main Jan 30, 2023
@alliepiper alliepiper added this to the 2.1.0 milestone Mar 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants