Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA exception handling #4095

Merged
merged 22 commits into from
Jan 29, 2021
Merged

Conversation

jngrad
Copy link
Member

@jngrad jngrad commented Jan 15, 2021

Description of changes:

  • convert CUDA error codes into runtime errors using the message associated to the error code
  • properly handle CUDA errors by halting the flow of the program instead of ignoring them (Barnes-Hut, LB GPU)
  • remove superfluous CUDA global variables (partial fix for Remove global variables #2628)
  • restore cuda_gather_gpus() functionality and convert the device_list() getter to a regular function (API change)
  • remove unused libcuda dependency (fixes -lcuda needs to be removed for cuda-11.2 #4085)

This header file can only be used in CUDA source files.
Utility CUDA functions now throw a runtime error wrapping the CUDA
error message when a primitive CUDA function returns an error code.

The LB GPU checking function gpu_init_particle_comm() was rewritten
to check the currently selected GPU instead of the one with ID 0.
The warning about insufficient compute capability was converted
to an error message. The warning about default GPU ID was removed.
The Python-specific message was removed. The function now calls
std::abort() upon any error since we cannot recover from it in
the Python interface (the list of actors cannot be cleared due to
a mismatch between `Actor` and `Actors` regarding active actors).

The dipolar Barnes-Hut code now exits early upon any CUDA error to
avoid undefined behavior in memory allocation on the GPU.
Also fix the type of the memory field to avoid integer overflow.
Convert regular comments to doxygen comments where appropriate,
document function arguments, remove duplicate doxygen blocks.
@jngrad jngrad removed the BugFix label Jan 15, 2021
src/core/cuda_init.cpp Show resolved Hide resolved
src/core/cuda_utils.cuh Outdated Show resolved Hide resolved
src/core/cuda_utils.cuh Outdated Show resolved Hide resolved
src/python/espressomd/cuda_init.pyx Show resolved Hide resolved
testsuite/python/gpu_availability.py Outdated Show resolved Hide resolved
@jngrad jngrad self-assigned this Jan 29, 2021
@jngrad jngrad added the automerge Merge with kodiak label Jan 29, 2021
@kodiakhq kodiakhq bot merged commit e4d6ecc into espressomd:python Jan 29, 2021
@jngrad jngrad deleted the cuda-exception-handling branch January 29, 2021 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

-lcuda needs to be removed for cuda-11.2
2 participants