Skip to content

Commit

Permalink
cuda: check for gpu instead of /dev/nvidiactl
Browse files Browse the repository at this point in the history
The check for `/dev/nvidiactl` to determine if the CUDA plugin can be
used is unreliable because in some cases the default path for driver
installation is different [1]. This patch changes the logic to check
if a GPU device is available in `/proc/driver/nvidia/gpus/`. This is
a more accurate indicator, and the subsequent check for `--action`
option would confirm if the NVIDIA driver supports checkpoint/restore.

[1] https://github.com/NVIDIA/gpu-operator

Fixes: #2509

Signed-off-by: Radostin Stoyanov <[email protected]>
  • Loading branch information
rst0git committed Nov 8, 2024
1 parent 216d804 commit d7860aa
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 3 deletions.
15 changes: 13 additions & 2 deletions plugins/cuda/cuda_plugin.c
Original file line number Diff line number Diff line change
Expand Up @@ -470,6 +470,17 @@ int cuda_plugin_resume_devices_late(int pid)
}
CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, cuda_plugin_resume_devices_late)

static bool has_nvidia_gpu(void)
{
const char *gpu_path = "/proc/driver/nvidia/gpus/";
struct stat sb;

if (stat(gpu_path, &sb) != 0)
return false;

return S_ISDIR(sb.st_mode);
}

int cuda_plugin_init(int stage)
{
int ret;
Expand All @@ -481,8 +492,8 @@ int cuda_plugin_init(int stage)
}
}

if (!fault_injected(FI_PLUGIN_CUDA_FORCE_ENABLE) && access("/dev/nvidiactl", F_OK)) {
pr_info("/dev/nvidiactl doesn't exist. The CUDA plugin is disabled.\n");
if (!fault_injected(FI_PLUGIN_CUDA_FORCE_ENABLE) && !has_nvidia_gpu()) {
pr_info("No GPU device found; CUDA plugin is disabled\n");
plugin_disabled = true;
return 0;
}
Expand Down
2 changes: 1 addition & 1 deletion scripts/ci/run-ci-tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -364,4 +364,4 @@ make -C plugins/amdgpu/ test_topology_remap
./test/zdtm.py run -t zdtm/static/maps00 -t zdtm/static/maps02 --criu-plugin amdgpu cuda
./test/zdtm.py run -t zdtm/static/busyloop00 --criu-plugin inventory_test_enabled inventory_test_disabled

./test/zdtm.py run -t zdtm/static/sigpending -t zdtm/static/pthread00 --mocked-cuda-checkpoint --fault 138
./test/zdtm.py run -t zdtm/static/sigpending -t zdtm/static/pthread00 --mocked-cuda-checkpoint --criu-plugin cuda --fault 138

0 comments on commit d7860aa

Please sign in to comment.