Problems when trying to run GPU workload on GCP's n1-standard-1 (nVidia T4 Tesla) #1918

tuommaki · 2023-08-14T07:29:26Z

tuommaki
Aug 14, 2023

When trying to run deviceQuery sample program from tag v11.6, compiled with CUDA toolkit 11.7 GA on Fedora 35 x86_64, using latest gpu-nvidia, compiled against Nanos' tag 0.1.46, I get the following error:

SeaBIOS (version 1.8.2-google)
Total RAM Size = 0x00000000f0000000 = 3840 MiB
Found pci whitelist file with size 24
CPUs found: 1     Max CPUs supported: 1
found virtio-scsi at 0:3
virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
virtio-scsi blksize=512 sectors=2097152 = 1024 MiB
drive 0x000f2890: PCHS=0/0/0 translation=lba LCHS=1024/32/63 s=2097152
Sending Seabios boot VM event.
Booting from Hard Disk 0...
en1: assigned 10.132.0.35
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  515.65.01  Release Build  (<snipped>)  Thu Aug 10 10:54:35 AM EEST 2023
Loaded the UVM driver, major device number 0.
devicequery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

I have bundled the matching nvidia/515.65.01/gsp.bin as instructed in GPU-accelerated Computing with Nanos Unikernels and I've also bundled lib64/libcuda.so.1 which seems to be dynamically loaded during runtime.

I'm using following configuration:

{
  "CloudConfig" :{
    "ProjectID": "<snipped>",
    "Zone": "europe-west1-b",
    "BucketName":"<snipped>",
    "Flavor":"n1-standard-1"
  },
  "Klibs":["gcp", "gpu_nvidia", "tls"],
  "RebootOnExit": true,
  "RunConfig": {
    "GPUs":1,
    "GPUType": "nvidia-tesla-t4",
    "Memory":"2G"
  },
  "Dirs": ["lib64", "nvidia"],
  "Program":"devicequery"
}

The deviceQuery program is compiled for matching target architecture (sm_75):

/usr/local/cuda/bin/nvcc -ccbin g++ -I../../../Common -m64 --threads 0 --std=c++11 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o deviceQuery.o -c deviceQuery.cpp

According to these tables I get the understanding that CUDA Toolkit v11.7 GA should work with the gpu-nvidia driver version 515.65.01.

Despite all these version cross-checks, I fail to get the program running and I'm wondering what is missing and how to debug this further?

Answered by francescolavra

Aug 15, 2023

The "CUDA driver version is insufficient for CUDA runtime version" error message is a somewhat misleading message coming from the CUDA library, which in this case is caused by a few shared libraries that are missing from your unikernel image.
First, the libcuda.so.1 file should not be under the /lib64 folder, because it won't be found there: instead, it should be under lib/x86_64-linux-gnu/; then, a few more needed libraries are /lib/x86_64-linux-gnu/libdl.so.2, /lib/x86_64-linux-gnu/libpthread.so.0, and /lib/x86_64-linux-gnu/librt.so.1.
A useful command line flag to find out what file(s) may be missing from a generic image is --missing-files: when given to an ops run command, this flag m…

View full answer

francescolavra · 2023-08-15T17:52:41Z

francescolavra
Aug 15, 2023
Maintainer

The "CUDA driver version is insufficient for CUDA runtime version" error message is a somewhat misleading message coming from the CUDA library, which in this case is caused by a few shared libraries that are missing from your unikernel image.
First, the libcuda.so.1 file should not be under the /lib64 folder, because it won't be found there: instead, it should be under lib/x86_64-linux-gnu/; then, a few more needed libraries are /lib/x86_64-linux-gnu/libdl.so.2, /lib/x86_64-linux-gnu/libpthread.so.0, and /lib/x86_64-linux-gnu/librt.so.1.
A useful command line flag to find out what file(s) may be missing from a generic image is --missing-files: when given to an ops run command, this flag makes the Nanos kernel print a list of names for files that the program tried to open but couldn't find: for example, if the pthread library was missing, with the above flag you would see something like:

missing_files_begin
libpthread.so.0
missing_files_end

If you want more detailed information on where a missing file should be placed in the image filesystem, you can add the --trace command line flag, which enables verbose logging, where messages like e.g. "/lib/x86_64-linux-gnu/libcuda.so.1" - not found would point you to the correct filesystem path.
Note that when running an image locally (where there is no GPU), you have to remove the gpu_nvidia klib from the configuration, otherwise this klib fails to load and the program is not even started.

In summary, a working configuration would be something like:

{
  "CloudConfig" :{
    "ProjectID": "<snipped>",
    "Zone": "europe-west1-b",
    "BucketName":"<snipped>",
    "Flavor":"n1-standard-1"
  },
  "Klibs":["gcp", "gpu_nvidia", "tls"],
  "RebootOnExit": true,
  "RunConfig": {
    "GPUs":1,
    "GPUType": "nvidia-tesla-t4",
    "Memory":"2G"
  },
  "Dirs": ["lib", "nvidia"],
  "Files": ["/lib/x86_64-linux-gnu/libdl.so.2", "/lib/x86_64-linux-gnu/libpthread.so.0", "/lib/x86_64-linux-gnu/librt.so.1"],
  "Program":"devicequery"
}

With the above configuration, the deviceQuery sample program runs successfully, with the following output:

Detected 1 CUDA Capable device(s)

Device 0: "Tesla T4"
  CUDA Driver Version / Runtime Version          11.7 / 11.7
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 14972 MBytes (15699148800 bytes)
  (040) Multiprocessors, (064) CUDA Cores/MP:    2560 CUDA Cores
  GPU Max Clock rate:                            1590 MHz (1.59 GHz)
  Memory Clock rate:                             5001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 4
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.7, CUDA Runtime Version = 11.7, NumDevs = 1
Result = PASS

Note: this sample program terminates successfully right after querying the GPU device, and this makes the VM instance shut down just a few seconds after starting; since the serial console output cannot be retrieved from a stopped instance, in order to be able to see the above output via e.g. ops instance logs you could modify the program so that it doesn't exit immediately: for example, you can insert a pause(); instruction right before exit(EXIT_SUCCESS); in the source file at https://github.com/NVIDIA/cuda-samples/blob/v11.6/Samples/1_Utilities/deviceQuery/deviceQuery.cpp (you will need to add #include <unistd.h> as well).

1 reply

tuommaki Aug 16, 2023
Author

Thanks! That helped me to get the program running. There were couple interesting things though:

I did need the lib64 as well. I think the dynamically loaded (during runtime, not by the linker based on ELF headers) libraries might come from the different location. Could this be a case?
"Funnily" I faced an issue with incompatible driver with CUDA v11.7, but weirdly managed to get it running with v12.2. 🤷‍♂️

With those adjustments, now I have the workload up and running. Thanks!

francescolavra · 2023-08-16T13:47:23Z

francescolavra
Aug 16, 2023
Maintainer

I did need the lib64 as well. I think the dynamically loaded (during runtime, not by the linker based on ELF headers) libraries might come from the different location. Could this be a case?

It might be; in my setup I didn't need the lib64 folder (at runtime the program doesn't look for any files in that folder), perhaps you did because of some differences between your build environment and mine.
By the way, all the libraries needed by the program that can be detected from the ELF headers (i.e. basically the libraries listed by the ldd tool) are automatically added to the image when you run ops image create (or ops run); so the additional libraries we had to add manually to the configuration file are those loaded during runtime by the program itself.
Anyway, good to hear you sorted out those issues!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems when trying to run GPU workload on GCP's n1-standard-1 (nVidia T4 Tesla) #1918

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Problems when trying to run GPU workload on GCP's n1-standard-1 (nVidia T4 Tesla) #1918

tuommaki Aug 14, 2023

Replies: 2 comments · 1 reply

francescolavra Aug 15, 2023 Maintainer

tuommaki Aug 16, 2023 Author

francescolavra Aug 16, 2023 Maintainer

tuommaki
Aug 14, 2023

Replies: 2 comments 1 reply

francescolavra
Aug 15, 2023
Maintainer

tuommaki Aug 16, 2023
Author

francescolavra
Aug 16, 2023
Maintainer