Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use GPU inside NVIDIA docker container #578

Closed
fyang93 opened this issue Oct 26, 2024 · 13 comments
Closed

Unable to use GPU inside NVIDIA docker container #578

fyang93 opened this issue Oct 26, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@fyang93
Copy link

fyang93 commented Oct 26, 2024

Bug description

I tried to use the NVIDIA Docker container on NixOS 24.05. According to the official docs, I set hardware.nvidia-container-toolkit.enable = true; and also enabled wsl.useWindowsDriver = true;. After setting the environment variable NIX_LD_LIBRARY_PATH = "/usr/lib/wsl/lib", I was able to get GPU information using nvidia-smi, confirming that it matches the output on Windows. Next, I checked that the CDI file exists and is valid by running cat /var/run/cdi/nvidia-container-toolkit.json; however, I still cannot access to the nvidia driver inside the container properly, the error message is attached in Logs.
It seems like the NVIDIA container toolkit did not correctly locate the driver files.

To Reproduce

Steps to reproduce the behavior:

  • Set wsl.useWindowsDriver = true; and hardware.nvidia-container-toolkit.enable = true; in flake
  • Set NIX_LD_LIBRARY_PATH = "/usr/lib/wsl/lib" in shell

Logs

 docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest /bin/bash
root@ac812a4a8cd7:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

WSL version

WSL Version: 2.3.24.0
Kernel version: 5.15.153.1-2
WSLg version: 1.0.59
MSRDC version: 1.2.5620
Direct3D version: 1.611.1-81528511
DXCore version: 10.0.26100.1-240331-1435.ge-release
Windows version: 10.0.22631.4317
@fyang93 fyang93 added the bug Something isn't working label Oct 26, 2024
@SuperSandro2000
Copy link
Member

Is this maybe duplicated with #454 ?

@fyang93
Copy link
Author

fyang93 commented Oct 26, 2024

Is this maybe duplicated with #454 ?

The solution mentioned in that issue does not solve my problem.
Setting hardware.opengl.setLdLibraryPath = true; does not make any difference in my case :\

@fyang93
Copy link
Author

fyang93 commented Oct 26, 2024

╭────┬─────────────────────────────────────────────╮
│  0 │ /usr/lib/wsl/lib/libcuda.so                 │
│  1 │ /usr/lib/wsl/lib/libcuda.so.1               │
│  2 │ /usr/lib/wsl/lib/libcuda.so.1.1             │
│  3 │ /usr/lib/wsl/lib/libcudadebugger.so.1       │
│  4 │ /usr/lib/wsl/lib/libd3d12.so                │
│  5 │ /usr/lib/wsl/lib/libd3d12core.so            │
│  6 │ /usr/lib/wsl/lib/libdxcore.so               │
│  7 │ /usr/lib/wsl/lib/libnvcuvid.so              │
│  8 │ /usr/lib/wsl/lib/libnvcuvid.so.1            │
│  9 │ /usr/lib/wsl/lib/libnvdxdlkernels.so        │
│ 10 │ /usr/lib/wsl/lib/libnvidia-encode.so        │
│ 11 │ /usr/lib/wsl/lib/libnvidia-encode.so.1      │
│ 12 │ /usr/lib/wsl/lib/libnvidia-ml.so.1          │
│ 13 │ /usr/lib/wsl/lib/libnvidia-opticalflow.so   │
│ 14 │ /usr/lib/wsl/lib/libnvidia-opticalflow.so.1 │
│ 15 │ /usr/lib/wsl/lib/libnvoptix.so.1            │
│ 16 │ /usr/lib/wsl/lib/libnvwgf2umx.so            │
│ 17 │ /usr/lib/wsl/lib/nvidia-smi                 │
╰────┴─────────────────────────────────────────────╯

Here are all files inside /usr/lib/wsl/lib/, there is only libnvidia-ml.so.1 but no libnvidia-ml.so. I tried creating a symlink, but no luck

@fyang93
Copy link
Author

fyang93 commented Oct 28, 2024

I just looked into the source code, looks like it doesn’t even link the file /usr/lib/wsl/lib/libnvidia-ml.so at all. Isn't this a critical bug?

extraPackages = mkIf cfg.useWindowsDriver [
  (pkgs.runCommand "wsl-lib" { } ''
    mkdir -p "$out/lib"
    # # we cannot just symlink the lib directory because it breaks merging with other drivers that provide the same directory
    ln -s /usr/lib/wsl/lib/libcudadebugger.so.1 "$out/lib"
    ln -s /usr/lib/wsl/lib/libcuda.so "$out/lib"
    ln -s /usr/lib/wsl/lib/libcuda.so.1 "$out/lib"
    ln -s /usr/lib/wsl/lib/libcuda.so.1.1 "$out/lib"
    ln -s /usr/lib/wsl/lib/libd3d12core.so "$out/lib"
    ln -s /usr/lib/wsl/lib/libd3d12.so "$out/lib"
    ln -s /usr/lib/wsl/lib/libdxcore.so "$out/lib"
    ln -s /usr/lib/wsl/lib/libnvcuvid.so "$out/lib"
    ln -s /usr/lib/wsl/lib/libnvcuvid.so.1 "$out/lib"
    ln -s /usr/lib/wsl/lib/libnvdxdlkernels.so "$out/lib"
    ln -s /usr/lib/wsl/lib/libnvidia-encode.so "$out/lib"
    ln -s /usr/lib/wsl/lib/libnvidia-encode.so.1 "$out/lib"
    ln -s /usr/lib/wsl/lib/libnvidia-ml.so.1 "$out/lib"
    ln -s /usr/lib/wsl/lib/libnvidia-opticalflow.so "$out/lib"
    ln -s /usr/lib/wsl/lib/libnvidia-opticalflow.so.1 "$out/lib"
    ln -s /usr/lib/wsl/lib/libnvoptix.so.1 "$out/lib"
    ln -s /usr/lib/wsl/lib/libnvwgf2umx.so "$out/lib"
    ln -s /usr/lib/wsl/lib/nvidia-smi "$out/lib"
  '')
];
};

@shikanime
Copy link

shikanime commented Nov 6, 2024

To solve the nvidia-smi problem when nix-ld is also enabled, you need to add the "wsl-lib" to program.nix-ld.packages as useWindowsDriver only set it up for OpenGL. Similarly I believe that to make docker work, you should also set up the virtualisation.docker.extraPackages.

Or you didn't enabled the CDI feature in the docker daemon at it isn't enabled by default.

@fyang93
Copy link
Author

fyang93 commented Nov 8, 2024

@shikanime Thank you for the information!
I guess you mean to add wsl-lib to programs.nix-ld.libraries, but I can't find this library in pkgs. Do I need to use some library from the nixos-wsl input?

@shikanime
Copy link

shikanime commented Nov 8, 2024

@fyang93 I was referring to the "wsl-lib" used by useWindowsDriver that you just mentioned.

@shikanime
Copy link

Ok, I just tested it, it seems that the CDI generated by nvidia-container-toolkit doesn't really work for WSL.
You need to generate your own nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml and use the following configuration:

virtualisation.docker = {
  enable = true;
  daemon.settings.features.cdi = true;
  daemon.settings.cdi-spec-dirs = ["/etc/cdi"];
};

@573
Copy link
Contributor

573 commented Nov 8, 2024

#487 (comment)

@shikanime
Copy link

Normally it is the job of the nvidia-container-toolkit to generate the CDI, just to confirm my hypothesis, I tried it on my machine and yes it didn't work properly because of missing libraries, on WSL you should disable the executables provided by the generated CDI and it looks fine so far.

hardware.nvidia-container-toolkit = {
  enable = true;
  mount-nvidia-executables = false;
};
❯ docker run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L     
GPU 0: NVIDIA GeForce RTX 2070 with Max-Q Design (UUID: GPU-724d28f9-c8dc-91b2-6927-e1661f089935)

573 added a commit to 573/nix-config-1 that referenced this issue Nov 8, 2024
573 added a commit to 573/nix-config-1 that referenced this issue Nov 8, 2024
573 added a commit to 573/nix-config-1 that referenced this issue Nov 8, 2024
573 added a commit to 573/nix-config-1 that referenced this issue Nov 8, 2024
573 added a commit to 573/nix-config-1 that referenced this issue Nov 8, 2024
573 added a commit to 573/nix-config-1 that referenced this issue Nov 8, 2024
@nzbr
Copy link
Member

nzbr commented Nov 10, 2024

@shikanime Would you be willing to create a PR to add this to the how-to section of the docs? I can't be of much help with this topic unfortunately, because the laptop I currently use primarily does not have an nvidia GPU

@fyang93
Copy link
Author

fyang93 commented Nov 10, 2024

@shikanime Sorry for the delayed response. Your solution was spot on and it worked like a charm, really appreciate the assist! 🙌

@fyang93 fyang93 closed this as completed Nov 10, 2024
@shikanime
Copy link

         🌸> ' ' フ
         | . _  _l    
        /` ミ_xノ     < I'll do it somehow this week, unless I'm lazing around like my cat.
       / .     |    
      /   ヽ . ノ
     │ . |  |  |
 / ̄| .  |  |  |
 | ( ̄ヽ__ヽ_)__)
 \二つ

573 added a commit to 573/nix-config-1 that referenced this issue Nov 11, 2024
573 added a commit to 573/nix-config-1 that referenced this issue Nov 11, 2024
573 added a commit to 573/nix-config-1 that referenced this issue Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants