Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA kernel/userspace interface issue due to cuda.so.1 #11390

Closed
twhitehead opened this issue Dec 1, 2015 · 14 comments
Closed

CUDA kernel/userspace interface issue due to cuda.so.1 #11390

twhitehead opened this issue Dec 1, 2015 · 14 comments

Comments

@twhitehead
Copy link
Contributor

We are running Nix as a package manager alongside CentOS 6. It is working good except for the fact that CUDA packages aren't working. For example, caffe (from nixpkgs tag 15.09) gives

$ caffe device_query --gpu=0
I1201 12:09:52.707620 25856 caffe.cpp:73] Querying device ID = 0
F1201 12:09:52.790141 25856 common.cpp:131] Check failed: error == cudaSuccess (35 vs. 0) CUDA driver version is insufficient for CUDA runtime version

Using strace I discovered the problem is that the cudatoolkit libraries are trying to load a libcuda.so.1 library (although it is not declared as an official a shared library dependency by any of the ELF objects). Digging reveals this is a kernel/user-space shimmy library shipped with the CUDA kernel driver and that it is tightly tied to the specific version of that CUDA kernel driver.

$ readlink -f /usr/lib64/libcuda.so
/usr/lib64/nvidia/libcuda.so.352.55

Using LD_PRELOAD to load this into the NIX binary results in it working as expected

$ LD_PRELOAD=/usr/lib64/libcuda.so caffe device_query --gpu=0
I1201 12:18:10.561049 25949 caffe.cpp:73] Querying device ID = 0
I1201 12:18:10.568706 25949 common.cpp:157] Device id: 0
I1201 12:18:10.568722 25949 common.cpp:158] Major revision number: 3
I1201 12:18:10.568725 25949 common.cpp:159] Minor revision number: 5
I1201 12:18:10.568730 25949 common.cpp:160] Name: Tesla K20m
...

How should Nix properly handle this? The libcuda.so.1 could be put in the Nix store, but it turns out that quickly creates a fragile situation where any update to the host NVIDIA driver version instantly breaks all installed Nix CUDA binaries as, if libcuda.so.1 does not perfectly match the kernel NVIDIA driver version, CUDA executable fail and the kernel spits messages of the form

NVRM: API mismatch: the client has the version 352.55, but
NVRM: this kernel module has the version 352.39. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
NVRM: nvidia_frontend_ioctl: minor 255, module->ioctl failed, error -22

To me this suggests that, as Nix uses the host kernel, it should also use the host provided version of this library. The logic here being that libcuda.so.1 is so tightly tied to the kernel it should really be considered part of the kernel interface (provided by the host) and not userspace (provided by Nix).

What do people think, and how might this best be done?

Thanks! -Tyson

@twhitehead
Copy link
Contributor Author

After sleeping on this I decided to create a var/nix/lib directory, add it to LD_LIBRARY_PATH, and symlink any host libraries I need Nix binaries to pick up into it.

Seems to work okay and sort of follows the same spirit of using environment variables such as SSL_CERT_FILE to interface bits of the host with Nix.

Would still be interest in hearing if anyone else has any better ideas though.

@jcumming
Copy link
Contributor

jcumming commented Dec 2, 2015

In this case, nix can't control the kernel ABI. A pragmatic way to manage the userspace/kernel version is to pin the linuxPackages, kernel headers, and kernel driver versions in nixpkgs to what the host system provides.

So, in top-level/all_packages.nix, have

  linuxPackages = linuxPackages_CentOS6;

Where linuxPackages_CentOS6 is what the host system is running.

Likewise the nvidia driver version would need to be pinned.

It might even be possible to do kernel ABI auto-detection.. ?

@vcunat
Copy link
Member

vcunat commented Dec 3, 2015

Well, on NixOS we have $LD_LIBRARY_PATH containing /run/opengl-driver/lib/ and there the OS puts also libcuda.so* if nvidia driver is selected. However, when installed on an arbitrary linux machine, nix can't in general know where system drivers are, I believe. We put also some other libs on that path, e.g. libGL*, libvdpau_*, ... It's up to the user to set such things up, but sometimes it's non-trivial due to dependencies of those libraries not being in hard-coded paths, as discussed e.g. in #9415.

@grahamc
Copy link
Member

grahamc commented Feb 17, 2016

For what it is worth, I have NixOS running in a small farm of servers with GPGPUs / CUDA, and have found it works fine as long as the package you're running is built by nix. Here is some of my config: https://github.com/grahamc/nixos-cuda-example

@twhitehead
Copy link
Contributor Author

Thanks for the feedback everyone. Timely that vcunat mentioned /run/opengl-driver/lib and #9415 as well. Just finished running into that the other day fighting to get OpenGL acceleration working.

@vcunat
Copy link
Member

vcunat commented Feb 17, 2016

@grahamc: the point is non-NixOS here. There are some libraries that need to be supplied impurely, based on what your OS/kernel is running, and nix-built packages don't know how to find those libs on non-NixOS. So far we don't have any good/automatic solution, and in case of libGL sometimes even symlinking isn't enough, as linked above.

@twhitehead
Copy link
Contributor Author

Yes. I was thinking this morning that perhaps what this ticket is really looking for is the installer script to

  1. create a /nix/lib directory (as an example)
  2. attempt to symlink any require impure libraries into it from the host OS (e.g., libGL and libcuda)
  3. have the users ~/.bash_profile add it to LD_LIRBARY_PATH

Cheers! -Tyson

@blogle
Copy link
Contributor

blogle commented May 2, 2017

@twhitehead I am currently going the route of installing cuda on an ubuntu box via nix and running into the kernel/runtime mismatch error. Did you ever get this to work?
could you perhaps give pointers on how to get it working?

@twhitehead
Copy link
Contributor Author

@blogle Yup. The key is that only libcuda.so* has to match your kernel version. What you want to do is

  • install matching kernel and runtime drivers in ubuntu
  • create symlinks to the runtime libcuda.so* files in a directory somewhere (e.g., /nix/var/nix/lib)
  • set LD_LIBRARY_PATH to be this directory

This will cause your Nix binaries to use the ubuntu runtime libcuda.so and the Nix stuff for the rest.

@twhitehead
Copy link
Contributor Author

One thing that is becoming more clear to me is running Nix inside another distribution is that LD_PRELOAD/LD_LIBRARY_PATH were written under the implicit standard-distribution assumption that there is only one version of the core libraries. They apply universally without mercy to everything creating conflicts as soon as you have multiple versions. Even the wrappers that set LD_LIBRARY_PATH are problematic as any child process inherits the LD_PRELOAD/LD_LIBRARY_PATH settings.

For this reason I would almost suggest that maybe their support should be removed from the Nix dynamic linker/loader (ld.so) so resolution happens purely by RPATH. Things like run/opengl-driver/lib/ can be done instead by putting these impure paths into the libraries that require them as RPATHs in the postFixup phase. For the few extreme cases where a static path can just not be made to work, the binary can be wrapped to load it directly via ld.so so the path specify via the --library-path PATH option (ld.so --library-path $ldpath $prog) .

A less drastic option might be to patch the Nix dynamic linker/loader so it uses NIX_LD_PRELOAD/NIX_LD_LIBRARY_PATH instead of LD_PRELOAD/LD_LIBRARY_PATH. This is still broken though as you are effectively assuming that a particular Nix library is okay for all Nix binaries, which isn't technically true given Nix's support for multiple versions of libraries and binaries.

@blogle
Copy link
Contributor

blogle commented May 3, 2017

To any future travelers, note that I also had to symlink the contents of /usr/lib/nvidia-${version}/into /nix/var/nix/lib.
@twhitehead thank you very much for the reply, I struggled with this for about 2 days until I came across your help. Now everything appears to be up and running 👍

@mmahut
Copy link
Member

mmahut commented Aug 19, 2019

Any news on this issue?

@stale
Copy link

stale bot commented Jun 2, 2020

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

  1. Search for maintainers and people that previously touched the related code and @ mention them in a comment.
  2. Ask on the NixOS Discourse.
  3. Ask on the #nixos channel on irc.freenode.net.

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 2, 2020
@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Apr 3, 2022
@FRidh
Copy link
Member

FRidh commented Apr 9, 2022

When you use the nvidia x11 drivers you get libcuda.so in /run/opengl-drivers/lib. Derivations need to use addOpenGLRunpath which extends the RPATH with that path so libraries will be able to find libcuda.so.

@FRidh FRidh closed this as completed Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants