Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Questions about maintaining CUDA-related packaging #217780

Open
3 tasks done
ConnorBaker opened this issue Feb 23, 2023 · 11 comments
Open
3 tasks done

docs: Questions about maintaining CUDA-related packaging #217780

ConnorBaker opened this issue Feb 23, 2023 · 11 comments

Comments

@ConnorBaker
Copy link
Contributor

ConnorBaker commented Feb 23, 2023

Problem

Asking these questions here because it may provide the opportunity for more people to see and contribute.

For context, I used docker to the point of tears a while back and don't ever want to go back. I like Nix and Nixpkgs, so I've been trying to contribute to the CUDA ecosystem to get things up to speed. I've made a number of PRs toward that end recently, and doing so has taught me a fair amount of Nix and about the state of CUDA support.

The following are a list of questions I've had knocking around while I've worked on the PRs. I figure this might be a good way to gather some knowledge, informally, before trying to write it up or otherwise synthesize it. I read through the big PR from last year which involved most of the CUDA refactoring (#167016) prior to writing this.

I apologize in advance if I've missed any obvious resources.

Proposal

The following questions and answers should either make their way into a FAQ or be incorporated into the CUDA documentation.

Table of Contents

sha256 vs. hash

Nix expects hashes provided by the hash attribute (e.g., for fetchurl) to be SRI hashes1. SRI hashes are self-describing, so it's not necessary to specify the hash algorithm via the attribute name. Where possible, it is preferable to use hash instead of sha256 to avoid confusion and to avoid the need to specify the hash algorithm -- there is an ongoing, Nixpkgs-wide effort to do so!

autoAddOpenGLRunpathHook

Background on autoAddOpenGLRunpathHook, why we need it, and answers, courtesy of @samuela:

Back in the olden days, one had to manually call patchelf to repair executable RUNPATHs. addOpenGLRunpath and subsequently autoAddOpenGLRunpathHook automate this process and removes confusion. What does this mean and why is any of it necessary? Well, binaries that depend on nvidia gpu access dynamically load driver libraries at runtime, esp. libcuda.so. This file is part of the kernel driver installation. On most Linux systems the kernel driver is available in a fixed, globally accessible location. The RUNPATH/RPATH is a section in the binary itself that determines where the executable will search for library files when dynamically loading, like LD_LIBRARY_PATH but embedded in the binary. So on conventional systems you just add /foo/bar/nvidia/ to RUNPATH and you're set to go.

However, NixOS operates differently: drivers live in the Nix store no different than any other kind of package. So how do we load these libraries? The NixOS solution is to symlink a special directory, currently /run/opengl-driver/lib, which points to the current graphics driver. Then you can load libcuda.so from there. But because this differs from external convention, we need to add this path into RUNPATH in order to keep everything running smoothly.

Once upon a time, each package had to do this manually with patchelf. Now we have addOpenGLRunpath and autoAddOpenGLRunpathHook to automate the process for us.

  1. How will I know when I need to use autoAddOpenGLRunpathHook?

    If your binaries are complaining about not being able to load libraries, you probably need one of the hooks.

  2. Does the hook replace the need to manually invoke addOpenGLRunpath?

    Hopefully yes. If you come across a case where it does not work, it's likely a bug and should be reported.

Multi-output derivations

Part of the goal of multi-output derivations is to allow for more granular control over what is installed. For example, if you only need the static library, you can install that without installing the shared library. This is especially useful for CUDA, where libraries are often dozens, if not hundreds, of megabytes in size.

  1. What name should be used for the output containing static libraries?
  2. Where do include files go?
    • Should they be in dev or out? Does it make sense to split at that granular of a level?
  3. Does it make sense for out to contain files which are available in other outputs?
    • Should out be standalone/able to run without being joined with other outputs from the derivation? For example, should library files required to run a CUDA program be in out, lib, or both?

Added challenges, pointed out by @SomeoneSerge:

  • NVidia ships pkg-config files with each of the redistributables
  • However, packages use FindCUDAToolkit.cmake (if we're lucky), FindCUDA.cmake (if we're less lucky), or bazel (if they positively decided to torture us) to find the toolkit
    • None of these are aware of the pkg-config files
    • All of these expect the toolkit to be installed in a single directory (e.g., CUDAToolkit_ROOT or CUDA_HOME)
  • Due to the expectation the toolkit is installed in a single directory, the symlinkJoin pattern emerges

Supporting multiple versions of CUDA

  1. What is the impetus to maintain multiple versions of CUDA?
    • Nothing beyond the trade-off of a little extra work for a larger supported install-base.
  2. Is there something Nixpkgs is committed to?
    • There is no established commitment.
  3. Does the Nixpkgs community have a clear idea of which packages rely on which versions of CUDA?
    • It can be determined mechanically by looking at Nixpkgs.

Version bounds within cudaPackages

  1. Different CUDA versions support different compute capabilities. What is the desired way to handle keeping track of this?
  2. For the Nixpkgs community at large, is there a best practice of including information about packages in the derivation vs. auxiliary files?

Version bounds for consumers of cudaPackages

  1. The source of some packages have a hard-coded list of supported compute capabilities. How should we handle such packages?

@SomeoneSerge found that magma specifically allows us to set CMAKE_CUDA_ARCHITECTURES to use architectures outside those in the CMakeLists.txt: https://github.com/NixOS/nixpkgs/pull/218265/files#diff-989b55d62898864bff7cbba951ccdcdf5ff604fc917498863d2fb567efde542fR137.

Storing "meta" information

  1. Information about GPUs, compute capabilities, and which packages support what is extremely useful and critical to ensuring that packages are built correctly. How should we store this information?
    • In the case that we're mirroring a JSON file locally, it should stay a JSON file. (Example being NVIDIA's CUDA redistributable.)
    • In the case that we're generating the information ourselves, we should use Nix.

Misc questions to be organized better later

  1. When we do override the C/C++ compilers by setting the CC/CXX environment variables, that doesn't change binutils, so (in my case) I still see ar/ranlib/ld and friends from gcc12 being used. Is that a problem? I don't know if version bumps to those tools can cause as much damage as libraries compiled with different language standards.
  2. If a package needs to link against libcuda.so specifically, what's the best way to make the linker aware of those stubs? I set LIBRARY_PATH and that seemed to do the trick: https://github.com/NixOS/nixpkgs/pull/218166/files#diff-ab3fb67b115c350953951c7c5aa868e8dd9694460710d2a99b845e7704ce0cf5R76
  3. Is it better to set environment variables as env.BLAG = "blarg" (I saw a tree-wide change about using env because of "structuredAttrs") in the derivation or to export them in the shell, in something like preConfigure?
  4. Do we have any infrastructure (like CI) besides cachix?
  5. What populates our cachix?
  6. What's the storage limit for our cachix (meaning, is the number of derivations we host a result of limited compute, storage, or both)?
  7. If it's not CI populating the cache, what's the process for getting permissions to push to it?

Checklist

cc @NixOS/cuda-maintainers

Footnotes

  1. https://developer.mozilla.org/en-US/docs/Web/Security/Subresource_Integrity

@ConnorBaker
Copy link
Contributor Author

@samuela I'd really appreciate your perspective and thoughts on these questions if you have the time!

@fricklerhandwerk
Copy link
Contributor

Related: https://discourse.nixos.org/t/nixpkgss-current-development-workflow-is-not-sustainable/18741

@samuela
Copy link
Member

samuela commented Feb 23, 2023

Hey @ConnorBaker, thanks for your interest in CUDA development and contributions so far! These are all fair questions. I'll try to answer as many as possible.

First of all, for some context the @NixOS/cuda-maintainers team was created here. We maintain a few things: Thanks to the generosity of @domenkozar and Cachix we have a build cache. @SomeoneSerge has built some great infrastructure that regularly builds packages like TF/JAX/PyTorch/etc with cudaSupport = true and pushes the results into the cache. I maintain nixpkgs-upkeep, a CI system that fills in some gaps left by Hydra, nixpkgs flagship CI. Since cuda does not fall under the strict definition of "free" software, Hydra refuses to test any packages that use it. nixpkgs-upkeep regularly builds a subset of packages with cudaSupport = true and auto-reports failures (which are more common than you'd think!). Most importantly we review PRs, write documentation, and provide support.

Folks can reach us via @NixOS/cuda-maintainers or on matrix chat in the #cuda:nixos.org channel.

Onto the questions!

On sha256 vs. hash

There is an ongoing, Nixpkgs-wide migration to using SRI hashes and the hash argument, which is intended to be more future-proof. More info here: https://nixos.wiki/wiki/Nix_Hash

How will I know when I need to use autoAddOpenGLRunpathHook?

Back in the olden days, one had to manually call patchelf to repair executable RUNPATHs. addOpenGLRunpath and subsequently autoAddOpenGLRunpathHook automate this process and removes confusion. What does this mean and why is any of it necessary? Well, binaries that depend on nvidia gpu access dynamically load driver libraries at runtime, esp. libcuda.so. This file is part of the kernel driver installation. On most Linux systems the kernel driver is available in a fixed, globally accessible location. The RUNPATH/RPATH is a section in the binary itself that determines where the executable will search for library files when dynamically loading, like LD_LIBRARY_PATH but embedded in the binary. So on conventional systems you just add /foo/bar/nvidia/ to RUNPATH and you're set to go.

However, NixOS operates differently: drivers live in the Nix store no different than any other kind of package. So how do we load these libraries? The NixOS solution is to symlink a special directory, currently /run/opengl-driver/lib, which points to the current graphics driver. Then you can load libcuda.so from there. But because this differs from external convention, we need to add this path into RUNPATH in order to keep everything running smoothly.

Once upon a time, each package had to do this manually with patchelf. Now we have addOpenGLRunpath and autoAddOpenGLRunpathHook to automate the process for us. tl;dr if your binaries are complaining about not being able to load libraries, you probably need one of the hooks.

Does the hook replace the need to manually invoke addOpenGLRunpath?

Hopefully yes. If you come across a case where it does not work, it's likely a bug and should be reported.

Are there cases where it should not be used? See last comment on #167016 (comment).

I'm not sure if I understand what you mean by the link... the last is comment is from @nixos-discourse. In any case, if your package build and runs without it, then no need.

On multi-output derivations

These are good questions, but outside my wheelhouse... I wonder if nixpkgs has a recommended path for these things?

What is the impetus to maintain multiple versions of CUDA? Is there something Nixpkgs is committed to? Does the Nixpkgs community have a clear idea of which packages rely on which versions of CUDA?

Personally I'm in favor of culling old versions of CUDA, but previously there was some pushback that the cost of keeping them was relatively small. You can find out which packages use which versions by searching the nixpkgs dependency tree (or source). As much as possible we try to keep packages on mainline cudaPackages. So older CUDA versions should not have any consumers.

Different CUDA versions support different compute capabilities. What is the desired way to handle keeping track of this?

Prior art is to just keep a list, but we're open to different solutions here. We're always looking for refactors that simplify things.

Information about GPUs, compute capabilities, and which packages support what is extremely useful and critical to ensuring that packages are built correctly. How should we store this information?

This is constantly evolving. Just having everything written down in code is an improvement to begin with. I believe Nix to be the best language for that, keeping it Nix-all-the-way-down. We'll prob keep refactoring and refining things here over time. I'm sure that more opportunities for refactors will present themselves down the road!

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Feb 27, 2023

On multi-output derivations

Where do static libraries go?

Ideally, we would like static libraries to go into separate outputs.
Use-case: a downstream package links against a shared library, its output is part of the runtime closure, but we don't need any static .a libraries at runtime. In case of cudaPackages this means gigabytes of useless extra weight, particularly bad for docker and singularity images built with Nix.

An annoying detail: when we splitting static libraries out in cudaPackages, we might find we need symlinkJoin more often because of build-scripts that expect "everything to be in one directory"

EDIT: Tracking in #224533

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Feb 27, 2023

Further on multiple outputs and downstream packages expecting "things in one place"

It seems that NVidia ships at least pkg-config files with all of the redist packages.
In practice, however, people use FindCUDAToolkit.cmake if we're lucky, FindCUDA.cmake if we're less lucky, or bazel if they positively decided to torture us 🙃 . They all discover cuda components through using various single-throw-all-in-location variables, e.g. CUDAToolkit_ROOT, CUDA_HOME, etc

It would be nice to have individual cmake targets/components for different pieces of cuda so that we could avoid symlinkJoin. I think we can consider at least opening an issue at https://github.com/NVIDIA/build-system-archive-import-examples/issues if we come up with a reasonable way to organize these things

(Maybe @kmittman would have some hints?)

@kmittman
Copy link

kmittman commented Feb 27, 2023

Hi @SomeoneSerge
Unfortunately I don't really have a solution for the assumption that all of the components are installed in one place, i.e. /usr/local/cuda/ - it's been that way for so long that decoupling the install paths has been problematic I can attest to it, had changed the libcublas install path in CUDA 10.1 and due to popular demand, had to revert that change in a later version

EDIT: some keyboard shortcut submitted early.

As I was saying, it is non-trivial and there are rpath constraints as well. But yes, please go ahead and file a request in that GitHub repo. Perhaps creating a CMake "stub" per component would be beneficial? Otherwise open to suggestions.

@ConnorBaker
Copy link
Contributor Author

From @SomeoneSerge

RE: Caching "single build that supports all capabilities" vs "multiple builds that support individual cuda architectures"

Couldn't find an issue tracking this, so I'll drop a message here.
The more precise argument in favour of building for individual capabilities is easier maintenance and nixpkgs development.
When working on master it's desirable to only build for your own arch, but currently it means a cache-miss for transitive dependencies.
For example, you work on torchvision and you import nixpkgs with config.cudaCapabilities = [ "8.6" ]. Snap! You're rebuilding pytorch, you cancel, you write a custom shell that overrides torchvision specifically, you remove asserts, etc.

Alternative world: cuda-maintainers.cachix.org has a day-old pytorch build for 8.6, a build for 7.5, a build for 6.0, etc
Extra: faster nixpkgs-review, assuming fewer default capabilities
Indivudial builds:

More builds, but they're lighter
Can re-use cache when working on master
Hard to choose default capabilities that would fit most users and not cost too much
All-platforms build:

Less compute in total, but jobs are fat and sometimes drain the build machine
Simpler UX for end-users

@ConnorBaker ConnorBaker self-assigned this Mar 9, 2023
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/tweag-nix-dev-update-45/26397/1

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Mar 17, 2023

RE: Caching "single build that supports all capabilities" vs "multiple builds that support individual cuda architectures"

CC #221564

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Mar 17, 2023

RE: Caching "single build that supports all capabilities" vs "multiple builds that support individual cuda architectures"

Since the actual motivation is that we want our binary cache to be 1) useful (minimize cache-misses), and 2) affordable, we should also:

EDIT: Being addressed in #224068

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/need-freelancer-for-nix-packaging-for-machine-learning-dependencies-with-cuda/47268/4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🔮 Roadmap
Development

No branches or pull requests

7 participants