Change default nixpkgs' cudaCapabilities #221564

SomeoneSerge · 2023-03-16T23:27:31Z

Subgoals:

cuda-maintainers.cachix.org caches nixpkgs imported without custom config.cudaCapabilities
cuda-maintainers.cachix.org caches master at a higher frequency, but with a shortest cudaCapabilities list possible
Builds are "reasonably cheap"
CUDA maintainers use the same short cudaCapabilities list when working on master?
Users are unlikely to accidentally use packages built with wrong capabilities. E.g. A100 users don't end up using a 7.5+PTX build
Users that after the change would need to override cudaCapabilities do not fail to discover config.cudaCapabilities

Steps:

Who uses CUDA in nixpkgs and how? A NixOS Discourse post with a multiple choice poll, listing possible combinations of cudaCapabilities.
1. Ask about MKL and alternative MPI implementations
2. Link to from https://matrix.to/#/#datascience:nixos.org
3. Reference https://github.com/Nix-QChem/NixOS-QChem users
4. Reference https://github.com/numtide/nixpkgs-unfree
5. If there's a new nixpkgs-unfree instance from Domen Kozar, reference that
6. Link https://discourse.nixos.org/t/improving-nixos-data-science-infrastructure-ci-for-mkl-cuda/5074/11
7. Link https://discourse.nixos.org/t/why-is-the-nix-compiled-python-slower/18717/20?u=sergek
8. Link https://discourse.nixos.org/t/petition-to-build-and-cache-unfree-packages-on-cache-nixos-org/17440

The text was updated successfully, but these errors were encountered:

SomeoneSerge · 2023-03-17T11:07:55Z

Do builds like [ "8.6" "7.0+PTX" ] work? Should we consider them?

ConnorBaker · 2023-03-17T14:38:40Z

Potentially related: #220366 (comment)

SomeoneSerge · 2023-03-21T01:44:17Z

I think we want to default to the newest available capability. Users who need lower caps will encounter an error (rather than a cache-miss, thinking of cachix), admittedly after a substantial download. The error is also likely going to look confusing, but hopefully people would discover the change through github and discourse. We'd document the change and suggest they import nixpkgs with the capability they need. We'd build cache for all capabilities separately

CC @ConnorBaker @samuela

ConnorBaker · 2023-03-21T02:23:04Z

Newest as of 11.8 is Hopper (9.0).

They way it's set up currently we should be able to do that by taking the last element of the supported capabilities, or something similar, right?

EDIT: Should we choose versions that cause breakages? For example, the current version of torch won't work with anything newer than 8.6 (IIRC). Although, if we had packages which were aware of what we were requesting and then picked the newest architecture they could support instead of just erroring, I guess that would be easier?

SomeoneSerge · 2023-03-21T11:13:36Z

current version of torch won't work with anything newer than 8.6

o_0 why?

ConnorBaker · 2023-03-21T15:33:01Z

o_0 why?

Oh, you know, just your standard hardcoded support 🙃

https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L1751-L1753

https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L1791-L1793

EDIT: Two paths I can think of:

Patch it out (and whatever else is touched by that) so we can build for arbitrary architectures.
Have cudaPackages.cudaFlags provide a function which, when given a list of supported capabilities (provided by a package), selects the arch closest to what the target arch is.

Interested in your thoughts!

SomeoneSerge · 2023-03-21T15:44:51Z

Patch it out (and whatever else is touched by that) so we can build for arbitrary architecture

This is more or less our default approach. Clearly, pytorch may not work with older capabilities (e.g. pytorch relying on newer instructions not supported by old hw), but the upper bound likely only means that they do not test 9.0 yet (nvidia dropping support for instructions shouldn't happen too often?)

ConnorBaker · 2023-03-21T19:45:38Z

@SomeoneSerge if you're going to patch it, would you also consider patching the check for compiler versions?

https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L51-L71

samuela · 2023-03-21T23:13:15Z

Hmm I have a number of questions,

What is the motivation for this change? What aspects of the current CUDA experience are we hoping to improve? I'm guessing it's a compute/download thing, but I would argue that UX is more important. Giving users an experience that "just works" should be of paramount importance IMHO. Power users can always build and cache their own packages if necessary.
We currently have to maintain two flavors of packages: cudaSupport = false and = true. IIUC this suggestion would result in us having to support 1 + N flavors, since there may be drift between master, what maintainers build with, and what is cached on cuda-maintainers.cachix.org. Is that correct? How many possible variations are looking at?
Concretely which capabilities are we considering removing/adding support for? If there are capabilities that are very old, eg for hardware that is >=5 years old, then I say yeah let's totally get rid of them.
If cuda-maintainers.cachix.org caches packages for a capability set that differs from master, what impact will this have on UX?

So I'm a little skeptical, but honestly this aspect of the codebase is not really my expertise so ofc open to hearing other perspectives.

SomeoneSerge · 2023-03-25T16:51:27Z

To speak plainly, the main source of motivation is that the fat flavours of tf and torch are so heavy, they totally confuse any schedulers.

Longer version: the motivation is to optimize for easier maintenance. When we work on master, we don't usually run our test builds with all capabilities on, because it's so slow. E.g. I usually build only for 8.6. In fact, I don't even need to deploy for more than 8.0 and 8.6. Anything else extra gigabytes of storage and extra compute. When working a package that needs torch or tf, it's nice to have them cached. I also want to make people aware that the cudaCapabilities option exists (we need to stabilize the interface first ofc)

We already build several flavours (cuda, cuda+mkl, different sets of capabilities). What I propose is that we stop maintaing the "fat" flavour, which sets all caps on. We'd still build N smaller packages. There are benefits to that:

less drift between git and cachix, because the builds are more incremental: the fat tf build may take up to 10 hours, but the first "thin flavour" can become available in some 3 hours (numbers off the top of my head so don't trust them)
CI robustness: several fat builds runnung in parallel (fat tf + fat pytorch) on the same machine tend to cause throttling, whereas multiple small builds are easier to rate-limit
better parallelism: multiple small builds are easier to distribute over several build nodes if/when we find those

The fat flavour isn't useless. In fact, it's very desirable:

You need it if you don't know (or won't bother to figure out) what environment you're going to run in. E.g. it's convenient to use on a cluster
It is a friendly default for nixpkgs newcomers

Fat builds are nice, but within our limited budget I think we should prioritize the smaller builds. If someone wants a binary cache for the fat flavour, there's the management work of gathering more resource they'll have to go through

UX-wise I just want to ensure that the default that people see when they copy and paste with (import <nixpkgs> {}); is not suddenly building tensorflow only to realize that it's not going to finish the same day

Another alternative that I like is to set cudaCapabilities to an error value, that would prompt users to override the option. Hwvr this goes against the nixpkgs policy of public attributes not showing warnings etc. Nixpkgs-review would need special treatment. Another similar approach: set broken=true unless exolicit cudaCapabilities provided by the user.

samuela · 2023-04-05T02:53:10Z

Ah gotcha, thanks for explaining @SomeoneSerge! I understand better why N smaller builds is attractive vs our 1 "fat" build now. Indeed, it would be very nice to maintain N builds, and not concern ourselves with fat builds at all.

IIUC what this really boils down to is UX challenges with cudaCapabilities. In an ideal world we could magically dispatch to using the right 1-of-N package for the GPU at runtime. But making this a reality comes with UX challenges:

Dispatching at runtime is likely impossible(?). Asking users to set capabilities at build time is annoying, esp. in cloud/cluster environments with varying GPU types, public code releases, etc.
Surfacing the cudaCapabilities option to users is tricky. Defaulting to a warnings, build failures, or meta.broken status unless configured otherwise is unorthodox and sub-optimal UX. Defaulting to all capabilities is easy UX but bad for build times.

This is maybe a crazy idea, but is there any way in which we could go from a N single-capability builds to a single "fat" build, without building the "fat" build from source all over again? If so, that might point towards a future in which we build N single-capability builds and then have the freedom to mix and match them cheaply...

SomeoneSerge · 2023-04-05T13:09:28Z

Surfacing the cudaCapabilities option to users is tricky.

Maybe a solution will just present itself in time https://discourse.nixos.org/t/working-group-member-search-module-system-for-packages/26574

samuela · 2023-04-05T17:32:49Z

Ah very cool, I was not aware of that initiative!

SomeoneSerge added the 6.topic: cuda label Mar 16, 2023

ConnorBaker assigned SomeoneSerge Mar 17, 2023

SomeoneSerge mentioned this issue Mar 17, 2023

docs: Questions about maintaining CUDA-related packaging #217780

Open

3 tasks

ConnorBaker mentioned this issue Mar 21, 2023

python3.pkgs.mmcv: init at 1.7.1 #219815

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default nixpkgs' cudaCapabilities #221564

Change default nixpkgs' cudaCapabilities #221564

SomeoneSerge commented Mar 16, 2023

SomeoneSerge commented Mar 17, 2023

ConnorBaker commented Mar 17, 2023

SomeoneSerge commented Mar 21, 2023 •

edited

Loading

ConnorBaker commented Mar 21, 2023 •

edited

Loading

SomeoneSerge commented Mar 21, 2023

ConnorBaker commented Mar 21, 2023 •

edited

Loading

SomeoneSerge commented Mar 21, 2023

ConnorBaker commented Mar 21, 2023

samuela commented Mar 21, 2023

SomeoneSerge commented Mar 25, 2023 •

edited

Loading

samuela commented Apr 5, 2023

SomeoneSerge commented Apr 5, 2023

samuela commented Apr 5, 2023

Change default nixpkgs' cudaCapabilities #221564

Change default nixpkgs' cudaCapabilities #221564

Comments

SomeoneSerge commented Mar 16, 2023

SomeoneSerge commented Mar 17, 2023

ConnorBaker commented Mar 17, 2023

SomeoneSerge commented Mar 21, 2023 • edited Loading

ConnorBaker commented Mar 21, 2023 • edited Loading

SomeoneSerge commented Mar 21, 2023

ConnorBaker commented Mar 21, 2023 • edited Loading

SomeoneSerge commented Mar 21, 2023

ConnorBaker commented Mar 21, 2023

samuela commented Mar 21, 2023

SomeoneSerge commented Mar 25, 2023 • edited Loading

samuela commented Apr 5, 2023

SomeoneSerge commented Apr 5, 2023

samuela commented Apr 5, 2023

SomeoneSerge commented Mar 21, 2023 •

edited

Loading

ConnorBaker commented Mar 21, 2023 •

edited

Loading

ConnorBaker commented Mar 21, 2023 •

edited

Loading

SomeoneSerge commented Mar 25, 2023 •

edited

Loading