-
-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hydra: build CUDA packages for the CUDA team #1335
Conversation
terraform/hydra-projects.tf
Outdated
input { | ||
name = "nixpkgs" | ||
type = "git" | ||
value = "https://github.com/NixOS/nixpkgs.git nixos-unstable" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SomeoneSerge do you want to use another branch, like a staging-cuda or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tried pushing big changes to cuda-updates
first and building it prior to merging into master
, but we kept coming back to targeting master
directly. In our hercules we build master + nixos-unstable + the latest release: this way by the time CI starts a round of nixos-unstable, some of will have been cached by the master job. This alleviates some of the pain of nixos-unstable
advancing without testing CUDA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A related idea would be to pull from nix-community/nixpkgs and give the CUDA team access to it. That would limit the risks compared to giving everybody push access to NixOS/nixpkgs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now we're building nixos-unstable-small
No objection. Maybe if this ends up too much for our current hardware we could consider upgrading. |
If you tell companies that all they need to do to get CUDA packages for NixOS is to pay some monthly donations, than I am sure we get the funding pretty quick. Also we still haven't asked Hetzner for the discount they are offering to the NixOS foundation. This way we would probably still save money with bigger hardware. |
terraform/hydra-projects.tf
Outdated
input { | ||
name = "supportedSystems" | ||
type = "nix" | ||
value = "[ \"x86_64-linux\" \"aarch64-linux\" \"aarch64-darwin\" ]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linux only?
value = "[ \"x86_64-linux\" \"aarch64-linux\" ]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now it's "x86_64-linux" only since upstream has set that value as a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I mean we should restrict it here to linux only as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support both x86_64 and aarch64 linux, I'll update the release file
First run: https://hydra.nix-community.org/eval/109915 |
0d372d9
to
df44813
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zowoq so I'm considering just building the whole import <nixpkgs> { config.cudaSupport = true; }
; this gives me something like
❯ nix-eval-jobs --expr 'import ./pkgs/top-level/release-cuda.nix { }' --force-recurse | wc -l
...
# eval errors, eval errors
...
138452
Does that sound unreasonable? I could in principle come up with a smaller, curated set of jobs.
Hexa also raises the concern that this would be effectively mirroring the NixOS Hydra:
hexa (UTC+1)
so all of them
if there was a cache behind nix-community hydra, than you'd be mirroring cache.nixos.org effectively
SomeoneSerge (UTC+3)
Yeah... Ideally we'd have a solution that evaluates the full DAGs for vanilla and cuda nixpkgs, starts building cuda from the leaves (ehhh, the roots), and always suspends the build if it hash matches the vanilla hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import <nixpkgs> { config.cudaSupport = true; }
TBH I doubt we have the capacity to handle that without it being detrimental to the other projects using this CI. Currently we only have two builders for linux that are shared by buildbot, hercules, hydra:
### `build03`
- Provider: Hetzner
- CPU: AMD Ryzen 9 3900 12-Core Processor
- RAM: 128GB DDR4 ECC
- Drives: 2 x 1.92 TB NVME in RAID 1
### `build04`
- Provider: Hetzner
- Instance type: [RX170](https://www.hetzner.com/dedicated-rootserver/rx170)
- CPU: Ampere Altra Q80-30 80-Core Processor
- RAM: 128GB DDR4 ECC
- Drives: 2 x 960 GB NVME in RAID 0
If you do want to build everything maybe we could have dedicated machines just for this package set? Not sure if raising the money for that is feasible?
if there was a cache behind nix-community hydra, than you'd be mirroring cache.nixos.org effectively
Not sure if I've misunderstood or not, we push everything to cachix but it skips existing nixos cache paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I've misunderstood or not, we push everything to cachix but it skips existing nixos cache paths.
The concern is that if there is a phase shift between NixOS and the Community Hydras, and the latter starts building a certain derivation from a certain commit before the former, we'll have wasted some storage and compute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I doubt we have the capacity to handle that without it being detrimental to the other projects using this CI. Currently we only have two builders for linux that are shared by buildbot, hercules, hydra:
Roger that. I'll push a smaller jobset tomorrow, based on what we've been building in https://github.com/SomeoneSerge/nixpkgs-cuda-ci.
How is the community builder funded? @ConnorBaker was asking on matrix if there's an opencollective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern is that if there is a phase shift between NixOS and the Community Hydras, and the latter starts building a certain derivation from a certain commit before the former, we'll have wasted some storage and compute
Yeah, we can't really avoid that with this approach. Could try adding the free deps as blockers for nixos-unstable or could try doing something similar in a repo here, flake update PRs with max-jobs = 0
so merging is blocked if they aren't cached?
How is the community builder funded? ConnorBaker was asking on matrix if there's an opencollective
We have an opencollective: https://opencollective.com/nix-community
They offer discounts? My Hetzner bill for part of the CUDA CI is like $400; I’d love to consolidate some of that stuff under the community, especially if you can get a discount and we can all benefit from it! |
Merging the current state. We can still do follow-up PRs afterwards! |
Yes, but they ran out of the discount budget for this year. We'll have to contact them again next year. |
If you want to donate hardware to the cause, we are discussing what the requirements would be in #1343 |
Could you go into some detail please? What hardware, what is built and what is the utilization like? |
I've reverted this as it had been interfering with our other CI builds. I'll see if I can find a way of running these builds without causing problems for our other users. |
I don't know how much resources that would take, but we could give it a go.
CUDA packages are slow to build, and not cached by upstream due to upstream not building unfree packages. This could help the team quite a bit.