-
-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default nixpkgs' cudaCapabilities #221564
Comments
|
Potentially related: #220366 (comment) |
I think we want to default to the newest available capability. Users who need lower caps will encounter an error (rather than a cache-miss, thinking of cachix), admittedly after a substantial download. The error is also likely going to look confusing, but hopefully people would discover the change through github and discourse. We'd document the change and suggest they import nixpkgs with the capability they need. We'd build cache for all capabilities separately |
Newest as of 11.8 is Hopper (9.0). They way it's set up currently we should be able to do that by taking the last element of the supported capabilities, or something similar, right? EDIT: Should we choose versions that cause breakages? For example, the current version of torch won't work with anything newer than 8.6 (IIRC). Although, if we had packages which were aware of what we were requesting and then picked the newest architecture they could support instead of just erroring, I guess that would be easier? |
o_0 why? |
Oh, you know, just your standard hardcoded support 🙃 https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L1751-L1753 https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L1791-L1793 EDIT: Two paths I can think of:
Interested in your thoughts! |
This is more or less our default approach. Clearly, pytorch may not work with older capabilities (e.g. pytorch relying on newer instructions not supported by old hw), but the upper bound likely only means that they do not test |
@SomeoneSerge if you're going to patch it, would you also consider patching the check for compiler versions? https://github.com/pytorch/pytorch/blob/v1.13.1/torch/utils/cpp_extension.py#L51-L71 |
Hmm I have a number of questions,
So I'm a little skeptical, but honestly this aspect of the codebase is not really my expertise so ofc open to hearing other perspectives. |
To speak plainly, the main source of motivation is that the fat flavours of tf and torch are so heavy, they totally confuse any schedulers. Longer version: the motivation is to optimize for easier maintenance. When we work on master, we don't usually run our test builds with all capabilities on, because it's so slow. E.g. I usually build only for We already build several flavours (cuda, cuda+mkl, different sets of capabilities). What I propose is that we stop maintaing the "fat" flavour, which sets all caps on. We'd still build N smaller packages. There are benefits to that:
The fat flavour isn't useless. In fact, it's very desirable:
Fat builds are nice, but within our limited budget I think we should prioritize the smaller builds. If someone wants a binary cache for the fat flavour, there's the management work of gathering more resource they'll have to go through UX-wise I just want to ensure that the default that people see when they copy and paste Another alternative that I like is to set cudaCapabilities to an error value, that would prompt users to override the option. Hwvr this goes against the nixpkgs policy of public attributes not showing warnings etc. Nixpkgs-review would need special treatment. Another similar approach: set broken=true unless exolicit cudaCapabilities provided by the user. |
Ah gotcha, thanks for explaining @SomeoneSerge! I understand better why N smaller builds is attractive vs our 1 "fat" build now. Indeed, it would be very nice to maintain N builds, and not concern ourselves with fat builds at all. IIUC what this really boils down to is UX challenges with
This is maybe a crazy idea, but is there any way in which we could go from a N single-capability builds to a single "fat" build, without building the "fat" build from source all over again? If so, that might point towards a future in which we build N single-capability builds and then have the freedom to mix and match them cheaply... |
Maybe a solution will just present itself in time https://discourse.nixos.org/t/working-group-member-search-module-system-for-packages/26574 |
Ah very cool, I was not aware of that initiative! |
Subgoals:
config.cudaCapabilities
cudaCapabilities
list possiblecudaCapabilities
list when working on master?7.5+PTX
buildcudaCapabilities
do not fail to discoverconfig.cudaCapabilities
Steps:
cudaCapabilities
.The text was updated successfully, but these errors were encountered: