-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enable module unloading and memory hotplug for nvidia gpu-operator #1031
base: main
Are you sure you want to change the base?
Conversation
This patch enables module unloading and memory hotplug to support the nvidia gpu-operator. Module unloading is required by driver containers. These containers are managed by the operator and will load and unload the nvidia kernel modules as part of their lifecycle. Memory hotplug is required to provide `/sys/devices/system/memory`. In particular, driver containers use `/sys/devices/system/memory/auto_online_blocks` to manage nvidia uvm (unified virtual memory). Signed-off-by: Jean-Francois Roy <[email protected]>
How would the module containers work with Talos given that Talos requires all modules to be signed with each kernel build (release)? Just trying to understand the full picture here. Would we (Talos team) build them and publish, and NVIDIA GPU operator would pull and load them on the fly? (By the way, module loading is prohibited for any workloads in Talos as another security measure) |
See siderolabs/talos#9339 for my rationale. We can probably focus the conversation there. |
I guess the reason the operator needs module unload is if the operator changes to use a different nvidia driver version, is that assumption right? |
Yes basically. The gpu operator can use a host-installed driver or a driver container. For driver containers, it schedules a daemonset for that container. The main image for that daemonset is derived from information in the gpu-operator's ClusterPolicy and node annotations. The In addition to this single Driver containers are basically an install rootfs and script that unloads the kernel modules, makes the userspace components available at One thing of note is that the gpu-operator doesn't really know or care about the details of the driver container. This script could be changed from bash to Python or replaced by a Go program, so long as the semantics are maintained. So for example, SideroLabs could provide Nvidia driver containers that use a Go program to communicate with machined to delegate privileged operations. It also means that a Talos driver container could cheat and skip kernel module management. The script Nvidia maintains in Nvidia driver container images does encode a certain sequence of kernel module loading that would have to be replicated in the Talos machine config, and some amount of logic to apply kernel module parameters from the Or another approach might be to stage the kernel modules in a known-location on the ephemeral volume and reboot the node, with a service responsible for loading the modules at that known path. The important idea is to honor the contract with the cluster operators where changing a |
Hmm that seems more plausible and since Talos already supports api access from pods, it could dynamically inject machine config (though I'm not sure if that's a good approach considering how we want this to work from omni side too). I did take a look at that script and having it as a Go program makes more sense since and could also reduce the image size.
I prefer the former since that means the driver containers could be re-used and kind of separate from talos releases, though we'd still need to build them against a kernel version which we do anyways |
I wonder if we can upstream the bash script as a Go program so changes gets updated upstream and we can just change thing via cli args or something 🤔 |
From Talos existing security approach, I would prefer to still keep the modules and firmware shipped with Talos as immutable and read-only. If we have to load kernel modules in a different way (e.g. order or arguments), that sounds good. The idea of an operator that manages whole cluster sounds a bit complicated in the case you have a mix of different Talos versions (and Linux kernel version/driver versions). So I would rather not compromise on the integrity/security at least in the first iteration. |
It's complicated but we've done it. -- I can remove the module unload change, but it can also remain since without |
I guess you can keep the module unload capability |
This patch enables module unloading and memory hotplug to support the nvidia gpu-operator.
Module unloading is required by driver containers. These containers are managed by the operator and will load and unload the nvidia kernel modules as part of their lifecycle.
Memory hotplug is required to provide
/sys/devices/system/memory
. In particular, driver containers use/sys/devices/system/memory/auto_online_blocks
to manage nvidia uvm (unified virtual memory).Additional changes are required in Talos to make use of this. A glibc extensions is required by the gpu-operator. See siderolabs/extensions#473.