Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable module unloading and memory hotplug for nvidia gpu-operator #1031

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jfroy
Copy link
Contributor

@jfroy jfroy commented Sep 18, 2024

This patch enables module unloading and memory hotplug to support the nvidia gpu-operator.

Module unloading is required by driver containers. These containers are managed by the operator and will load and unload the nvidia kernel modules as part of their lifecycle.

Memory hotplug is required to provide /sys/devices/system/memory. In particular, driver containers use /sys/devices/system/memory/auto_online_blocks to manage nvidia uvm (unified virtual memory).

Additional changes are required in Talos to make use of this. A glibc extensions is required by the gpu-operator. See siderolabs/extensions#473.

This patch enables module unloading and memory hotplug to support the
nvidia gpu-operator.

Module unloading is required by driver containers. These containers are
managed by the operator and will load and unload the nvidia kernel
modules as part of their lifecycle.

Memory hotplug is required to provide `/sys/devices/system/memory`. In
particular, driver containers use
`/sys/devices/system/memory/auto_online_blocks` to manage nvidia uvm
(unified virtual memory).

Signed-off-by: Jean-Francois Roy <[email protected]>
@smira
Copy link
Member

smira commented Sep 18, 2024

How would the module containers work with Talos given that Talos requires all modules to be signed with each kernel build (release)? Just trying to understand the full picture here.

Would we (Talos team) build them and publish, and NVIDIA GPU operator would pull and load them on the fly?

(By the way, module loading is prohibited for any workloads in Talos as another security measure)

@jfroy
Copy link
Contributor Author

jfroy commented Sep 18, 2024

How would the module containers work with Talos given that Talos requires all modules to be signed with each kernel build (release)? Just trying to understand the full picture here.

Would we (Talos team) build them and publish, and NVIDIA GPU operator would pull and load them on the fly?

(By the way, module loading is prohibited for any workloads in Talos as another security measure)

See siderolabs/talos#9339 for my rationale. We can probably focus the conversation there.

@frezbo
Copy link
Member

frezbo commented Sep 18, 2024

I guess the reason the operator needs module unload is if the operator changes to use a different nvidia driver version, is that assumption right?

@jfroy
Copy link
Contributor Author

jfroy commented Sep 19, 2024

I guess the reason the operator needs module unload is if the operator changes to use a different nvidia driver version, is that assumption right?

Yes basically. The gpu operator can use a host-installed driver or a driver container. For driver containers, it schedules a daemonset for that container. The main image for that daemonset is derived from information in the gpu-operator's ClusterPolicy and node annotations. The ClusterPolicy provides the repository, image name, and driver version. The operator combines this driver version with annotations providing each node's kernel version and OS ID and version to compute the driver container image's tag. So although there can only be one ClusterPolicy object in a cluster, each node will pull a driver container image that is suited for its kernel version and operating system. If any of these things change, the operator will schedule a new driver container.

In addition to this single ClusterPolicy, a cluster operator can also enable the NVIDIADriver CRD. NVIDIADriver includes the same driver information as the ClusterPolicy as well as a node selector, enabling per-node driver versions.

Driver containers are basically an install rootfs and script that unloads the kernel modules, makes the userspace components available at /run/nvidia/driver (this is typically done by bind-mounting the root of the driver container to that path), copies firmware files to the host (this is typically done by copying them to a known host path mounted as a volume and adding that to /sys/module/firmware_class/parameters/path) and loads the new kernel modules. The script does other things, like start the persistence daemon, load other related modules, handle related Nvidia tech like fabric manager. Once all that is done, the script sleeps, and when it receives SIGTERM (e.g. when the pod is deleted) it unloads the kernel modules.

One thing of note is that the gpu-operator doesn't really know or care about the details of the driver container. This script could be changed from bash to Python or replaced by a Go program, so long as the semantics are maintained. So for example, SideroLabs could provide Nvidia driver containers that use a Go program to communicate with machined to delegate privileged operations.

It also means that a Talos driver container could cheat and skip kernel module management. The script Nvidia maintains in Nvidia driver container images does encode a certain sequence of kernel module loading that would have to be replicated in the Talos machine config, and some amount of logic to apply kernel module parameters from the ClusterPolicy that would also need to be synchronized or replicated between the machine config and the ClusterPolicy.

Or another approach might be to stage the kernel modules in a known-location on the ephemeral volume and reboot the node, with a service responsible for loading the modules at that known path.

The important idea is to honor the contract with the cluster operators where changing a ClusterPolicy or NVIDIADriver resource applies the changes to the cluster (or applicable nodes). There are powerful narrow use cases that this enables, for example allowing a group of users to manage NVIDIADriver resources (potentially with an admission webhook to validate the node selector) without those users being able to do the equivalent of talosctl upgrade. And of course this is also a mechanism to allow users to manage the driver they want, with some restrictions potentially that depend on the distribution, on managed clusters where host access or configuration is not offered.

@frezbo
Copy link
Member

frezbo commented Sep 19, 2024

One thing of note is that the gpu-operator doesn't really know or care about the details of the driver container. This script could be changed from bash to Python or replaced by a Go program, so long as the semantics are replaced. So for example, SideroLabs could provide Nvidia driver containers that use a Go program to communicate with machined to delegate privileged operations.

Hmm that seems more plausible and since Talos already supports api access from pods, it could dynamically inject machine config (though I'm not sure if that's a good approach considering how we want this to work from omni side too). I did take a look at that script and having it as a Go program makes more sense since and could also reduce the image size.

Or another approach might be to stage the kernel modules in a known-location on the ephemeral volume and reboot the node, with a service responsible for loading the modules at that known path.

I prefer the former since that means the driver containers could be re-used and kind of separate from talos releases, though we'd still need to build them against a kernel version which we do anyways

@frezbo
Copy link
Member

frezbo commented Sep 19, 2024

Hmm that seems more plausible and since Talos already supports api access from pods, it could dynamically inject machine config (though I'm not sure if that's a good approach considering how we want this to work from omni side too). I did take a look at that script and having it as a Go program makes more sense since and could also reduce the image size.

I wonder if we can upstream the bash script as a Go program so changes gets updated upstream and we can just change thing via cli args or something 🤔

@smira
Copy link
Member

smira commented Sep 19, 2024

From Talos existing security approach, I would prefer to still keep the modules and firmware shipped with Talos as immutable and read-only. If we have to load kernel modules in a different way (e.g. order or arguments), that sounds good.

The idea of an operator that manages whole cluster sounds a bit complicated in the case you have a mix of different Talos versions (and Linux kernel version/driver versions).

So I would rather not compromise on the integrity/security at least in the first iteration.

@jfroy
Copy link
Contributor Author

jfroy commented Sep 21, 2024

From Talos existing security approach, I would prefer to still keep the modules and firmware shipped with Talos as immutable and read-only. If we have to load kernel modules in a different way (e.g. order or arguments), that sounds good.

The idea of an operator that manages whole cluster sounds a bit complicated in the case you have a mix of different Talos versions (and Linux kernel version/driver versions).

So I would rather not compromise on the integrity/security at least in the first iteration.

It's complicated but we've done it.

--

I can remove the module unload change, but it can also remain since without SYS_MODULE pretty much only machined will be able to issue module unloads. It's a pretty reasonable capability to have.

@frezbo
Copy link
Member

frezbo commented Sep 21, 2024

I can remove the module unload change, but it can also remain since without SYS_MODULE pretty much only machined will be able to issue module unloads. It's a pretty reasonable capability to have.

I guess you can keep the module unload capability

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants