feat: enable module unloading and memory hotplug for nvidia gpu-operator #1031

jfroy · 2024-09-18T17:00:47Z

This patch enables module unloading and memory hotplug to support the nvidia gpu-operator.

Module unloading is required by driver containers. These containers are managed by the operator and will load and unload the nvidia kernel modules as part of their lifecycle.

Memory hotplug is required to provide /sys/devices/system/memory. In particular, driver containers use /sys/devices/system/memory/auto_online_blocks to manage nvidia uvm (unified virtual memory).

Additional changes are required in Talos to make use of this. A glibc extensions is required by the gpu-operator. See siderolabs/extensions#473.

This patch enables module unloading and memory hotplug to support the nvidia gpu-operator. Module unloading is required by driver containers. These containers are managed by the operator and will load and unload the nvidia kernel modules as part of their lifecycle. Memory hotplug is required to provide `/sys/devices/system/memory`. In particular, driver containers use `/sys/devices/system/memory/auto_online_blocks` to manage nvidia uvm (unified virtual memory). Signed-off-by: Jean-Francois Roy <[email protected]>

smira · 2024-09-18T17:27:23Z

How would the module containers work with Talos given that Talos requires all modules to be signed with each kernel build (release)? Just trying to understand the full picture here.

Would we (Talos team) build them and publish, and NVIDIA GPU operator would pull and load them on the fly?

(By the way, module loading is prohibited for any workloads in Talos as another security measure)

jfroy · 2024-09-18T17:33:10Z

How would the module containers work with Talos given that Talos requires all modules to be signed with each kernel build (release)? Just trying to understand the full picture here.

Would we (Talos team) build them and publish, and NVIDIA GPU operator would pull and load them on the fly?

(By the way, module loading is prohibited for any workloads in Talos as another security measure)

See siderolabs/talos#9339 for my rationale. We can probably focus the conversation there.

frezbo · 2024-09-18T23:56:20Z

I guess the reason the operator needs module unload is if the operator changes to use a different nvidia driver version, is that assumption right?

jfroy · 2024-09-19T07:26:57Z

I guess the reason the operator needs module unload is if the operator changes to use a different nvidia driver version, is that assumption right?

Yes basically. The gpu operator can use a host-installed driver or a driver container. For driver containers, it schedules a daemonset for that container. The main image for that daemonset is derived from information in the gpu-operator's ClusterPolicy and node annotations. The ClusterPolicy provides the repository, image name, and driver version. The operator combines this driver version with annotations providing each node's kernel version and OS ID and version to compute the driver container image's tag. So although there can only be one ClusterPolicy object in a cluster, each node will pull a driver container image that is suited for its kernel version and operating system. If any of these things change, the operator will schedule a new driver container.

In addition to this single ClusterPolicy, a cluster operator can also enable the NVIDIADriver CRD. NVIDIADriver includes the same driver information as the ClusterPolicy as well as a node selector, enabling per-node driver versions.

Driver containers are basically an install rootfs and script that unloads the kernel modules, makes the userspace components available at /run/nvidia/driver (this is typically done by bind-mounting the root of the driver container to that path), copies firmware files to the host (this is typically done by copying them to a known host path mounted as a volume and adding that to /sys/module/firmware_class/parameters/path) and loads the new kernel modules. The script does other things, like start the persistence daemon, load other related modules, handle related Nvidia tech like fabric manager. Once all that is done, the script sleeps, and when it receives SIGTERM (e.g. when the pod is deleted) it unloads the kernel modules.

One thing of note is that the gpu-operator doesn't really know or care about the details of the driver container. This script could be changed from bash to Python or replaced by a Go program, so long as the semantics are maintained. So for example, SideroLabs could provide Nvidia driver containers that use a Go program to communicate with machined to delegate privileged operations.

It also means that a Talos driver container could cheat and skip kernel module management. The script Nvidia maintains in Nvidia driver container images does encode a certain sequence of kernel module loading that would have to be replicated in the Talos machine config, and some amount of logic to apply kernel module parameters from the ClusterPolicy that would also need to be synchronized or replicated between the machine config and the ClusterPolicy.

Or another approach might be to stage the kernel modules in a known-location on the ephemeral volume and reboot the node, with a service responsible for loading the modules at that known path.

The important idea is to honor the contract with the cluster operators where changing a ClusterPolicy or NVIDIADriver resource applies the changes to the cluster (or applicable nodes). There are powerful narrow use cases that this enables, for example allowing a group of users to manage NVIDIADriver resources (potentially with an admission webhook to validate the node selector) without those users being able to do the equivalent of talosctl upgrade. And of course this is also a mechanism to allow users to manage the driver they want, with some restrictions potentially that depend on the distribution, on managed clusters where host access or configuration is not offered.

frezbo · 2024-09-19T08:40:09Z

One thing of note is that the gpu-operator doesn't really know or care about the details of the driver container. This script could be changed from bash to Python or replaced by a Go program, so long as the semantics are replaced. So for example, SideroLabs could provide Nvidia driver containers that use a Go program to communicate with machined to delegate privileged operations.

Hmm that seems more plausible and since Talos already supports api access from pods, it could dynamically inject machine config (though I'm not sure if that's a good approach considering how we want this to work from omni side too). I did take a look at that script and having it as a Go program makes more sense since and could also reduce the image size.

Or another approach might be to stage the kernel modules in a known-location on the ephemeral volume and reboot the node, with a service responsible for loading the modules at that known path.

I prefer the former since that means the driver containers could be re-used and kind of separate from talos releases, though we'd still need to build them against a kernel version which we do anyways

frezbo · 2024-09-19T08:41:24Z

Hmm that seems more plausible and since Talos already supports api access from pods, it could dynamically inject machine config (though I'm not sure if that's a good approach considering how we want this to work from omni side too). I did take a look at that script and having it as a Go program makes more sense since and could also reduce the image size.

I wonder if we can upstream the bash script as a Go program so changes gets updated upstream and we can just change thing via cli args or something 🤔

smira · 2024-09-19T11:56:09Z

From Talos existing security approach, I would prefer to still keep the modules and firmware shipped with Talos as immutable and read-only. If we have to load kernel modules in a different way (e.g. order or arguments), that sounds good.

The idea of an operator that manages whole cluster sounds a bit complicated in the case you have a mix of different Talos versions (and Linux kernel version/driver versions).

So I would rather not compromise on the integrity/security at least in the first iteration.

jfroy · 2024-09-21T06:01:47Z

From Talos existing security approach, I would prefer to still keep the modules and firmware shipped with Talos as immutable and read-only. If we have to load kernel modules in a different way (e.g. order or arguments), that sounds good.

The idea of an operator that manages whole cluster sounds a bit complicated in the case you have a mix of different Talos versions (and Linux kernel version/driver versions).

So I would rather not compromise on the integrity/security at least in the first iteration.

It's complicated but we've done it.

--

I can remove the module unload change, but it can also remain since without SYS_MODULE pretty much only machined will be able to issue module unloads. It's a pretty reasonable capability to have.

frezbo · 2024-09-21T13:53:24Z

I can remove the module unload change, but it can also remain since without SYS_MODULE pretty much only machined will be able to issue module unloads. It's a pretty reasonable capability to have.

I guess you can keep the module unload capability

jfroy mentioned this pull request Sep 18, 2024

feat: glibc extension siderolabs/extensions#473

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable module unloading and memory hotplug for nvidia gpu-operator #1031

feat: enable module unloading and memory hotplug for nvidia gpu-operator #1031

jfroy commented Sep 18, 2024

smira commented Sep 18, 2024

jfroy commented Sep 18, 2024

frezbo commented Sep 18, 2024

jfroy commented Sep 19, 2024 •

edited

Loading

frezbo commented Sep 19, 2024

frezbo commented Sep 19, 2024

smira commented Sep 19, 2024

jfroy commented Sep 21, 2024 •

edited

Loading

frezbo commented Sep 21, 2024

feat: enable module unloading and memory hotplug for nvidia gpu-operator #1031

Are you sure you want to change the base?

feat: enable module unloading and memory hotplug for nvidia gpu-operator #1031

Conversation

jfroy commented Sep 18, 2024

smira commented Sep 18, 2024

jfroy commented Sep 18, 2024

frezbo commented Sep 18, 2024

jfroy commented Sep 19, 2024 • edited Loading

frezbo commented Sep 19, 2024

frezbo commented Sep 19, 2024

smira commented Sep 19, 2024

jfroy commented Sep 21, 2024 • edited Loading

frezbo commented Sep 21, 2024

jfroy commented Sep 19, 2024 •

edited

Loading

jfroy commented Sep 21, 2024 •

edited

Loading