Skip to content

Commit

Permalink
docs: update nvidia docs
Browse files Browse the repository at this point in the history
Update NVIDIA docs to point to use pre-built extensions.

Ref: siderolabs/extensions#201

Fixes: #7611

Signed-off-by: Noel Georgi <[email protected]>
  • Loading branch information
frezbo committed Aug 16, 2023
1 parent 9606e87 commit b5c0e7b
Show file tree
Hide file tree
Showing 3 changed files with 74 additions and 68 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,63 @@ export REGISTRY=ghcr.io
```

> The examples below will use the sample variables set above.
Modify accordingly for your environment.
> Modify accordingly for your environment.
> If using Talos v1.4.8 or later follow the steps below and directly jump to [Verifying the NVIDIA modules and the system extension](#verifying-the-nvidia-modules-and-the-system-extension)
## Building the NVIDIA extensions

Start by cloning the `release-1.5` branch [extensions](https://github.com/siderolabs/extensions) repository.

```bash
git clone --depth=1 --branch=release-1.5 https://github.com/siderolabs/extensions.git
```

Lookup the version of [pkgs](https://github.com/siderolabs/pkgs) used for the particular Talos version [here](https://github.com/siderolabs/talos/blob/v1.4.8/pkg/machinery/gendata/data/pkgs)

> Replace v1.4.8 with the Talos version you are using.
Now run the following command to build and push custom NVIDIA extension.

```bash
make nonfree-kmod-nvidia PKGS=<pkgs-version-looked-up-above> PLATFORM=linux/amd64 PUSH=true
```

> Replace the platform with `linux/arm64` if building for ARM64
> Make sure to use `talosctl` version {{< release >}} or later
First create a patch yaml `gpu-worker-patch.yaml` to update the machine config similar to below:

```yaml
- op: add
path: /machine/install/extensions
value:
- image: ghcr.io/siderolabs/nonfree-kmod-nvidia:{{< nvidia_driver_release >}}-{{< release >}}
- image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
- op: add
path: /machine/kernel
value:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
- op: add
path: /machine/sysctls
value:
net.core.bpf_jit_harden: 1
```
Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed:
```bash
talosctl patch mc --patch @gpu-worker-patch.yaml
```

Now we can proceed to upgrading Talos to the same version to enable the system extension:

```bash
talosctl upgrade --image=ghcr.io/siderolabs/installer:{{< release >}}
```

## Building the installer image

Expand Down Expand Up @@ -108,6 +164,8 @@ Now we can proceed to upgrading Talos with the installer built previously:
talosctl upgrade --image=ghcr.io/talos-user/installer:{{< release >}}-nvidia
```

## Verifying the NVIDIA modules and the system extension

Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.

This can be confirmed by running:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,72 +6,15 @@ aliases:
---

> Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/).
> The Talos published NVIDIA drivers are bound to a specific Talos release.
> The extensions versions also needs to be updated when upgrading Talos.
These are the steps to enabling NVIDIA support in Talos.
The published versions of the NVIDIA system extensions can be found here:

- Talos pre-installed on a node with NVIDIA GPU installed.
- Building a custom Talos installer image with NVIDIA modules
- Upgrading Talos with the custom installer and enabling NVIDIA modules and the system extension
- [nonfree-kmod-nvidia](https://github.com/siderolabs/extensions/pkgs/container/nonfree-kmod-nvidia)
- [nvidia-container-toolkit](https://github.com/siderolabs/extensions/pkgs/container/nvidia-container-toolkit)

This requires that the user build and maintain their own Talos installer image.

## Prerequisites

This guide assumes the user has access to a container registry with `push` permissions, docker installed on the build machine and the Talos host has `pull` access to the container registry.

Set the local registry, username and version environment variables:

```bash
export USERNAME=<username>
export REGISTRY=<registry>
export TAG={{< release >}}-nvidia
```

For eg:

```bash
export USERNAME=talos-user
export REGISTRY=ghcr.io
```

> The examples below will use the sample variables set above.
Modify accordingly for your environment.

## Building the installer image

Start by cloning the [pkgs](https://github.com/siderolabs/pkgs) repository.

Now run the following command to build and push custom Talos kernel image and the NVIDIA image with the NVIDIA kernel modules signed by the kernel built along with it.

```bash
make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true
```

> Replace the platform with `linux/arm64` if building for ARM64
Now we need to create a custom Talos installer image.

Start by creating a `Dockerfile` with the following content:

```Dockerfile
FROM scratch as customization
# this is needed so that Talos copies base kernel modules info and default modules shipped with Talos
COPY --from=ghcr.io/talos-user/kernel:{{< release >}}-nvidia /lib/modules /kernel/lib/modules
COPY --from=ghcr.io/talos-user/nonfree-kmod-nvidia:{{< release >}}-nvidia /lib/modules /lib/modules

FROM ghcr.io/siderolabs/installer:{{< release >}}
COPY --from=ghcr.io/talos-user/kernel:{{< release >}}-nvidia /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
```

Now build the image and push it to the registry.

```bash
DOCKER_BUILDKIT=0 docker build --squash -t ghcr.io/talos-user/installer:{{< release >}}-nvidia .
docker push ghcr.io/talos-user/installer:{{< release >}}-nvidia
```

> Note: buildkit has a bug [#816](https://github.com/moby/buildkit/issues/816), to disable it use DOCKER_BUILDKIT=0
> Replace the platform with `linux/arm64` if building for ARM64
> To build a NVIDIA driver version not published by SideroLabs follow the instructions [here]({{< relref "../../../v1.4/talos-guides/configuration/nvidia-gpu-proprietary" >}})
## Upgrading Talos and enabling the NVIDIA modules and the system extension

Expand All @@ -83,6 +26,7 @@ First create a patch yaml `gpu-worker-patch.yaml` to update the machine config s
- op: add
path: /machine/install/extensions
value:
- image: ghcr.io/siderolabs/nonfree-kmod-nvidia:{{< nvidia_driver_release >}}-{{< release >}}
- image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}}
- op: add
path: /machine/kernel
Expand All @@ -98,16 +42,20 @@ First create a patch yaml `gpu-worker-patch.yaml` to update the machine config s
net.core.bpf_jit_harden: 1
```
> Update the driver version and Talos release in the above patch yaml from the published versions if there is a newer one available.
> Make sure the driver version matches for both the `nonfree-kmod-nvidia` and `nvidia-container-toolkit` extensions.
> The `nonfree-kmod-nvidia` extension is versioned as `<nvidia-driver-version>-<talos-release-version>` and the `nvidia-container-toolkit` extension is versioned as `<nvidia-driver-version>-<nvidia-container-toolkit-version>`.

Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed:

```bash
talosctl patch mc --patch @gpu-worker-patch.yaml
```

Now we can proceed to upgrading Talos with the installer built previously:
Now we can proceed to upgrading Talos to the same version to enable the system extension:

```bash
talosctl upgrade --image=ghcr.io/talos-user/installer:{{< release >}}-nvidia
talosctl upgrade --image=ghcr.io/siderolabs/installer:{{< release >}}
```

Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.
Expand Down
4 changes: 2 additions & 2 deletions website/content/v1.5/talos-guides/configuration/nvidia-gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@ aliases:
> The Talos published NVIDIA OSS drivers are bound to a specific Talos release.
> The extensions versions also needs to be updated when upgrading Talos.
The published versions of the NVIDIA system extensions can be found here:
The published versions of the NVIDIA OSS system extensions can be found here:

- [nvidia-open-gpu-kernel-modules](https://github.com/siderolabs/extensions/pkgs/container/nvidia-open-gpu-kernel-modules)
- [nvidia-container-toolkit](https://github.com/siderolabs/extensions/pkgs/container/nvidia-container-toolkit)

## Upgrading Talos and enabling the NVIDIA modules and the system extension
## Upgrading Talos and enabling the NVIDIA OSS modules and the system extension

> Make sure to use `talosctl` version {{< release >}} or later
Expand Down

0 comments on commit b5c0e7b

Please sign in to comment.