From 0cdcb2e0e8131510aab654211d3622fb17f8375e Mon Sep 17 00:00:00 2001 From: Sascha Desch Date: Thu, 17 Aug 2023 23:11:29 +0200 Subject: [PATCH] docs: restructure docs for nvidia drivers for v1.4 Re-structure docs for proprietary NVIDIA docs for Talos v1.4. Signed-off-by: Noel Georgi --- .../configuration/nvidia-gpu-proprietary.md | 161 +++++++++--------- .../configuration/nvidia-gpu-proprietary.md | 24 ++- 2 files changed, 106 insertions(+), 79 deletions(-) diff --git a/website/content/v1.4/talos-guides/configuration/nvidia-gpu-proprietary.md b/website/content/v1.4/talos-guides/configuration/nvidia-gpu-proprietary.md index 867a8fd8aa..843afd6b84 100644 --- a/website/content/v1.4/talos-guides/configuration/nvidia-gpu-proprietary.md +++ b/website/content/v1.4/talos-guides/configuration/nvidia-gpu-proprietary.md @@ -7,13 +7,12 @@ aliases: > Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/). -These are the steps to enabling NVIDIA support in Talos. +The steps to enable NVIDIA support in Talos in v1.4.8 and later differ from +previous versions: -- Talos pre-installed on a node with NVIDIA GPU installed. -- Building a custom Talos installer image with NVIDIA modules -- Upgrading Talos with the custom installer and enabling NVIDIA modules and the system extension - -This requires that the user build and maintain their own Talos installer image. +* For versions prior to v1.4.8 the steps require that the user build and maintain their own Talos installer image. +After the [Prerequisites](#prerequisites) jump to [Building the installer image (Talos prior to v1.4.8)](#building-the-installer-image-talos-prior-to-v148) and after [Upgrading Talos and enabling the NVIDIA modules and the system extension (Talos prior to v1.4.8)](#upgrading-talos-and-enabling-the-nvidia-modules-and-the-system-extension-talos-prior-to-v148) continue from [Verifying the NVIDIA modules and the system extension](#verifying-the-nvidia-modules-and-the-system-extension). +* for v1.4.8 and later versions building a custom Talos installer image is not required anymore and the new, prefered way to enable NVIDIA support is via an extension. ## Prerequisites @@ -36,10 +35,14 @@ export REGISTRY=ghcr.io > The examples below will use the sample variables set above. > Modify accordingly for your environment. -> If using Talos v1.4.8 or later follow the steps below and directly jump to [Verifying the NVIDIA modules and the system extension](#verifying-the-nvidia-modules-and-the-system-extension) ## Building the NVIDIA extensions +> Instead of building the extensions yourself, you can use the extensions +> published by SideroLabs in the `pkgs` repo +> [here](https://github.com/siderolabs/extensions/pkgs/container/nonfree-kmod-nvidia) and +> [here](https://github.com/siderolabs/extensions/pkgs/container/nvidia-container-toolkit) + Start by cloning the `release-1.5` branch [extensions](https://github.com/siderolabs/extensions) repository. ```bash @@ -59,6 +62,8 @@ make nonfree-kmod-nvidia PKGS= PLATFORM=linux/amd6 > Replace the platform with `linux/arm64` if building for ARM64 > Make sure to use `talosctl` version {{< release >}} or later +## Upgrading Talos and enabling the NVIDIA modules and the system extension + First create a patch yaml `gpu-worker-patch.yaml` to update the machine config similar to below: ```yaml @@ -93,77 +98,6 @@ Now we can proceed to upgrading Talos to the same version to enable the system e talosctl upgrade --image=ghcr.io/siderolabs/installer:{{< release >}} ``` -## Building the installer image - -Start by cloning the [pkgs](https://github.com/siderolabs/pkgs) repository. - -Now run the following command to build and push custom Talos kernel image and the NVIDIA image with the NVIDIA kernel modules signed by the kernel built along with it. - -```bash -make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true -``` - -> Replace the platform with `linux/arm64` if building for ARM64 - -Now we need to create a custom Talos installer image. - -Start by creating a `Dockerfile` with the following content: - -```Dockerfile -FROM scratch as customization -COPY --from=ghcr.io/talos-user/nonfree-kmod-nvidia:{{< release >}}-nvidia /lib/modules /lib/modules - -FROM ghcr.io/siderolabs/installer:{{< release >}} -COPY --from=ghcr.io/talos-user/kernel:{{< release >}}-nvidia /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz -``` - -Now build the image and push it to the registry. - -```bash -DOCKER_BUILDKIT=0 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/talos-user/installer:{{< release >}}-nvidia . -docker push ghcr.io/talos-user/installer:{{< release >}}-nvidia -``` - -> Note: buildkit has a bug [#816](https://github.com/moby/buildkit/issues/816), to disable it use DOCKER_BUILDKIT=0 -> Replace the platform with `linux/arm64` if building for ARM64 - -## Upgrading Talos and enabling the NVIDIA modules and the system extension - -> Make sure to use `talosctl` version {{< release >}} or later - -First create a patch yaml `gpu-worker-patch.yaml` to update the machine config similar to below: - -```yaml -- op: add - path: /machine/install/extensions - value: - - image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}} -- op: add - path: /machine/kernel - value: - modules: - - name: nvidia - - name: nvidia_uvm - - name: nvidia_drm - - name: nvidia_modeset -- op: add - path: /machine/sysctls - value: - net.core.bpf_jit_harden: 1 -``` - -Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed: - -```bash -talosctl patch mc --patch @gpu-worker-patch.yaml -``` - -Now we can proceed to upgrading Talos with the installer built previously: - -```bash -talosctl upgrade --image=ghcr.io/talos-user/installer:{{< release >}}-nvidia -``` - ## Verifying the NVIDIA modules and the system extension Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed. @@ -268,3 +202,74 @@ kubectl run \ --overrides '{"spec": {"runtimeClassName": "nvidia"}}' \ nvidia-smi ``` + +## Building the installer image (Talos prior to v1.4.8) + +Start by cloning the [pkgs](https://github.com/siderolabs/pkgs) repository. + +Now run the following command to build and push custom Talos kernel image and the NVIDIA image with the NVIDIA kernel modules signed by the kernel built along with it. + +```bash +make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true +``` + +> Replace the platform with `linux/arm64` if building for ARM64 + +Now we need to create a custom Talos installer image. + +Start by creating a `Dockerfile` with the following content: + +```Dockerfile +FROM scratch as customization +COPY --from=ghcr.io/talos-user/nonfree-kmod-nvidia:{{< release >}}-nvidia /lib/modules /lib/modules + +FROM ghcr.io/siderolabs/installer:{{< release >}} +COPY --from=ghcr.io/talos-user/kernel:{{< release >}}-nvidia /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz +``` + +Now build the image and push it to the registry. + +```bash +DOCKER_BUILDKIT=0 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/talos-user/installer:{{< release >}}-nvidia . +docker push ghcr.io/talos-user/installer:{{< release >}}-nvidia +``` + +> Note: buildkit has a bug [#816](https://github.com/moby/buildkit/issues/816), to disable it use DOCKER_BUILDKIT=0 +> Replace the platform with `linux/arm64` if building for ARM64 + +## Upgrading Talos and enabling the NVIDIA modules and the system extension (Talos prior to v1.4.8) + +> Make sure to use `talosctl` version {{< release >}} or later + +First create a patch yaml `gpu-worker-patch.yaml` to update the machine config similar to below: + +```yaml +- op: add + path: /machine/install/extensions + value: + - image: ghcr.io/siderolabs/nvidia-container-toolkit:{{< nvidia_driver_release >}}-{{< nvidia_container_toolkit_release >}} +- op: add + path: /machine/kernel + value: + modules: + - name: nvidia + - name: nvidia_uvm + - name: nvidia_drm + - name: nvidia_modeset +- op: add + path: /machine/sysctls + value: + net.core.bpf_jit_harden: 1 +``` + +Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed: + +```bash +talosctl patch mc --patch @gpu-worker-patch.yaml +``` + +Now we can proceed to upgrading Talos with the installer built previously: + +```bash +talosctl upgrade --image=ghcr.io/talos-user/installer:{{< release >}}-nvidia +``` diff --git a/website/content/v1.5/talos-guides/configuration/nvidia-gpu-proprietary.md b/website/content/v1.5/talos-guides/configuration/nvidia-gpu-proprietary.md index bf5f6008f3..85df0e9a7f 100644 --- a/website/content/v1.5/talos-guides/configuration/nvidia-gpu-proprietary.md +++ b/website/content/v1.5/talos-guides/configuration/nvidia-gpu-proprietary.md @@ -14,7 +14,7 @@ The published versions of the NVIDIA system extensions can be found here: - [nonfree-kmod-nvidia](https://github.com/siderolabs/extensions/pkgs/container/nonfree-kmod-nvidia) - [nvidia-container-toolkit](https://github.com/siderolabs/extensions/pkgs/container/nvidia-container-toolkit) -> To build a NVIDIA driver version not published by SideroLabs follow the instructions [here]({{< relref "../../../v1.4/talos-guides/configuration/nvidia-gpu-proprietary" >}}) +> To build a NVIDIA driver version not published by SideroLabs jump to [Building the NVIDIA extensions](#building-the-nvidia-extensions) and then use those in the steps below instead of the ones published by SideroLabs ## Upgrading Talos and enabling the NVIDIA modules and the system extension @@ -160,3 +160,25 @@ kubectl run \ --overrides '{"spec": {"runtimeClassName": "nvidia"}}' \ nvidia-smi ``` + +## Building the NVIDIA extensions + +If you want to build the NVIDIA extensions yourself instead of using the extensions +published by SideroLabs start by cloning the `release-1.5` branch [extensions](https://github.com/siderolabs/extensions) repository. + +```bash +git clone --depth=1 --branch=release-1.5 https://github.com/siderolabs/extensions.git +``` + +Lookup the version of [pkgs](https://github.com/siderolabs/pkgs) used for the particular Talos version at `https://github.com/siderolabs/talos/blob//pkg/machinery/gendata/data/pkgs`. + +Now run the following command to build and push custom NVIDIA extension. + +```bash +make nonfree-kmod-nvidia PKGS= PLATFORM=linux/amd64 PUSH=true +``` + +> Replace the platform with `linux/arm64` if building for ARM64. +> To change the NVIDIA driver version modify the build argument in +> `nvidia-gpu/nonfree/kmod-nvidia/vars.yaml` accordingly. +> Make sure to use `talosctl` version {{< release >}} or later