Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume Management #8367

Open
2 of 9 tasks
Tracked by #9249 ...
smira opened this issue Feb 26, 2024 · 22 comments
Open
2 of 9 tasks
Tracked by #9249 ...

Volume Management #8367

smira opened this issue Feb 26, 2024 · 22 comments
Assignees
Labels

Comments

@smira
Copy link
Member

smira commented Feb 26, 2024

Closely related: #8016

Problem Statement

Talos Linux is not flexible in the way it manages volumes, it occupies the whole system disk, creating an EPHEMERAL partition covering 99% of the disk space. User disk management is fragile, requires extra steps to get it to work properly (mounting into the kubelet), doesn’t support wiping disks, etc. Talos does not detect properly various partition types which leads to wiping user data (e.g. Ceph bluestore).

There were following requests from the users/customers which can’t be addressed in the current design:

  • running Talos network booted (i.e. with STATE / EPHEMERAL on tmpfs)
  • running Talos installed e.g. to the SBC’s SD card, but mounting /var from an NVMe/SSD
  • Azure VMs with directly attached NVMes will always install Talos to the network volume, so locally attached NVMe can’t be used for e.g. containerd state
  • splitting etcd data directory to a separate disk
  • performing wipe operations on the contents of the /var:
    • wiping system directories (e.g. containerd state)
    • wiping user data (e.g. /var/mnt/foo)
  • disk encryption of the user disks (volumes)
  • read-only user disks mounts (e.g. mounting my precious photo archive to the Talos machine, making sure that Talos never touches the contents)
  • volume/disk management operations without reboot:
    • wiping user disks/volumes
    • wiping system areas
  • creating additional user-managed partitions on the system disk:
    • /data
    • swap space
  • container image cache storage
  • some persistent across reboots logs storage (e.g. storing installation logs during staged upgrades)

The proposed design provides an option to solve the issues mentioned above.

Groundwork

Before we move into volume management operations, there is some amount of work that needs to be done to improve the blockdevice management operations:

  • Talos should quickly and easily detect various filesystem/partition types, including the most common ones, the ones that can be used in Kubernetes clusters. The minimum detection of the filesystem/partition prevents the disk from being considered empty and eligible for allocation
  • The blockdevices/partitions should be presented as resources in a way that it allows to render them as a tree presenting a user with a view of available block devices, their types, partitions, filesystem types, available space, etc.
  • Talos should detect and reliably show information which matches various standard Linux tools, e.g. blkid, to allow easier identification of storage objects.

Installation Process

Talos installation should do the bare minimum to make sure that Talos can be booted from the disk, without touching the pieces which are not strictly required to boot Talos. This might include installing Talos without having machine configuration.

So the install should only touch following partitions/objects:

  • BOOT / EFI partitions (boot assets, boot loader itself)
  • META partition
  • bootloader-specific stuff, e.g. writing to the MBR

Any management of the storage/volumes should be deferred to the Talos running on the host (i.e. creating /var, /system/state, etc.)

Volumes

Let’s introduce a new concept of the volumes, which will solve the ideas mentioned above and allow us to take storage management to the next level.

There are two kinds of volumes:

  • system volumes (i.e. required by Talos, and Talos provides default configuration for them if no other configuration is available)
  • user volumes (configured and managed by users, optional)

Every volume has several most important features:

  • lookup: Talos can find the volume by some selector, or say that the volume is not available
  • provisioning (optional): if the volume is not available, Talos can provision the volume (e.g. create a partition), so that it can be looked up, and it becomes available
  • parent: if the volume has a parent volume, it creates a dependency relationship between them
  • mount path: might create another dependency on the volume which provides the mount path

Volumes support a basic set of operations:

  • Mount/Unmount
  • Wipe (requires unmounting first)
  • Destroy (implies Wipe, but removes provisioned volumes)

Volume types:

  • disk (e.g. use the whole disk)
  • partition (allocate a partition on the disk)
  • subdirectory (a sub-path on other Volume)

Volume formats:

  • filesystem (or none)
  • encryption (or none)

Volume additional options:

  • criticality (should be available for the pods to be started)
  • mounted into the kubelet

System Volumes

As of today, Talos implicitly has the following volume types:

Name Lookup Provisioning Format
STATE a partition with the label STATE create a partition on the system disk of size X MiB xfs, optionally encrypted
EPHEMERAL a partition with the label EPHEMERAL create a partition on the system disk which occupies all remaining space xfs, optionally encrypted
etcd data - subdirectory of EPHEMERAL, /var/lib/etcd -
containerd data - subdirectory of EPHEMERAL, /var/lib/containerd -
kubelet data - subdirectory of EPHEMERAL, /var/lib/kubelet -

Volume Lifecycle

Talos services can express their dependency on the volumes. For example, kubelet service can only be started when kubelet data volume is available. Same way, if kubelet data volume is going to be unmounted, kubelet should be stopped first.

The boot process should naturally stop when the required volume is not available. E.g. maintenance mode of Talos implies that the boot can’t proceed as long as the volume configuration is not available.

Volume Configuration

System volumes have implicit configuration, which is applied as long as v1alpha1.Config is applied to the machine. Some properties are configurable in v1alpha1.Config, e.g. disk encryption. If an explicit volume configuration is provided, Talos uses that.

For example, if the user configures EPHEMERAL to be tmpfs of size 10 GiB, it will be created on each boot as instructed.

Users might provide configuration for user volumes (similar to the user disks feature today), which might be critical for the pods to be started, an otherwise e.g. extension services might provide a dependency on the additional volumes.

Some system volumes might be optional, i.e. configured by the users - for example, container image cache.

Upgrades and Wiping

Talos Linux upgrades should not wipe anything by default, and wiping should be an additional operation which can be done without an upgrade, or can optionally be combined with an upgrade.

Update itself should only modify boot assets/boot loader, i.e. ensure that the new version of Talos Linux can be booted up from the disk device.

Wiping is volume-based, examples:

  • I want to wipe EPHEMERAL, which implies wiping all volumes which have EPHEMERAL as a parent (e.g. subdirectory volume of /var/lib/etcd); all services which depend on EPHEMERAL or its children should be stopped, but reboot is not necessary, as the EPHEMERAL will be re-provisioned after the wipe
  • I want to wipe etcd data, which in the default configuration implies leaving etcd, stopping etcd services, performing rm -rf /var/lib/etcd, and re-starting etcd join process

Notes

As pointed out by @utkuozdemir, EPHEMERAL might be a bad name given that the partition is not supposed to be forced wiped by default.

Tasks

  1. 7 of 8
    dsseng smira
  2. smira
@smira
Copy link
Member Author

smira commented Mar 12, 2024

This feature is going to be shifted to Talos 1.8.0 (only first bits might appear in Talos 1.7.0).

Talos 1.8.0 will be released as soon as this feature is ready.

@runningman84
Copy link

Some software like longhorn might not respect limits and fill the whole disk. It would be great if a misbehaving pod cannot destroy etcd or other core parts of talos by just claiming all the available disk space.

@andrewrynhard
Copy link
Member

Some software like longhorn might not respect limits and fill the whole disk. It would be great if a misbehaving pod cannot destroy etcd or other core parts of talos by just claiming all the available disk space.

This is good to know. I always liked the separation of Talos and it using a dedicated disk to prevent unknowns/complications like this. Any ideas on how we could impose those limitations?

@runningman84
Copy link

runningman84 commented May 9, 2024

From my point of view something like lvm and partitions for each part would help. I used a similar setup in k3s and had never issues like this.

LVM would also make the encryption part easy because you only have to encrypt one device…

@bplein
Copy link

bplein commented May 9, 2024

Allow the choice of any block device.

A partition is also a block device. People could partition a single SSD with sufficient space for Talos and then an additional partition for general use. Filling up the general use partition isn’t going to affect the Talos partition(s)

@PeterFalken
Copy link

Allow the choice of any block device.

A partition is also a block device. People could partition a single SSD with sufficient space for Talos and then an additional partition for general use. Filling up the general use partition isn’t going to affect the Talos partition(s)

Similar to how the newer ESXi installer does when using the systemMediaSize option, this allows the installer to make the system & OS partitions at the beginning of the disk.While leaving free space at the end of the disk.

@runningman84
Copy link

I think at minimal we would need two partitions or lvm volumes:
Talos (etcd and other stuff)
General purpose (container data and so on)

It would be great if we could have an option to say okay we also need 100gb longhorn space and 50gb local path space. That are just some examples we would just need a volume size and mount path. All remaining space could be assigned to the general purpose partition. Here the default setting should be to use all space.

With something like lvm we could also allow to fix the general volume to a specific size and leave the remaining space unused. The would allow for expansion of other volumes or ensure that all nodes are the same even if one has a bigger disk.

@cortopy
Copy link

cortopy commented May 18, 2024

This feature is going to be shifted to Talos 1.8.0 (only first bits might appear in Talos 1.7.0).

Talos 1.8.0 will be released as soon as this feature is ready.

Thank you for clarifying @smira! If I set up a cluster with 1.7 today, will there be a migration path in 1.8 to have talos managing disks as proposed in this issue?

@smira
Copy link
Member Author

smira commented May 20, 2024

Thank you for clarifying @smira! If I set up a cluster with 1.7 today, will there be a migration path in 1.8 to have talos managing disks as proposed in this issue?

Talos is always backwards compatible, so upgrade to 1.8 will always work. You would be able to start using volume management features, but some of them (e.g. shrinking /var) might require a wipe of some volumes.

@laibe
Copy link

laibe commented May 28, 2024

OpenEBS had a component called ndm (node-disk-manager) that was quite handy to manage block-devices. HostPath and OS disks could be excluded with filters, e.g.:

    filterconfigs:
      - key: os-disk-exclude-filter
        name: os disk exclude filter
        state: true
        exclude: "/,/etc/hosts,/boot,/var/mnt/openebs/nvme-hostpath-xfs"

This was used by the localpc-device sc, letting you assign a whole block device to a pod. Unfortunately they have stopped supporting ndm and localpv-device with the release of OpenEBS 4.0.

It would be great if talos had a similar feature!

smira added a commit to smira/talos that referenced this issue Jun 12, 2024
smira added a commit to smira/talos that referenced this issue Jul 8, 2024
@chr0n1x
Copy link

chr0n1x commented Jul 8, 2024

this is incredibly exciting, happy to give it a whirl once you get an RC/beta or something @smira . thank you!

smira added a commit to smira/talos that referenced this issue Jul 9, 2024
This is early WIP.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
@PrivatePuffin
Copy link

@smira TopoLVM requires a pvcreate and vgcreate to be able to allocate remaining free disk space.

What I get from this issue, is that we would be able to at least allocate the system disk with free-space remaining, which is already 99% of the way there!

Does it also allow us to use lvm using pvcreate and vgcreate to consume the rest of the disk space?
Note: not a lvm specialist at all.

smira added a commit to smira/talos that referenced this issue Aug 23, 2024
This is early WIP.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Aug 23, 2024
This is early WIP.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Aug 26, 2024
This is early WIP.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Aug 26, 2024
This is early WIP.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Aug 27, 2024
This is early WIP.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Aug 27, 2024
This is early WIP.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Aug 27, 2024
This is early WIP.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Aug 27, 2024
This is early WIP.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Aug 29, 2024
This is early WIP.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Aug 29, 2024
This implements the first round of changes, replacing the volume backend
with the new implementation, while keeping most of the external
interfaces intact.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
@smira
Copy link
Member Author

smira commented Aug 29, 2024

Progress Update (1.8.0-alpha.2)

As soon as #8901 is merged, we will release the last alpha version.

At this point, the internals for volume management got completely implemented from scratch, while there are new configuration options available to use new features.

Before 1.8.0-beta.0, we plan to add minimal configuration for EPHEMERAL - creating on a different disk, controlling maximum/minimum size, and growth.

All other changes got shifted towards 1.9.0.

@schneid-l
Copy link
Contributor

Hi @smira, huge congratulations for your incredible work on this PR 🎉

Having a quick look at the content of the PR + the documentation added to it, I couldn't find an answer to the following question:

With Talos v1.8 will we be able to do a full-ram PXE boot without having to install the OS on a disk? Or is this one of the features shifting to v1.9?

Thanks a lot 🙏

@smira
Copy link
Member Author

smira commented Aug 29, 2024

With Talos v1.8 will we be able to do a full-ram PXE boot without having to install the OS on a disk? Or is this one of the features shifting to v1.9?

not in 1.8 (probably 1.9), two pieces are missing:

  • ability to skip install itself (will come separate via Install API)
  • tmpfs volumes for required EPHEMERAL/STATE (will come as part of this workflow)

smira added a commit to smira/talos that referenced this issue Aug 29, 2024
This implements the first round of changes, replacing the volume backend
with the new implementation, while keeping most of the external
interfaces intact.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Aug 29, 2024
This implements the first round of changes, replacing the volume backend
with the new implementation, while keeping most of the external
interfaces intact.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
@qjoly
Copy link

qjoly commented Aug 30, 2024

Well done on the work done on 1.8.0-alpha.2, can't wait to see the results in practice 👏 🎉 (Your newly added get discoveredvolume will help us a lot)

If I understand correctly, the changes made mainly concern the new concept of volumes (which will be used for the new features in 1.9). This means that Machine Configurations are not yet impacted, am I right?

Thank you in advance 🙏

@smira
Copy link
Member Author

smira commented Aug 30, 2024

If I understand correctly, the changes made mainly concern the new concept of volumes (which will be used for the new features in 1.9). This means that Machine Configurations are not yet impacted, am I right?

yes, for 1.8.0-alpha.2, there are no machine configuration changes, and everything "should just work as before".

Before 1.8.0-beta.0, there will be new machine configuration document to configure some aspects of EPHEMERAL volume (this is the most requested feature).

Everything else for 1.9.

smira added a commit to smira/talos that referenced this issue Aug 30, 2024
This implements the first round of changes, replacing the volume backend
with the new implementation, while keeping most of the external
interfaces intact.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Aug 30, 2024
This implements the first round of changes, replacing the volume backend
with the new implementation, while keeping most of the external
interfaces intact.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Aug 30, 2024
This implements the first round of changes, replacing the volume backend
with the new implementation, while keeping most of the external
interfaces intact.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this issue Aug 30, 2024
This implements the first round of changes, replacing the volume backend
with the new implementation, while keeping most of the external
interfaces intact.

See siderolabs#8367

Signed-off-by: Andrey Smirnov <[email protected]>
@smira smira added the EPIC label Sep 2, 2024
@pitabwire
Copy link

@smira with the release of 1.8 can we have a user defined /data directory on the primary system disk or this will be done in the 1.9 release? I was looking at the docs and can't see how to create additional user-managed partitions on the system disk

@smira
Copy link
Member Author

smira commented Sep 11, 2024

@smira with the release of 1.8 can we have a user defined /data directory on the primary system disk or this will be done in the 1.9 release? I was looking at the docs and can't see how to create additional user-managed partitions on the system disk

With the release of 1.8 (or technically right now with v1.8.0-beta.0), you can shrink EPHEMERAL to be fixed size, located on a system disk or not.

There is no support for adding extra volumes besides legacy machine.disks.

  1. You can add /data yourself (and Talos won't touch it)
  2. You might not have immediate need /data anymore, as Talos never wipes on upgrades.

More user volume support is planned for 1.9.

@smira
Copy link
Member Author

smira commented Sep 17, 2024

Current state of things for volume/disk management: https://www.talos.dev/v1.8/talos-guides/configuration/disk-management/

@isometry
Copy link

Given the new implementation and support for configuring the EPHEMERAL volume, is it now possible to split a disk between machine.disks and VolumeConfig?
If I configure machine.disks[0].partitions[0].size somewhat smaller than the associated NVMe drive, and then configure a VolumeConfig with diskSelector targeting the same drive, will it automagically add and use a second partition for the EPHEMERAL volume? (Use case: getting the most out of internal NVMe in a homelab Turing Pi 2 cluster with Longhorn…)

@smira
Copy link
Member Author

smira commented Sep 18, 2024

Given the new implementation and support for configuring the EPHEMERAL volume, is it now possible to split a disk between machine.disks and VolumeConfig? If I configure machine.disks[0].partitions[0].size somewhat smaller than the associated NVMe drive, and then configure a VolumeConfig with diskSelector targeting the same drive, will it automagically add and use a second partition for the EPHEMERAL volume? (Use case: getting the most out of internal NVMe in a homelab Turing Pi 2 cluster with Longhorn…)

yes, it should work I believe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests