design-proposal: VirtualMachineInstanceMigration - Live migration to …

…specific node Adding a design proposal to extend VirtualMachineInstanceMigration object with an additional API to let a cluster admin try to trigger a live migration of a VM to a specific host for maintenance reasons overriding what is defined on the VM itself in terms of nodeselector, affinity and anti-affinity rules and without the need to temporary alter the definition of the VM. Signed-off-by: Simone Tiraboschi <[email protected]>
kubevirt · Sep 3, 2024 · 2989e12 · 2989e12
1 parent db2ea07
commit 2989e12
Showing 1 changed file with 97 additions and 0 deletions.
diff --git a/design-proposals/migration-target.md b/design-proposals/migration-target.md
@@ -0,0 +1,97 @@
+# Overview
+We are getting asks from multiple cluster admin that would like to explicitly specify the "destination" of the VM when doing Live migration.
+While this may be less important in a cloud-native environment,
+we get this ask from many users coming from other virtualization solutions, where this is a common practice.
+The same result can already be achieved today with a few steps, this is only about simplifying it with a single direct API on the single `VirtualMachineInstanceMigration` without the need to alter a VM spec.
+
+## Motivation
+In the ideal cloud native design, the scheduler is supposed to be always able to correctly identify
+the best node to run a pod (so the target pod for the VMI after the live-migration) on.
+In the real world, we still see specific use cases where the flexibility do explicitly and directly define the target node for a live migration is a relevant nice-to-have:
+- Experienced admins are used to control where their critical workloads are move to 
+- Workload balancing solution doesn't always work as expected
+- Troubleshooting a node
+- Validating a new node migrating there a specific VM
+
+Such a capability is expected from traditional virtualization solutions but, with certain limitations, is still pretty common across the most popular cloud providers (at least when using dedicated and not shared nodes).
+- For instance on Amazon EC2 the user can already live-migrate a `Dedicated Instance` from a `Dedicated Host` to another `Dedicated Host` explicitly choosing it from the EC2 console, see: https://repost.aws/knowledge-center/migrate-dedicated-different-host
+- also on Google Cloud Platform Compute Engine the user can easily and directly live-migrate a VM from a `sole-tenancy` node to another one via CLI or REST API, see: https://cloud.google.com/compute/docs/nodes/manually-live-migrate#gcloud
+
+On the technical side something like this can already be indirectly achieved playing with node labels and affinity but nodeSelector and affinity are going to be defined as VM properties that are going to stay while here we are focusing just on explicit the desired target of a single in time migration attempt.
+The motivation is to better define a boundary between what is an absolute and long-lasting property of a VM (like affinity) with what is just an optional property of the single migration attempt.
+This could also be relevant in terms of personas: we could have the VM owner/developer that is going to specify long-lasting affinity for a VM that is part of an application composed by different VMs and pods and a cluster admin/operator that needs to temporary override that for maintenance reasons.
+On the other side the VM owner is not required/supposed to be aware of node names. 
+
+## Goals
+- A user allowed to trigger a live-migration of a VM and list the nodes in the cluster is able to rely on a simple and direct API to try to live migrate a VM to a specific node.
+- The explict migration target overrules a nodeSelector or affinity and anti-affinity rules defined by the VM owner.
+- The livemigration then can successfully complete or fail for various reasons exactly as it can succeed of fail today for other reasons. 
+- The target node that is explicitly required for the actual livemigration attempt should not influence future livemigrations or the placement in case the VM is restarted. For long-lasting placement, nodeSelectors or affinity/anti-affinity rules are the only way to go. 
+
+## Non Goals
+- this proposal is not defining a custom scheduler plugin nor suggesting to alter how the default k8s scheduler works with `nodeName`, `nodeSelector` and `affinity/anti-affinity` rules. See https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ for the relevant documentation
+
+## Definition of Users
+- VM owner: the user who owns a VM in his namespace on a Kubernetes cluster with KubeVirt
+- Cluster-admin: the administrator of the cluster
+
+## User Stories
+- As a cluster admin I want to be able to try to live-migrate a VM to specific node for maintenance reasons eventually overriding what the VM owner set 
+- As a VM owner I don't want to see my VM object getting amended by another user just for maintenance reasons
+
+## Repos
+- https://github.com/kubevirt/kubevirt
+
+# Design
+We are going to add a new optional `nodeName` string field on the `VirtualMachineInstanceMigration` object.
+If the `nodeName` filed is not empty, the migration controller will explicitly set `nodeName` on the virt-launcher pod that is going to be used as the target endpoint for the livemigration.
+If the `nodeName` field is not empty, the k8s scheduler will ignore the Pod that is going to be used as the target for the migration and the kubelet on the named node will directly try to place the Pod on that node.
+Using `nodeName` on the `VirtualMachineInstanceMigration` object will overrule nodeSelector or affinity and anti-affinity rules defined on the VM.
+If the target pod will fail to be started, the `VirtualMachineInstanceMigration` object will be marked as failed as it can already happen today for other reasons.
+We are not going to alter by any mean the `spec` stanza of the VM or the VMI objects so future migrations or the node placement after a restart of the VM are not going to be affected by a `nodeName` set on a specific `VirtualMachineInstanceMigration` object. 
+
+## Alternative design
+One of the main reason behind this proposal is for improving the UX making it simpler and better defining boundaries between what is long-term placement requirement and what should simply be tried for this specific migration attempt.
+According to:
+https://kubevirt.io/user-guide/compute/node_assignment/#live-update
+changes to a VM's node selector or affinities for a VM with LiveUpdate rollout strategy are now dynamically propagated to the VMI.
+
+This means that, only for VMs with LiveUpdate rollout strategy, we can already force the target for a live migration with something like:
+- set a (temporary?) nodeSelector/affinity on the VM
+- wait for it to be propagated to the VMI due to LiveUpdate rollout strategy
+- trigger a live migration with existing APIs (no need for any code change)
+- wait for the migration to complete
+- (eventually) remove the (temporary?) nodeSelector to let the VM be freely migrate to any node in the future
+
+Such a flow can already be implemented today with a pipeline or directly from a client like `virtctl` without any backend change.
+The drawback of that strategy is that we should tolerate having the spec of the VM amended twice with an unclear boundary about what was asked by the VM owner for long-lasting application specific reasons and what is required by a maintenance operator just for a specific migration attempt.
+
+## API Examples
+```yaml
+apiVersion: kubevirt.io/v1
+kind: VirtualMachineInstanceMigration
+metadata:
+  name: migration-job
+spec:
+  vmiName: vmi-fedora
+  nodeName: my-new-target-node
+```
+
+## Scalability
+Forcing a `nodeName` on `VirtualMachineInstanceMigration` will cause it to be propagated to the destination virt-launcher pod. Having a `nodeName` on a pod will bypass the k8s scheduler and this could potentially lead to an unbalanced node placement across the nodes.
+But the same result can be already achieved today specifying a `nodeSelector` or `affinity` and `anti-affinity` rules on a VM. 
+
+## Update/Rollback Compatibility
+`nodeName` on `VirtualMachineInstanceMigration` will be only an optional field so no impact in terms of update compatibility.
+
+## Functional Testing Approach
+- positive test 1: a VirtualMachineInstanceMigration with an explict nodeName pointing to a node able to accommodate the VM should succeed
+- positive test 2: a VirtualMachineInstanceMigration with an explict nodeName pointing to a node able to accommodate the VM but not matching a nodeSelector already present on the VM should succeed
+- negative test 1: a VirtualMachineInstanceMigration with an explict nodeName should be refused if the required node doesn't exist
+- negative test 2: a VirtualMachineInstanceMigration with an explict nodeName should be refused if the VM is already running on the requested node
+- negative test 3: a VirtualMachineInstanceMigration with an explict nodeName should be refused if the user is not allowed to list nodes in the cluster
+- negative test 4: a VirtualMachineInstanceMigration with an explict nodeName should fail if the selected target node is not able to accommodate the additional pod for virt-launcher
+
+# Implementation Phases
+A really close attempt was already tried in the past with https://github.com/kubevirt/kubevirt/pull/10712 but the Pr got some pushbacks.
+A similar PR should be reopened, refined and we should implement functional tests.