-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design-proposal: VirtualMachineInstanceMigration - Live migration to a named node #320
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Follow-up and derived from: kubevirt#10712 Implements: kubevirt/community#320 TODO: add functional tests Signed-off-by: zhonglin6666 <[email protected]> Signed-off-by: Simone Tiraboschi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely to see this clear design proposal (even if I don't like anything that assumes a specific node is long-living). I have two questions, though.
design-proposals/migration-target.md
Outdated
|
||
## Goals | ||
- A user allowed to trigger a live-migration of a VM and list the nodes in the cluster is able to rely on a simple and direct API to try to live migrate a VM to a specific node. | ||
- The explict migration target overrules a nodeSelector or affinity and anti-affinity rules defined by the VM owner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this odd, as the VM and the application in it may not function well (or at all) if affinity is ignored. Can you share more about the origins of this goal? I'd expect the target node to be ANDed with existing anti/affinity rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to think that as a cluster admin that is trying to force a VM to migrate to named node this is the natural and expected behaviour:
if I explicitly select a named node, I'm expecting that my VM will be eventually migrated there and nowhere else (such as on a different node selected by the scheduler according to a weighted combination of affinity criteria and resource availability and so on); then I can tolerate that the live migration will fail since I chose a wrong node, but the controller should only try to live-migrate it according to what I'm explicitly asking for.
And by the way this is absolutely consistent with the native k8s behaviour for pods.
spec.nodeName
for pods is under spec for historical reasons but it's basically controlled by the scheduler:
when a pod is going to be executed, the scheduler is going to check it and, according to available cluster resources, nodeselectors, weighted affinity and anti-affinity rules and so on, it's going to select a node and write it on spec.nodeName
on the pod objects. At this point the kubelet on the named node will try to execute the Pod on that node.
If the user explicitly sets spec.nodeName
on a pod (or in the template in a deployment and so on), the scheduler is not going to be involved in the process since the pod is basically already scheduled for that node and nothing else and so the kubelet on that node will directly try to execute it there eventually failing.
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename explictly state:
If the nodeName field is not empty, the scheduler ignores the Pod and the kubelet on the named node tries to place the Pod on that node.
Using nodeName overrules using nodeSelector or affinity and anti-affinity rules.
And this in my opinion is exactly how we should treat a Live migration attempt to a named node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take the following example (this is a real world use-case):
- An admin is adding a new node to a cluster to take it into prod. This node has a taint to prevent workloads to immediately land there.
- The admin wants to migrate a VM to this now to validted it is working properly.
If we AND
a new selector for this node, then the migration will not take place, because there is the taint. We'd also need to add a toleration to get the vm scheduled to that node.
With spec.nodeName
it would be no issue - initially - it could become one if Require*atRuntime
effects are used.
However, with spec.nodeName
all other validations - CPU caps, extended, storage, and local resources etc will be ignored. We are asking a VM to not start.
Worse: It would be really hard now to understand WHY the vm is not launching.
Thus I think we have to AND
to the node selector, but need code to understand taints specifically (because taints keep workloads away).
Then we still need to think about a generic mechanism to deal with reasons of why a pod can not be placed on the selected node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like taking examples from the historically-understandable Pod.spec.nodeName
. Node identity is not something that should have typically been exposed to workload owners.
Can you summarize your reasoning into the proposal? I think I understand it now, but I am not at ease with it. For example, a cluster admin may easily violate anti/affinity rules that are important for app availability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand with taints is a bit more complex: the valid effects for a taint are NoExecute
, NoSchedule
and PreferNoSchedule
.
Bypassing the scheduler directly setting spec.nodeName
will allow us to bypass taints with NoSchedule
and PreferNoSchedule
effect but, AFAIK, it will be still blocked by a NoExecute
that is also enforced by the Kubelect with eviction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dankenigsberg yes, this is a critical aspect of this design proposal so we should carefully explore and weight the different alternatives tracking them down in the design proposal itself as a future reference.
In my opinion the choice strictly depends on the use case and the power we want to offer to the cluster admin when creating a live migration request to a named node.
Directly setting spec.nodeName
on the target pod will completely bypass all the scheduling hints (spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution
) and constraints (spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution
) meaning that the target pod will be started on the named nome regardless how the VM is actually configured.
Another option is trying to append/merge (this sub-topic deserves by itself another discussion) something like
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- <nodeName>
to the affinity rules already defined on the VM.
My concern with this choice is that affinity/anti-affinity grammar is pretty complex so, if the VM owner already defined some affinity/anti-affinity rules, we can easily end up with a set of conflicting rules so that the target pod cannot be scheduled on the named node as on any other node.
If the use case that we want to address is giving to the cluster admin the right to try migrating a generic VM to a named node (for instance for maintenance/emergency reasons), this is approach is not fully addressing it with many possible cases where the only viable option is still about manually overriding affinity/anti-affinity rules set by the VM owner.
I still tend to think that the always bypass the scheduler with a spec.nodeName
is the K.I.S.S. approach here if try to forcing a live migration to a named node is exactly what the cluster admin is trying to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I summarized this considerations into the proposal itself, let's continue from there.
design-proposals/migration-target.md
Outdated
|
||
# Implementation Phases | ||
A really close attempt was already tried in the past with https://github.com/kubevirt/kubevirt/pull/10712 but the Pr got some pushbacks. | ||
A similar PR should be reopened, refined and we should implement functional tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you outline the nature of the pushback? Do we currently have good answers to the issues raised back then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to summarize (@EdDev please keep me honest on this), it was somehow considered a semi-imperative approach and it was pointed out that a similar behavior could already indirectly be achieved modifying on the fly and then reverting affinity rules on the VM object.
see: kubevirt/kubevirt#10712 (comment)
and: kubevirt/kubevirt#10712 (comment)
How much this is imperative is questionable: at the end we already have a VirtualMachineInstanceMigration
object that you can use to declare that you want to trigger a live migration, this is only about letting you also declare that you want to have a live migration to a named host.
The alternative approach based on amending the affinity rules on the VM object and waiting for the LiveUpdate rollout strategy to propagate it to the VMI before trying a live migration is described, pointing out its main drawback, in the Alternative design
section in this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you inline this succinctly? E.g, that Pr got some pushbacks because it was not clear why a new API for one-off migration is needed. We give here a better explanation why this one-off migration destination request is necessary
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The "one-time" operation convinced me.
- The reasoning for the real need is hard for me, but I did feedback on this proposal what is convincing me.
2989e12
to
900eb23
Compare
/cc |
design-proposals/migration-target.md
Outdated
- Cluster-admin: the administrator of the cluster | ||
|
||
## User Stories | ||
- As a cluster admin I want to be able to try to live-migrate a VM to specific node for maintenance reasons eventually overriding what the VM owner set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see more fleshed out user stories. It's unclear to me based on these user stories why the existing methods wouldn't suffice.
As a cluster admin I want to be able to try to live-migrate a VM to specific node for maintenance reasons eventually overriding what the VM owner set
For example, why wouldn't the cluster admin taint the source node and live migrate the vms away using the existing methods? Why would the admin need direct control over the exact node the VM goes to? I'd like to see a solid answer for why this is necessary over existing methods.
That's where this discussion usually falls apart and why it hasn't seen progress through the years. I'm not opposed to this feature, but I do think we need to articulate clearly why the feature is necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expanded this section
900eb23
to
cb9fb47
Compare
311710d
to
339f9f5
Compare
design-proposals/migration-target.md
Outdated
## User Stories | ||
- As a cluster admin I want to be able to try to live-migrate a VM to specific node for various possible reasons such as: | ||
- I just added to the cluster a new powerful node and I want to migrate a selected VM there without trying more than once according to scheduler decisions | ||
- I'm not using any automatic workload rebalancing mechanism and I periodically want to manually rebalance my cluster according to my observations | ||
- Foreseeing a peak in application load (e.g. new product announcement), I'd like to balance in advance my cluster according to my expectation and not to current observations | ||
- During a planned maintenance window, I'm planning to drain more than one node in a sequence, so I want to be sure that the VM is going to land on a node that is not going to be drained in a near future (needing then a second migration) and being not interested in cordoning it also for other pods | ||
- I just added a new node and I want to validate it trying to live migrate a specific VM there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! these are good reasons that hadn't been explored during previous discussions, thanks
design-proposals/migration-target.md
Outdated
When a pod is going to be executed, the scheduler is going to check it and, according to available cluster resources, nodeselectors, weighted affinity and anti-affinity rules and so on, | ||
the scheduler is going to select a node and write its name on `spec.nodeName` on the pod object. At this point the kubelet on the named node will try to execute the Pod on that node. | ||
|
||
If `spec.nodeName` is already set on a pod object as in this approach, the scheduler is not going to be involved in the process since the pod is basically already scheduled for that node and only for tha named node and so the kubelet on that node will directly try to execute it there eventually failing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using pod.spec.nodeName
is likely the most straightforward approach. This does introduce some new failure modes that might not be obvious to admins.
For example, today if a target pod is unschedulable due to lack of resources, the migration object will time out due to the pod being stuck in "pending". This information is feed back to admin as an k8s event associated with the migration object.
However, by setting the pod.spec.NodeName directly, we'd be bypassing the checks that ensure the required resources are available on the node (like the node having the "kvm" device available for instance), and the pod would likely get scheduled and immediately fail. I don't think we are currently bubbling up these types of errors to the migration object, so this could leave admins wondering why their migration failed.
I guess what I'm trying to get at here is, I like this approach, let's make sure the new failure modes get reported back on the migration object so the Admin has some sort of clue as to why a migration has failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidvossel We already report the failure reason on the VMIM. This is part of the VMIM status.
pod.spec.nodeName
entirely bypassed the scheduler making AAQ unusable as it relies on "pod scheduling readiness".
From my pov, bypassing the scheduler is a no go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my pov, bypassing the scheduler is a no go.
luckily we have also another option as described on:
### B. appending/merging an additional nodeAffinity rule on the target virt-launcher pod (merging it with VM owner set affinity/anti-affinity rules)
This will add an additional constraint for the scheduler summing it up with existing constraints/hints.
In case of mismatching/oppositing rules, the destination pod will not be scheduled and the migration will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladikr @davidvossel +1.
spec.nodeName
is a horrible field that is not being removed from Kubernetes only due to backward compatibility and causes a lot of trouble. I agree that it should be considered as a no-go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the intention behind introducing the nodeName field, but I fail to see how something like this may work at scale. It seems to me that most, if not all, of the user stories listed in the proposal can already be achieved through existing methods. Adding this field could potentially cause confusion for admins and lead to unnecessary friction with the Kubernetes scheduler and descheduler flows. I'd prefer to see solutions to the user stories to be aligned closely with established patterns. (descheduler policies or scheduler plugins )
|
||
## User Stories | ||
- As a cluster admin I want to be able to try to live-migrate a VM to specific node for various possible reasons such as: | ||
- I just added to the cluster a new powerful node and I want to migrate a selected VM there without trying more than once according to scheduler decisions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what would be so special about these VMs that cannot be handled by a descheduled?
Also, how would the admin know that the said descheduler did not remove these VMs at a later time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The descheduler it's going to decide according to its internal policy.
In the more general use case it will be a cluster admin who can decide to live migrate a VM just because he thinks it's the right thing to do.
- I'm not using any automatic workload rebalancing mechanism and I periodically want to manually rebalance my cluster according to my observations | ||
- Foreseeing a peak in application load (e.g. new product announcement), I'd like to balance in advance my cluster according to my expectation and not to current observations | ||
- During a planned maintenance window, I'm planning to drain more than one node in a sequence, so I want to be sure that the VM is going to land on a node that is not going to be drained in a near future (needing then a second migration) and being not interested in cordoning it also for other pods | ||
- I just added a new node and I want to validate it trying to live migrate a specific VM there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be achieved today by modifying the VM's node selector or creating a new VM. New nodes will be the schedulers' very likely target for new pods already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right,
from a pure technical perspective this feature can be already simply achieved directly manipulating the node affinity rules on the VM object. Now we have LiveUpdate
rollout strategy and so the new affinity rules will be quickly propagated to the VMI and so consumed on the target pod of the live-migration.
No doubt, on the technical side it will work.
But the central idea of this proposal is about allowing a cluster admin doing that without touching the VM object.
This for two maina reasons:
- separation of personas: the VM owner can set rules on his VM, a cluster admin could be still interested in migrating a VM without messing up or altering the configuration set by the owner on the VM object.
- separating what it a one-off configuration for the single migration attempt (so set on the
VirtualMachineInstanceMigration
object) that is relevant only for this single migration attempt but it should not produce any side effect in the future from what is a long-term configuration that is going to stay there and be applied also later on (future live migrations, restarts).
This comment applies to all the user stories here.
design-proposals/migration-target.md
Outdated
## User Stories | ||
- As a cluster admin I want to be able to try to live-migrate a VM to specific node for various possible reasons such as: | ||
- I just added to the cluster a new powerful node and I want to migrate a selected VM there without trying more than once according to scheduler decisions | ||
- I'm not using any automatic workload rebalancing mechanism and I periodically want to manually rebalance my cluster according to my observations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also doable today as the default scheduler will try to choose the least busy node to schedule the target pod.
- As a cluster admin I want to be able to try to live-migrate a VM to specific node for various possible reasons such as: | ||
- I just added to the cluster a new powerful node and I want to migrate a selected VM there without trying more than once according to scheduler decisions | ||
- I'm not using any automatic workload rebalancing mechanism and I periodically want to manually rebalance my cluster according to my observations | ||
- Foreseeing a peak in application load (e.g. new product announcement), I'd like to balance in advance my cluster according to my expectation and not to current observations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate on this?
How would the cluster look like to the admins' expectations?
Couldn't a taint be placed on some nodes to resolve capacity before the new product announcement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, I do not want to argue with an admin on how the cluster should be managed, but this is surely not a recommended way we want to encourage/support.
Right, I also added this note: Note technically all of this can be already achieved manipulating the node affinity rules on the VM object, but as a cluster admin I want to keep a clear boundary between what is a long-lasting setting for a VM, defined by the VM owner, and what is single shot requirement for a one-off migration |
63818ed
to
a937ba2
Compare
I spoke with @fabiand offline. My main concern with this proposal is that it may promote a wrong assumption that manual cluster balancing is preferred instead of relying on the scheduler/descheduler - while this is just a local minimum. |
I think that exposing the whole node affinity/anti-affinity (+ tolerations + ...) grammar on the
I think it's up to us to emphasize this assumption in the API documentation making absolutely clear that the I'm proposing something like: // NodeName is a request to try to migrate this VMI to a specific node.
// If it is non-empty, the migration controller simply try to configure the target VMI pod to be started onto that node,
// assuming that it fits resource, limits and other node placement constraints; it will override nodeSelector and affinity
// and anti-affinity rules set on the VM.
// If it is empty, recommended, the scheduler becomes responsible for finding the best Node to migrate the VMI to.
// +optional
NodeName string `json:"nodeName,omitempty"` I'm adding it to this proposal. |
a937ba2
to
812d69f
Compare
Setting affinity and toleration is exactly what any other user would need to do to allow scheduling a workload on tainted node, not sure why we need to facilitate this in the migration case. Generally speaking,
From my pov, we could get away without any API changes and without advertising this option at all - making it available for special cases and not a mainstream.
|
I'm sorry but now I'm a bit confused. apiVersion: kubevirt.io/v1
kind: VirtualMachineInstanceMigration
metadata:
name: migration-job
spec:
vmiName: vmi-fedora or, more imperatively, executed something like: $ virtctl migrate vmi-fedora that under the hood is going to create a This proposal is now about extending it with the optional capability to try to live migrate to a named node. apiVersion: kubevirt.io/v1
kind: VirtualMachineInstanceMigration
metadata:
name: migration-job
spec:
vmiName: vmi-fedora
nodeName: my-new-target-node or executing something like: $ virtctl migrate vmi-fedora --nodeName=my-new-target-node and this because one of the key point here is that the cluster admin is not supposed to be required to amend the The migration controller will simply notice that spec:
nodeName: <nodeName> or spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- <nodeName> on the target virt-launcher pod. Can you please summarize what do you exactly mean with
? |
@iholder101 sorry for being picky, but it is not "used to it" - There is simply no way to address the outlined use-cases from an admin perspective. And e should reocgnize that the use-cases are simply problems that administrators face when running virtualization platforms. Thus to some extent we could say that KubeVirt is only concerned about parts of the use-case, because some of the causes are actually platform driven, and KubeVirt is just a platform add-on. Existing functionality allows VM owners to place workloads. |
- Workload balancing solution doesn't always work as expected | ||
> I have configured my cluster with the descheduler and a load aware scheduler (trimaran), thus by default, my VMs will be regularly descheduled if utilization is not balanced, and trimaran will ensure that my VMs will be scheduled to underutilized nodes. Often this is working, however, in exceptional cases, i.e. if the load changes too quickly, or only 1 VM is suffering, and I want to avoid that all Vms on the cluster are moved, I need - for exception - a tool to move one VM, once to deal with this exceptional situation. | ||
- Troubleshooting a node | ||
- Validating a new node migrating there a specific VM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such an admin should probably run a checkup [1].
[1] https://github.com/kiagnose/kubevirt-dpdk-checkup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see this proposed API as an enabler for such eventual automated check.
Without this additional API, the checkup code should handle an imperative flow amending the spec ofthe named VM a few times in a sequence to trigger the behaviour to be verified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdDev both have a use-case.
But hte check up tool is not helping the admin to test exactly one node in the way s/he needs to test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand , for me, this is an attempt to explain that the need can be solved with other means.
I agree that the solution of defining a specific target node solves this and other needs. However, some argue that it is not an ideal solution. This repeats itself on several points.
design-proposals/migration-target.md
Outdated
|
||
## Goals | ||
- A user allowed to trigger a live-migration of a VM and list the nodes in the cluster is able to rely on a simple and direct API to try to live migrate a VM to a specific node (or a node within a set of nodes identified by adding node affinity constraints). | ||
- The live migration then can successfully complete or fail for various reasons exactly as it can succeed of fail today for other reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this goal.
It is already the case today, is there a new goal beyond what exists today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is stating that in term of exit status, a migration to selected node via the optional API should behave exactly as migration behaves today when the target node is selected by the scheduler: no side effects or try this first and then eventually let the scheduler decide or anything like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today live migrations can fail or many reasons.
With this change there is no change to this, live migrations can still fail in many ways.
Technically it's really no change at all, because live migration depens on meeting scheduling constraints. constraints are cpu flags, memory, storage etc - and with this feature locality will be simply one additional constratint,. but the failure more is still the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this should be dropped as nothing special is added or changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, dropped
## Goals | ||
- A user allowed to trigger a live-migration of a VM and list the nodes in the cluster is able to rely on a simple and direct API to try to live migrate a VM to a specific node (or a node within a set of nodes identified by adding node affinity constraints). | ||
- The live migration then can successfully complete or fail for various reasons exactly as it can succeed of fail today for other reasons. | ||
- The target node that is explicitly required for the actual live migration attempt should not influence future live migrations or the placement in case the VM is restarted. For long-lasting placement, nodeSelectors or affinity/anti-affinity rules directly set on the VM spec are the only way to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For long-lasting placement, nodeSelectors or affinity/anti-affinity rules directly set on the VM spec are the only way to go.
As a goal, there is no need to explain how to do something else. I think this is already mentioned well in the motivation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was enough tension about this poin tof why setting affinity on the VM is not a solution. Thus mentioning it here sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't fight over this, but IMO a detailed explanation does not fit in the goal.
- I'm not using any automatic workload rebalancing mechanism and I periodically want to manually rebalance my cluster according to my observations | ||
- Foreseeing a peak in application load (e.g. new product announcement), I'd like to balance in advance my cluster according to my expectation and not to current observations | ||
- During a planned maintenance window, I'm planning to drain more than one node in a sequence, so I want to be sure that the VM is going to land on a node that is not going to be drained in a near future (needing then a second migration) and being not interested in cordoning it also for other pods | ||
- I just added a new node and I want to validate it trying to live migrate a specific VM there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall this from memory, but it is a real world request from multiple end users.
The need of the admin is clear.
But we have better tooling for the admin to use: https://github.com/kiagnose/kiagnose
- I just added to the cluster a new powerful node and I want to migrate a selected VM there without trying more than once according to scheduler decisions | ||
- I'm not using any automatic workload rebalancing mechanism and I periodically want to manually rebalance my cluster according to my observations | ||
- Foreseeing a peak in application load (e.g. new product announcement), I'd like to balance in advance my cluster according to my expectation and not to current observations | ||
- During a planned maintenance window, I'm planning to drain more than one node in a sequence, so I want to be sure that the VM is going to land on a node that is not going to be drained in a near future (needing then a second migration) and being not interested in cordoning it also for other pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second @iholder101 .
I do not think Kubevirt is interested in suggesting such an action to a cluster admin.
While I have no interest in arguing with an admin on how one should manage the cluster, I prefer not to try and convince our-self that this makes sense and feel it is a valid scenario.
- I just added a new node and I want to validate it trying to live migrate a specific VM there | ||
> [!NOTE] | ||
> technically all of this can be already achieved manipulating the node affinity rules on the VM object, but as a cluster admin I want to keep a clear boundary between what is a long-lasting setting for a VM, defined by the VM owner, and what is single shot requirement for a one-off migration | ||
- As a VM owner I don't want to see my VM object getting amended by another user just for maintenance reasons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's think about dev-ops, infrastructure as code or platform engineering: as a VM owner (or as the owner of a complex application composed by VMs, regular pods, services...) I want to store and control its configuration by an external mean and I do not like to have the admin of the KubeVirt cluster interfering with my objects to configure the target for a live migration: this should be eventually done on the migration object, not on my own object.
It's a matter of separation of boundaries.
- VM owner: the user who owns a VM in his namespace on a Kubernetes cluster with KubeVirt | ||
- Cluster-admin: the administrator of the cluster | ||
|
||
## User Stories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the user-stories always a challenge.
As I see it, the stories (and even the goals) are suppose to describe the need of the user and attempt to avoid providing the solution. Here, the stories provide the solution upfront: Migrate a VM to a selected node.
If we drop that solution, you may see that stories can be resolved by other means, weakening the solution to migrate to a specific node.
Said all that, there is one user story which is not specified here and is convincing me:
As a cluster admin that managed various VMM systems, I have well defined processes and steps that proved useful when managing VMs.
One of the basic actions I require to continue following my processes and steps is to have control over the node target when migrating a VM.
With time, I expect to learn and trust Kubevirt to achieve my needs through other means, but until that time, I would like to keep the capabilities I learned to trust so far.
At its base, I acknowledge that Kubevirt is not a classic VMM, it does things very differently, however it has interest to make it easy for VMs from other platforms to be migrated to it. That means making the admins happy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree @EdDev. This indeed sounds much more compelling than the other use-cases.
I do think that if that's the rationale I do think we should make it clear documentation-wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is not simply about laziness on learning something new.
Many if not all of the user stories presented here are not directly achievable if not installing optional or even to be developed scheduler and descheduler plugins able to observe real time metrics or using imperative multi steps flows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many if not all of the user stories presented here are not directly achievable if not installing optional or even to be developed scheduler and descheduler plugins able to observe real time metrics or using imperative multi steps flows.
As I see it, a user-story should not be interested on how it is implemented (e.g. installing add-ons, depending on features, etc), it is only about expressing the need of a user.
I think it is problematic to decide on the solution and embed it in the user-story and goals.
As I see it, this is the main friction on the proposal. Reviewers have reservations on the solution and not on the need.
On the other side `pod.spec.nodeSelector` is only matching labels and the predefined `kubernetes.io/hostname` [label is not guaranteed to be reliable](https://kubernetes.io/docs/reference/node/node-labels/#preset-labels). | ||
`NodeSelectorTerm` offers more options, and in particular: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Reliability can be mitigated by using a label we control.
- Having more options is also a con. It opens up options which may be too wide, increasing the chance we will need to help users to use it correctly.
While I agree we need to keep this as generic as possible, having a simple solution is an advantage.
@iholder101 , you mention that nodeSelector
is bad and it is left there for backward compatibility reasons. Is this written in any formal location? If it is, then this section can ref it as a strong no-go reason.
Anyway, I would add these points even if the previous option is chosen/better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdDev I think you're referring to what I wrote about .spec.nodeName
. Unlike node affinities or nodeSelectors, it is used to bypass the scheduler, which is a huge problem and breaks stuff very quickly. It's probably almost a bad idea to use it, except for very extreme cases.
You can look this this conversation which is interesting regarding the topic: kubernetes/enhancements#3521 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s it, thanks. This should be added here as the main con then.
Edit: strike that, now I see my confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already tracked in the Exposing spec.nodeName and directly setting it to the target pod bypassing the k8s scheduler
section on the proposal.
design-proposals/migration-target.md
Outdated
|
||
# Implementation Phases | ||
A really close attempt was already tried in the past with https://github.com/kubevirt/kubevirt/pull/10712 but the Pr got some pushbacks. | ||
A similar PR should be reopened, refined and we should implement functional tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The "one-time" operation convinced me.
- The reasoning for the real need is hard for me, but I did feedback on this proposal what is convincing me.
I missed to add to my review that I really like this design proposal. The investment in the motivation and alternative solutions is impressive. 💪 |
I feel we're always circling back to "the admin just wants to do that".
A descheduler plugin could support that.
I've added comments on how I think we should address these use-cases above.
This is correct, however, sometimes an admin wants to achieve a certain goal but chooses to use the wrong mechanisms in order to do so. This is more likely to happen when the admin comes from different virt platforms which makes him less experienced with the Kubernetes world. @fabiand
I believe my views are thoroughly expressed here already, so I'll shut up now and let others express their opinions as well. |
7dacf7e
to
f45827b
Compare
Other cases are here because they are valid by themselves and not directly addressable without relying on multi-steps imperative flows.
same.
I don't see any real implication of that regarding this API.
Honestly abusing annotations to avoid defining APIs is a bad practice.
|
I think there is a disagreement here on the solution, not on the need. Perhaps you can emphasize the troubleshooting need. I guess it will be hard to find alternative solutions to it.
It is helpful if it convinces others that there is no other alternative solution to it. |
Personally I still think that letting the cluster admin inject a
I want also to better understand the reasons why we should not accept it. Are we saying that propagating that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoa. I hadn't looked at this thread in awhile. So much discussion. We're hitting analysis paralysis at this point.
At the end of the day, here's the reality as I see it. Admins are asking for this feature... And they're asking for it over and over. For years. I tried to ignore it in the past, and guide people to our preferred way of doing this sort of thing, but the requests keep coming. I'm worn down.
In my opinion I'm at the point where I say let's just do this and move on. The nodeSelectorTerm on the VMIM seems fine. The alternative with nodeSelectors seems fine as well.
Here's what I'm most interested in now... Does anyone see how proceeding with this will cause any future harm or complexity to the project that impacts our ability to maintain KubeVirt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdDev @vladikr I changed the proposal slightly to highlight: This is about expcetions.
The feature being proposed here is for exceptional cases. Debugging, researching, understanding, directing. As an admin, transpraently to the user.
Today, I am not aware of a procedure or tool to move a VM to a specific node wihtout modifying the VM definition itself.
This tool is for our users, for cluster administrators, it feels like somebody has to beg to get this in.
It feels like the layed out user-stories are not except, while otoh they come from end-users.
Is our answer that: No, KubeVirt does not support this?
We make the lifes of our admins harder, and I do not see this as the mission of KubeVirt.
This feature is
- small
- declarative
- not conflicting
- low risk
I' not asking why this shoul dbe kept out.
Can you please tell me what are the next steps to get this in?
- Workload balancing solution doesn't always work as expected | ||
> I have configured my cluster with the descheduler and a load aware scheduler (trimaran), thus by default, my VMs will be regularly descheduled if utilization is not balanced, and trimaran will ensure that my VMs will be scheduled to underutilized nodes. Often this is working, however, in exceptional cases, i.e. if the load changes too quickly, or only 1 VM is suffering, and I want to avoid that all Vms on the cluster are moved, I need - for exception - a tool to move one VM, once to deal with this exceptional situation. | ||
- Troubleshooting a node | ||
- Validating a new node migrating there a specific VM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EdDev both have a use-case.
But hte check up tool is not helping the admin to test exactly one node in the way s/he needs to test it.
design-proposals/migration-target.md
Outdated
|
||
## Goals | ||
- A user allowed to trigger a live-migration of a VM and list the nodes in the cluster is able to rely on a simple and direct API to try to live migrate a VM to a specific node (or a node within a set of nodes identified by adding node affinity constraints). | ||
- The live migration then can successfully complete or fail for various reasons exactly as it can succeed of fail today for other reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today live migrations can fail or many reasons.
With this change there is no change to this, live migrations can still fail in many ways.
Technically it's really no change at all, because live migration depens on meeting scheduling constraints. constraints are cpu flags, memory, storage etc - and with this feature locality will be simply one additional constratint,. but the failure more is still the same.
## Goals | ||
- A user allowed to trigger a live-migration of a VM and list the nodes in the cluster is able to rely on a simple and direct API to try to live migrate a VM to a specific node (or a node within a set of nodes identified by adding node affinity constraints). | ||
- The live migration then can successfully complete or fail for various reasons exactly as it can succeed of fail today for other reasons. | ||
- The target node that is explicitly required for the actual live migration attempt should not influence future live migrations or the placement in case the VM is restarted. For long-lasting placement, nodeSelectors or affinity/anti-affinity rules directly set on the VM spec are the only way to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was enough tension about this poin tof why setting affinity on the VM is not a solution. Thus mentioning it here sounds good.
the best node to run a pod (so the target pod for the VMI after the live-migration) on. | ||
In the real world, we still see specific use cases where the flexibility do explicitly and directly define the target node for a live migration is a relevant nice-to-have: | ||
- Experienced admins are used to control where their critical workloads are move to | ||
> I as an admin, notice that a VM with guaranteed resources is having issues (I watched the cpu iowait metric). In order to resolve the performance issue and keep my user happy, I as admin want to move the VM, without interruption, to a node which is currently underutilized - and will make the user's vm perform better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make this even more convincing if we mention: I'm am a long-time cluster-admin of another, well-established VM management system. Missing this functionality, of one-off placement of a VM on a node, in KubeVirt is one of the reasons I haven't expanded my KubeVirt footprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it hard to be convinced that the project should provide manual solutions to admins instead of an automated, self managed one.
I do find it convincing to provide reasoning like the one @dankenigsberg raised here, just because the admin wants to have control like he had on other projects. It is convincing for me because I have no alternative to give for that need.
We will escalate this enhancement in the next maintainer/approver meeting and update. I still think that the way forward is to use a goal and user-story that have no better alternatives. [1] #320 (comment) |
…specific nodes Adding a design proposal to extend VirtualMachineInstanceMigration object with an additional API to let a cluster admin try to trigger a live migration of a VM injecting on the fly an additional NodeSelectorTerm constraint. The additional NodeSelectorTerm can only restrict the set of Nodes that are valid target for the migration (eventually down to a single host); in order to achieve this, all the `NodeSelectorRequirements` defined on the additional `NodeSelectorTerm` set on the VMIM object should be appended to all the `NodeSelectorTerms` already defined on the VM object. All the affinity rules defined on the VM spec are still going to be satisfied. Signed-off-by: Simone Tiraboschi <[email protected]>
Added that this is for exceptions Signed-off-by: Fabian Deutsch <[email protected]>
Signed-off-by: Simone Tiraboschi <[email protected]>
7cfd58b
to
9a5f3d9
Compare
I added
to the list of goals and
to the list of user stories. |
Indeed, but on the other hand, would allow features that may hard key process in, in the requested form? I keep wondering if the same feature could not be achieved in a more "cloudy" - however, I'm not blocking on this.
This is the key issue. Migrations today are not limited to cluster admins.
@fabiand how do you define risk? This new API also opens the door to queueing up migrations and delaying important tasks that rely on live-migration without control - I think it might be a risk. Perhaps before proceeding with this feature, we should implement a priority queue for important tasks and some logic around handling important and less important migrations.
|
Honestly I don't think that this feature by itself is going to make this worse. |
Much like vladik, I don't like it that project owners can migrate a VMI (it's not "cloudy", as is the fact they can set /lgtm |
Further to the risks @vladikr have mentioned, I think this proposal imposes a "philosophical" question: is it reasonable to add a feature that admins ask for although as developers we believe has no real use-cases and is the wrong way of doing things? Is our responsibility to simply please the users in this case, or try to educate them towards using the right mechanisms, even if they're used to doing things in a different way?
What you write here is consistent with what @tiraboschi have added to the proposal:
This text openly admits that the sole use-case here is that admins are used to this feature that's common on other platforms. So, eventually, we need to decide if this reasoning is enough:
Is being worn down from user requests justifies adding a feature we believe is wrong? |
No, I don't think that we can derive this. |
Most of what I am going to summarize below has been raised already above, however, I think it is useful to see the whole picture in one place. The topic has been discussed in the maintainer meeting on the 4th of November:
I think that the 1st point can be easily addressed and probably not a major blocker. I have not re-reviewed the latest changes done in the last few days. The 2nd point can be solved at the implementation, but it does needs some research, checking if we can validate that the target node selector can only be valid for a cluster-admin. Hopefully the 3rd point will pass @vladikr . |
@EdDev, can you please explain why this proposal will expand the problem? |
I think @vladikr can explain this better. |
What this PR does / why we need it:
Adding a design proposal to extend VirtualMachineInstanceMigration
object with an additional API to let a cluster admin
try to trigger a live migration of a VM injecting
on the fly and additional NodeSelector constraint.
The additional NodeSelector can only restrict the set
of Nodes that are valid target for the migration
(eventually down to a single host).
All the affinity rules defined on the VM spec are still
going to be satisfied.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes https://issues.redhat.com/browse/CNV-7075
Special notes for your reviewer:
Something like this was directly proposed/implemented with kubevirt/kubevirt#10712 getting already discussed there.
Checklist
This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.
Release note: