-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Managing Control Plane machines proposal #292
Conversation
e00ad19
to
3b653f4
Compare
3b653f4
to
9b2ffa6
Compare
- The Cluster etcd Operator manages certificates, report healthiness and add etcd members as "Master" nodes join the cluster. | ||
- The Kubelet is tied to the OS and therefore is managed by the Machine Config Operator. | ||
- To integrate with any Control Plane components topology e.g Remote hosted pod based Control Plane. | ||
- Automated disaster recovery of a cluster that has lost quorum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we do intended to support should be moved to goals, which is "simplify disaster recovery of a cluster that has lost quorum" (and then later describe what we do offer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see this in-scope as well. This is the critical scenario when we have lost etcd and the cluster is broken. We hit this with an OSD cluster last weekend, 2 masters offline and quorum lost. Guarding against this with other goals noted in this enhancement is an improvement but without the ability to recover from lost quorum there is still a huge risk carried in supporting clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only affects to scaling and ensuring compute resources. Any operational aspect of etcd recovery should be handled by the etcd operator orthogonally.
Currently during a regular IPI bootstrapping process the installer uses Terraform to create a bootstrapping instance and 3 master instances. Then it creates Machine resources to "adopt" the existing master instances. | ||
|
||
Additionally: | ||
- It will create MachineSets to adopt those machines by looking up known labels (Adopting behaviour already exists in machineSet logic). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be automatic by the platform instead of install - install creates instances, machine sets are created by a more core operator that respects infrastrucutre.spec.managedMasters / masters
. That supports customers upgrading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it’d be very complex to materialise an opinionated expected state for something like infra.spec.managedMasters in existing clusters.
There are no strong guarantees about cluster topology or safe assumptions we can do on already existing clusters/environments for an flag to instantiate and manage these objects. I’m concern this would create more rough edges than it would solve.
I think we should make no difference with workers. I don’t think we should have a semantic for managedMasters the same way we don’t have one for managedWorkers. MachineSets are building blocks compute resources.
Before there was a limitation on etcd that prevent master compute resources from being machineSets. Now that’s fixed at the right layer and shipped on upgrades (etcd operator).
So we just want all compute resources to be machineSets out of the box. And as for existing clusters, as an admin I now know that creating machineSets for masters is safe so it’s my recommended choice as for any other machine.
The installer is just defaulting the initial ha topology for all the machines in a cluster. MachineSets is not more. It could actually supersede current masters machines objects and terraform code in the installer, just like we do for workers (but we want to keep that step separate from this proposal to maintain scope).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To mitigate this concerns and alleviate the upgrade burden from the user I introduced controlPlane CRD and controller managed by the Machine API Operator. Opt-in for existing clusters requires no more than just instantiate this object.
This also let us have more control to limit the underlying machineSet capabilities to prevent an undesired to high degree of flexibility, and reserve the ability to relax it later in time.
See 5ef75fd
PTAL @smarterclayton
- Scale out. | ||
- Scale in. | ||
- Vertical scaling forcing machine recreation. | ||
- Simulate unhealthy nodes and remediation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it woulrd be good to detail a reasonably complete test plan for each of these (for instance, for unhealthy node emulation alay already owns a test, describe the scenario here so we can review it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structure is ok, not happy with the installer cerating machine sets.
- The Cluster etcd Operator manages certificates, report healthiness and add etcd members as "Master" nodes join the cluster. | ||
- The Kubelet is tied to the OS and therefore is managed by the Machine Config Operator. | ||
- To integrate with any Control Plane components topology e.g Remote hosted pod based Control Plane. | ||
- Automated disaster recovery of a cluster that has lost quorum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see this in-scope as well. This is the critical scenario when we have lost etcd and the cluster is broken. We hit this with an OSD cluster last weekend, 2 masters offline and quorum lost. Guarding against this with other goals noted in this enhancement is an improvement but without the ability to recover from lost quorum there is still a huge risk carried in supporting clusters.
|
||
The Control Plane is the most critical and sensitive entity of a running cluster. Today OCP Control Plane instances are "pets" and therefore fragile. There are multiple scenarios where adjusting the compute capacity which is backing the Control Plane components might be desirable either for resizing or repairing. | ||
|
||
Currently there is nothing that automates or eases this task. The steps for the Control Plane to be resized in any manner or to recover from a tolerable failure are completely manual. Different teams are following different "Standard Operating Procedure" documents scattered around with manual steps resulting in loss of information, confusion and extra efforts for engineers and users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purposes of this document, what is "a tolerable failure"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
etcd quorum is not lost. Clarified.
|
||
- Exhaustive unit testing. | ||
- Exhaustive integration testing via [envTest](https://book.kubebuilder.io/reference/testing/envtest.html). | ||
- E2e testing on the [machine API](https://github.com/openshift/cluster-api-actuator-pkg/tree/master/pkg) and origin e2e test suite: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to see some tests when bad configuration is supplied. I.e.
- scale in to less than 3 masters
- instance type too small to handle master workload
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At best we can test to block scale in on etcd guard PDB. For any farther automation that requires somehow auto scaling of compute this proposal is a necessary first step.
9b2ffa6
to
a13a7a7
Compare
@jeremyeder My preference is for some component to create the control plane resource for existing clusters by default. Which component TBD. Having the user create them definitely presents lots of problems, and will be quite error prone. If the resource in question ends up being something like the cluster autoscaler (basically a singleton CRD to deploy something else that does the actual work) with minimal user-drive configuration, that might be suitable as well. I think this enhancement is in need of some additional revision. |
I'm all good merging this as it is as provisional, move forward and review during the next release cycle. @michaelgugino is already hands on working on a prototype either way.
That's literally what this proposal is describing atm. I think we can re-interact about more details during the next release cycle, sync about the state of etcd operator and get some tangible feelings out of the prototype you are working on. @jeremyeder I'd like to get ack from @smarterclayton to merge though as he requested changes in the first place. |
### Goals | ||
|
||
- To have a declarative mechanism to ensure that existing Control Plane Machines are recreated on a deletion request at any time. | ||
- To auto repair unhealthy Control Plane Nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are great goals
- As a SRE, I want to have consumable API primitives in place to resize Control Plane compute resources so I can develop upper level automation tooling atop. E.g Automate Story 3 to support a severe growth peak of the number of worker nodes. | ||
|
||
#### Story 5 | ||
- As an operator, I want faulty nodes to be remediated automatically. This includes having self healing Control Plane machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is covered, but want to call it out here. Faulty control plane machines must be remediated in a way that does not break quorum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh, I clarified here 5e56429#diff-dcd7423e408e8b740b1b9b14e53eeda1R201
Any deletion operation must block on PDB and so honour etcd-guard.
Additionally for auto repairing nodes the managed MHC will keep a non-disruptive maxUnhealthy value for etcd quorum i.e 1 out of 3 which will limit the number deletion operations.
One question remaining.
…On Thu, Jun 25, 2020 at 4:16 AM Alberto García Lamela < ***@***.***> wrote:
@enxebre <https://github.com/enxebre> how are we looking on this one? Any
outstanding issues before it can merge?
I'm all good merging this as it is as provisional, move forward and review
during the next release cycle. @michaelgugino
<https://github.com/michaelgugino> is already hands on working on a
prototype either way.
If the resource in question ends up being something like the cluster
autoscaler (basically a singleton CRD to deploy something else that does
the actual work) with minimal user-drive configuration, that might be
suitable as well.
That's literally what this proposal is describing atm. I think we can
re-interact about more details during the next release cycle, sync about
the state of etcd operator and get some tangible feelings out of the
prototype you are working on.
@jeremyeder <https://github.com/jeremyeder> I'd like to get ack from
@smarterclayton <https://github.com/smarterclayton> to merge though as he
requested changes in the first place.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#292 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI37J6O3GSJLHMNRBEGKXDRYMBTXANCNFSM4MO6JUGA>
.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
thanks @smarterclayton PTAL -> #292 (comment) |
// Defaults to true. | ||
// This can be disable for complex scenarios where manual introspection is needed. | ||
EnableNodeAutorepair bool `json:"enableautorepair,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but to be able to default this to true it should be a pointer to a bool, otherwise we won't know if the user explicitly set it to false
or just omitted it. Alternatively, negate it so that default is false by calling it something like DisableNodeAutoRepair
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by 2aa9977
8449e62
to
2aa9977
Compare
LGTM, I'll leave the label for @smarterclayton |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
We definitely want this. /lifecycle frozen |
- The ControlPlane controller creates and manages a MachineSet to back each Control Plane Machine that is found at any time. | ||
- The ControlPlane controller creates and manages a Machine Health Check resource to monitor the Control Plane Machines. | ||
|
||
This proposal assumes that all etcd operational aspects are managed by the cluster-etcd-operator orthogonally in a safe manner while manipulating the compute resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the etcd member dance, we may need something to manipulate load balancer targets for api and api-int, at least in clusters where those were provisioned by the installer. Maybe that looks like "grow an in-cluster component to optionally manage those load balancers", because that could be more convenient than writing something specific to the control-plane-recovery case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The machine-api adds and removes control plane hosts from the load balancers today, at least in AWS and GCP, and probably Azure. I think there's also something such about SLBs, not sure how they work.
On platforms where this can't be automated, users will need to add/remove load balancers. The nice thing about the machineset model is we're not going to add/remove machines without you telling us to, so you'll be able to do this when you see fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The machine-api adds and removes control plane hosts from the load balancers today, at least in AWS and GCP, and probably Azure.
How does it discover which LBs to adjust?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's part of the machine object spec.
This proposal is over a year old. As part of recent efforts to clean up old pull requests, I am removing the life-cycle/frozen label to allow it to age out and be closed. If the proposal is still active, please restore the label. /remove-lifecycle frozen |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This proposal outlines a solution for managing the Control Plane compute resources with a controlPlane resource which manages machineSets and MHC.
cc @abhinavdahiya, @hexfusion, @JoelSpeed, @jeremyeder @smarterclayton @derekwaynecarr @deads2k @markmc @michaelgugino
Follow up to supersede #278