-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
single-node production deployment approach #560
single-node production deployment approach #560
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is extremely well written, makes total sense to me.
The only thing I found myself wanting here is the specific example of the capabilities API; it seems like that's going to be openshift/api#816 ? Let's either link to that or explicitly demo what the "user interface" is for this in the install config?
017205a
to
52c5ea8
Compare
1. Telco workloads typically require special network setups | ||
for a host to boot, including bonded interfaces, access to multiple | ||
VLANs, and static IPs. How do we anticipate configuring those? | ||
2. How do we hand off ownership of the `etcd-quorum-guard` Deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romfreiman mentioned that he and @hexfusion have discussed it and have a plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romfreiman @hexfusion can you summarize the plan in a comment or provide a link to a design doc so I can put either/both into this enhancement?
D'oh! There's a link to that enhancement in the metadata, but probably not in the body of the text. It's #555 and I'll make sure that is called out more clearly. |
0d0c1e9
to
0003860
Compare
/cc @markmc |
88fbe8d
to
dbb028d
Compare
+1 the effort from many people on this is very well captured here For me, the tl;dr that this is the minimal list of changes we believe would be needed by operators to respond to a "this is a non-HA cluster" API. Unless there are major objections to that approach, I think we should be able to merge this enhancement quite quickly. |
@dhellmann there was a bunch of thoughtful discussion in #504 about configuration changes. I'm not sure I follow the conclusions 100%, so I'm curious in your mind the result of that discussion is captured in this enhancement? Thanks. (See #504 (comment)) |
would be expected to run without issue during this interval. Workloads | ||
that do depend on apiserver availability would need to be resilient to | ||
these events. OpenShift core components are already resilient in this | ||
way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Maybe this closely related to my question on the configuration changes discussion)
This section begs more questions for me than answers - e.g. what do we mean by "rollouts"? Reconfiguration for key rotations I think I get, but are there other examples of "periodically reconfigured"? Inaccessible for up to 2 minutes seems very specific - is this 2 minute timeframe somehow fundamental, or something that can be improved? Can we be more specific about how OpenShift core components are resilient to this, and how other workloads are going to need to be adapted?
Thanks,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads2k @cgwalters I think these details came from one of you, can you help answer the question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging @deads2k and @cgwalters for help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, the new encryption keys cause a restart, or at least a temporary pause while the new keys are loaded. I guess restarting takes around 2 minutes? @deads2k is that right?
In those earlier discussions we were still assuming we might end up with a version of this that cut some operators out of the cluster completely. The proposal has evolved significantly since then, and I've tried to capture that in an update to the goals/non-goals section. @deads2k , the comment @markmc linked to was yours. Could you take a look at the goals/non-goals list and confirm that I've captured the details to alleviate your earlier concerns? |
Yes, this was definitely a team effort! |
003367c
to
35df4e1
Compare
ea99322
to
ce35544
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few notes, thanks for the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhellmann this is very well done, had a few questions/clarifications.
new capabilities API to change the replica count to 1 when the | ||
high-availability mode is none. | ||
|
||
#### cluster-machine-approver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
during bootstrapping, we approve everything, post bootstrapping, we need a corresponding machine record in order to auto-approve.
1. Telco workloads typically require special network setups | ||
for a host to boot, including bonded interfaces, access to multiple | ||
VLANs, and static IPs. How do we anticipate configuring those? | ||
2. How do we hand off ownership of the `etcd-quorum-guard` Deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a link?
(https://github.com/openshift/release/pull/14552) tests using the | ||
bootstrap-in-place approach described in | ||
https://github.com/openshift/enhancements/pull/565 on Packet and | ||
e2e-aws-single-node (https://github.com/openshift/release/pull/14556) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is accurate to assume that the aws cloud provider will be enabled when running on aws infra?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eranco74 does the installer use the right platform setting or does it use an empty platform like UPI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The installer will use the platform specified in the install-config.yaml, for bootstrap-in-place the platform should be None - same as UPI.
You can still install single node with the openshift-installer regular flow (with bootstrao node) on AWS, the aws cloud provider will be enabled.
ce35544
to
e13ed3e
Compare
Thanks everyone for your reviews! I have updated the text based on the actionable feedback, so please take another look if you have reviewed an earlier draft. There are still several threads with open questions or requests for help, but I think we're a lot closer to being able to merge this. |
Agree, thanks Doug. /approve @derekwaynecarr please lgtm if/when you're happy with Doug's responses to your feedback |
This enhancement describes the approach to deploying single-node production OpenShift instances without using a cluster profile. Signed-off-by: Doug Hellmann <[email protected]>
e13ed3e
to
d5e748f
Compare
to stop all workloads safely as part of the reboot. That feature is | ||
alpha in kubernetes 1.20 and disabled by default, so we will need to | ||
add a feature gate to enable it. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any operator that generates a MachineConfig and templates in the machine-config-operator must be high-availability mode agnostic. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this comment earlier. Maybe we want to fold it into #587?
thanks @dhellmann , this looks good to merge and iterate. if we find more is necessary, we can update. /lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Console updates LGTM
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, markmc, pweil- The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
+1, thanks @dhellmann |
And the best part is now that this is merged, once we stand up CI and there are occasional failures...we can call those SNOflakes. |
@cgwalters and we have a logo |
Incorporate feedback from openshift#560 (comment) Signed-off-by: Doug Hellmann <[email protected]>
Incorporate feedback from openshift#560 (comment) Signed-off-by: Doug Hellmann <[email protected]>
Incorporate feedback from openshift#560 (comment) Signed-off-by: Doug Hellmann <[email protected]>
This enhancement describes the approach to deploying single-node
production OpenShift instances without using a cluster profile.