Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single-node production deployment approach #560

Conversation

dhellmann
Copy link
Contributor

This enhancement describes the approach to deploying single-node
production OpenShift instances without using a cluster profile.

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely well written, makes total sense to me.

The only thing I found myself wanting here is the specific example of the capabilities API; it seems like that's going to be openshift/api#816 ? Let's either link to that or explicitly demo what the "user interface" is for this in the install config?

@dhellmann dhellmann force-pushed the single-node-production-deployment-approach branch from 017205a to 52c5ea8 Compare December 10, 2020 22:51
1. Telco workloads typically require special network setups
for a host to boot, including bonded interfaces, access to multiple
VLANs, and static IPs. How do we anticipate configuring those?
2. How do we hand off ownership of the `etcd-quorum-guard` Deployment
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalp @staebler I've added an open question about the CVO-CEO hand-off that we discussed earlier. Did we have an answer for that, or is it still an open question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romfreiman mentioned that he and @hexfusion have discussed it and have a plan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a link?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romfreiman @hexfusion can you summarize the plan in a comment or provide a link to a design doc so I can put either/both into this enhancement?

@dhellmann
Copy link
Contributor Author

This is extremely well written, makes total sense to me.

The only thing I found myself wanting here is the specific example of the capabilities API; it seems like that's going to be openshift/api#816 ? Let's either link to that or explicitly demo what the "user interface" is for this in the install config?

D'oh! There's a link to that enhancement in the metadata, but probably not in the body of the text. It's #555 and I'll make sure that is called out more clearly.

@dhellmann dhellmann force-pushed the single-node-production-deployment-approach branch 2 times, most recently from 0d0c1e9 to 0003860 Compare December 10, 2020 23:28
@dhellmann
Copy link
Contributor Author

/cc @markmc

@dhellmann dhellmann force-pushed the single-node-production-deployment-approach branch 2 times, most recently from 88fbe8d to dbb028d Compare December 11, 2020 21:27
@markmc
Copy link
Contributor

markmc commented Dec 14, 2020

This is extremely well written, makes total sense to me.

+1 the effort from many people on this is very well captured here

For me, the tl;dr that this is the minimal list of changes we believe would be needed by operators to respond to a "this is a non-HA cluster" API. Unless there are major objections to that approach, I think we should be able to merge this enhancement quite quickly.

@markmc
Copy link
Contributor

markmc commented Dec 14, 2020

@dhellmann there was a bunch of thoughtful discussion in #504 about configuration changes. I'm not sure I follow the conclusions 100%, so I'm curious in your mind the result of that discussion is captured in this enhancement? Thanks.

(See #504 (comment))

would be expected to run without issue during this interval. Workloads
that do depend on apiserver availability would need to be resilient to
these events. OpenShift core components are already resilient in this
way.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Maybe this closely related to my question on the configuration changes discussion)

This section begs more questions for me than answers - e.g. what do we mean by "rollouts"? Reconfiguration for key rotations I think I get, but are there other examples of "periodically reconfigured"? Inaccessible for up to 2 minutes seems very specific - is this 2 minute timeframe somehow fundamental, or something that can be improved? Can we be more specific about how OpenShift core components are resilient to this, and how other workloads are going to need to be adapted?

Thanks,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k @cgwalters I think these details came from one of you, can you help answer the question?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinging @deads2k and @cgwalters for help here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, the new encryption keys cause a restart, or at least a temporary pause while the new keys are loaded. I guess restarting takes around 2 minutes? @deads2k is that right?

@dhellmann
Copy link
Contributor Author

@dhellmann there was a bunch of thoughtful discussion in #504 about configuration changes. I'm not sure I follow the conclusions 100%, so I'm curious in your mind the result of that discussion is captured in this enhancement? Thanks.

(See #504 (comment))

In those earlier discussions we were still assuming we might end up with a version of this that cut some operators out of the cluster completely. The proposal has evolved significantly since then, and I've tried to capture that in an update to the goals/non-goals section.

@deads2k , the comment @markmc linked to was yours. Could you take a look at the goals/non-goals list and confirm that I've captured the details to alleviate your earlier concerns?

@dhellmann
Copy link
Contributor Author

This is extremely well written, makes total sense to me.

+1 the effort from many people on this is very well captured here

Yes, this was definitely a team effort!

@dhellmann dhellmann force-pushed the single-node-production-deployment-approach branch from ea99322 to ce35544 Compare January 6, 2021 21:34
Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few notes, thanks for the details.

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhellmann this is very well done, had a few questions/clarifications.

enhancements/single-node-production-deployment-approach.md Outdated Show resolved Hide resolved
enhancements/single-node-production-deployment-approach.md Outdated Show resolved Hide resolved
new capabilities API to change the replica count to 1 when the
high-availability mode is none.

#### cluster-machine-approver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during bootstrapping, we approve everything, post bootstrapping, we need a corresponding machine record in order to auto-approve.

https://github.com/openshift/cluster-machine-approver/blob/master/README.md#openshift-and-csrs

1. Telco workloads typically require special network setups
for a host to boot, including bonded interfaces, access to multiple
VLANs, and static IPs. How do we anticipate configuring those?
2. How do we hand off ownership of the `etcd-quorum-guard` Deployment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a link?

(https://github.com/openshift/release/pull/14552) tests using the
bootstrap-in-place approach described in
https://github.com/openshift/enhancements/pull/565 on Packet and
e2e-aws-single-node (https://github.com/openshift/release/pull/14556)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is accurate to assume that the aws cloud provider will be enabled when running on aws infra?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eranco74 does the installer use the right platform setting or does it use an empty platform like UPI?

Copy link
Contributor

@eranco74 eranco74 Jan 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The installer will use the platform specified in the install-config.yaml, for bootstrap-in-place the platform should be None - same as UPI.
You can still install single node with the openshift-installer regular flow (with bootstrao node) on AWS, the aws cloud provider will be enabled.

@dhellmann dhellmann force-pushed the single-node-production-deployment-approach branch from ce35544 to e13ed3e Compare January 8, 2021 16:58
@dhellmann
Copy link
Contributor Author

Thanks everyone for your reviews! I have updated the text based on the actionable feedback, so please take another look if you have reviewed an earlier draft. There are still several threads with open questions or requests for help, but I think we're a lot closer to being able to merge this.

@markmc
Copy link
Contributor

markmc commented Jan 8, 2021

Thanks everyone for your reviews! I have updated the text based on the actionable feedback, so please take another look if you have reviewed an earlier draft. There are still several threads with open questions or requests for help, but I think we're a lot closer to being able to merge this.

Agree, thanks Doug.

/approve

@derekwaynecarr please lgtm if/when you're happy with Doug's responses to your feedback

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 8, 2021
This enhancement describes the approach to deploying single-node
production OpenShift instances without using a cluster profile.

Signed-off-by: Doug Hellmann <[email protected]>
@dhellmann dhellmann force-pushed the single-node-production-deployment-approach branch from e13ed3e to d5e748f Compare January 8, 2021 19:04
to stop all workloads safely as part of the reboot. That feature is
alpha in kubernetes 1.20 and disabled by default, so we will need to
add a feature gate to enable it.

Copy link
Contributor

@darkmuggle darkmuggle Jan 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Any operator that generates a MachineConfig and templates in the machine-config-operator must be high-availability mode agnostic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this comment earlier. Maybe we want to fold it into #587?

@derekwaynecarr
Copy link
Member

thanks @dhellmann , this looks good to merge and iterate. if we find more is necessary, we can update.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2021
Copy link

@pweil- pweil- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Console updates LGTM

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, markmc, pweil-

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@spadgett
Copy link
Member

Console updates LGTM

+1, thanks @dhellmann

@openshift-merge-robot openshift-merge-robot merged commit 48ae0cf into openshift:master Jan 13, 2021
@cgwalters
Copy link
Member

And the best part is now that this is merged, once we stand up CI and there are occasional failures...we can call those SNOflakes.

@romfreiman
Copy link

@cgwalters and we have a logo
image (7)

dhellmann added a commit to dhellmann/openshift-enhancements that referenced this pull request Jan 21, 2021
VaishnaviHire pushed a commit to VaishnaviHire/enhancements that referenced this pull request Feb 11, 2021
mansikulkarni96 pushed a commit to mansikulkarni96/enhancements that referenced this pull request Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.