-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancements/update/automatic-updates: Propose a new enhancement #124
Conversation
LGTM, is it worth noting that autoupdates would be limited to the selected channel? |
Adminstrators can configure the cluster to push those alerts out to the on-call admistrator to recover the cluster. | ||
* Stability testing. | ||
We are continually refining our CI suite and processing Telemetry from live clusters in order to assess the stability of each upgrade. | ||
We will not place upgrades in production channels unless they have proven themselves stable in earlier testing, and we will remove upgrades from production channels if a live cluster trips over a corner case that we do not yet cover in pretesting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/remove/stop pushing/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/remove/stop pushing/
Expanded "we will remove upgrades from production channels" to "we will remove update recommendations from production channels". We're not really pushing anything, we're just serving graphs to clients like the CVO. Does the new wording look good to you, or do you have an alternative idea?
If that update proves unstable, many of those upgrades would already be in progress by the time the first Telemetry comes back with failure messages. | ||
A phased rollout would limit the number of simultaneously updating clusters to give Telemetry time to come back so we could stop recommending upgrades that proved unstable on live clusters not yet covered in pretesting. | ||
|
||
There is also a security risk where a compromised upstream Cincinnati could recommend cluster updates that were not in the cluster's best interest (e.g. 4.2.4 -> 4.1.0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if relevant for OpenShift mitigations, but Zincati has client-side checks and knobs to prevent auto-downgrades: https://github.com/coreos/zincati/blob/dbb0b0a8884435f2b1186b2228199cd4adb6f705/docs/usage/auto-updates.md#updates-ordering-and-downgrades
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if relevant for OpenShift mitigations...
We could grow this. But sometimes you might want to recommend a downgrade. E.g, 4.y.7 turns out to explode after 24h because of broken cert rotation, and a fixed 4.y.8 is 48h out, so you recommend 4.y.7->4.y.6 until then to get folks back to a safe place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to version updates separately from the cluster versions, e.g.
update-1.0 => (ocp-4.y.6 -> ocp-4.y.7)
update-1.1 => (ocp-4.y.7 -> ocp-4.y.6)
update-1.2 => (ocp-4.y.6 -> ocp-4.y.8, ocp-4.y.7 -> ocp-4.y.8)
You always want the latest update version, which may or may not upgrade you to the highest cluster version.
Another way to address rollback attacks, as well as freeze attacks: TUF uses timestamps to certify updates for short durations, requiring constant refreshing to ensure the latest update metadata. Timestamp ordering is simple enough but does require secure and accurate time sources (for TUF, see the timestamp.json section: tuf-spec.md#4-document-formats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For libostree the timestamp in the commit is covered by the GPG signature; we don't do anything about comparing with the system wall clock, just require that the timestamp in the new commit increases.
The TUF threat model helps in a DoS attack where a MITM attacker just tells you there are no more updates. This is useful, but comes with a lot of overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For OpenShift though the MCO ignores that bit, and obviously the libostree part isn't use for the rest of the container images anyways.
So I guess I'm just arguing that "signed timestamps" work pretty well and would likely be easy to add to the CVO if it doesn't do it today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You always want the latest update version, which may or may not upgrade you to the highest cluster version.
We've talked about signing update recommendations, but you'd probably need to sign each of them separately. E.g.
from: 4.3.0
to: 4.3.1
expires: 2020-02-08T00:00Z
That's possible, but would be a fair bit of work to put in. Protections like "require admin overrides before applying downgrades" are coarser (and as above, sometimes you want downgrades), but would be easier to implement in the short term.
So I guess I'm just arguing that "signed timestamps" work pretty well and would likely be easy to add to the CVO if it doesn't do it today.
For the releases themselves, we have a version number and creation timestamp, both of which are covered by the image signature. Sorting on version number seems more sane to me, because we're cutting 4.2.18 after 4.3.0, and 4.3.0 -> 4.2.18 is probably not what most clusters want ;).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right...when you have multiple branches you do want something more sophisticated. We ended up implementing ref binding in libostree, which avoids parsing the version strings. But I guess for the fully general case of switching branches, you do need something that is parsing them.
|
||
There are also potential future mitigations: | ||
|
||
* The cluster-version operator could be taught to limit automatic upgrades to those which appear in the [signed metadata][cluster-version-operator-release-metadata] of either the source or target release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a mitigation that actually fit in the rest of design/release-engineering (or more likely, I didn't understand your suggestion).
In particular, it seems to conflict with the "rollouts based on Telemetry" point above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a mitigation that actually fit in the rest of design/release-engineering...
This is a guard against malicious Cincinnati, whose risk exists independent of this enhancement. This enhancement just increases exposure to the preexisting risk.
- Tues..Thurs *-11..1-* 11:00 | ||
``` | ||
|
||
The schedule structure was designed to support maintenance windows, allowing for updates on certain days or during certain times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entries shown above are punctual datetimes, not maintenance windows.
That is, they specify when a maintenance event can start (edge-trigger) but not how long it can be active (level-trigger). I fear this is a relevant semantic detail if there is any polling logic involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, they specify when a maintenance event can start (edge-trigger) but not how long it can be active (level-trigger).
There's no cap on upgrade duration. If you're concerned about it, you'd have small windows early in your day for initiating upgrades, to leave lots of time for monitoring/recovering before you went home. But this is an alternative proposal; I dunno how deeply we want to dig into it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I realize my comment was not worded clearly enough. I'm not trying to cap upgrade duration.
I'm stating that the schedule format would need to define a timespan, even just for the starting event.
That is, "allow auto-update events between 02:00 and 03:00 on Sat" instead of "allow auto-update events exactly at 02:00:00.00 on Sat".
The current syntax is taken from systemd timestamps, which defines a single specific point in time instead.
But I guess all of these are implementations details, so feel free to defer this discussion.
For example, a `schedule` property directly on [the ClusterVersion `spec`][api-spec] would protect administrators from accidentally using the web console or `oc adm upgrade ...` to trigger an upgrade outside of the configured window. | ||
Instead, administrators would have to adjust the `schedule` configuration or set an override option to trigger out-of-window updates. | ||
Customization like this can also be addressed by intermediate [policy-engine][cincinnati-policy-engine], without involving the ClusterVersion configuration at all. | ||
Teaching the cluster-version operator about a local `schedule` filter is effectively like a local policy engine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a difference though:
- if a schedule is enforced locally by CVO, the operator can grow logic for an escape hatch for manual forced-updates (or only apply to auto-updates at all)
- if a schedule is enforced globally by policy-engine, it cannot be force-bypassed locally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a schedule is enforced globally by policy-engine, it cannot be force-bypassed locally
Sure it can, you just set the update by pullspec and force the "don't worry if this is not in your graph" exception.
Only skimmed this, looks reasonable so far. One thing I do like to say though is that Red Hat hasn't truly completed the acquisition of CoreOS until we enable automatic updates by default (on by default in Fedora CoreOS, but not in OpenShift). |
dae7d51
to
9b61a96
Compare
9b61a96
to
568a1bd
Compare
Ok, I think I've addressed most of @lucab's excellent feedback. Outstanding threads are:
|
// automaticUpdates enables automatic updates. | ||
// +optional | ||
AutomaticUpdates bool `json:"automaticUpdates,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think booleans are great API as they are not extensible..
having a boolean makes it so that we are saying that cvo only knows update always or just don't and it is never going to do anything else, any condition if required will be done somewhere else..
personally i think something like union discriminator kubernetes/enhancements#926 is much better suited to express the intent no-updates, auto-updates-always, auto-updates-schedule etc..
There is an overhead trade-off that needs to taken into consideration before saying CVO doesn't do local stuff...
will a user with single or handful clusters want to run local policy-agent? or is it sane for them to do it?
will the users that want different policy for their clusters like testing/staging/prod like to run 3 different policy agents?
Does the fact that we have no policy agent at this point, and not sure how far that is or even the concept of supporting upstream that is self-signed... deter anybody from even using this... hence diminishing the value prop of the boolean.
Personally i would like clusters to support at some one usecase of automatic updates based on condition..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think booleans are great API as they are not extensible..
I'm still not clear on what sort of extensibility you'd want. I think there's a set of policy engines which can be distributed in external chains or run inside the CVO itself, that eventually spit out a set of recommended update targets. Then you need a boolean knob to decide if the CVO automatically initiates an update when a target becomes available. For example:
- Upstream Cincinnati spits out graph A.
- On-site policy engine ingests graph A and applies local stability filtering, removing Red-Hat-recommended update edges until they have also passed local update testing. This engine spits out graph B.
- CVO ingests graph B via the
upstream
URI, and appliesschedule
filtering, removing all update edges if and only if the current time lies outside the configured update windows. This engine spits out graph C. - CVO ingests its own graph C and looks for available updates from the current version, storing those in
AvailableUpdates
. - CVO looks at
AvailableUpdates
and theAutomaticUpdates
boolean, and decides whether to automatically setDesiredUpdate
.
So there's still lots of room for extension. The boolean option is just an in-cluster knob for the existing CLI flag.
...any condition if required will be done somewhere else...
No, you could still apply policy filtering on the CVO side, independent of whether the CVO automatically updates based on the results of a local policy engine. This is what I was trying to convey here.
Administrators can configure the cluster to push those alerts out to the on-call administrator to recover the cluster. | ||
* Stability testing. | ||
We are continually refining our CI suite and processing Telemetry from live clusters in order to assess the stability of each update. | ||
We will not place updates in production channels unless they have proven themselves stable in earlier testing, and we will remove update recommendations from production channels if a live cluster trips over a corner case that we do not yet cover in pretesting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the ability to control rollout, this seems like a risky feature. What is the minimal phased rollout feature that mitigates risk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the minimal phased rollout feature that mitigates risk?
The ability to spread the update out across a release-admin-specified time window, with clusters in the given channel randomly spread across that window. Things like automatically pulling edges on failure would be nice, but can be mitigated with manual monitoring and large-duration windows (i.e. slow rollouts). Things like intelligently sorting sensitive clusters toward the back of the queue would also be nice, but can be mitigated by ensuring sufficient populations in less-stable channels.
There are already canary clusters will polling automatic updates, so we can use those for testing and will not need to provision more long-running clusters to excercise this enhancement. | ||
We can provision additional short-lived clusters in existing CI accounts if we want to provide additional end-to-end testing. | ||
|
||
[api-pull-request]: https://github.com/openshift/api/pull/326 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request is now closed. So may be we can remove it from here. Not sure what is the right action here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request is now closed.
It was closed pending this enhancement discussion. If the enhancement lands, I'll re-open and reroll the PR.
Allowing automatic updates would make it more likely that update failures happen when there is no administrator actively watching to notice and recover from the failure. | ||
This is mitigated by: | ||
|
||
* Alerting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually know that certain (non critical) alerts will fire during upgrades. The observability group (of which the in-cluster monitoring team is a part of) has been trialing silencing everything but critical alerts during upgrades. While this is not perfect yet, I think this is what we ultimately want to automate for automatic upgrades.
That said any severity: critical
alerts firing should prevent an automatic upgrade from happening. This level is used to page people, so in a production cluster this should already mean that someone has been notified. Our goal must be that no critical alerts are firing during upgrades. Essentially this means we need to set certain objectives per component and if they are violated we cannot proceed with an automatic upgrade and/or people are notified. I know the in-cluster team is planning an initiative to introduce these objectives for components (at this point earliest in the 4.5 time frame I would say).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually know that certain (non critical) alerts will fire during upgrades.
Do we consider these overly sensitive alerts? Update bugs? I'd expect that our goal is to have updates be smooth enough that we can apply them without additional alerts firing. Silencing non-critical alerts sounds like a reasonable stopgap to avoid alert fatigue, but I'm less comfortable with it as a long-term plan.
That said any severity: critical alerts firing should prevent an automatic upgrade from happening.
Hmm. We might be able to swing this with the monitoring operator setting Upgradeable=False
when it sees a critical alert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silencing is made exactly for the case of "we know a maintenance window is coming, so don't bother us with non critical things", so I think it's exactly the right thing. Warning alerts are more the type of "something might be up, but it's not urgent". They should typically just open a ticket for someone to eventually take a look at. Critical alerts are the type of alerts that tell someone that there is something that needs attention urgently and that SLOs are likely to be violated. The later should never fire during an upgrade, I agree.
We might be able to swing this with the monitoring operator setting Upgradeable=False when it sees a critical alert.
I think that's a good idea.
@wking Few more thoughts on this. Sorry if they have been covered else where.
|
Yup. And the CVO has had autoupdate logic to select the latest SemVer from available updates since 2018. This PR is just asking for a knob to enable that code. And if there were any stability issues with the longest recommended hop, the upstream service would remove that recommendation.
I'm not against that in general, but I am against it in this PR. Ifyou feel that it's impossible to punt without getting painted into a corner, please explain why in an existing thread.
If you are running 4.6 and want to stay there, subscribe to channel |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/lifecycle frozen Just waiting for everyone to get onboard ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do either of the following:
- close the enhancement
- update the enhancement to capture that we are
deferring
implementation.
before pursuing this further, we need to rationalize it with the present shared responsibility model we have in OpenShift Dedicated and other managed environments that do view calendar gating as part of the mvp solution.
- "@crawford" | ||
- "@smarterclayton" | ||
approvers: | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest @crawford ?
|
||
* Make it easy to opt in to and out of automatic updates. | ||
|
||
### Non-Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for reference, openshift dedicated has layered logic over the cluster to handle calendar gating logic.
the update settings presented for dedicated are the following:
automatic with a user supplied preferred day and start time.
manual with a warning that critical and high cve vulnerabilities will be patched independent of this setting.
node draining budget for maximum grace time is also configurable (default 1 hr)
- TBD | ||
creation-date: 2019-11-19 | ||
last-updated: 2019-11-21 | ||
status: implementable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly update this status to deferred
given the frozen state if we want to merge for record keeping?
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/close |
@sdodson: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
To back openshift/api#326.
CC @abhinavdahiya, @bradmwilliams, @crawford, @lucab, @smarterclayton, @steveej, @stevekuznetsov, @vrutkovs