enhancements/update/automatic-updates: Propose a new enhancement #124

wking · 2019-11-19T21:56:19Z

To back openshift/api#326.

CC @abhinavdahiya, @bradmwilliams, @crawford, @lucab, @smarterclayton, @steveej, @stevekuznetsov, @vrutkovs

vrutkovs · 2019-11-20T09:50:57Z

LGTM, is it worth noting that autoupdates would be limited to the selected channel?

enhancements/update/automatic-updates.md

lucab · 2019-11-20T10:12:35Z

enhancements/update/automatic-updates.md

+    Adminstrators can configure the cluster to push those alerts out to the on-call admistrator to recover the cluster.
+* Stability testing.
+    We are continually refining our CI suite and processing Telemetry from live clusters in order to assess the stability of each upgrade.
+    We will not place upgrades in production channels unless they have proven themselves stable in earlier testing, and we will remove upgrades from production channels if a live cluster trips over a corner case that we do not yet cover in pretesting.


s/remove/stop pushing/

s/remove/stop pushing/

Expanded "we will remove upgrades from production channels" to "we will remove update recommendations from production channels". We're not really pushing anything, we're just serving graphs to clients like the CVO. Does the new wording look good to you, or do you have an alternative idea?

lucab · 2019-11-20T10:16:49Z

enhancements/update/automatic-updates.md

+    If that update proves unstable, many of those upgrades would already be in progress by the time the first Telemetry comes back with failure messages.
+    A phased rollout would limit the number of simultaneously updating clusters to give Telemetry time to come back so we could stop recommending upgrades that proved unstable on live clusters not yet covered in pretesting.
+
+There is also a security risk where a compromised upstream Cincinnati could recommend cluster updates that were not in the cluster's best interest (e.g. 4.2.4 -> 4.1.0).


Not sure if relevant for OpenShift mitigations, but Zincati has client-side checks and knobs to prevent auto-downgrades: https://github.com/coreos/zincati/blob/dbb0b0a8884435f2b1186b2228199cd4adb6f705/docs/usage/auto-updates.md#updates-ordering-and-downgrades

Not sure if relevant for OpenShift mitigations...

We could grow this. But sometimes you might want to recommend a downgrade. E.g, 4.y.7 turns out to explode after 24h because of broken cert rotation, and a fixed 4.y.8 is 48h out, so you recommend 4.y.7->4.y.6 until then to get folks back to a safe place.

I think it makes sense to version updates separately from the cluster versions, e.g.

update-1.0 => (ocp-4.y.6 -> ocp-4.y.7) update-1.1 => (ocp-4.y.7 -> ocp-4.y.6) update-1.2 => (ocp-4.y.6 -> ocp-4.y.8, ocp-4.y.7 -> ocp-4.y.8)

You always want the latest update version, which may or may not upgrade you to the highest cluster version.

Another way to address rollback attacks, as well as freeze attacks: TUF uses timestamps to certify updates for short durations, requiring constant refreshing to ensure the latest update metadata. Timestamp ordering is simple enough but does require secure and accurate time sources (for TUF, see the timestamp.json section: tuf-spec.md#4-document-formats

For libostree the timestamp in the commit is covered by the GPG signature; we don't do anything about comparing with the system wall clock, just require that the timestamp in the new commit increases.

The TUF threat model helps in a DoS attack where a MITM attacker just tells you there are no more updates. This is useful, but comes with a lot of overhead.

For OpenShift though the MCO ignores that bit, and obviously the libostree part isn't use for the rest of the container images anyways.

So I guess I'm just arguing that "signed timestamps" work pretty well and would likely be easy to add to the CVO if it doesn't do it today.

You always want the latest update version, which may or may not upgrade you to the highest cluster version.

We've talked about signing update recommendations, but you'd probably need to sign each of them separately. E.g.

from: 4.3.0 to: 4.3.1 expires: 2020-02-08T00:00Z

That's possible, but would be a fair bit of work to put in. Protections like "require admin overrides before applying downgrades" are coarser (and as above, sometimes you want downgrades), but would be easier to implement in the short term.

So I guess I'm just arguing that "signed timestamps" work pretty well and would likely be easy to add to the CVO if it doesn't do it today.

For the releases themselves, we have a version number and creation timestamp, both of which are covered by the image signature. Sorting on version number seems more sane to me, because we're cutting 4.2.18 after 4.3.0, and 4.3.0 -> 4.2.18 is probably not what most clusters want ;).

Right...when you have multiple branches you do want something more sophisticated. We ended up implementing ref binding in libostree, which avoids parsing the version strings. But I guess for the fully general case of switching branches, you do need something that is parsing them.

lucab · 2019-11-20T10:27:39Z

enhancements/update/automatic-updates.md

+
+There are also potential future mitigations:
+
+* The cluster-version operator could be taught to limit automatic upgrades to those which appear in the [signed metadata][cluster-version-operator-release-metadata] of either the source or target release.


I don't think this is a mitigation that actually fit in the rest of design/release-engineering (or more likely, I didn't understand your suggestion).
In particular, it seems to conflict with the "rollouts based on Telemetry" point above.

I don't think this is a mitigation that actually fit in the rest of design/release-engineering...

This is a guard against malicious Cincinnati, whose risk exists independent of this enhancement. This enhancement just increases exposure to the preexisting risk.

enhancements/update/automatic-updates.md

lucab · 2019-11-20T10:41:39Z

enhancements/update/automatic-updates.md

+  - Tues..Thurs *-11..1-* 11:00
+```
+
+The schedule structure was designed to support maintenance windows, allowing for updates on certain days or during certain times.


The entries shown above are punctual datetimes, not maintenance windows.
That is, they specify when a maintenance event can start (edge-trigger) but not how long it can be active (level-trigger). I fear this is a relevant semantic detail if there is any polling logic involved.

That is, they specify when a maintenance event can start (edge-trigger) but not how long it can be active (level-trigger).

There's no cap on upgrade duration. If you're concerned about it, you'd have small windows early in your day for initiating upgrades, to leave lots of time for monitoring/recovering before you went home. But this is an alternative proposal; I dunno how deeply we want to dig into it here.

Oh sorry, I realize my comment was not worded clearly enough. I'm not trying to cap upgrade duration.

I'm stating that the schedule format would need to define a timespan, even just for the starting event.
That is, "allow auto-update events between 02:00 and 03:00 on Sat" instead of "allow auto-update events exactly at 02:00:00.00 on Sat".

The current syntax is taken from systemd timestamps, which defines a single specific point in time instead.

But I guess all of these are implementations details, so feel free to defer this discussion.

lucab · 2019-11-20T10:45:47Z

enhancements/update/automatic-updates.md

+For example, a `schedule` property directly on [the ClusterVersion `spec`][api-spec] would protect administrators from accidentally using the web console or `oc adm upgrade ...` to trigger an upgrade outside of the configured window.
+Instead, administrators would have to adjust the `schedule` configuration or set an override option to trigger out-of-window updates.
+Customization like this can also be addressed by intermediate [policy-engine][cincinnati-policy-engine], without involving the ClusterVersion configuration at all.
+Teaching the cluster-version operator about a local `schedule` filter is effectively like a local policy engine.


There is a difference though:

if a schedule is enforced locally by CVO, the operator can grow logic for an escape hatch for manual forced-updates (or only apply to auto-updates at all)

if a schedule is enforced globally by policy-engine, it cannot be force-bypassed locally

if a schedule is enforced globally by policy-engine, it cannot be force-bypassed locally

Sure it can, you just set the update by pullspec and force the "don't worry if this is not in your graph" exception.

cgwalters · 2019-11-20T15:48:17Z

Only skimmed this, looks reasonable so far.

One thing I do like to say though is that Red Hat hasn't truly completed the acquisition of CoreOS until we enable automatic updates by default (on by default in Fedora CoreOS, but not in OpenShift).

wking · 2019-11-21T19:39:22Z

Ok, I think I've addressed most of @lucab's excellent feedback. Outstanding threads are:

Wording around edge removal. I've made some changes, but would like feedback on the new wording.
Malicious downgrade protection (also here). Do we consider compromised Cincinnati a high enough risk to need guards here before we add an auto-update knob?
Details of the calendar idea (also here). I think we both agree we can punt that as peripheral, but I haven't marked the thread resolved because I want the information to be discoverable by anyone who does pick up a calendar-guard enhancement.

abhinavdahiya · 2019-11-21T23:11:29Z

enhancements/update/automatic-updates.md

+// automaticUpdates enables automatic updates.
+// +optional
+AutomaticUpdates bool `json:"automaticUpdates,omitempty"`


I don't think booleans are great API as they are not extensible..

having a boolean makes it so that we are saying that cvo only knows update always or just don't and it is never going to do anything else, any condition if required will be done somewhere else..

personally i think something like union discriminator kubernetes/enhancements#926 is much better suited to express the intent no-updates, auto-updates-always, auto-updates-schedule etc..

https://github.com/openshift/enhancements/pull/124/files#diff-8354eb4f233832a1e5afa3ffc9bea8e8R154-R160

There is an overhead trade-off that needs to taken into consideration before saying CVO doesn't do local stuff...
will a user with single or handful clusters want to run local policy-agent? or is it sane for them to do it?
will the users that want different policy for their clusters like testing/staging/prod like to run 3 different policy agents?

Does the fact that we have no policy agent at this point, and not sure how far that is or even the concept of supporting upstream that is self-signed... deter anybody from even using this... hence diminishing the value prop of the boolean.

Personally i would like clusters to support at some one usecase of automatic updates based on condition..

I don't think booleans are great API as they are not extensible..

I'm still not clear on what sort of extensibility you'd want. I think there's a set of policy engines which can be distributed in external chains or run inside the CVO itself, that eventually spit out a set of recommended update targets. Then you need a boolean knob to decide if the CVO automatically initiates an update when a target becomes available. For example:

Upstream Cincinnati spits out graph A.

On-site policy engine ingests graph A and applies local stability filtering, removing Red-Hat-recommended update edges until they have also passed local update testing. This engine spits out graph B.

CVO ingests graph B via the upstream URI, and applies schedule filtering, removing all update edges if and only if the current time lies outside the configured update windows. This engine spits out graph C.

CVO ingests its own graph C and looks for available updates from the current version, storing those in AvailableUpdates.

CVO looks at AvailableUpdates and the AutomaticUpdates boolean, and decides whether to automatically set DesiredUpdate.

So there's still lots of room for extension. The boolean option is just an in-cluster knob for the existing CLI flag.

...any condition if required will be done somewhere else...

No, you could still apply policy filtering on the CVO side, independent of whether the CVO automatically updates based on the results of a local policy engine. This is what I was trying to convey here.

smarterclayton · 2019-11-23T23:13:18Z

enhancements/update/automatic-updates.md

+    Administrators can configure the cluster to push those alerts out to the on-call administrator to recover the cluster.
+* Stability testing.
+    We are continually refining our CI suite and processing Telemetry from live clusters in order to assess the stability of each update.
+    We will not place updates in production channels unless they have proven themselves stable in earlier testing, and we will remove update recommendations from production channels if a live cluster trips over a corner case that we do not yet cover in pretesting.


Without the ability to control rollout, this seems like a risky feature. What is the minimal phased rollout feature that mitigates risk?

What is the minimal phased rollout feature that mitigates risk?

The ability to spread the update out across a release-admin-specified time window, with clusters in the given channel randomly spread across that window. Things like automatically pulling edges on failure would be nice, but can be mitigated with manual monitoring and large-duration windows (i.e. slow rollouts). Things like intelligently sorting sensitive clusters toward the back of the queue would also be nice, but can be mitigated by ensuring sufficient populations in less-stable channels.

LalatenduMohanty · 2019-11-25T20:07:37Z

enhancements/update/automatic-updates.md

+There are already canary clusters will polling automatic updates, so we can use those for testing and will not need to provision more long-running clusters to excercise this enhancement.
+We can provision additional short-lived clusters in existing CI accounts if we want to provide additional end-to-end testing.
+
+[api-pull-request]: https://github.com/openshift/api/pull/326


This pull request is now closed. So may be we can remove it from here. Not sure what is the right action here.

This pull request is now closed.

It was closed pending this enhancement discussion. If the enhancement lands, I'll re-open and reroll the PR.

brancz · 2019-12-12T08:26:26Z

enhancements/update/automatic-updates.md

+Allowing automatic updates would make it more likely that update failures happen when there is no administrator actively watching to notice and recover from the failure.
+This is mitigated by:
+
+* Alerting.


We actually know that certain (non critical) alerts will fire during upgrades. The observability group (of which the in-cluster monitoring team is a part of) has been trialing silencing everything but critical alerts during upgrades. While this is not perfect yet, I think this is what we ultimately want to automate for automatic upgrades.

That said any severity: critical alerts firing should prevent an automatic upgrade from happening. This level is used to page people, so in a production cluster this should already mean that someone has been notified. Our goal must be that no critical alerts are firing during upgrades. Essentially this means we need to set certain objectives per component and if they are violated we cannot proceed with an automatic upgrade and/or people are notified. I know the in-cluster team is planning an initiative to introduce these objectives for components (at this point earliest in the 4.5 time frame I would say).

We actually know that certain (non critical) alerts will fire during upgrades.

Do we consider these overly sensitive alerts? Update bugs? I'd expect that our goal is to have updates be smooth enough that we can apply them without additional alerts firing. Silencing non-critical alerts sounds like a reasonable stopgap to avoid alert fatigue, but I'm less comfortable with it as a long-term plan.

That said any severity: critical alerts firing should prevent an automatic upgrade from happening.

Hmm. We might be able to swing this with the monitoring operator setting Upgradeable=False when it sees a critical alert.

Silencing is made exactly for the case of "we know a maintenance window is coming, so don't bother us with non critical things", so I think it's exactly the right thing. Warning alerts are more the type of "something might be up, but it's not urgent". They should typically just open a ticket for someone to eventually take a look at. Critical alerts are the type of alerts that tell someone that there is something that needs attention urgently and that SLOs are likely to be violated. The later should never fire during an upgrade, I agree.

We might be able to swing this with the monitoring operator setting Upgradeable=False when it sees a critical alert.

I think that's a good idea.

dofinn · 2020-06-04T11:10:12Z

@wking Few more thoughts on this. Sorry if they have been covered else where.

If I am 3 releases behind in z-stream latest and enable automatic upgrades. What is the expectation? I am upgraded to latest via the shortest path?
+1 for calendar configuration
I also see a need for Y vs Z stream policy. I might be ok with z-stream auto upgrades, but not Y. What are your thoughts?

wking · 2020-06-04T13:24:56Z

If I am 3 releases behind in z-stream latest and enable automatic upgrades. What is the expectation? I am upgraded to latest via the shortest path?

Yup. And the CVO has had autoupdate logic to select the latest SemVer from available updates since 2018. This PR is just asking for a knob to enable that code. And if there were any stability issues with the longest recommended hop, the upstream service would remove that recommendation.

+1 for calendar configuration

I'm not against that in general, but I am against it in this PR. Ifyou feel that it's impossible to punt without getting painted into a corner, please explain why in an existing thread.

I also see a need for Y vs Z stream policy. I might be ok with z-stream auto upgrades, but not Y.

If you are running 4.6 and want to stay there, subscribe to channel stable-4.6. If you are open to moving to 4.7, subscribe to stable-4.7. So channels provide the control you seek there today, and are orthogonal to this auto-update proposal.

openshift-bot · 2020-10-17T00:48:22Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

wking · 2020-10-22T13:21:55Z

/lifecycle frozen

Just waiting for everyone to get onboard ;)

derekwaynecarr

I think we should do either of the following:

close the enhancement
update the enhancement to capture that we are deferring implementation.

before pursuing this further, we need to rationalize it with the present shared responsibility model we have in OpenShift Dedicated and other managed environments that do view calendar gating as part of the mvp solution.

derekwaynecarr · 2021-01-29T14:42:21Z

enhancements/update/automatic-updates.md

+  - "@crawford"
+  - "@smarterclayton"
+approvers:
+  - TBD


suggest @crawford ?

derekwaynecarr · 2021-01-29T14:46:31Z

enhancements/update/automatic-updates.md

+
+* Make it easy to opt in to and out of automatic updates.
+
+### Non-Goals


for reference, openshift dedicated has layered logic over the cluster to handle calendar gating logic.

the update settings presented for dedicated are the following:

automatic with a user supplied preferred day and start time.
manual with a warning that critical and high cve vulnerabilities will be patched independent of this setting.

node draining budget for maximum grace time is also configurable (default 1 hr)

derekwaynecarr · 2021-01-29T14:50:19Z

enhancements/update/automatic-updates.md

+  - TBD
+creation-date: 2019-11-19
+last-updated: 2019-11-21
+status: implementable


possibly update this status to deferred given the frozen state if we want to merge for record keeping?

openshift-ci-robot · 2021-01-29T14:52:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign ashcrow after the PR has been reviewed.
You can assign the PR to them by writing /assign @ashcrow in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sdodson · 2021-07-08T17:54:17Z

/close
We'll either re-open this or create a new enhancement when this is actually on the roadmap.

openshift-ci · 2021-07-08T17:54:25Z

@sdodson: Closed this PR.

In response to this:

/close
We'll either re-open this or create a new enhancement when this is actually on the roadmap.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 19, 2019

openshift-ci-robot requested review from bparees and brancz November 19, 2019 21:56

wking mentioned this pull request Nov 19, 2019

config/v1/types_cluster_version: Add ClusterVersionSpec.AutomaticUpdates openshift/api#326

Closed

lucab reviewed Nov 20, 2019

View reviewed changes

wking force-pushed the automatic-updates branch from dae7d51 to 9b61a96 Compare November 21, 2019 19:20

enhancements/update/automatic-updates: Propose a new enhancement

568a1bd

wking force-pushed the automatic-updates branch from 9b61a96 to 568a1bd Compare November 21, 2019 19:32

abhinavdahiya reviewed Nov 21, 2019

View reviewed changes

smarterclayton reviewed Nov 23, 2019

View reviewed changes

LalatenduMohanty reviewed Nov 25, 2019

View reviewed changes

brancz reviewed Dec 12, 2019

View reviewed changes

wking mentioned this pull request May 7, 2020

update: distribute cincinnati-graph-data as container images #310

Merged

wking mentioned this pull request Sep 10, 2020

Question: auto update enabled openshift/cluster-version-operator#99

Open

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 17, 2020

cgwalters mentioned this pull request Oct 20, 2020

node-controller: Support an annotation to hold updates openshift/machine-config-operator#2163

Closed

openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 22, 2020

derekwaynecarr requested changes Jan 29, 2021

View reviewed changes

openshift-ci bot closed this Jul 8, 2021

wking mentioned this pull request Nov 30, 2022

OTA-560: Modernize README.md to be user-centered openshift/cluster-version-operator#869

Merged


		There are also potential future mitigations:

		* The cluster-version operator could be taught to limit automatic upgrades to those which appear in the [signed metadata][cluster-version-operator-release-metadata] of either the source or target release.


		* Make it easy to opt in to and out of automatic updates.

		### Non-Goals

enhancements/update/automatic-updates: Propose a new enhancement #124

enhancements/update/automatic-updates: Propose a new enhancement #124

Conversation

wking commented Nov 19, 2019

vrutkovs commented Nov 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucab Nov 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgwalters commented Nov 20, 2019 • edited Loading

wking commented Nov 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brancz Feb 18, 2020 • edited Loading

Choose a reason for hiding this comment

dofinn commented Jun 4, 2020 • edited Loading

wking commented Jun 4, 2020

openshift-bot commented Oct 17, 2020

wking commented Oct 22, 2020

derekwaynecarr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Jan 29, 2021

sdodson commented Jul 8, 2021

openshift-ci bot commented Jul 8, 2021

lucab Nov 20, 2019 •

edited

Loading

cgwalters commented Nov 20, 2019 •

edited

Loading

brancz Feb 18, 2020 •

edited

Loading

dofinn commented Jun 4, 2020 •

edited

Loading