Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancements/update/automatic-updates: Propose a new enhancement #124

Closed
wants to merge 1 commit into from

Conversation

wking
Copy link
Member

@wking wking commented Nov 19, 2019

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 19, 2019
@vrutkovs
Copy link
Member

LGTM, is it worth noting that autoupdates would be limited to the selected channel?

enhancements/update/automatic-updates.md Outdated Show resolved Hide resolved
enhancements/update/automatic-updates.md Outdated Show resolved Hide resolved
enhancements/update/automatic-updates.md Outdated Show resolved Hide resolved
enhancements/update/automatic-updates.md Outdated Show resolved Hide resolved
Adminstrators can configure the cluster to push those alerts out to the on-call admistrator to recover the cluster.
* Stability testing.
We are continually refining our CI suite and processing Telemetry from live clusters in order to assess the stability of each upgrade.
We will not place upgrades in production channels unless they have proven themselves stable in earlier testing, and we will remove upgrades from production channels if a live cluster trips over a corner case that we do not yet cover in pretesting.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/remove/stop pushing/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/remove/stop pushing/

Expanded "we will remove upgrades from production channels" to "we will remove update recommendations from production channels". We're not really pushing anything, we're just serving graphs to clients like the CVO. Does the new wording look good to you, or do you have an alternative idea?

If that update proves unstable, many of those upgrades would already be in progress by the time the first Telemetry comes back with failure messages.
A phased rollout would limit the number of simultaneously updating clusters to give Telemetry time to come back so we could stop recommending upgrades that proved unstable on live clusters not yet covered in pretesting.

There is also a security risk where a compromised upstream Cincinnati could recommend cluster updates that were not in the cluster's best interest (e.g. 4.2.4 -> 4.1.0).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if relevant for OpenShift mitigations, but Zincati has client-side checks and knobs to prevent auto-downgrades: https://github.com/coreos/zincati/blob/dbb0b0a8884435f2b1186b2228199cd4adb6f705/docs/usage/auto-updates.md#updates-ordering-and-downgrades

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if relevant for OpenShift mitigations...

We could grow this. But sometimes you might want to recommend a downgrade. E.g, 4.y.7 turns out to explode after 24h because of broken cert rotation, and a fixed 4.y.8 is 48h out, so you recommend 4.y.7->4.y.6 until then to get folks back to a safe place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to version updates separately from the cluster versions, e.g.

update-1.0 => (ocp-4.y.6 -> ocp-4.y.7)
update-1.1 => (ocp-4.y.7 -> ocp-4.y.6)
update-1.2 => (ocp-4.y.6 -> ocp-4.y.8, ocp-4.y.7 -> ocp-4.y.8)

You always want the latest update version, which may or may not upgrade you to the highest cluster version.

Another way to address rollback attacks, as well as freeze attacks: TUF uses timestamps to certify updates for short durations, requiring constant refreshing to ensure the latest update metadata. Timestamp ordering is simple enough but does require secure and accurate time sources (for TUF, see the timestamp.json section: tuf-spec.md#4-document-formats

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For libostree the timestamp in the commit is covered by the GPG signature; we don't do anything about comparing with the system wall clock, just require that the timestamp in the new commit increases.

The TUF threat model helps in a DoS attack where a MITM attacker just tells you there are no more updates. This is useful, but comes with a lot of overhead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For OpenShift though the MCO ignores that bit, and obviously the libostree part isn't use for the rest of the container images anyways.

So I guess I'm just arguing that "signed timestamps" work pretty well and would likely be easy to add to the CVO if it doesn't do it today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You always want the latest update version, which may or may not upgrade you to the highest cluster version.

We've talked about signing update recommendations, but you'd probably need to sign each of them separately. E.g.

from: 4.3.0
to: 4.3.1
expires: 2020-02-08T00:00Z

That's possible, but would be a fair bit of work to put in. Protections like "require admin overrides before applying downgrades" are coarser (and as above, sometimes you want downgrades), but would be easier to implement in the short term.

So I guess I'm just arguing that "signed timestamps" work pretty well and would likely be easy to add to the CVO if it doesn't do it today.

For the releases themselves, we have a version number and creation timestamp, both of which are covered by the image signature. Sorting on version number seems more sane to me, because we're cutting 4.2.18 after 4.3.0, and 4.3.0 -> 4.2.18 is probably not what most clusters want ;).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right...when you have multiple branches you do want something more sophisticated. We ended up implementing ref binding in libostree, which avoids parsing the version strings. But I guess for the fully general case of switching branches, you do need something that is parsing them.


There are also potential future mitigations:

* The cluster-version operator could be taught to limit automatic upgrades to those which appear in the [signed metadata][cluster-version-operator-release-metadata] of either the source or target release.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a mitigation that actually fit in the rest of design/release-engineering (or more likely, I didn't understand your suggestion).
In particular, it seems to conflict with the "rollouts based on Telemetry" point above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a mitigation that actually fit in the rest of design/release-engineering...

This is a guard against malicious Cincinnati, whose risk exists independent of this enhancement. This enhancement just increases exposure to the preexisting risk.

enhancements/update/automatic-updates.md Show resolved Hide resolved
- Tues..Thurs *-11..1-* 11:00
```

The schedule structure was designed to support maintenance windows, allowing for updates on certain days or during certain times.
Copy link

@lucab lucab Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entries shown above are punctual datetimes, not maintenance windows.
That is, they specify when a maintenance event can start (edge-trigger) but not how long it can be active (level-trigger). I fear this is a relevant semantic detail if there is any polling logic involved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, they specify when a maintenance event can start (edge-trigger) but not how long it can be active (level-trigger).

There's no cap on upgrade duration. If you're concerned about it, you'd have small windows early in your day for initiating upgrades, to leave lots of time for monitoring/recovering before you went home. But this is an alternative proposal; I dunno how deeply we want to dig into it here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I realize my comment was not worded clearly enough. I'm not trying to cap upgrade duration.

I'm stating that the schedule format would need to define a timespan, even just for the starting event.
That is, "allow auto-update events between 02:00 and 03:00 on Sat" instead of "allow auto-update events exactly at 02:00:00.00 on Sat".

The current syntax is taken from systemd timestamps, which defines a single specific point in time instead.

But I guess all of these are implementations details, so feel free to defer this discussion.

For example, a `schedule` property directly on [the ClusterVersion `spec`][api-spec] would protect administrators from accidentally using the web console or `oc adm upgrade ...` to trigger an upgrade outside of the configured window.
Instead, administrators would have to adjust the `schedule` configuration or set an override option to trigger out-of-window updates.
Customization like this can also be addressed by intermediate [policy-engine][cincinnati-policy-engine], without involving the ClusterVersion configuration at all.
Teaching the cluster-version operator about a local `schedule` filter is effectively like a local policy engine.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a difference though:

  • if a schedule is enforced locally by CVO, the operator can grow logic for an escape hatch for manual forced-updates (or only apply to auto-updates at all)
  • if a schedule is enforced globally by policy-engine, it cannot be force-bypassed locally

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a schedule is enforced globally by policy-engine, it cannot be force-bypassed locally

Sure it can, you just set the update by pullspec and force the "don't worry if this is not in your graph" exception.

@cgwalters
Copy link
Member

cgwalters commented Nov 20, 2019

Only skimmed this, looks reasonable so far.

One thing I do like to say though is that Red Hat hasn't truly completed the acquisition of CoreOS until we enable automatic updates by default (on by default in Fedora CoreOS, but not in OpenShift).

@wking
Copy link
Member Author

wking commented Nov 21, 2019

Ok, I think I've addressed most of @lucab's excellent feedback. Outstanding threads are:

  • Wording around edge removal. I've made some changes, but would like feedback on the new wording.
  • Malicious downgrade protection (also here). Do we consider compromised Cincinnati a high enough risk to need guards here before we add an auto-update knob?
  • Details of the calendar idea (also here). I think we both agree we can punt that as peripheral, but I haven't marked the thread resolved because I want the information to be discoverable by anyone who does pick up a calendar-guard enhancement.

Comment on lines +54 to +56
// automaticUpdates enables automatic updates.
// +optional
AutomaticUpdates bool `json:"automaticUpdates,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think booleans are great API as they are not extensible..

having a boolean makes it so that we are saying that cvo only knows update always or just don't and it is never going to do anything else, any condition if required will be done somewhere else..

personally i think something like union discriminator kubernetes/enhancements#926 is much better suited to express the intent no-updates, auto-updates-always, auto-updates-schedule etc..

https://github.com/openshift/enhancements/pull/124/files#diff-8354eb4f233832a1e5afa3ffc9bea8e8R154-R160

There is an overhead trade-off that needs to taken into consideration before saying CVO doesn't do local stuff...
will a user with single or handful clusters want to run local policy-agent? or is it sane for them to do it?
will the users that want different policy for their clusters like testing/staging/prod like to run 3 different policy agents?

Does the fact that we have no policy agent at this point, and not sure how far that is or even the concept of supporting upstream that is self-signed... deter anybody from even using this... hence diminishing the value prop of the boolean.

Personally i would like clusters to support at some one usecase of automatic updates based on condition..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think booleans are great API as they are not extensible..

I'm still not clear on what sort of extensibility you'd want. I think there's a set of policy engines which can be distributed in external chains or run inside the CVO itself, that eventually spit out a set of recommended update targets. Then you need a boolean knob to decide if the CVO automatically initiates an update when a target becomes available. For example:

  1. Upstream Cincinnati spits out graph A.
  2. On-site policy engine ingests graph A and applies local stability filtering, removing Red-Hat-recommended update edges until they have also passed local update testing. This engine spits out graph B.
  3. CVO ingests graph B via the upstream URI, and applies schedule filtering, removing all update edges if and only if the current time lies outside the configured update windows. This engine spits out graph C.
  4. CVO ingests its own graph C and looks for available updates from the current version, storing those in AvailableUpdates.
  5. CVO looks at AvailableUpdates and the AutomaticUpdates boolean, and decides whether to automatically set DesiredUpdate.

So there's still lots of room for extension. The boolean option is just an in-cluster knob for the existing CLI flag.

...any condition if required will be done somewhere else...

No, you could still apply policy filtering on the CVO side, independent of whether the CVO automatically updates based on the results of a local policy engine. This is what I was trying to convey here.

Administrators can configure the cluster to push those alerts out to the on-call administrator to recover the cluster.
* Stability testing.
We are continually refining our CI suite and processing Telemetry from live clusters in order to assess the stability of each update.
We will not place updates in production channels unless they have proven themselves stable in earlier testing, and we will remove update recommendations from production channels if a live cluster trips over a corner case that we do not yet cover in pretesting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the ability to control rollout, this seems like a risky feature. What is the minimal phased rollout feature that mitigates risk?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the minimal phased rollout feature that mitigates risk?

The ability to spread the update out across a release-admin-specified time window, with clusters in the given channel randomly spread across that window. Things like automatically pulling edges on failure would be nice, but can be mitigated with manual monitoring and large-duration windows (i.e. slow rollouts). Things like intelligently sorting sensitive clusters toward the back of the queue would also be nice, but can be mitigated by ensuring sufficient populations in less-stable channels.

There are already canary clusters will polling automatic updates, so we can use those for testing and will not need to provision more long-running clusters to excercise this enhancement.
We can provision additional short-lived clusters in existing CI accounts if we want to provide additional end-to-end testing.

[api-pull-request]: https://github.com/openshift/api/pull/326
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request is now closed. So may be we can remove it from here. Not sure what is the right action here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request is now closed.

It was closed pending this enhancement discussion. If the enhancement lands, I'll re-open and reroll the PR.

Allowing automatic updates would make it more likely that update failures happen when there is no administrator actively watching to notice and recover from the failure.
This is mitigated by:

* Alerting.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually know that certain (non critical) alerts will fire during upgrades. The observability group (of which the in-cluster monitoring team is a part of) has been trialing silencing everything but critical alerts during upgrades. While this is not perfect yet, I think this is what we ultimately want to automate for automatic upgrades.

That said any severity: critical alerts firing should prevent an automatic upgrade from happening. This level is used to page people, so in a production cluster this should already mean that someone has been notified. Our goal must be that no critical alerts are firing during upgrades. Essentially this means we need to set certain objectives per component and if they are violated we cannot proceed with an automatic upgrade and/or people are notified. I know the in-cluster team is planning an initiative to introduce these objectives for components (at this point earliest in the 4.5 time frame I would say).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually know that certain (non critical) alerts will fire during upgrades.

Do we consider these overly sensitive alerts? Update bugs? I'd expect that our goal is to have updates be smooth enough that we can apply them without additional alerts firing. Silencing non-critical alerts sounds like a reasonable stopgap to avoid alert fatigue, but I'm less comfortable with it as a long-term plan.

That said any severity: critical alerts firing should prevent an automatic upgrade from happening.

Hmm. We might be able to swing this with the monitoring operator setting Upgradeable=False when it sees a critical alert.

Copy link

@brancz brancz Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silencing is made exactly for the case of "we know a maintenance window is coming, so don't bother us with non critical things", so I think it's exactly the right thing. Warning alerts are more the type of "something might be up, but it's not urgent". They should typically just open a ticket for someone to eventually take a look at. Critical alerts are the type of alerts that tell someone that there is something that needs attention urgently and that SLOs are likely to be violated. The later should never fire during an upgrade, I agree.

We might be able to swing this with the monitoring operator setting Upgradeable=False when it sees a critical alert.

I think that's a good idea.

@dofinn
Copy link
Contributor

dofinn commented Jun 4, 2020

@wking Few more thoughts on this. Sorry if they have been covered else where.

  1. If I am 3 releases behind in z-stream latest and enable automatic upgrades. What is the expectation? I am upgraded to latest via the shortest path?

  2. +1 for calendar configuration

  3. I also see a need for Y vs Z stream policy. I might be ok with z-stream auto upgrades, but not Y. What are your thoughts?

@wking
Copy link
Member Author

wking commented Jun 4, 2020

If I am 3 releases behind in z-stream latest and enable automatic upgrades. What is the expectation? I am upgraded to latest via the shortest path?

Yup. And the CVO has had autoupdate logic to select the latest SemVer from available updates since 2018. This PR is just asking for a knob to enable that code. And if there were any stability issues with the longest recommended hop, the upstream service would remove that recommendation.

+1 for calendar configuration

I'm not against that in general, but I am against it in this PR. Ifyou feel that it's impossible to punt without getting painted into a corner, please explain why in an existing thread.

I also see a need for Y vs Z stream policy. I might be ok with z-stream auto upgrades, but not Y.

If you are running 4.6 and want to stay there, subscribe to channel stable-4.6. If you are open to moving to 4.7, subscribe to stable-4.7. So channels provide the control you seek there today, and are orthogonal to this auto-update proposal.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 17, 2020
@wking
Copy link
Member Author

wking commented Oct 22, 2020

/lifecycle frozen

Just waiting for everyone to get onboard ;)

@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 22, 2020
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do either of the following:

  1. close the enhancement
  2. update the enhancement to capture that we are deferring implementation.

before pursuing this further, we need to rationalize it with the present shared responsibility model we have in OpenShift Dedicated and other managed environments that do view calendar gating as part of the mvp solution.

- "@crawford"
- "@smarterclayton"
approvers:
- TBD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest @crawford ?


* Make it easy to opt in to and out of automatic updates.

### Non-Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reference, openshift dedicated has layered logic over the cluster to handle calendar gating logic.

the update settings presented for dedicated are the following:

automatic with a user supplied preferred day and start time.
manual with a warning that critical and high cve vulnerabilities will be patched independent of this setting.

node draining budget for maximum grace time is also configurable (default 1 hr)

- TBD
creation-date: 2019-11-19
last-updated: 2019-11-21
status: implementable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly update this status to deferred given the frozen state if we want to merge for record keeping?

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign ashcrow after the PR has been reviewed.
You can assign the PR to them by writing /assign @ashcrow in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sdodson
Copy link
Member

sdodson commented Jul 8, 2021

/close
We'll either re-open this or create a new enhancement when this is actually on the roadmap.

@openshift-ci openshift-ci bot closed this Jul 8, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 8, 2021

@sdodson: Closed this PR.

In response to this:

/close
We'll either re-open this or create a new enhancement when this is actually on the roadmap.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.