Bug 1850057: stage OS updates (nicely) while etcd is still running #1897

cgwalters · 2020-07-02T23:21:08Z

See https://bugzilla.redhat.com/show_bug.cgi?id=1850057

(This content is canonically stored at https://github.com/cgwalters/workboard/tree/master/openshift/bz1850057-etcd-osupdate )

OS upgrade I/O competes with etcd

Currently the MCD does:

drain
apply updates
reboot

Now "drain" keeps both daemonsets and static pods running. Of those two, etcd is a static pod today. When we're applying OS updates, that can be a lot of I/O and (reportedly) compete with etcd.

We have two options:

kill etcd after draining (or even more strongly, stop kubelet)
"stage" updates gracefully while everything is running

I like option 2) better because we've put a whole lot of work into making the ostree stack support this "stage updates while system is running" and it'd be cool if OpenShift used it 😄 Another way to say this is - I think we want to minimize the time window in which the etcd cluster is missing a member, so the more work we can do while etcd is still running the better!

Links

Workboard:

Pull request in progress: #1957

Example failing jobs:

Note that both of those jobs jumped from RHEL 8.1 to 8.2.

Prometheus queries

histogram_quantile(0.99, sum(irate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)) 
histogram_quantile(0.99, sum(irate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le))

The text was updated successfully, but these errors were encountered:

cgwalters · 2020-07-02T23:24:17Z

We also had problems in the past with applying config changes to /etc (including the static pod manifest) while the static pod was still running; I tried looking at git log --grep=etcd but didn't see it. If we change the flow to ensure we're stopping etcd before we apply changes on the filesystem in /etc, that would also have fixed that earlier problem.

(Bigger picture we're being pulled in two different directions here - openshift/enhancements#159 is all about "apply changes without rebooting!" whereas this direction is more "make updates more transactional as part of the reboot!")

smarterclayton · 2020-07-02T23:28:49Z

Apply updates without rebooting is not a system design goal yet :)

The design goal is “upgrade is completely non disruptive to the system”. Right now it’s possible that static pods like etcd might actually have to be shut down prior to any disruption. Note that it’s not clear what is going on yet.

cgwalters · 2020-07-02T23:35:42Z

Yep, we're in agreement there! I am hopeful we can mimimize the upgrade disruption - we aren't making any attempt to do so today, and I would be not all surprised to discover the oscontainer/rpm-ostree upgrade bit is disruptive as far as block IO goes.

cgwalters · 2020-07-02T23:47:53Z

To flesh this out a bit today, we implemented a lot of sophisticated logic as part of zincati that the MCO can use too. The basic thing is - if we happen to be interrupted after "staging" an update, we only want to actually have it take effect on reboot at the very end. If we e.g. fail to drain, or the kernel happens to crash and we get rebooted, we don't want that pending update to take effect!

See coreos/rpm-ostree#1814
and
https://github.com/coreos/zincati/blob/50d51ce97c63fd35fac9bf5c702e4fe299c156c6/src/rpm_ostree/cli_finalize.rs#L21

IOW the flow is like this:

if diff.oscontainer != "" {
  rpm_ostree_upgrade(diff.oscontainer, "--lock-finalization")
}
drain()
write_config_changes(diff.filesystem)
stop_kubelet()
finalize_deployment(diff.oscontainer)

Where only at that very last step do we perform the swap of the bootloader config (and that's basically it).

hexfusion · 2020-07-07T15:08:28Z

Update after triage from etcd team.

It appears that we have a direct correlation to increased I/O latencies leading to possible control-plane disruption and the chain of events that happens during upgrade.

steps performed for data experiment

manually created GCP IPI cluster 4.4.11
deployed systemd unit for atop monitoring via MachineConfig https://gist.github.com/hexfusion/400d43d511a967814f5ccd8a7e09db8c
performed upgrade to 4.5-rc-6
to ensure atop time series datafile was preserved over upgrade I copied it to ~root/ right before node reboot.
observed metrics from cluster noting timestamp for I/O spike
stepped through atop timeseries and observed load during spike.

results as observed from sbatsche-atop10-2020-vrpt6-master-1

approx 8:49 p99 fpr etcd metrics etcd_disk_wal_fsync_duration_seconds_bucket and etcd_disk_backend_commit_duration_seconds_bucket spike.

peaking I/O activity during this time series, in general, I/O remains higher

podman pull (update payload pull?)

atop 8:48:55 podman pull -q ... starts disks report 61% busy r/0 w/1847 MBw/s 266.3 avio 0.55 ms

kworker/u8:4+fl

atop 8:48:27 kworker/u8:4+fl starts disks report 78% busy r/0 w/14900 MBw/s 242.1 avio 0.32 ms

rpm-ostree

atop 8:49:21 /usr/bin/rpm-ostree start-daemon starts and instantly disks report 77% busy r/981 w/57 avio 0.74 ms
atop 8:49:25 /usr/bin/rpm-ostree start-daemon running disks report 78% busy r/404 w/1234 MBw/s 257.0 avio 0.42 ms
atop 8:49:26 /usr/bin/rpm-ostree start-daemon running disks report 91% busy r/853 w/1268 MBw/s 229.7 avio 0.41 ms
atop 8:49:27 /usr/bin/rpm-ostree start-daemon running disks report 95% busy r/1162 w/75 MBw/s 229.7 avio 0.69 ms

crio

atop 8:49:27 /usr/bin/crio wipe starts disks report 87% busy r/896 w/281 MBw/s 8.9 avio 0.74 ms

kube scheduler

atop 8:52:21 kube-scheduler spikes disks report 73% busy r/1124 w/0 MBr/s 27.3 | MBw/s 3.1 avio 0.64 ms

conclusion/open question: can we perform these actions in a way that less disruption to fs.

Both `upgrade` and `deploy` already support this. There's no reason why all the remaining "deployment-creating" commands shouldn't. Prompted by openshift/machine-config-operator#1897 which will need this specifically for `rebase`.

jlebon · 2020-07-07T21:21:40Z

coreos/rpm-ostree#2158

Both `upgrade` and `deploy` already support this. There's no reason why all the remaining "deployment-creating" commands shouldn't. Prompted by openshift/machine-config-operator#1897 which will need this specifically for `rebase`.

cgwalters · 2020-07-08T19:13:47Z

OK so I'm playing with things like IOWeight= and IOSchedulingClass=idle etc. (see e.g. https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html http://0pointer.de/blog/projects/resources.html ) and...none of it is having any effect.

I'm playing with this etcd fio job: etcd-io/etcd#10577 (comment)

(Does etcd really use O_DIRECT? I need to verify that)

Also as far as I can tell, we're basically not exposing or using filesystem/block IO cgroups to pods right? (Basically just cpu/memory)
So this issue of balancing etcd with ostree is just a subset of the generic I/O control for workloads. I think this is related to cgroups v2: https://andrestc.com/post/cgroups-io/ - definitely for ostree we use buffered writes. We may have to just manually tune down our write speed.

mrunalp · 2020-07-09T00:39:01Z

We have discussed throttling image pulls as well in podman/crio till we can use cgroups v2 features.

We need to expose more control over how fast we write to disk in order to allow OS vendors to more finely tune latency vs throughput for OS updates. See openshift/machine-config-operator#1897 This is a crude hack I'm using for testing; need to also expose this via API, document it etc.

cgwalters · 2020-07-09T15:22:08Z

I am working on a writeup around this in https://hackmd.io/WeqiDWMAQP2sNtuPRul9QA

jlebon · 2020-07-10T16:25:14Z

Run a HTTP server (kube service) that mounts the oscontainer

This makes a lot of sense to me and clearly scales better with the size of the cluster. (We could even go fancier in the future and have the service generate static deltas to trade CPU cycles on one node for even more I/O efficiency across the cluster; and at least CPU cycles are more easily accountable/schedulable.)

jlebon · 2020-07-10T16:27:13Z

Another advantage of that approach is that given that network speed is normally slower than disk speed, you also automatically get less demanding disk I/O usage.

We need to expose more control over how fast we write to disk in order to allow OS vendors to more finely tune latency vs throughput for OS updates. See openshift/machine-config-operator#1897 This `OSTREE_PULL_MAX_OUTSTANDING_WRITES` is a crude hack I'm using for testing; need to also expose this via API, document it etc. Second, add a `per-object-fsync` option which goes back to invoking `fsync()` on each object just after writing it. Combined, these two ensure natural "backpressure" against trying to fsync a huge amount of data all at once, which I believe is what is leading to the huge etcd fsync latency.

cgwalters · 2020-07-15T17:11:29Z

More updates in the doc but to summarize: It seems that what ostree is doing today can introduce e.g. 2 second latency for etcd fdatasync. We can rework things so we fsync as we go, avoiding a single big flush. This slows down the update a lot, but also greatly reduces the max latency (from ~2s to .46s).

In other words, if we take this from:

  fsync/fdatasync/sync_file_range:
    sync (usec): min=732, max=1935.7k, avg=2594.14, stdev=19014.96
    sync percentiles (usec):
     |  1.00th=[   832],  5.00th=[   881], 10.00th=[   914], 20.00th=[   971],
     | 30.00th=[  1020], 40.00th=[  1090], 50.00th=[  2933], 60.00th=[  3032],
     | 70.00th=[  3064], 80.00th=[  3130], 90.00th=[  3228], 95.00th=[  3425],
     | 99.00th=[  6194], 99.50th=[  7963], 99.90th=[ 17433], 99.95th=[ 95945],
     | 99.99th=[843056]

to:

  fsync/fdatasync/sync_file_range:
    sync (usec): min=741, max=461949, avg=3970.41, stdev=6696.28
    sync percentiles (usec):
     |  1.00th=[   824],  5.00th=[   898], 10.00th=[   947], 20.00th=[  1029],
     | 30.00th=[  2073], 40.00th=[  2868], 50.00th=[  3032], 60.00th=[  3163],
     | 70.00th=[  4113], 80.00th=[  6194], 90.00th=[  8586], 95.00th=[ 10028],
     | 99.00th=[ 13042], 99.50th=[ 15139], 99.90th=[ 35390], 99.95th=[ 87557],
     | 99.99th=[459277]

is that sufficiently better to call this "fixed"?

cgwalters · 2020-07-15T17:14:05Z

Now that I look more closely we've mostly swapped having a huge stdev (and maximum) for a much worse average, and actually pushing the 99th percentile over 10ms. Argh.

hexfusion · 2020-07-15T17:28:00Z

TBH the 10ms is more or less arbitrary in this case. A comparison where we felt this threshold held true was with clusters without load. So don't let these results dampen hope.

cgwalters · 2020-07-15T17:31:51Z

Can you speak to how bad would be those potential huge 2s stalls for etcd be versus a longer period of increased latency? Does it sound likely that having huge spikes or two like that is the root cause for disruption?

hexfusion · 2020-07-15T17:56:44Z

when we are looking at "why are we doing this", "why does it matter" we need to consider the following. The latency on disk is directly related to election timeouts (leader elections) as the time in which it takes to persist raft transactions to the WAL file is very high. So if we go to 2s fsync with for example Azure, where the election timeout is 2500ms we are pushing the cluster towards timeout.

Where if instead, we have much lower yet elevated latency even over a longer time series we could see api with some slowness but chances are far less we introduce election as a result. So the net gain is less disruption. Also consider the operators who use leader election. They must update TTL every 10s or go through leader election process. So if etcd has huge spike in latency it could result in an operator perhaps not being able to update configmap in time.

In short its easier for etcd to tolerate elevated latency over a period of time vs very heavy spike.

cgwalters · 2020-07-15T18:16:11Z

Here's another sub-thread though: Since we know we're very shortly going to be rebooting this machine, would it help to tell etcd "if this node is a leader, please re-elect a different one"? Or to rephrase, are latency spikes much worse of a problem on the leader?

ironcladlou · 2020-07-15T18:18:59Z

Here's another sub-thread though: Since we know we're very shortly going to be rebooting this machine, would it help to tell etcd "if this node is a leader, please re-elect a different one"? Or to rephrase, are latency spikes much worse of a problem on the leader?

This is a good question, I think it has been discussed before elsewhere in more depth. I think @hexfusion has some input here re: actively designating a leader intelligently (if we can predict the reboot order)

cgwalters · 2020-07-15T18:23:26Z

Moved that to openshift/cluster-etcd-operator#392

hexfusion · 2020-07-15T18:24:05Z

Yeah the net result of rolling reboots will be leader change right it is going to happen, leader will go down. When etcd gets SIGTERM it will try to gracefully transfer leadership vs election. But the result is still "leader change" which results in a fail-fast to clients to retry (TTL resets etc). Because technically there is a time where we have no leader even if just a few ms. So while we expect 1 leader change what we see as a result here can be multiple.

cgwalters · 2020-07-15T18:35:43Z

So while we expect 1 leader change what we see as a result here can be multiple.

Hmm...actually there can be anywhere between 1 and 3 elections - if the MCO happens to choose the leader each time to upgrade. Right? So I think perhaps instead of openshift/cluster-etcd-operator#392 the MCO should prefer upgrading followers first...right?

hexfusion · 2020-07-15T18:48:03Z

Hmm...actually there can be anywhere between 1 and 3 elections - if the MCO happens to choose the leader each time to upgrade. Right? So I think perhaps instead of openshift/cluster-etcd-operator#392 the MCO should prefer upgrading followers first...right?

I don't disagree that chance is currently involved. Logically choosing non-leader to upgrade first would help hedge those bets.

cgwalters · 2020-07-15T19:35:59Z

I don't disagree that chance is currently involved. Logically choosing non-leader to upgrade first would help hedge those bets.

I can say for sure that the MCO code today makes no attempt to apply any kind of ordering/priority to the nodes it's going to update. It could in fact be effectively random.

Adding a little bit of logic here for the control plane to avoid the etcd leader should be pretty easy...is the current leader reflected anywhere in the API? I am not seeing it in oc describe clusteroperator/etcd or oc get etcd etc.

We have "real" OS update tests which look at a previous build; this is generally good, but I want to be able to reliably test e.g. a "large" upgrade in some CI scenarios, and it's OK if the upgrade isn't "real". This command takes an ostree commit and adds a note to a percentage of ELF binaries. This way one can generate a "large" update by specifying e.g. `--percentage=80` or so. I plan to use this for testing etcd performance during large updates; see openshift/machine-config-operator#1897

Part of solving openshift#1897 A lot more details in https://hackmd.io/WeqiDWMAQP2sNtuPRul9QA The TL;DR is that the `bfq` I/O scheduler better respects IO priorities, and also does a better job of handling latency sensitive processes like `etcd` versus bulk/background I/O .

We switched rpm-ostree to do this when applying updates, but it also makes sense to do when extracting the oscontainer. Part of: openshift#1897 Which is about staging OS updates more nicely when etcd is running.

The way we're talking to etcd is a bit hacky, I ended up cargo culting some code. This would be much cleaner if the etcd operator did it. But it's critical that we update the etcd followers first, because leader elections are disruptive events and we can easily minimize that. Closes: openshift#1897

ironcladlou · 2020-08-13T14:50:41Z

The tricky thing is: it seems useful to measure success (at least partially) by number of leader elections. And if we do that we need the MCO to be doing the right thing in the first place.

Anyways like I said, I think we can just hack this into the MCO code directly for now, so we're not blocked.

Would it help if we were able to see the p99 fsync and commit metrics aggregated across a meaningfully sized set of CI runs that included the nontrivial OS upgrade? Absolute election count is one important metric, but it seems to me that we're also interested in knowing how much headroom we're providing to assess the risk of inducing election cycles in the near future even if an election didn't actually happen during the test.

Can/should we come up with some clear SLOs against these metrics for some specific test suites?

We have "real" OS update tests which look at a previous build; this is generally good, but I want to be able to reliably test e.g. a "large" upgrade in some CI scenarios, and it's OK if the upgrade isn't "real". This command takes an ostree commit and adds a note to a percentage of ELF binaries. This way one can generate a "large" update by specifying e.g. `--percentage=80` or so. I plan to use this for testing etcd performance during large updates; see openshift/machine-config-operator#1897

Part of solving openshift#1897 A lot more details in https://hackmd.io/WeqiDWMAQP2sNtuPRul9QA The TL;DR is that the `bfq` I/O scheduler better respects IO priorities, and also does a better job of handling latency sensitive processes like `etcd` versus bulk/background I/O .

We switched rpm-ostree to do this when applying updates, but it also makes sense to do when extracting the oscontainer. Part of: openshift#1897 Which is about staging OS updates more nicely when etcd is running.

We have "real" OS update tests which look at a previous build; this is generally good, but I want to be able to reliably test e.g. a "large" upgrade in some CI scenarios, and it's OK if the upgrade isn't "real". This command takes an ostree commit and adds a note to a percentage of ELF binaries. This way one can generate a "large" update by specifying e.g. `--percentage=80` or so. Building on that, also add a wrapper which generates an update from an oscontainer. I plan to use this for testing etcd performance during large updates; see openshift/machine-config-operator#1897

openshift-bot · 2020-11-19T02:44:04Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-12-19T04:40:28Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-01-18T06:29:33Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2021-01-18T06:29:48Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jlebon mentioned this issue Jul 7, 2020

app: Add --lock-finalization switch to other commands coreos/rpm-ostree#2158

Merged

cgwalters mentioned this issue Jul 9, 2020

pull: Add OSTREE_PULL_MAX_OUTSTANDING_WRITES env var ostreedev/ostree#2147

Closed

cgwalters mentioned this issue Jul 15, 2020

switch leader if MCO is going to reboot a node openshift/cluster-etcd-operator#392

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 19, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 19, 2020

openshift-ci-robot closed this as completed Jan 18, 2021

cgwalters mentioned this issue Jan 6, 2022

Cleanup from staged deployments ostreedev/ostree#2510

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1850057: stage OS updates (nicely) while etcd is still running #1897

Bug 1850057: stage OS updates (nicely) while etcd is still running #1897

cgwalters commented Jul 2, 2020 •

edited

Loading

cgwalters commented Jul 2, 2020

smarterclayton commented Jul 2, 2020

cgwalters commented Jul 2, 2020

cgwalters commented Jul 2, 2020

hexfusion commented Jul 7, 2020 •

edited

Loading

jlebon commented Jul 7, 2020

cgwalters commented Jul 8, 2020

mrunalp commented Jul 9, 2020

cgwalters commented Jul 9, 2020

jlebon commented Jul 10, 2020 •

edited

Loading

Run a HTTP server (kube service) that mounts the oscontainer

jlebon commented Jul 10, 2020

cgwalters commented Jul 15, 2020

cgwalters commented Jul 15, 2020 •

edited

Loading

hexfusion commented Jul 15, 2020

cgwalters commented Jul 15, 2020

hexfusion commented Jul 15, 2020

cgwalters commented Jul 15, 2020 •

edited

Loading

ironcladlou commented Jul 15, 2020

cgwalters commented Jul 15, 2020

hexfusion commented Jul 15, 2020

cgwalters commented Jul 15, 2020 •

edited

Loading

hexfusion commented Jul 15, 2020

cgwalters commented Jul 15, 2020

ironcladlou commented Aug 13, 2020

openshift-bot commented Nov 19, 2020

openshift-bot commented Dec 19, 2020

openshift-bot commented Jan 18, 2021

openshift-ci-robot commented Jan 18, 2021

Bug 1850057: stage OS updates (nicely) while etcd is still running #1897

Bug 1850057: stage OS updates (nicely) while etcd is still running #1897

Comments

cgwalters commented Jul 2, 2020 • edited Loading

OS upgrade I/O competes with etcd

Links

Example failing jobs:

Prometheus queries

cgwalters commented Jul 2, 2020

smarterclayton commented Jul 2, 2020

cgwalters commented Jul 2, 2020

cgwalters commented Jul 2, 2020

hexfusion commented Jul 7, 2020 • edited Loading

peaking I/O activity during this time series, in general, I/O remains higher

podman pull (update payload pull?)

kworker/u8:4+fl

rpm-ostree

crio

kube scheduler

jlebon commented Jul 7, 2020

cgwalters commented Jul 8, 2020

mrunalp commented Jul 9, 2020

cgwalters commented Jul 9, 2020

jlebon commented Jul 10, 2020 • edited Loading

Run a HTTP server (kube service) that mounts the oscontainer

jlebon commented Jul 10, 2020

cgwalters commented Jul 15, 2020

cgwalters commented Jul 15, 2020 • edited Loading

hexfusion commented Jul 15, 2020

cgwalters commented Jul 15, 2020

hexfusion commented Jul 15, 2020

cgwalters commented Jul 15, 2020 • edited Loading

ironcladlou commented Jul 15, 2020

cgwalters commented Jul 15, 2020

hexfusion commented Jul 15, 2020

cgwalters commented Jul 15, 2020 • edited Loading

hexfusion commented Jul 15, 2020

cgwalters commented Jul 15, 2020

ironcladlou commented Aug 13, 2020

openshift-bot commented Nov 19, 2020

openshift-bot commented Dec 19, 2020

openshift-bot commented Jan 18, 2021

openshift-ci-robot commented Jan 18, 2021

cgwalters commented Jul 2, 2020 •

edited

Loading

hexfusion commented Jul 7, 2020 •

edited

Loading

jlebon commented Jul 10, 2020 •

edited

Loading

cgwalters commented Jul 15, 2020 •

edited

Loading

cgwalters commented Jul 15, 2020 •

edited

Loading

cgwalters commented Jul 15, 2020 •

edited

Loading