-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1850057: stage OS updates (nicely) while etcd is still running #1897
Comments
We also had problems in the past with applying config changes to (Bigger picture we're being pulled in two different directions here - openshift/enhancements#159 is all about "apply changes without rebooting!" whereas this direction is more "make updates more transactional as part of the reboot!") |
Apply updates without rebooting is not a system design goal yet :) The design goal is “upgrade is completely non disruptive to the system”. Right now it’s possible that static pods like etcd might actually have to be shut down prior to any disruption. Note that it’s not clear what is going on yet. |
Yep, we're in agreement there! I am hopeful we can mimimize the upgrade disruption - we aren't making any attempt to do so today, and I would be not all surprised to discover the oscontainer/rpm-ostree upgrade bit is disruptive as far as block IO goes. |
To flesh this out a bit today, we implemented a lot of sophisticated logic as part of zincati that the MCO can use too. The basic thing is - if we happen to be interrupted after "staging" an update, we only want to actually have it take effect on reboot at the very end. If we e.g. fail to drain, or the kernel happens to crash and we get rebooted, we don't want that pending update to take effect! See coreos/rpm-ostree#1814 IOW the flow is like this:
Where only at that very last step do we perform the swap of the bootloader config (and that's basically it). |
Update after triage from etcd team. It appears that we have a direct correlation to increased I/O latencies leading to possible control-plane disruption and the chain of events that happens during upgrade. steps performed for data experiment
results as observed from
peaking I/O activity during this time series, in general, I/O remains higherpodman pull (update payload pull?)
kworker/u8:4+fl
rpm-ostree
crio
kube scheduler
conclusion/open question: can we perform these actions in a way that less disruption to fs. |
Both `upgrade` and `deploy` already support this. There's no reason why all the remaining "deployment-creating" commands shouldn't. Prompted by openshift/machine-config-operator#1897 which will need this specifically for `rebase`.
Both `upgrade` and `deploy` already support this. There's no reason why all the remaining "deployment-creating" commands shouldn't. Prompted by openshift/machine-config-operator#1897 which will need this specifically for `rebase`.
OK so I'm playing with things like I'm playing with this etcd fio job: etcd-io/etcd#10577 (comment) (Does etcd really use Also as far as I can tell, we're basically not exposing or using filesystem/block IO cgroups to pods right? (Basically just cpu/memory) |
We have discussed throttling image pulls as well in podman/crio till we can use cgroups v2 features. |
We need to expose more control over how fast we write to disk in order to allow OS vendors to more finely tune latency vs throughput for OS updates. See openshift/machine-config-operator#1897 This is a crude hack I'm using for testing; need to also expose this via API, document it etc.
I am working on a writeup around this in https://hackmd.io/WeqiDWMAQP2sNtuPRul9QA |
This makes a lot of sense to me and clearly scales better with the size of the cluster. (We could even go fancier in the future and have the service generate static deltas to trade CPU cycles on one node for even more I/O efficiency across the cluster; and at least CPU cycles are more easily accountable/schedulable.) |
Another advantage of that approach is that given that network speed is normally slower than disk speed, you also automatically get less demanding disk I/O usage. |
We need to expose more control over how fast we write to disk in order to allow OS vendors to more finely tune latency vs throughput for OS updates. See openshift/machine-config-operator#1897 This `OSTREE_PULL_MAX_OUTSTANDING_WRITES` is a crude hack I'm using for testing; need to also expose this via API, document it etc. Second, add a `per-object-fsync` option which goes back to invoking `fsync()` on each object just after writing it. Combined, these two ensure natural "backpressure" against trying to fsync a huge amount of data all at once, which I believe is what is leading to the huge etcd fsync latency.
More updates in the doc but to summarize: It seems that what ostree is doing today can introduce e.g. 2 second latency for etcd fdatasync. We can rework things so we fsync as we go, avoiding a single big flush. This slows down the update a lot, but also greatly reduces the max latency (from ~2s to .46s). In other words, if we take this from:
to:
is that sufficiently better to call this "fixed"? |
Now that I look more closely we've mostly swapped having a huge stdev (and maximum) for a much worse average, and actually pushing the 99th percentile over 10ms. Argh. |
TBH the 10ms is more or less arbitrary in this case. A comparison where we felt this threshold held true was with clusters without load. So don't let these results dampen hope. |
Can you speak to how bad would be those potential huge 2s stalls for etcd be versus a longer period of increased latency? Does it sound likely that having huge spikes or two like that is the root cause for disruption? |
when we are looking at "why are we doing this", "why does it matter" we need to consider the following. The latency on disk is directly related to election timeouts (leader elections) as the time in which it takes to persist raft transactions to the WAL file is very high. So if we go to 2s fsync with for example Azure, where the election timeout is 2500ms we are pushing the cluster towards timeout. Where if instead, we have much lower yet elevated latency even over a longer time series we could see api with some slowness but chances are far less we introduce election as a result. So the net gain is less disruption. Also consider the operators who use leader election. They must update TTL every 10s or go through leader election process. So if etcd has huge spike in latency it could result in an operator perhaps not being able to update configmap in time. In short its easier for etcd to tolerate elevated latency over a period of time vs very heavy spike. |
Here's another sub-thread though: Since we know we're very shortly going to be rebooting this machine, would it help to tell etcd "if this node is a leader, please re-elect a different one"? Or to rephrase, are latency spikes much worse of a problem on the leader? |
This is a good question, I think it has been discussed before elsewhere in more depth. I think @hexfusion has some input here re: actively designating a leader intelligently (if we can predict the reboot order) |
Moved that to openshift/cluster-etcd-operator#392 |
Yeah the net result of rolling reboots will be leader change right it is going to happen, leader will go down. When etcd gets SIGTERM it will try to gracefully transfer leadership vs election. But the result is still "leader change" which results in a fail-fast to clients to retry (TTL resets etc). Because technically there is a time where we have no leader even if just a few ms. So while we expect 1 leader change what we see as a result here can be multiple. |
Hmm...actually there can be anywhere between 1 and 3 elections - if the MCO happens to choose the leader each time to upgrade. Right? So I think perhaps instead of openshift/cluster-etcd-operator#392 the MCO should prefer upgrading followers first...right? |
I don't disagree that chance is currently involved. Logically choosing non-leader to upgrade first would help hedge those bets. |
I can say for sure that the MCO code today makes no attempt to apply any kind of ordering/priority to the nodes it's going to update. It could in fact be effectively random. Adding a little bit of logic here for the control plane to avoid the etcd leader should be pretty easy...is the current leader reflected anywhere in the API? I am not seeing it in |
We have "real" OS update tests which look at a previous build; this is generally good, but I want to be able to reliably test e.g. a "large" upgrade in some CI scenarios, and it's OK if the upgrade isn't "real". This command takes an ostree commit and adds a note to a percentage of ELF binaries. This way one can generate a "large" update by specifying e.g. `--percentage=80` or so. I plan to use this for testing etcd performance during large updates; see openshift/machine-config-operator#1897
We have "real" OS update tests which look at a previous build; this is generally good, but I want to be able to reliably test e.g. a "large" upgrade in some CI scenarios, and it's OK if the upgrade isn't "real". This command takes an ostree commit and adds a note to a percentage of ELF binaries. This way one can generate a "large" update by specifying e.g. `--percentage=80` or so. I plan to use this for testing etcd performance during large updates; see openshift/machine-config-operator#1897
Part of solving openshift#1897 A lot more details in https://hackmd.io/WeqiDWMAQP2sNtuPRul9QA The TL;DR is that the `bfq` I/O scheduler better respects IO priorities, and also does a better job of handling latency sensitive processes like `etcd` versus bulk/background I/O .
We switched rpm-ostree to do this when applying updates, but it also makes sense to do when extracting the oscontainer. Part of: openshift#1897 Which is about staging OS updates more nicely when etcd is running.
The way we're talking to etcd is a bit hacky, I ended up cargo culting some code. This would be much cleaner if the etcd operator did it. But it's critical that we update the etcd followers first, because leader elections are disruptive events and we can easily minimize that. Closes: openshift#1897
Would it help if we were able to see the p99 fsync and commit metrics aggregated across a meaningfully sized set of CI runs that included the nontrivial OS upgrade? Absolute election count is one important metric, but it seems to me that we're also interested in knowing how much headroom we're providing to assess the risk of inducing election cycles in the near future even if an election didn't actually happen during the test. Can/should we come up with some clear SLOs against these metrics for some specific test suites? |
We have "real" OS update tests which look at a previous build; this is generally good, but I want to be able to reliably test e.g. a "large" upgrade in some CI scenarios, and it's OK if the upgrade isn't "real". This command takes an ostree commit and adds a note to a percentage of ELF binaries. This way one can generate a "large" update by specifying e.g. `--percentage=80` or so. I plan to use this for testing etcd performance during large updates; see openshift/machine-config-operator#1897
Part of solving openshift#1897 A lot more details in https://hackmd.io/WeqiDWMAQP2sNtuPRul9QA The TL;DR is that the `bfq` I/O scheduler better respects IO priorities, and also does a better job of handling latency sensitive processes like `etcd` versus bulk/background I/O .
We switched rpm-ostree to do this when applying updates, but it also makes sense to do when extracting the oscontainer. Part of: openshift#1897 Which is about staging OS updates more nicely when etcd is running.
We have "real" OS update tests which look at a previous build; this is generally good, but I want to be able to reliably test e.g. a "large" upgrade in some CI scenarios, and it's OK if the upgrade isn't "real". This command takes an ostree commit and adds a note to a percentage of ELF binaries. This way one can generate a "large" update by specifying e.g. `--percentage=80` or so. Building on that, also add a wrapper which generates an update from an oscontainer. I plan to use this for testing etcd performance during large updates; see openshift/machine-config-operator#1897
We have "real" OS update tests which look at a previous build; this is generally good, but I want to be able to reliably test e.g. a "large" upgrade in some CI scenarios, and it's OK if the upgrade isn't "real". This command takes an ostree commit and adds a note to a percentage of ELF binaries. This way one can generate a "large" update by specifying e.g. `--percentage=80` or so. Building on that, also add a wrapper which generates an update from an oscontainer. I plan to use this for testing etcd performance during large updates; see openshift/machine-config-operator#1897
We have "real" OS update tests which look at a previous build; this is generally good, but I want to be able to reliably test e.g. a "large" upgrade in some CI scenarios, and it's OK if the upgrade isn't "real". This command takes an ostree commit and adds a note to a percentage of ELF binaries. This way one can generate a "large" update by specifying e.g. `--percentage=80` or so. Building on that, also add a wrapper which generates an update from an oscontainer. I plan to use this for testing etcd performance during large updates; see openshift/machine-config-operator#1897
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
See https://bugzilla.redhat.com/show_bug.cgi?id=1850057
(This content is canonically stored at https://github.com/cgwalters/workboard/tree/master/openshift/bz1850057-etcd-osupdate )
OS upgrade I/O competes with etcd
Currently the MCD does:
Now "drain" keeps both daemonsets and static pods running. Of those two, etcd is a static pod today. When we're applying OS updates, that can be a lot of I/O and (reportedly) compete with etcd.
We have two options:
I like option 2) better because we've put a whole lot of work into making the ostree stack support this "stage updates while system is running" and it'd be cool if OpenShift used it 😄 Another way to say this is - I think we want to minimize the time window in which the etcd cluster is missing a member, so the more work we can do while etcd is still running the better!
Links
Workboard:
Pull request in progress: #1957
Example failing jobs:
Note that both of those jobs jumped from RHEL 8.1 to 8.2.
Prometheus queries
The text was updated successfully, but these errors were encountered: