-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add coreos-installer install --fetch-retries
#565
Conversation
(Tested this interactively locally, but want to test the |
OK yup, now also tested the Though... this does reveal a downside of the shotgun approach we're going with here. Basically, we're retrying everything (similar to curl's I'll tweak the logic here to only retry transient errors. |
This is done now! |
It looks like https://bugzilla.redhat.com/show_bug.cgi?id=1974411 at least is probably something else and not just a race condition. That said, I think this is still valuable to have regardless. |
This was tested successfully by a customer in https://bugzilla.redhat.com/show_bug.cgi?id=1967483#c16, though in that specific case it might hint towards a deeper integration issue with NetworkManager. |
We originally did this in coreos#326 because we wanted to support booting the live ISO without networking. This was solved on the initramfs side by the conditional networking work (coreos#426). But for the real root, this was still useful because if booting the ISO interactively on a system without any network, or a non-DHCP network, we didn't want the user to have to wait until the service timed out before getting a shell. The core issue however is that we're requesting `network-online.target` at all. It's an "active unit" which means that it's only pulled in the transaction, possibly delaying boot, if another systemd unit needs it. And ideally, no service would need it as per: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ In our case, this unit was fedora-coreos-pinger. We drop that requirement here: coreos/fedora-coreos-pinger#41 With that, we no longer pull in `network-online.target` and so no longer delay reaching the console even if NetworkManager isn't able to get an active connection for whatever reason. This matches how it works on traditional Fedora as well. Having a short timeout actually also had a counterproductive effect in the automated install case. There, `coreos-installer.service` does pull in `network-online.target` (which with coreos/coreos-installer#565 we could consider dropping as advised by systemd, though we probably should bump the number of retries some more in that case), but because of the short timeout, we genuinely may not yet have the network fully up before we run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).
As per coreos/fedora-coreos-config#1088, I think we could consider as a follow-up to this removing our requirement on |
We originally did this in coreos#326 because we wanted to support booting the live ISO without networking. This was solved on the initramfs side by the conditional networking work (coreos#426). But for the real root, this was still useful because if booting the ISO interactively on a system without any network, or a non-DHCP network, we didn't want the user to have to wait until the service timed out before getting a shell. The core issue however is that we're requesting `network-online.target` at all. It's an "active unit" which means that it's only pulled in the transaction, possibly delaying boot, if another systemd unit needs it. And ideally, no service would need it as per: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ In our case, this unit was fedora-coreos-pinger. We drop that requirement here: coreos/fedora-coreos-pinger#41 With that, we no longer pull in `network-online.target` and so no longer delay reaching the console even if NetworkManager isn't able to get an active connection for whatever reason. This matches how it works on traditional Fedora as well. Having a short timeout actually also had a counterproductive effect in the automated install case. There, `coreos-installer.service` does pull in `network-online.target` (which with coreos/coreos-installer#565 we could consider dropping as advised by systemd, though we probably should bump the number of retries some more in that case), but because of the short timeout, we genuinely may not yet have the network fully up before we run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).
We originally did this in #326 because we wanted to support booting the live ISO without networking. This was solved on the initramfs side by the conditional networking work (#426). But for the real root, this was still useful because if booting the ISO interactively on a system without any network, or a non-DHCP network, we didn't want the user to have to wait until the service timed out before getting a shell. The core issue however is that we're requesting `network-online.target` at all. It's an "active unit" which means that it's only pulled in the transaction, possibly delaying boot, if another systemd unit needs it. And ideally, no service would need it as per: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ In our case, this unit was fedora-coreos-pinger. We drop that requirement here: coreos/fedora-coreos-pinger#41 With that, we no longer pull in `network-online.target` and so no longer delay reaching the console even if NetworkManager isn't able to get an active connection for whatever reason. This matches how it works on traditional Fedora as well. Having a short timeout actually also had a counterproductive effect in the automated install case. There, `coreos-installer.service` does pull in `network-online.target` (which with coreos/coreos-installer#565 we could consider dropping as advised by systemd, though we probably should bump the number of retries some more in that case), but because of the short timeout, we genuinely may not yet have the network fully up before we run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).
Updated for comments! |
src/cmdline.rs
Outdated
Arg::with_name("http-retries") | ||
.long("http-retries") | ||
.value_name("N") | ||
.help("HTTP retries, or string \"infinite\"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before someone suggests making -1
mean infinite: not opposed, but I personally think this is much more explicit and self-documenting. There's also precedence with sleep infinity
.
We originally did this in coreos#326 because we wanted to support booting the live ISO without networking. This was solved on the initramfs side by the conditional networking work (coreos#426). But for the real root, this was still useful because if booting the ISO interactively on a system without any network, or a non-DHCP network, we didn't want the user to have to wait until the service timed out before getting a shell. The core issue however is that we're requesting `network-online.target` at all. It's an "active unit" which means that it's only pulled in the transaction, possibly delaying boot, if another systemd unit needs it. And ideally, no service would need it as per: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ In our case, this unit was fedora-coreos-pinger. We drop that requirement here: coreos/fedora-coreos-pinger#41 With that, we no longer pull in `network-online.target` and so no longer delay reaching the console even if NetworkManager isn't able to get an active connection for whatever reason. This matches how it works on traditional Fedora as well. Having a short timeout actually also had a counterproductive effect in the automated install case. There, `coreos-installer.service` does pull in `network-online.target` (which with coreos/coreos-installer#565 we could consider dropping as advised by systemd, though we probably should bump the number of retries some more in that case), but because of the short timeout, we genuinely may not yet have the network fully up before we run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483). (cherry picked from commit dd54e8c)
We originally did this in coreos#326 because we wanted to support booting the live ISO without networking. This was solved on the initramfs side by the conditional networking work (coreos#426). But for the real root, this was still useful because if booting the ISO interactively on a system without any network, or a non-DHCP network, we didn't want the user to have to wait until the service timed out before getting a shell. The core issue however is that we're requesting `network-online.target` at all. It's an "active unit" which means that it's only pulled in the transaction, possibly delaying boot, if another systemd unit needs it. And ideally, no service would need it as per: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ In our case, this unit was fedora-coreos-pinger. We drop that requirement here: coreos/fedora-coreos-pinger#41 With that, we no longer pull in `network-online.target` and so no longer delay reaching the console even if NetworkManager isn't able to get an active connection for whatever reason. This matches how it works on traditional Fedora as well. Having a short timeout actually also had a counterproductive effect in the automated install case. There, `coreos-installer.service` does pull in `network-online.target` (which with coreos/coreos-installer#565 we could consider dropping as advised by systemd, though we probably should bump the number of retries some more in that case), but because of the short timeout, we genuinely may not yet have the network fully up before we run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483). (cherry picked from commit dd54e8c)
We originally did this in coreos#326 because we wanted to support booting the live ISO without networking. This was solved on the initramfs side by the conditional networking work (coreos#426). But for the real root, this was still useful because if booting the ISO interactively on a system without any network, or a non-DHCP network, we didn't want the user to have to wait until the service timed out before getting a shell. The core issue however is that we're requesting `network-online.target` at all. It's an "active unit" which means that it's only pulled in the transaction, possibly delaying boot, if another systemd unit needs it. And ideally, no service would need it as per: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ In our case, this unit was fedora-coreos-pinger. We drop that requirement here: coreos/fedora-coreos-pinger#41 With that, we no longer pull in `network-online.target` and so no longer delay reaching the console even if NetworkManager isn't able to get an active connection for whatever reason. This matches how it works on traditional Fedora as well. Having a short timeout actually also had a counterproductive effect in the automated install case. There, `coreos-installer.service` does pull in `network-online.target` (which with coreos/coreos-installer#565 we could consider dropping as advised by systemd, though we probably should bump the number of retries some more in that case), but because of the short timeout, we genuinely may not yet have the network fully up before we run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483). (cherry picked from commit dd54e8c)
We originally did this in #326 because we wanted to support booting the live ISO without networking. This was solved on the initramfs side by the conditional networking work (#426). But for the real root, this was still useful because if booting the ISO interactively on a system without any network, or a non-DHCP network, we didn't want the user to have to wait until the service timed out before getting a shell. The core issue however is that we're requesting `network-online.target` at all. It's an "active unit" which means that it's only pulled in the transaction, possibly delaying boot, if another systemd unit needs it. And ideally, no service would need it as per: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ In our case, this unit was fedora-coreos-pinger. We drop that requirement here: coreos/fedora-coreos-pinger#41 With that, we no longer pull in `network-online.target` and so no longer delay reaching the console even if NetworkManager isn't able to get an active connection for whatever reason. This matches how it works on traditional Fedora as well. Having a short timeout actually also had a counterproductive effect in the automated install case. There, `coreos-installer.service` does pull in `network-online.target` (which with coreos/coreos-installer#565 we could consider dropping as advised by systemd, though we probably should bump the number of retries some more in that case), but because of the short timeout, we genuinely may not yet have the network fully up before we run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483). (cherry picked from commit dd54e8c)
We originally did this in #326 because we wanted to support booting the live ISO without networking. This was solved on the initramfs side by the conditional networking work (#426). But for the real root, this was still useful because if booting the ISO interactively on a system without any network, or a non-DHCP network, we didn't want the user to have to wait until the service timed out before getting a shell. The core issue however is that we're requesting `network-online.target` at all. It's an "active unit" which means that it's only pulled in the transaction, possibly delaying boot, if another systemd unit needs it. And ideally, no service would need it as per: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ In our case, this unit was fedora-coreos-pinger. We drop that requirement here: coreos/fedora-coreos-pinger#41 With that, we no longer pull in `network-online.target` and so no longer delay reaching the console even if NetworkManager isn't able to get an active connection for whatever reason. This matches how it works on traditional Fedora as well. Having a short timeout actually also had a counterproductive effect in the automated install case. There, `coreos-installer.service` does pull in `network-online.target` (which with coreos/coreos-installer#565 we could consider dropping as advised by systemd, though we probably should bump the number of retries some more in that case), but because of the short timeout, we genuinely may not yet have the network fully up before we run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483). (cherry picked from commit dd54e8c)
coreos-installer install --http-retries
coreos-installer install --fetch-retries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
We're hitting issues in RHCOS right now where it's possible that `coreos-installer.service` is racing with networking being fully up, even though we're ordered after `network-online.target` and `systemd-resolved.service` (though RHCOS doesn't use the latter). The issue may not be a race in the end, but misconfigured networking. But anyway, we really should be retrying fetches. I gated this behind a switch instead of doing it by default for all fetches, because e.g. interactively I think it makes more sense not to retry. (And similarly for other commands like `download` and `list-stream`). Related: https://bugzilla.redhat.com/show_bug.cgi?id=1974411 Related: https://bugzilla.redhat.com/show_bug.cgi?id=1967483 Closes: #283
We originally did this in coreos#326 because we wanted to support booting the live ISO without networking. This was solved on the initramfs side by the conditional networking work (coreos#426). But for the real root, this was still useful because if booting the ISO interactively on a system without any network, or a non-DHCP network, we didn't want the user to have to wait until the service timed out before getting a shell. The core issue however is that we're requesting `network-online.target` at all. It's an "active unit" which means that it's only pulled in the transaction, possibly delaying boot, if another systemd unit needs it. And ideally, no service would need it as per: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ In our case, this unit was fedora-coreos-pinger. We drop that requirement here: coreos/fedora-coreos-pinger#41 With that, we no longer pull in `network-online.target` and so no longer delay reaching the console even if NetworkManager isn't able to get an active connection for whatever reason. This matches how it works on traditional Fedora as well. Having a short timeout actually also had a counterproductive effect in the automated install case. There, `coreos-installer.service` does pull in `network-online.target` (which with coreos/coreos-installer#565 we could consider dropping as advised by systemd, though we probably should bump the number of retries some more in that case), but because of the short timeout, we genuinely may not yet have the network fully up before we run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).
We originally did this in coreos#326 because we wanted to support booting the live ISO without networking. This was solved on the initramfs side by the conditional networking work (coreos#426). But for the real root, this was still useful because if booting the ISO interactively on a system without any network, or a non-DHCP network, we didn't want the user to have to wait until the service timed out before getting a shell. The core issue however is that we're requesting `network-online.target` at all. It's an "active unit" which means that it's only pulled in the transaction, possibly delaying boot, if another systemd unit needs it. And ideally, no service would need it as per: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ In our case, this unit was fedora-coreos-pinger. We drop that requirement here: coreos/fedora-coreos-pinger#41 With that, we no longer pull in `network-online.target` and so no longer delay reaching the console even if NetworkManager isn't able to get an active connection for whatever reason. This matches how it works on traditional Fedora as well. Having a short timeout actually also had a counterproductive effect in the automated install case. There, `coreos-installer.service` does pull in `network-online.target` (which with coreos/coreos-installer#565 we could consider dropping as advised by systemd, though we probably should bump the number of retries some more in that case), but because of the short timeout, we genuinely may not yet have the network fully up before we run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).
The `Wants=network-online.target` predates the addition of fetch retries. Now that we retry HTTP requests indefinitely in the automated flow, let's follow best practices and stop pulling in `network-online.target`. Still keep it as an `After` though; *if* something pulls in `network-online.target`, then we might as well be smarter and start our retries after we know we're online. Also add `network.target` so that if `network-online.target` *isn't* pulled in, we still have a reasonable lower bound on when we start retrying. Related: coreos/fedora-coreos-config#1088 Related: coreos#565 (comment) Closes: https://github.com/coreos/coreos-installer/issues/1334
The `Wants=network-online.target` predates the osmet work which enabled the now default fully offline install flow. It also predates the addition of fetch retries. So even in an online flow, now that we retry HTTP requests indefinitely, we don't really need this. Let's follow best practices and stop pulling in `network-online.target`. Still keep it as an `After` though; *if* something pulls in `network-online.target`, then we might as well be nicer and start our retries after we know we're online. Also add `network.target` so that if `network-online.target` *isn't* pulled in, we still have a reasonable lower bound on when we start retrying. Related: coreos/fedora-coreos-config#1088 Related: coreos#565 (comment) Related: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ Closes: https://github.com/coreos/coreos-installer/issues/1334
The `Wants=network-online.target` predates the osmet work which enabled the now default fully offline install flow. It also predates the addition of fetch retries. So even in an online flow, now that we retry HTTP requests indefinitely, we don't really need this. Let's follow best practices and stop pulling in `network-online.target`. We still need to keep pulling in `NetworkManager.service` though. It's enabled by default in `multi-user.target` but not `coreos-installer-post.target` (which is what we boot to in an automated install). NetworkManager is capable of handling offline environments just fine and won't block the install just because a connection isn't available (yet or ever). Related: coreos/fedora-coreos-config#1088 Related: coreos#565 (comment) Related: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ Closes: https://github.com/coreos/coreos-installer/issues/1334
The `Wants=network-online.target` predates the osmet work which enabled the now default fully offline install flow. It also predates the addition of fetch retries. So even in an online flow, now that we retry HTTP requests indefinitely, we don't really need this. Let's follow best practices and stop pulling in `network-online.target`. We still need to keep pulling in `NetworkManager.service` though. It's enabled by default in `multi-user.target` but not `coreos-installer-post.target` (which is what we boot to in an automated install). NetworkManager is capable of handling offline environments just fine and won't block the install just because a connection isn't available (yet or ever). Related: coreos/fedora-coreos-config#1088 Related: coreos#565 (comment) Related: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ Closes: https://github.com/coreos/coreos-installer/issues/1334
We're hitting issues in RHCOS right now where it's possible that
coreos-installer.service
is racing with networking being fully up,even though we're ordered after
network-online.target
andsystemd-resolved.service
(though RHCOS doesn't use the latter).The issue may not be a race in the end, but misconfigured networking.
But anyway, we really should be retrying fetches.
I gated this behind a switch instead of doing it by default for all
fetches, because e.g. interactively I think it makes more sense not to
retry. (And similarly for other commands like
download
andlist-stream
).Related: https://bugzilla.redhat.com/show_bug.cgi?id=1974411
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1967483
Closes: #283