Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add coreos-installer install --fetch-retries #565

Merged
merged 1 commit into from
Jul 28, 2021
Merged

Add coreos-installer install --fetch-retries #565

merged 1 commit into from
Jul 28, 2021

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Jun 24, 2021

We're hitting issues in RHCOS right now where it's possible that
coreos-installer.service is racing with networking being fully up,
even though we're ordered after network-online.target and
systemd-resolved.service (though RHCOS doesn't use the latter).

The issue may not be a race in the end, but misconfigured networking.
But anyway, we really should be retrying fetches.

I gated this behind a switch instead of doing it by default for all
fetches, because e.g. interactively I think it makes more sense not to
retry. (And similarly for other commands like download and
list-stream).

Related: https://bugzilla.redhat.com/show_bug.cgi?id=1974411
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1967483

Closes: #283

@jlebon
Copy link
Member Author

jlebon commented Jun 24, 2021

(Tested this interactively locally, but want to test the coreos.inst.* flow as well.)

@jlebon jlebon marked this pull request as ready for review June 24, 2021 16:01
@jlebon
Copy link
Member Author

jlebon commented Jun 24, 2021

OK yup, now also tested the coreos.inst.* case correctly retries.

Though... this does reveal a downside of the shotgun approach we're going with here. Basically, we're retrying everything (similar to curl's --retry-all-errors), even if the server returned e.g. 404. This is obvious when using the coreos.inst.insecure karg: we currently always download the signature, even in insecure mode, but the signature may not exist at all, so we'd be retrying for no reason really.

I'll tweak the logic here to only retry transient errors.

@jlebon
Copy link
Member Author

jlebon commented Jun 24, 2021

I'll tweak the logic here to only retry transient errors.

This is done now!

@jlebon
Copy link
Member Author

jlebon commented Jun 30, 2021

It looks like https://bugzilla.redhat.com/show_bug.cgi?id=1974411 at least is probably something else and not just a race condition. That said, I think this is still valuable to have regardless.

@jlebon
Copy link
Member Author

jlebon commented Jul 5, 2021

This was tested successfully by a customer in https://bugzilla.redhat.com/show_bug.cgi?id=1967483#c16, though in that specific case it might hint towards a deeper integration issue with NetworkManager.

jlebon added a commit to jlebon/fedora-coreos-config that referenced this pull request Jul 7, 2021
We originally did this in coreos#326 because we wanted to support booting the
live ISO without networking. This was solved on the initramfs side by
the conditional networking work (coreos#426). But for the real root, this was
still useful because if booting the ISO interactively on a system
without any network, or a non-DHCP network, we didn't want the user to
have to wait until the service timed out before getting a shell.

The core issue however is that we're requesting `network-online.target`
at all. It's an "active unit" which means that it's only pulled in the
transaction, possibly delaying boot, if another systemd unit needs it.
And ideally, no service would need it as per:

https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

In our case, this unit was fedora-coreos-pinger. We drop that
requirement here:

coreos/fedora-coreos-pinger#41

With that, we no longer pull in `network-online.target` and so no longer
delay reaching the console even if NetworkManager isn't able to get an
active connection for whatever reason. This matches how it works on
traditional Fedora as well.

Having a short timeout actually also had a counterproductive effect in
the automated install case. There, `coreos-installer.service` does pull
in `network-online.target` (which with
coreos/coreos-installer#565 we could consider
dropping as advised by systemd, though we probably should bump the
number of retries some more in that case), but because of the short
timeout, we genuinely may not yet have the network fully up before we
run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).
@jlebon
Copy link
Member Author

jlebon commented Jul 7, 2021

As per coreos/fedora-coreos-config#1088, I think we could consider as a follow-up to this removing our requirement on network-online.target. We don't necessarily need networking since by default we install offline, and now that we have retries we can deal with the network taking time to come up (though probably we should still use network.target at least as a lower bound from which to start the retries, even if it may not necessarily mean much in practice in terms of network state).

jlebon added a commit to jlebon/fedora-coreos-config that referenced this pull request Jul 8, 2021
We originally did this in coreos#326 because we wanted to support booting the
live ISO without networking. This was solved on the initramfs side by
the conditional networking work (coreos#426). But for the real root, this was
still useful because if booting the ISO interactively on a system
without any network, or a non-DHCP network, we didn't want the user to
have to wait until the service timed out before getting a shell.

The core issue however is that we're requesting `network-online.target`
at all. It's an "active unit" which means that it's only pulled in the
transaction, possibly delaying boot, if another systemd unit needs it.
And ideally, no service would need it as per:

https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

In our case, this unit was fedora-coreos-pinger. We drop that
requirement here:

coreos/fedora-coreos-pinger#41

With that, we no longer pull in `network-online.target` and so no longer
delay reaching the console even if NetworkManager isn't able to get an
active connection for whatever reason. This matches how it works on
traditional Fedora as well.

Having a short timeout actually also had a counterproductive effect in
the automated install case. There, `coreos-installer.service` does pull
in `network-online.target` (which with
coreos/coreos-installer#565 we could consider
dropping as advised by systemd, though we probably should bump the
number of retries some more in that case), but because of the short
timeout, we genuinely may not yet have the network fully up before we
run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).
jlebon added a commit to coreos/fedora-coreos-config that referenced this pull request Jul 8, 2021
We originally did this in #326 because we wanted to support booting the
live ISO without networking. This was solved on the initramfs side by
the conditional networking work (#426). But for the real root, this was
still useful because if booting the ISO interactively on a system
without any network, or a non-DHCP network, we didn't want the user to
have to wait until the service timed out before getting a shell.

The core issue however is that we're requesting `network-online.target`
at all. It's an "active unit" which means that it's only pulled in the
transaction, possibly delaying boot, if another systemd unit needs it.
And ideally, no service would need it as per:

https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

In our case, this unit was fedora-coreos-pinger. We drop that
requirement here:

coreos/fedora-coreos-pinger#41

With that, we no longer pull in `network-online.target` and so no longer
delay reaching the console even if NetworkManager isn't able to get an
active connection for whatever reason. This matches how it works on
traditional Fedora as well.

Having a short timeout actually also had a counterproductive effect in
the automated install case. There, `coreos-installer.service` does pull
in `network-online.target` (which with
coreos/coreos-installer#565 we could consider
dropping as advised by systemd, though we probably should bump the
number of retries some more in that case), but because of the short
timeout, we genuinely may not yet have the network fully up before we
run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).
scripts/coreos-installer-service Outdated Show resolved Hide resolved
src/source.rs Outdated Show resolved Hide resolved
src/cmdline.rs Outdated Show resolved Hide resolved
src/cmdline.rs Outdated Show resolved Hide resolved
src/source.rs Outdated Show resolved Hide resolved
src/source.rs Outdated Show resolved Hide resolved
src/source.rs Outdated Show resolved Hide resolved
src/source.rs Outdated Show resolved Hide resolved
@jlebon
Copy link
Member Author

jlebon commented Jul 15, 2021

Updated for comments!

src/cmdline.rs Outdated
Arg::with_name("http-retries")
.long("http-retries")
.value_name("N")
.help("HTTP retries, or string \"infinite\"")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before someone suggests making -1 mean infinite: not opposed, but I personally think this is much more explicit and self-documenting. There's also precedence with sleep infinity.

jlebon added a commit to jlebon/fedora-coreos-config that referenced this pull request Jul 19, 2021
We originally did this in coreos#326 because we wanted to support booting the
live ISO without networking. This was solved on the initramfs side by
the conditional networking work (coreos#426). But for the real root, this was
still useful because if booting the ISO interactively on a system
without any network, or a non-DHCP network, we didn't want the user to
have to wait until the service timed out before getting a shell.

The core issue however is that we're requesting `network-online.target`
at all. It's an "active unit" which means that it's only pulled in the
transaction, possibly delaying boot, if another systemd unit needs it.
And ideally, no service would need it as per:

https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

In our case, this unit was fedora-coreos-pinger. We drop that
requirement here:

coreos/fedora-coreos-pinger#41

With that, we no longer pull in `network-online.target` and so no longer
delay reaching the console even if NetworkManager isn't able to get an
active connection for whatever reason. This matches how it works on
traditional Fedora as well.

Having a short timeout actually also had a counterproductive effect in
the automated install case. There, `coreos-installer.service` does pull
in `network-online.target` (which with
coreos/coreos-installer#565 we could consider
dropping as advised by systemd, though we probably should bump the
number of retries some more in that case), but because of the short
timeout, we genuinely may not yet have the network fully up before we
run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).

(cherry picked from commit dd54e8c)
jlebon added a commit to jlebon/fedora-coreos-config that referenced this pull request Jul 19, 2021
We originally did this in coreos#326 because we wanted to support booting the
live ISO without networking. This was solved on the initramfs side by
the conditional networking work (coreos#426). But for the real root, this was
still useful because if booting the ISO interactively on a system
without any network, or a non-DHCP network, we didn't want the user to
have to wait until the service timed out before getting a shell.

The core issue however is that we're requesting `network-online.target`
at all. It's an "active unit" which means that it's only pulled in the
transaction, possibly delaying boot, if another systemd unit needs it.
And ideally, no service would need it as per:

https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

In our case, this unit was fedora-coreos-pinger. We drop that
requirement here:

coreos/fedora-coreos-pinger#41

With that, we no longer pull in `network-online.target` and so no longer
delay reaching the console even if NetworkManager isn't able to get an
active connection for whatever reason. This matches how it works on
traditional Fedora as well.

Having a short timeout actually also had a counterproductive effect in
the automated install case. There, `coreos-installer.service` does pull
in `network-online.target` (which with
coreos/coreos-installer#565 we could consider
dropping as advised by systemd, though we probably should bump the
number of retries some more in that case), but because of the short
timeout, we genuinely may not yet have the network fully up before we
run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).

(cherry picked from commit dd54e8c)
jlebon added a commit to jlebon/fedora-coreos-config that referenced this pull request Jul 19, 2021
We originally did this in coreos#326 because we wanted to support booting the
live ISO without networking. This was solved on the initramfs side by
the conditional networking work (coreos#426). But for the real root, this was
still useful because if booting the ISO interactively on a system
without any network, or a non-DHCP network, we didn't want the user to
have to wait until the service timed out before getting a shell.

The core issue however is that we're requesting `network-online.target`
at all. It's an "active unit" which means that it's only pulled in the
transaction, possibly delaying boot, if another systemd unit needs it.
And ideally, no service would need it as per:

https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

In our case, this unit was fedora-coreos-pinger. We drop that
requirement here:

coreos/fedora-coreos-pinger#41

With that, we no longer pull in `network-online.target` and so no longer
delay reaching the console even if NetworkManager isn't able to get an
active connection for whatever reason. This matches how it works on
traditional Fedora as well.

Having a short timeout actually also had a counterproductive effect in
the automated install case. There, `coreos-installer.service` does pull
in `network-online.target` (which with
coreos/coreos-installer#565 we could consider
dropping as advised by systemd, though we probably should bump the
number of retries some more in that case), but because of the short
timeout, we genuinely may not yet have the network fully up before we
run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).

(cherry picked from commit dd54e8c)
jlebon added a commit to coreos/fedora-coreos-config that referenced this pull request Jul 19, 2021
We originally did this in #326 because we wanted to support booting the
live ISO without networking. This was solved on the initramfs side by
the conditional networking work (#426). But for the real root, this was
still useful because if booting the ISO interactively on a system
without any network, or a non-DHCP network, we didn't want the user to
have to wait until the service timed out before getting a shell.

The core issue however is that we're requesting `network-online.target`
at all. It's an "active unit" which means that it's only pulled in the
transaction, possibly delaying boot, if another systemd unit needs it.
And ideally, no service would need it as per:

https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

In our case, this unit was fedora-coreos-pinger. We drop that
requirement here:

coreos/fedora-coreos-pinger#41

With that, we no longer pull in `network-online.target` and so no longer
delay reaching the console even if NetworkManager isn't able to get an
active connection for whatever reason. This matches how it works on
traditional Fedora as well.

Having a short timeout actually also had a counterproductive effect in
the automated install case. There, `coreos-installer.service` does pull
in `network-online.target` (which with
coreos/coreos-installer#565 we could consider
dropping as advised by systemd, though we probably should bump the
number of retries some more in that case), but because of the short
timeout, we genuinely may not yet have the network fully up before we
run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).

(cherry picked from commit dd54e8c)
jlebon added a commit to coreos/fedora-coreos-config that referenced this pull request Jul 19, 2021
We originally did this in #326 because we wanted to support booting the
live ISO without networking. This was solved on the initramfs side by
the conditional networking work (#426). But for the real root, this was
still useful because if booting the ISO interactively on a system
without any network, or a non-DHCP network, we didn't want the user to
have to wait until the service timed out before getting a shell.

The core issue however is that we're requesting `network-online.target`
at all. It's an "active unit" which means that it's only pulled in the
transaction, possibly delaying boot, if another systemd unit needs it.
And ideally, no service would need it as per:

https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

In our case, this unit was fedora-coreos-pinger. We drop that
requirement here:

coreos/fedora-coreos-pinger#41

With that, we no longer pull in `network-online.target` and so no longer
delay reaching the console even if NetworkManager isn't able to get an
active connection for whatever reason. This matches how it works on
traditional Fedora as well.

Having a short timeout actually also had a counterproductive effect in
the automated install case. There, `coreos-installer.service` does pull
in `network-online.target` (which with
coreos/coreos-installer#565 we could consider
dropping as advised by systemd, though we probably should bump the
number of retries some more in that case), but because of the short
timeout, we genuinely may not yet have the network fully up before we
run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).

(cherry picked from commit dd54e8c)
@jlebon jlebon mentioned this pull request Jul 27, 2021
30 tasks
@jlebon jlebon changed the title Add coreos-installer install --http-retries Add coreos-installer install --fetch-retries Jul 27, 2021
Copy link
Contributor

@bgilbert bgilbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

src/source.rs Outdated Show resolved Hide resolved
src/source.rs Outdated Show resolved Hide resolved
We're hitting issues in RHCOS right now where it's possible that
`coreos-installer.service` is racing with networking being fully up,
even though we're ordered after `network-online.target` and
`systemd-resolved.service` (though RHCOS doesn't use the latter).

The issue may not be a race in the end, but misconfigured networking.
But anyway, we really should be retrying fetches.

I gated this behind a switch instead of doing it by default for all
fetches, because e.g. interactively I think it makes more sense not to
retry. (And similarly for other commands like `download` and
`list-stream`).

Related: https://bugzilla.redhat.com/show_bug.cgi?id=1974411
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1967483

Closes: #283
@jlebon jlebon enabled auto-merge July 28, 2021 17:36
@jlebon jlebon merged commit a19befc into coreos:main Jul 28, 2021
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this pull request Oct 10, 2023
We originally did this in coreos#326 because we wanted to support booting the
live ISO without networking. This was solved on the initramfs side by
the conditional networking work (coreos#426). But for the real root, this was
still useful because if booting the ISO interactively on a system
without any network, or a non-DHCP network, we didn't want the user to
have to wait until the service timed out before getting a shell.

The core issue however is that we're requesting `network-online.target`
at all. It's an "active unit" which means that it's only pulled in the
transaction, possibly delaying boot, if another systemd unit needs it.
And ideally, no service would need it as per:

https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

In our case, this unit was fedora-coreos-pinger. We drop that
requirement here:

coreos/fedora-coreos-pinger#41

With that, we no longer pull in `network-online.target` and so no longer
delay reaching the console even if NetworkManager isn't able to get an
active connection for whatever reason. This matches how it works on
traditional Fedora as well.

Having a short timeout actually also had a counterproductive effect in
the automated install case. There, `coreos-installer.service` does pull
in `network-online.target` (which with
coreos/coreos-installer#565 we could consider
dropping as advised by systemd, though we probably should bump the
number of retries some more in that case), but because of the short
timeout, we genuinely may not yet have the network fully up before we
run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this pull request Oct 10, 2023
We originally did this in coreos#326 because we wanted to support booting the
live ISO without networking. This was solved on the initramfs side by
the conditional networking work (coreos#426). But for the real root, this was
still useful because if booting the ISO interactively on a system
without any network, or a non-DHCP network, we didn't want the user to
have to wait until the service timed out before getting a shell.

The core issue however is that we're requesting `network-online.target`
at all. It's an "active unit" which means that it's only pulled in the
transaction, possibly delaying boot, if another systemd unit needs it.
And ideally, no service would need it as per:

https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

In our case, this unit was fedora-coreos-pinger. We drop that
requirement here:

coreos/fedora-coreos-pinger#41

With that, we no longer pull in `network-online.target` and so no longer
delay reaching the console even if NetworkManager isn't able to get an
active connection for whatever reason. This matches how it works on
traditional Fedora as well.

Having a short timeout actually also had a counterproductive effect in
the automated install case. There, `coreos-installer.service` does pull
in `network-online.target` (which with
coreos/coreos-installer#565 we could consider
dropping as advised by systemd, though we probably should bump the
number of retries some more in that case), but because of the short
timeout, we genuinely may not yet have the network fully up before we
run (see https://bugzilla.redhat.com/show_bug.cgi?id=1967483).
jlebon added a commit to jlebon/coreos-installer that referenced this pull request Nov 6, 2023
The `Wants=network-online.target` predates the addition of fetch retries.
Now that we retry HTTP requests indefinitely in the automated flow, let's
follow best practices and stop pulling in `network-online.target`.

Still keep it as an `After` though; *if* something pulls in
`network-online.target`, then we might as well be smarter and start our
retries after we know we're online. Also add `network.target` so that
if `network-online.target` *isn't* pulled in, we still have a reasonable
lower bound on when we start retrying.

Related: coreos/fedora-coreos-config#1088
Related: coreos#565 (comment)
Closes: https://github.com/coreos/coreos-installer/issues/1334
jlebon added a commit to jlebon/coreos-installer that referenced this pull request Nov 6, 2023
The `Wants=network-online.target` predates the osmet work which enabled
the now default fully offline install flow.

It also predates the addition of fetch retries. So even in an online
flow, now that we retry HTTP requests indefinitely, we don't really
need this.

Let's follow best practices and stop pulling in `network-online.target`.

Still keep it as an `After` though; *if* something pulls in
`network-online.target`, then we might as well be nicer and start our
retries after we know we're online. Also add `network.target` so that
if `network-online.target` *isn't* pulled in, we still have a reasonable
lower bound on when we start retrying.

Related: coreos/fedora-coreos-config#1088
Related: coreos#565 (comment)
Related: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/
Closes: https://github.com/coreos/coreos-installer/issues/1334
jlebon added a commit to jlebon/coreos-installer that referenced this pull request Nov 9, 2023
The `Wants=network-online.target` predates the osmet work which enabled
the now default fully offline install flow.

It also predates the addition of fetch retries. So even in an online
flow, now that we retry HTTP requests indefinitely, we don't really
need this.

Let's follow best practices and stop pulling in `network-online.target`.

We still need to keep pulling in `NetworkManager.service`
though. It's enabled by default in `multi-user.target` but not
`coreos-installer-post.target` (which is what we boot to in an automated
install).

NetworkManager is capable of handling offline environments just fine and
won't block the install just because a connection isn't available (yet
or ever).

Related: coreos/fedora-coreos-config#1088
Related: coreos#565 (comment)
Related: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/
Closes: https://github.com/coreos/coreos-installer/issues/1334
jlebon added a commit to jlebon/coreos-installer that referenced this pull request Nov 9, 2023
The `Wants=network-online.target` predates the osmet work which enabled
the now default fully offline install flow.

It also predates the addition of fetch retries. So even in an online
flow, now that we retry HTTP requests indefinitely, we don't really
need this.

Let's follow best practices and stop pulling in `network-online.target`.

We still need to keep pulling in `NetworkManager.service`
though. It's enabled by default in `multi-user.target` but not
`coreos-installer-post.target` (which is what we boot to in an automated
install).

NetworkManager is capable of handling offline environments just fine and
won't block the install just because a connection isn't available (yet
or ever).

Related: coreos/fedora-coreos-config#1088
Related: coreos#565 (comment)
Related: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/
Closes: https://github.com/coreos/coreos-installer/issues/1334
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retry HTTP requests
2 participants