Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a few retries around failure points in release scripting #1353

Closed
Verolop opened this issue Jun 10, 2020 · 15 comments
Closed

Add a few retries around failure points in release scripting #1353

Verolop opened this issue Jun 10, 2020 · 15 comments
Assignees
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/release Categorizes an issue or PR as relevant to SIG Release.

Comments

@Verolop
Copy link
Contributor

Verolop commented Jun 10, 2020

What would you like to be added:

Adding some retries around failure points in release scripting before failing completely. Eg:

Why is this needed:

In case of failure, often we just need to hack workaround and attempt to re-run the release process, which is very time and resource consuming. Adding retries will allow these attempts to take place on the same run.

@justaugustus justaugustus transferred this issue from kubernetes/sig-release Jun 10, 2020
@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority labels Jun 10, 2020
@justaugustus justaugustus added area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/release Categorizes an issue or PR as relevant to SIG Release. labels Jun 10, 2020
@k8s-ci-robot k8s-ci-robot removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority labels Jun 10, 2020
@justaugustus
Copy link
Member

/remove-priority important-soon
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 10, 2020
@tpepper
Copy link
Member

tpepper commented Jun 10, 2020

Simple retries might help us limp along.

But long term we likely must run our own container image registry and mirror external content. Requiring the internet to be consistent/coherent in order to build is always going to be problematic. Failing to build because we can't docker pull docker.io/library/debian:stretch-slim should not happen.

@Verolop
Copy link
Contributor Author

Verolop commented Jun 11, 2020

sounds good, I agree!

@tpepper tpepper changed the title Add a few retries on docker pull before failing on Mock Release Add a few retries around failure points in release scripting Jul 27, 2020
@tpepper
Copy link
Member

tpepper commented Jul 27, 2020

@Verolop we've just discussed that there are a number of issues that are relatively similar to this, so I've tweaked the description slightly to cover the more abstracted case. There's max half a dozen points where a minimal additional retry loop in the shell script could make us much more likely to survive these random transient failures and save tonnes of release time and effort.

@tpepper
Copy link
Member

tpepper commented Jul 27, 2020

(ie: rather than close a bunch of issues and create new one...just re-using/re-focusing this one for broad impact)

@saschagrunert
Copy link
Member

An idea which came into my mind: What if we add a krel subcommand for pushing the git objects? Seems fairly straight forward and we could remove the bash bits from anago. Then I'd like to enhance the logging via logrus and maybe add some pre-checks: For example making the call fail only in some certain cases and assume that "everything is ok" if the tag is already present remotely. WDYT?

@tpepper
Copy link
Member

tpepper commented Aug 13, 2020

An idea which came into my mind: What if we add a krel subcommand for pushing the git objects? Seems fairly straight forward and we could remove the bash bits from anago. Then I'd like to enhance the logging via logrus and maybe add some pre-checks: For example making the call fail only in some certain cases and assume that "everything is ok" if the tag is already present remotely. WDYT?

This is the type of decomposition we need. The anago bash bits doing that push can be removed to instead have anago call a more robust pusher.

@tpepper
Copy link
Member

tpepper commented Aug 13, 2020

Related on the topic of fail/retry/continue resilience:
kubernetes/test-infra#18808

@Verolop
Copy link
Contributor Author

Verolop commented Aug 13, 2020

Does it still make sense to introduce the retries at this point, or should we just go ahead with @saschagrunert 's idea?

@tpepper
Copy link
Member

tpepper commented Aug 13, 2020

We definitely need some retries still in anago too.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 11, 2020
@puerco
Copy link
Member

puerco commented Nov 11, 2020

/remove-lifecycle stale

This was mostly addressed in #1595

unless there are more suggestions and/or comments I think we can close this one

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 11, 2020
@cpanato
Copy link
Member

cpanato commented Nov 14, 2020

Agree, if more things come up we can create a new issue

/close

@k8s-ci-robot
Copy link
Contributor

@cpanato: Closing this issue.

In response to this:

Agree, if more things come up we can create a new issue

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/release Categorizes an issue or PR as relevant to SIG Release.
Projects
None yet
Development

No branches or pull requests

8 participants