Add a few retries around failure points in release scripting #1353

Verolop · 2020-06-10T16:05:30Z

What would you like to be added:

Adding some retries around failure points in release scripting before failing completely. Eg:

docker pull commands on the code for Mock Release (Issue Implement clone, commit and push of release notes draft to user's fork #1102 )
"Step Initial commit of release automation tooling. #1: PUSH GIT OBJECTS (11/14)" (Issue gcbmgr release frequently fails at "push git objects" #933 )
audit past year's "HACK" and "revert" commits on anago/gcbmgr/lib shell to other points which would benefit from a basic sleep/retry loop

Why is this needed:

In case of failure, often we just need to hack workaround and attempt to re-run the release process, which is very time and resource consuming. Adding retries will allow these attempts to take place on the same run.

justaugustus · 2020-06-10T19:15:28Z

/assign @Verolop @cpanato
ref: https://kubernetes.slack.com/archives/CJH2GBF7Y/p1591814834391800?thread_ts=1591794999.379400&cid=CJH2GBF7Y

justaugustus · 2020-06-10T19:17:02Z

/remove-priority important-soon
/priority critical-urgent

tpepper · 2020-06-10T20:41:34Z

Simple retries might help us limp along.

But long term we likely must run our own container image registry and mirror external content. Requiring the internet to be consistent/coherent in order to build is always going to be problematic. Failing to build because we can't docker pull docker.io/library/debian:stretch-slim should not happen.

Verolop · 2020-06-11T20:41:18Z

sounds good, I agree!

tpepper · 2020-07-27T16:02:41Z

@Verolop we've just discussed that there are a number of issues that are relatively similar to this, so I've tweaked the description slightly to cover the more abstracted case. There's max half a dozen points where a minimal additional retry loop in the shell script could make us much more likely to survive these random transient failures and save tonnes of release time and effort.

tpepper · 2020-07-27T16:03:37Z

(ie: rather than close a bunch of issues and create new one...just re-using/re-focusing this one for broad impact)

saschagrunert · 2020-07-28T06:57:57Z

An idea which came into my mind: What if we add a krel subcommand for pushing the git objects? Seems fairly straight forward and we could remove the bash bits from anago. Then I'd like to enhance the logging via logrus and maybe add some pre-checks: For example making the call fail only in some certain cases and assume that "everything is ok" if the tag is already present remotely. WDYT?

tpepper · 2020-08-13T17:06:51Z

An idea which came into my mind: What if we add a krel subcommand for pushing the git objects? Seems fairly straight forward and we could remove the bash bits from anago. Then I'd like to enhance the logging via logrus and maybe add some pre-checks: For example making the call fail only in some certain cases and assume that "everything is ok" if the tag is already present remotely. WDYT?

This is the type of decomposition we need. The anago bash bits doing that push can be removed to instead have anago call a more robust pusher.

tpepper · 2020-08-13T17:07:33Z

Related on the topic of fail/retry/continue resilience:
kubernetes/test-infra#18808

Verolop · 2020-08-13T17:12:58Z

Does it still make sense to introduce the retries at this point, or should we just go ahead with @saschagrunert 's idea?

tpepper · 2020-08-13T17:39:05Z

We definitely need some retries still in anago too.

fejta-bot · 2020-11-11T18:25:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

puerco · 2020-11-11T18:28:49Z

/remove-lifecycle stale

This was mostly addressed in #1595

unless there are more suggestions and/or comments I think we can close this one

cpanato · 2020-11-14T16:39:57Z

Agree, if more things come up we can create a new issue

/close

k8s-ci-robot · 2020-11-14T16:40:11Z

@cpanato: Closing this issue.

In response to this:

Agree, if more things come up we can create a new issue

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

justaugustus transferred this issue from kubernetes/sig-release Jun 10, 2020

k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority labels Jun 10, 2020

k8s-ci-robot removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority labels Jun 10, 2020

k8s-ci-robot assigned cpanato and Verolop Jun 10, 2020

k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 10, 2020

tpepper changed the title ~~Add a few retries on docker pull before failing on Mock Release~~ Add a few retries around failure points in release scripting Jul 27, 2020

saschagrunert mentioned this issue Jul 29, 2020

[krel] Introduce krel subcommand for pushing git objects #1446

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 11, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 11, 2020

k8s-ci-robot closed this as completed Nov 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a few retries around failure points in release scripting #1353

Add a few retries around failure points in release scripting #1353

Verolop commented Jun 10, 2020 •

edited by tpepper

Loading

justaugustus commented Jun 10, 2020

justaugustus commented Jun 10, 2020

tpepper commented Jun 10, 2020

Verolop commented Jun 11, 2020

tpepper commented Jul 27, 2020

tpepper commented Jul 27, 2020

saschagrunert commented Jul 28, 2020

tpepper commented Aug 13, 2020

tpepper commented Aug 13, 2020

Verolop commented Aug 13, 2020

tpepper commented Aug 13, 2020

fejta-bot commented Nov 11, 2020

puerco commented Nov 11, 2020

cpanato commented Nov 14, 2020

k8s-ci-robot commented Nov 14, 2020

Add a few retries around failure points in release scripting #1353

Add a few retries around failure points in release scripting #1353

Comments

Verolop commented Jun 10, 2020 • edited by tpepper Loading

What would you like to be added:

Why is this needed:

justaugustus commented Jun 10, 2020

justaugustus commented Jun 10, 2020

tpepper commented Jun 10, 2020

Verolop commented Jun 11, 2020

tpepper commented Jul 27, 2020

tpepper commented Jul 27, 2020

saschagrunert commented Jul 28, 2020

tpepper commented Aug 13, 2020

tpepper commented Aug 13, 2020

Verolop commented Aug 13, 2020

tpepper commented Aug 13, 2020

fejta-bot commented Nov 11, 2020

puerco commented Nov 11, 2020

cpanato commented Nov 14, 2020

k8s-ci-robot commented Nov 14, 2020

Verolop commented Jun 10, 2020 •

edited by tpepper

Loading