Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-kubernetes-build jobs may upload incomplete set of artfacts #18808

Closed
spiffxp opened this issue Aug 12, 2020 · 16 comments
Closed

ci-kubernetes-build jobs may upload incomplete set of artfacts #18808

spiffxp opened this issue Aug 12, 2020 · 16 comments
Assignees
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@spiffxp
Copy link
Member

spiffxp commented Aug 12, 2020

What happened:
ci-kubernetes-build jobs that timeout may upload an incomplete set of artifacts, and subsequent runs against the same commit don't rebuild / push the rest of the artifacts

The two timeouts on the left are examples of this

What you expected to happen:
I expect if ci-kubernetes-build (and its release-branch variants) times out or uploads an incomplete set of artifacts, the next run of the job will rebuild / push the rest of the artifacts

How to reproduce it (as minimally and precisely as possible):
Set the timeout for a ci-kubernetes-build job low enough that it times out during artifact upload

Please provide links to example occurrences, if any:

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-build-stable1/1291744141339791362 (repo-commit 236b9e7fcda25d9b28afa81305ab23f98e622461)

I0807 17:06:38.191] +++ [0807 17:06:38] Waiting on test tarballs
I0807 17:06:38.196] +++ [0807 17:06:38] Starting tarball: test linux-arm64
I0807 17:06:38.201] +++ [0807 17:06:38] Starting tarball: test darwin-amd64
I0807 17:06:38.203] +++ [0807 17:06:38] Starting tarball: test windows-amd64
I0807 17:08:37.091] +++ [0807 17:08:37] Building tarball: test portable
W0807 17:08:37.720] Run: ('../release/push-build.sh', '--nomock', '--verbose', '--ci', '--release-kind=kubernetes', '--docker-registry=gcr.io/kubernetes-ci-images', '--extra-publish-file=k8s-stable1', '--allow-dup')
I0807 17:33:29.168] Terminate 485 on timeout

It's not clear what did or did not get pushed

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-build-stable1/1291191706741379076

W0806 01:59:55.439] Run: ('gsutil', 'ls', 'gs://kubernetes-release-dev/ci/v1.18.7-rc.0.36+2a3a36842f8ab9')
W0806 01:59:57.576] Run: ('gsutil', 'ls', 'gs://kubernetes-release-dev/ci/v1.18.7-rc.0.36+2a3a36842f8ab9/kubernetes.tar.gz')
W0806 01:59:59.586] Run: ('gsutil', 'ls', 'gs://kubernetes-release-dev/ci/v1.18.7-rc.0.36+2a3a36842f8ab9/bin')
W0806 02:00:01.822] build already exists, exit

These things did, at least. But what should be there?

Anything else we need to know?:

The simple check is coming from scenarios/kubernetes_build.py

I'm going to raise the timeout on build jobs for now so we hopefully hit this less, but we should decide what a higher fidelity "did it already build" check should look like, and where it should live

/area release-eng
/sig release
/sig testing

@spiffxp spiffxp added the kind/bug Categorizes issue or PR as related to a bug. label Aug 12, 2020
@k8s-ci-robot k8s-ci-robot added area/release-eng Issues or PRs related to the Release Engineering subproject sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Aug 12, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Aug 12, 2020

FYI @kubernetes/release-engineering

@BenTheElder
Copy link
Member

so the reason we check for an existing build is also because we broke kops in the past due to hashes changing b/c the build was not fully reproducible ... so pushing over a partial build may not be the best either, since we currently can't do it atomically

@spiffxp
Copy link
Member Author

spiffxp commented Aug 12, 2020

Using the example above, are you saying a followup build deleting/re-writing everything in gs://kubernetes-release-dev/ci/v1.18.7-rc.0.36+2a3a36842f8ab9 might break kops?

Yeah lack of atomicity here is annoying

@BenTheElder
Copy link
Member

  • deleting everything could break anything currently attempting to use it.
  • rewriting URLs to new contents is problematic because some of them contain expected hashes of the others (which kops at least verifies), and now you're racily changing the hash files and the files being hashed
  • you can't just write to a new location currently because the location is well known based on the commit

@BenTheElder
Copy link
Member

I'm not sure how much things depend on the commit value other than ci/latest.txt (or similar files) containing it, and that being in the GCS path, so maybe we could do ${COMMIT}+${ATTEMPT}as the latext.txt contents and partial path, in order to write to a new location?

@justaugustus
Copy link
Member

/assign @hasheddan @saschagrunert

@saschagrunert
Copy link
Member

I think we can start working on this issue after the replacement of push-build.sh with krel push.

Ref #19488

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Jan 25, 2021

/remove-lifecycle stale
I'm willing to say this is less of an issue given the fact that recent builds consistently take < 80min, well under the raised-as-workaround 240min. But this still remains an issue

https://testgrid.k8s.io/sig-release-master-blocking#build-master&width=5&graph-metrics=test-duration-minutes

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2021
@xmudrii
Copy link
Member

xmudrii commented May 25, 2021

@spiffxp Is this issue still relevant?

@spiffxp
Copy link
Member Author

spiffxp commented May 25, 2021

/remove-lifecycle stale
AKAIK yes. This would be a good candidate to move to kubernetes/release since this is under the purview of @kubernetes/release-engineering

I would suggest creating a script or tool to verify the completeness of all ci builds to start with. That will answer how much of a problem it still is

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 25, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 22, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

9 participants