Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail restore if any warmup job failed for volume-snapshot restores (#5569) #5578

Merged

Conversation

ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #5569

What problem does this PR solve?

Currently restore only checks if all warmup jobs completed (success or fail) before continuing w/ later steps of restore. Instead, if any warmup jobs fail, we should fail the restore.

Observed behavior:

  • Failed warmup job conditions:
...
conditions:
  - lastProbeTime: "2024-03-10T21:05:24Z"
    lastTransitionTime: "2024-03-10T21:05:24Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
failed: 5
ready: 0
startTime: "2024-03-10T20:53:09Z"
uncountedTerminatedPods: {}
  • Results in WarmUpComplete in restore
commitTs: "448239596192667254"
conditions:
  - lastTransitionTime: "2024-03-10T20:51:49Z"
    status: "True"
    type: Scheduled
  - lastTransitionTime: "2024-03-10T20:51:57Z"
    status: "True"
    type: Running
  - lastTransitionTime: "2024-03-10T20:52:11Z"
    status: "True"
    type: VolumeComplete
  - lastTransitionTime: "2024-03-10T20:52:57Z"
    status: "True"
    type: WarmUpStarted
  - lastTransitionTime: "2024-03-10T22:02:49Z"
    status: "True"
    type: WarmUpComplete
phase: WarmUpComplete
progresses:
  - lastTransitionTime: "2024-03-10T20:52:11Z"
    progress: 100
    step: Volume Restore
timeCompleted: null
timeStarted: "2024-03-10T20:51:57Z"

What is changed and how does it work?

When any warmup jobs fail, fail the entire restore. Verified this in a testing workload by artificially causing warmup to fail and observing Restore (and consequently, VolumeRestore) put into Failed state

Code changes

  • Has Go code change
  • Has CI related scripts change

Tests

  • Unit test
  • E2E test
  • Manual test
  • No code

Side effects

  • Breaking backward compatibility
  • Other side effects:

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.


Copy link
Contributor

ti-chi-bot bot commented Mar 15, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hanlins for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@BornChanger
Copy link
Contributor

/run-pull-e2e-kind-br

@csuzhangxc csuzhangxc merged commit a1425dc into pingcap:release-1.5 Mar 15, 2024
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants