-
Notifications
You must be signed in to change notification settings - Fork 153
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
added incident report for issue #3907
- Loading branch information
Showing
11 changed files
with
298 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,297 @@ | ||
# Release v0.29 published helm chart before container image was available causing upgrade failures in users | ||
|
||
Tracking issue: https://github.com/weaveworks/weave-gitops/issues/3907 | ||
|
||
## Summary | ||
|
||
Wednesday, August 2nd, we released Weave Gitops v0.29 as part of our regular release cadence. The release published | ||
artefacts out of order causing degradation in users environments. For example as stated in the [issue](https://github.com/weaveworks/weave-gitops/issues/3907), | ||
the helm chart was published before the container image was available: | ||
|
||
``` | ||
weave-gitops-648b7f9655-ll4zn 0/1 ImagePullBackOff 0 11m | ||
weave-gitops-648b7f9655-ll4zn 0/1 ErrImagePull 0 11m | ||
``` | ||
|
||
It was caused due to the release PR merged before the release workflows finished. It recovered when the release workflow finished | ||
and the container image was published. | ||
|
||
## Root Cause Analysis | ||
|
||
Releasing is stated in [release process](https://github.com/weaveworks/weave-gitops/blob/main/doc/release-process.md) that | ||
requires two steps and workflows: | ||
- [prepare-release](https://github.com/weaveworks/weave-gitops/actions/workflows/prepare-release.yaml) | ||
- [release](https://github.com/weaveworks/weave-gitops/actions/workflows/release.yaml) | ||
|
||
Looking at process and workflows, at least these two paths could cause the failures shown in the Background: | ||
|
||
1. Release PR merged before Release job finishes | ||
2. Release job steps [ordering](https://github.com/weaveworks/weave-gitops/blob/main/.github/workflows/release.yaml#L104-L107) | ||
|
||
### Release PR merged before Release job finishes | ||
|
||
Release PR has two expected interaction: | ||
- Releaser PR is approved by the releaser flagging the release job to start. | ||
- Release PR is merged by release bot after the release job has completed. | ||
|
||
What the release process does not avoid, is the releaser accidentally merging the PR after approving and before the | ||
release job has finished. | ||
|
||
This is the scenario that we found ourselves in the latest release [v0.29](https://github.com/weaveworks/weave-gitops/pull/3906). | ||
The merge triggered the [chart workflow](https://github.com/weaveworks/weave-gitops/blob/main/.github/workflows/chart.yaml) | ||
that [released v4.0.27 chart](https://github.com/weaveworks/weave-gitops/actions/runs/5738131386/job/15551116775) that | ||
expected [container image tag v0.29.0](https://github.com/weaveworks/weave-gitops/pull/3906/files#diff-67081178cf02ff87b1326e8b608a6ab4e49a85606346d64607b498fead04b048R13) | ||
at `2023-08-02T11:16` | ||
|
||
```bash | ||
2023-08-02T11:16:18.4241737Z ##[group]Run helm push helm-release/weave-gitops-4.0.27.tgz oci://ghcr.io/weaveworks/charts | ||
2023-08-02T11:16:18.4242301Z [36;1mhelm push helm-release/weave-gitops-4.0.27.tgz oci://ghcr.io/weaveworks/charts[0m | ||
2023-08-02T11:16:18.4298565Z shell: /usr/bin/bash -e {0} | ||
2023-08-02T11:16:18.4298839Z env: | ||
2023-08-02T11:16:18.4299087Z REGISTRY: ghcr.io | ||
2023-08-02T11:16:18.4299390Z CHART_LOCATION: weaveworks/charts | ||
2023-08-02T11:16:18.4299903Z CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE: /home/runner/work/weave-gitops/weave-gitops/gha-creds-4b2fc240e9fa929d.json | ||
2023-08-02T11:16:18.4300541Z GOOGLE_APPLICATION_CREDENTIALS: /home/runner/work/weave-gitops/weave-gitops/gha-creds-4b2fc240e9fa929d.json | ||
2023-08-02T11:16:18.4301139Z GOOGLE_GHA_CREDS_PATH: /home/runner/work/weave-gitops/weave-gitops/gha-creds-4b2fc240e9fa929d.json | ||
2023-08-02T11:16:18.4301620Z CLOUDSDK_CORE_PROJECT: docs-gitops-ww-prod | ||
2023-08-02T11:16:18.4301972Z CLOUDSDK_PROJECT: docs-gitops-ww-prod | ||
2023-08-02T11:16:18.4302382Z GCLOUD_PROJECT: docs-gitops-ww-prod | ||
2023-08-02T11:16:18.4302728Z GCP_PROJECT: docs-gitops-ww-prod | ||
2023-08-02T11:16:18.4303069Z GOOGLE_CLOUD_PROJECT: docs-gitops-ww-prod | ||
2023-08-02T11:16:18.4303393Z ##[endgroup] | ||
2023-08-02T11:16:19.8473627Z Pushed: ghcr.io/weaveworks/charts/weave-gitops:4.0.27 | ||
2023-08-02T11:16:19.8474536Z Digest: sha256:014965fae24719c62faf864c6748f6c24437c6fc437d0920131765a78ca91650 | ||
2023-08-02T11:16:19.8550603Z Post job cleanup. | ||
``` | ||
The container image tag v0.29.0 [started building](https://github.com/weaveworks/weave-gitops/actions/runs/5738127454) at `2023-08-02T11:29` | ||
``` | ||
2023-08-02T11:29:24.0894265Z ##[group]Run docker/build-push-action@v3 | ||
2023-08-02T11:29:24.0894501Z with: | ||
2023-08-02T11:29:24.0894681Z context: . | ||
2023-08-02T11:29:24.0894896Z platforms: linux/amd64,linux/arm64 | ||
2023-08-02T11:29:24.0895102Z push: true | ||
2023-08-02T11:29:24.0895449Z tags: ghcr.io/weaveworks/wego-app:0.29.0 | ||
ghcr.io/weaveworks/wego-app:v0.29.0 | ||
ghcr.io/weaveworks/wego-app:latest | ||
``` | ||
And ended up at around `2023-08-02T12:41` | ||
``` | ||
023-08-02T12:40:58.9623639Z #70 pushing layers | ||
2023-08-02T12:41:00.6152315Z #70 pushing layers 1.7s done | ||
2023-08-02T12:41:00.6153313Z #70 pushing manifest for ghcr.io/weaveworks/wego-app:0.29.0@sha256:a0d23c6429357d2bc037470acbe347e40bd0ee738cbe4b328656cead3330e394 | ||
2023-08-02T12:41:01.8178601Z #70 pushing manifest for ghcr.io/weaveworks/wego-app:0.29.0@sha256:a0d23c6429357d2bc037470acbe347e40bd0ee738cbe4b328656cead3330e394 1.2s done | ||
2023-08-02T12:41:01.9367337Z #70 pushing layers 0.3s done | ||
2023-08-02T12:41:01.9369527Z #70 pushing manifest for ghcr.io/weaveworks/wego-app:v0.29.0@sha256:a0d23c6429357d2bc037470acbe347e40bd0ee738cbe4b328656cead3330e394 | ||
2023-08-02T12:41:02.5391427Z #70 pushing manifest for ghcr.io/weaveworks/wego-app:v0.29.0@sha256:a0d23c6429357d2bc037470acbe347e40bd0ee738cbe4b328656cead3330e394 0.5s done | ||
2023-08-02T12:41:02.8397930Z #70 pushing layers 0.2s done | ||
2023-08-02T12:41:02.8398709Z #70 pushing manifest for ghcr.io/weaveworks/wego-app:latest@sha256:a0d23c6429357d2bc037470acbe347e40bd0ee738cbe4b328656cead3330e394 | ||
2023-08-02T12:41:03.2911496Z #70 pushing manifest for ghcr.io/weaveworks/wego-app:latest@sha256:a0d23c6429357d2bc037470acbe347e40bd0ee738cbe4b328656cead3330e394 0.5s done | ||
2023-08-02T12:41:03.2911964Z #70 DONE 9.0s | ||
20 | ||
``` | ||
Any upgrade to v0.29.0 from `2023-08-02T11:16` to `2023-08-02T12:41` ended up in failure. | ||
At this stage we could likely say that this was the cause of the incident. The rest of the document focuses on understanding | ||
remediation actions and next steps. Release job steps ordering will be evaluated as separate issue. | ||
## Resolution | ||
It was resolved after the release workflow ended and the container image was published. | ||
## Prevention | ||
Given that a releaser just needs to verify the release looks fine before triggered. We should design it the process | ||
with that into consideration. Two alternatives are identified: | ||
1. no human intervention has effect or not possible after release PR is approved | ||
2. no human intervention is required during the release process | ||
From the two approaches, given the scope of this work is to be in a better place next time we release with smallest cost possible, | ||
we will be focusing on the first one, so the second would require changing the process that we could anticipate would | ||
more costly (more changes to do) and riskier (adopt new process and changes could end up in different new failure modes). | ||
The second alternative would be discovered in case there is no feasible solution from 1) or there is no appetite to | ||
go ahead. | ||
### No human intervention has effect or not possible after release PR is approved | ||
The scenario we want to model would be something like | ||
```gherkin | ||
Feature: release with out of order guardrails | ||
Scenario: | ||
Given releaser executed prepare-release job with `v0.30` | ||
Then release PR is created | ||
And release PR cannot be merged cause not approved | ||
And release PR cannot be merged cause not released | ||
When releaser approves it | ||
Then release process starts | ||
And release PR cannot merge it cause release hasn't happened | ||
When release process ends | ||
Then release PR is merged by bot | ||
``` | ||
A solution to achieve the semantics that only `approved and released` release PRs could be merged, could be achieved | ||
by extending our current `branch merge protection`: | ||
Protection against merging unapproved PRs is already in place. | ||
![weave-gitops-protection-approval.png](imgs/weave-gitops-protection-approval.png) | ||
Protection against merging not-released PRs is not in place: | ||
![weave-gitops-protection-status-checks.png](imgs/weave-gitops-protection-status-checks.png) | ||
We could add that layer of protection by leveraging [github status checks](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/collaborating-on-repositories-with-code-quality-features/about-status-checks) | ||
and requiring a `release` status check to merge PRs: | ||
![main-protection.png](imgs/main-protection.png) | ||
A `release` status checks would have locking semantics as follows: | ||
- Do nothing if the PR is not from a release branch. | ||
- if release branch | ||
- `lock`: at prepare-release workflow. | ||
- `unlock`: after release has been done (and container image has been published). | ||
Therefore, an attempt to merge a release PR would be blocked until the release does not happen. The `release` status check could be managed via github api and gitihub actions steps like: | ||
```yaml | ||
release: | ||
if: ${{ !startsWith(github.event.pull_request.head.ref, 'releases/') }} | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: release | ||
run: | | ||
curl --request POST \ | ||
--url https://api.github.com/repos/${{ github.repository }}/statuses/${{ github.event.pull_request.head.sha }} \ | ||
--header 'authorization: Bearer ${{ secrets.WEAVE_GITOPS_BOT_ACCESS_TOKEN }}' \ | ||
--header 'content-type: application/json' \ | ||
--data '{ | ||
"state":"success", | ||
"description":"release not required", | ||
"context":"release" | ||
}' | ||
``` | ||
#### Proof of Concept Validation | ||
```gherkin | ||
Feature: can build non-release branches without changing the flow nor overhead | ||
Background: | ||
Given weave-gitops main merges are protected with `build` status check | ||
And weave-gitops main merges are protected with `release` status check | ||
Scenario: current flow | ||
Given a non release pr | ||
When build | ||
Then CI workflow passes | ||
And I could merge | ||
``` | ||
``` | ||
Background: | ||
Given weave-gitops main merges are protected with `build` status check | ||
And weave-gitops main merges are protected with `release` status check | ||
``` | ||
![main-protection.png](imgs/main-protection.png) | ||
``` | ||
Given a non release pr | ||
When build | ||
``` | ||
Created branch called `feature/validation-poc` and Bot raised PR https://github.com/enekofb/weave-gitops/pull/6 | ||
Where we could see the checks pending | ||
![validate-non-release-pr.png](imgs/validate-non-release-pr.png) | ||
Action building https://github.com/enekofb/weave-gitops/actions/runs/5806346911 | ||
After building, it gets updated | ||
![validate-non-release-pr-after-build.png](imgs/validate-non-release-pr-after-build.png) | ||
I could approve and merge as usual https://github.com/enekofb/weave-gitops/pull/6 | ||
//TODO: release job running in main https://github.com/enekofb/weave-gitops/actions/runs/5806364525/job/15739102657 | ||
// it does not fail but not great | ||
```gherkin | ||
Feature: can build release branches with guardrails | ||
Background: | ||
Given weave-gitops main merges are protected with `build` status check | ||
And weave-gitops main merges are protected with `release` status check | ||
Scenario: add a check to the PR that only gets passed when release goes on | ||
When human executed prepare-release job with `v0.30` | ||
Then release PR is created | ||
And release PR cannot be merged cause not approved | ||
And release PR cannot be merged cause not release build passes | ||
When human approves it | ||
Then release process starts | ||
And cannot merge it cause release hasnt happened | ||
When release process ends | ||
Then release check passes | ||
And release branch is merged by bot | ||
``` | ||
``` | ||
Given human executed prepare-release job with `v0.100` | ||
``` | ||
https://github.com/enekofb/weave-gitops/actions/runs/5806576714 | ||
``` | ||
Then release PR is created | ||
And release PR cannot be merged cause not approved | ||
And release PR cannot be merged cause not release build passes | ||
``` | ||
https://github.com/enekofb/weave-gitops/pull/11 | ||
We could see how `release Pending — execute release to pass` | ||
![validate-release-pr-prepared.png](imgs/validate-release-pr-prepared.png) | ||
``` | ||
When human approves it | ||
Then release process starts | ||
And cannot merge it cause release hasnt happened | ||
``` | ||
Release has started https://github.com/enekofb/weave-gitops/actions/runs/5806793749/job/15740396733 | ||
And cannot merge it | ||
![validate-release-pr-approved.png](imgs/validate-release-pr-approved.png) | ||
``` | ||
When release process ends | ||
Then release check passes | ||
And release branch is merged by bot | ||
``` | ||
After the release happened, the PR got merged by the bot | ||
![validate-release-pr-happened.png](imgs/validate-release-pr-happened.png) | ||
https://github.com/enekofb/weave-gitops/actions/runs/5806793749/job/15740396733 | ||
``` | ||
{ | ||
"sha": "03093a127262cad8f4c94caf8aa4262314a1cedd", | ||
"merged": true, | ||
"message": "Pull Request successfully merged" | ||
} | ||
``` |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+104 KB
doc/incidents/issues-3907/imgs/validate-non-release-pr-after-build.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+254 KB
doc/incidents/issues-3907/imgs/weave-gitops-protection-status-checks.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.