Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix the rollout stuck issue when pod/replicas changed together for canary rollout. #3257

Conversation

andyliuliming
Copy link
Contributor

@andyliuliming andyliuliming commented Dec 17, 2023

when the rollout is in the "Pause" status,
and replica/poc spec changed together, then the rollout will stuck there forever.
this is because the newRS will be nil,
and there's one short circut return in reconcileNewReplicaSet().

issue reported in:
#3256

the change will follow the "BlueGreen" style to handle this.
pick the replica set which the canary svc targeted to. and reconcile it.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@andyliuliming andyliuliming changed the title fix: fix the rollout stuck when pod/replicas changed together or cana… fix: fix the rollout stuck issue when pod/replicas changed together for canary rollout. Dec 17, 2023
@andyliuliming
Copy link
Contributor Author

andyliuliming commented Dec 17, 2023

@zachaller could you please help take a look when you have time? thanks.

@andyliuliming andyliuliming marked this pull request as ready for review December 17, 2023 18:09
Copy link
Contributor

github-actions bot commented Dec 17, 2023

Go Published Test Results

2 099 tests   2 098 ✔️  2m 50s ⏱️
   118 suites         0 💤
       1 files           1

For more details on these failures, see this check.

Results for commit 6cffabe.

♻️ This comment has been updated with latest results.

Copy link

codecov bot commented Dec 17, 2023

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (1ecd394) 81.81% compared to head (6cffabe) 81.83%.
Report is 1 commits behind head on master.

Files Patch % Lines
rollout/canary.go 50.00% 2 Missing and 1 partial ⚠️
rollout/service.go 93.75% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3257      +/-   ##
==========================================
+ Coverage   81.81%   81.83%   +0.01%     
==========================================
  Files         135      135              
  Lines       20629    20651      +22     
==========================================
+ Hits        16878    16899      +21     
  Misses       2880     2880              
- Partials      871      872       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

github-actions bot commented Dec 17, 2023

E2E Tests Published Test Results

    4 files      4 suites   3h 27m 55s ⏱️
106 tests   98 ✔️   6 💤   2
434 runs  400 ✔️ 24 💤 10

For more details on these failures, see this check.

Results for commit 6cffabe.

♻️ This comment has been updated with latest results.

@andyliuliming andyliuliming force-pushed the andliu/fixreplicapodchangetogether branch 4 times, most recently from d94a730 to c2142b5 Compare December 19, 2023 03:32
@andyliuliming andyliuliming force-pushed the andliu/fixreplicapodchangetogether branch from c2142b5 to 976a6cb Compare December 19, 2023 04:20
Copy link

sonarcloud bot commented Dec 20, 2023

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
2.4% Duplication on New Code

See analysis details on SonarCloud

@zachaller
Copy link
Collaborator

Linking to the BlueGreen issue for reference: #3060

@zachaller
Copy link
Collaborator

@andyliuliming I could still be doing something wrong so still digging but this PR does not seem to fix the issue for me.

@zachaller
Copy link
Collaborator

I also think the issue is around this code:

func (c *rolloutContext) rolloutCanary() error {
	var err error
	if replicasetutil.PodTemplateOrStepsChanged(c.rollout, c.newRS) {
		c.newRS, err = c.getAllReplicaSetsAndSyncRevision(false)
		if err != nil {
			return err
		}
		return c.syncRolloutStatusCanary()
	}

	c.newRS, err = c.getAllReplicaSetsAndSyncRevision(true)
	if err != nil {
		return err
	}

@zachaller
Copy link
Collaborator

This also happens in Rollouts 1.5

@andyliuliming
Copy link
Contributor Author

@zachaller

@andyliuliming I could still be doing something wrong so still digging but this PR does not seem to fix the issue for me.

this only works when there's canaryservice defined in rollout.
the fix works as expected in our rollout.

@zachaller
Copy link
Collaborator

zachaller commented Dec 21, 2023

You are right I don't know how I missed that in my test rollout. Still going to spend some time on it though because I feel there is a deeper underlying issue, I want to investigate that a bit more but this does seem to fix the issue as long as you have canary service defined.

@zachaller
Copy link
Collaborator

I think this would also fix the issue but also work for rollouts without a canary service: https://github.com/argoproj/argo-rollouts/pull/3272/files

@andyliuliming
Copy link
Contributor Author

andyliuliming commented Dec 23, 2023

I think this would also fix the issue but also work for rollouts without a canary service: https://github.com/argoproj/argo-rollouts/pull/3272/files

tested. it works! abandoned mine.

@eugenepaniot eugenepaniot mentioned this pull request Jan 18, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants