Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flagger with larger number of canaries underperforms #638

Closed
Apollorion opened this issue Jun 30, 2020 · 5 comments · Fixed by #725
Closed

Flagger with larger number of canaries underperforms #638

Apollorion opened this issue Jun 30, 2020 · 5 comments · Fixed by #725

Comments

@Apollorion
Copy link

When I run flagger with ~20 Canaries everything seems fine. But as soon as I scale up to around 50 canaries, flagger starts to have a bad time.

What we see when we scale up the number canaries:

  • Canaries happen much slower, like 30+ minutes when it should only take about 5 minutes. Without a halt message in the flagger logs.
  • Canaries sometimes just dont ever progress. It will show in the events of the canary New revision detected.... but never progresses beyond 0%.
  • Flagger will get itself into a crash loop if leader election is enabled with the message: {"level":"info","ts":"2020-06-29T16:05:51.926Z","caller":"flagger/main.go:302","msg":"Leadership lost"}. The other replica will pick up for a few minutes before dying with the same message.

The above conditions happen when flagger has plenty of resources. We've never actually seen it go above 2% CPU utilization or 50MB memory utilization during the time that any of the above problems happen.

What we've tried:

  • Increasing "threadiness" in flagger to around 30, looks like the default is 2. flag.IntVar(&threadiness, "threadiness", 2, "Worker concurrency.")
  • Turning on debug logging, nothing of value comes from this and even shows flagger will just hang for periods of time doing nothing.
  • Adding and removing leader election.

Flagger v1.0.0
Kubernetes v1.14
Nginx Controller v0.26.1

@stefanprodan
Copy link
Member

I suspect this is due to Kubernetes API rate limits. I've been benchmarking Flagger with 100 canaries in parallel on GKE and I haven't seen delays more than a couple of seconds, but it really depends on what Kubernetes provider you are using.

@Apollorion
Copy link
Author

Apollorion commented Jun 30, 2020

I suspect this is due to Kubernetes API rate limits. I've been benchmarking Flagger with 100 canaries in parallel on GKE and I haven't seen delays more than a couple of seconds, but it really depends on what Kubernetes provider you are using.

Is there somewhere that flagger will log errors if its being rate limited? Or will flagger just hold onto the request until it finally succeeds? I'm asking because nothing is showing up in the debug logs for the flagger operator 😔

Im on Amazon EKS.

@tr-fteixeira
Copy link
Contributor

Seeing a very similar behaviour, also AWS EKS, v1.16. We are at ~45 canaries, nothing from the logs, even in debug. After upgrading to 1.2.0, the leader election problem is gone, but the slowness persists. Happens even if there is only 1 canary being updated at a time.
Trying to look it from the AWS/Prom metrics side but didn't find anything that looked directly related to this.

Threadiness increase did not change results for us.

If you have an automation for that benchmark, let me know and i am up for porting it to EKS if needed to try and replicate this for others.

Thanks for all the things =)

@tr-fteixeira
Copy link
Contributor

Hey, just an update on the last comment, i believe i've been able to reproduce on a test environment and identify the cause. It is indeed a rate-limiter, but doesn't seem do be on AWS EKS side, it seems to be on the client config.

When the clients are created here and here, the QPS and Burst values are default. and those are throttling some of the responses. Adding something crazy like

cfg.QPS = 100
cfg.Burst = 1000

and

cfgHost.QPS = 100
cfgHost.Burst = 1000

will do the trick and allow it to breeze through at least 75 canaries with no noticeable impact in the progress performance.

Not sure what would be the preferred approach would be, maybe individual clients per canary, configurable limits, tweak them based on the canary count as that is what drives the usage number.. there are multiple options. if you do have a preferred solution, let me know and i might be able to help out..

Regards,

@stefanprodan
Copy link
Member

stefanprodan commented Nov 5, 2020

I think those two options could be set with Flagger command args, we need to figure out a default that works ok with 100 canaries. If you could open a PR for this it would be great. Thank you!

adam-thorpe pushed a commit to adam-thorpe/flagger-helm that referenced this issue Sep 3, 2024
Implemented as requested in PR723
supersedes: fluxcd/flagger#723
fixes: fluxcd/flagger#638
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants