Flagger with larger number of canaries underperforms #638

Apollorion · 2020-06-30T13:24:20Z

When I run flagger with ~20 Canaries everything seems fine. But as soon as I scale up to around 50 canaries, flagger starts to have a bad time.

What we see when we scale up the number canaries:

Canaries happen much slower, like 30+ minutes when it should only take about 5 minutes. Without a halt message in the flagger logs.
Canaries sometimes just dont ever progress. It will show in the events of the canary New revision detected.... but never progresses beyond 0%.
Flagger will get itself into a crash loop if leader election is enabled with the message: {"level":"info","ts":"2020-06-29T16:05:51.926Z","caller":"flagger/main.go:302","msg":"Leadership lost"}. The other replica will pick up for a few minutes before dying with the same message.

The above conditions happen when flagger has plenty of resources. We've never actually seen it go above 2% CPU utilization or 50MB memory utilization during the time that any of the above problems happen.

What we've tried:

Increasing "threadiness" in flagger to around 30, looks like the default is 2. flag.IntVar(&threadiness, "threadiness", 2, "Worker concurrency.")
Turning on debug logging, nothing of value comes from this and even shows flagger will just hang for periods of time doing nothing.
Adding and removing leader election.

Flagger v1.0.0
Kubernetes v1.14
Nginx Controller v0.26.1

The text was updated successfully, but these errors were encountered:

stefanprodan · 2020-06-30T17:34:50Z

I suspect this is due to Kubernetes API rate limits. I've been benchmarking Flagger with 100 canaries in parallel on GKE and I haven't seen delays more than a couple of seconds, but it really depends on what Kubernetes provider you are using.

Apollorion · 2020-06-30T18:01:12Z

I suspect this is due to Kubernetes API rate limits. I've been benchmarking Flagger with 100 canaries in parallel on GKE and I haven't seen delays more than a couple of seconds, but it really depends on what Kubernetes provider you are using.

Is there somewhere that flagger will log errors if its being rate limited? Or will flagger just hold onto the request until it finally succeeds? I'm asking because nothing is showing up in the debug logs for the flagger operator 😔

Im on Amazon EKS.

tr-fteixeira · 2020-11-02T20:47:33Z

Seeing a very similar behaviour, also AWS EKS, v1.16. We are at ~45 canaries, nothing from the logs, even in debug. After upgrading to 1.2.0, the leader election problem is gone, but the slowness persists. Happens even if there is only 1 canary being updated at a time.
Trying to look it from the AWS/Prom metrics side but didn't find anything that looked directly related to this.

Threadiness increase did not change results for us.

If you have an automation for that benchmark, let me know and i am up for porting it to EKS if needed to try and replicate this for others.

Thanks for all the things =)

tr-fteixeira · 2020-11-05T03:35:01Z

Hey, just an update on the last comment, i believe i've been able to reproduce on a test environment and identify the cause. It is indeed a rate-limiter, but doesn't seem do be on AWS EKS side, it seems to be on the client config.

When the clients are created here and here, the QPS and Burst values are default. and those are throttling some of the responses. Adding something crazy like

cfg.QPS = 100
cfg.Burst = 1000

and

cfgHost.QPS = 100
cfgHost.Burst = 1000

will do the trick and allow it to breeze through at least 75 canaries with no noticeable impact in the progress performance.

Not sure what would be the preferred approach would be, maybe individual clients per canary, configurable limits, tweak them based on the canary count as that is what drives the usage number.. there are multiple options. if you do have a preferred solution, let me know and i might be able to help out..

Regards,

stefanprodan · 2020-11-05T06:43:31Z

I think those two options could be set with Flagger command args, we need to figure out a default that works ok with 100 canaries. If you could open a PR for this it would be great. Thank you!

Implemented as requested in PR723 supersedes: fluxcd/flagger#723 fixes: fluxcd/flagger#638

stefanprodan mentioned this issue Jul 6, 2020

Add threadiness to helm chart #643

Merged

This was referenced Nov 10, 2020

Add QPS and Burst configs for kubernetes client #723

Closed

Add QPS and Burst configs for kubernetes client #725

Merged

stefanprodan closed this as completed in #725 Nov 12, 2020

adam-thorpe pushed a commit to adam-thorpe/flagger-helm that referenced this issue Sep 3, 2024

Add QPS and Burst configs for kubernetes client

442233f

Implemented as requested in PR723 supersedes: fluxcd/flagger#723 fixes: fluxcd/flagger#638

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flagger with larger number of canaries underperforms #638

Flagger with larger number of canaries underperforms #638

Apollorion commented Jun 30, 2020

stefanprodan commented Jun 30, 2020

Apollorion commented Jun 30, 2020 •

edited

Loading

tr-fteixeira commented Nov 2, 2020

tr-fteixeira commented Nov 5, 2020

stefanprodan commented Nov 5, 2020 •

edited

Loading

Flagger with larger number of canaries underperforms #638

Flagger with larger number of canaries underperforms #638

Comments

Apollorion commented Jun 30, 2020

stefanprodan commented Jun 30, 2020

Apollorion commented Jun 30, 2020 • edited Loading

tr-fteixeira commented Nov 2, 2020

tr-fteixeira commented Nov 5, 2020

stefanprodan commented Nov 5, 2020 • edited Loading

Apollorion commented Jun 30, 2020 •

edited

Loading

stefanprodan commented Nov 5, 2020 •

edited

Loading