-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling with spiky queues #116
Comments
I understand your concern. Thanks for raising this here. I see you have not tried out secondsToProcessOneJob. Using that will keep a minimum number of pods always running based on the queue RPM. Please try it once, this may solve the issue as the surge would not be that spiky after that. Also we try to not have such spiky load on the system as your pods might take it but a 3rd party system might go down :) but that is a separate discussion and not a concern for WPA. We want to keep the metrics to minimum, flags to minimum so that the whole system remains simple and easy to use, do not wish to complicate the system untill very necessary. Please try secondsToProcessOneJob and give feedback. |
Also you could control the scale down in this use case by using maxDisruption spec. Using maxDisruption you have the control of how quickly or slowly you scale down. This helps you scale down in steps. Please try this also once. |
Hi thanks for the reponse and the suggestions. I've run a few simple tests based on these and results are below: Test 1: targetMessagesPerWorker: 40, maxDisruption: 100%, secondsToProcessOneJob: null (same as in my original post) These four tests are graphed below from left to right: We see that time to get from 1000 messages to 0 messages is as follows: Test 1: 8m0s The graph of replicas is below: As you can see, the difference when maxDisruption is there, but not huge. And the behaviour when secondsToProcessOneJob is much the same time-wise and the replica-count is just plain weird. The replicas spike back up when the count is near-zero. I increased verbosity to 4 on the controller while running these tests, here's a snippet from during that period when the workers spike back up. It seems the controller believes messages are arriving on the queue??
|
Hi @mf-lit, Thank you for the comprehensive problem statement. One caveat with spiky traffic is that the calculations and the experiments that we discuss can result in weird outcomes because:
I really like your idea of scaling based on the ApproximateAgeOfOldestMessage metric and the maxDesiredMessageAge config parameter. We already use the ApproximateAgeOfOldestMessage metric for our cloudwatch alarms but we haven't explored this possibility before. The calculations and the config parameters in WPA are designed to complement each other and optimize primarily for 2 use-cases: Case-1 - Long-running workers with low throughput: (like 2 minutes to process a job and having 0-10 jobs per minute) Case-2 - Fast workers with high throughput: (like < 1 second to process a job and having > 1000 jobs per minute) I think your use-case can be better mapped to Case-1 than to Case-2. If so, setting maxDisruption to 0 is recommended to ensure that the backlog is consumed as fast as possible and scaling down happens only when all workers are idle. However, if all the workers are never idle, then it can result in over-provisioned infrastructure. If that doesn't work and we need to treat it as a new case(Case-3), then the solution that you proposed can be used to ensure the desired workers don't fall below the calculated value. We should however explore if this can cause some unexpected issues. One minor issue that I can think of is if the pod crashes and the job goes into messagesNotVisible for the rest of the visibility timeout, this can trigger unnecessary scale up because the ApproximateAgeOfOldestMessage metric will go up even though all visible messages are getting consumed immediately. Also, the AWS documentation for the ApproximateAgeOfOldestMessage metric warns about poison-pill messages and also jobs that are received more than 3 times.
|
Hi @justjkk , thanks for taking the time to respond so thoroughly. Sorry it's taken a while to get back to you, I've been away.
These are very good points. I guess some of these could be compensated for, but it would make the algorithm/logic used rather more complicated which may not be desirable.
Actually no, here is the complete command line I'm using:
Like you I was also surprised that maxDisruption=10% wasn't more effective. I've also tried 1% and that too did not improve things. This can be seen on the left of the graph below.
Out of curiosity I tried maxDisruption=0% despite it not being suitable for our production workloads (our queue rarely reaches zero). This is shown on the right of the graph. It certainly improved the time taken to clear the queue, but desiredReplicas never came down when the queue was down to zero?? |
In v1.6.0 we have released Resync period is not respected when the updates to status happen frequently due to changes in the status object. So we have added this cooldown to make scale down happen only at 10mins interval after last scale activity and not in 200ms which was happening. I am hoping using this should solve the problem you have mentioned above as there would be more workers running and the queue would get cleared up soon. |
I'm looking at using worker-pod-autoscaler as a means to scale our application based on an SQS queue. Overall it's been working well, indeed it behaves exactly as described in the docs.
However, it's behaviour doesn't well fit the nature of our workloads so I'm opening this issue to describe the problem and hopefully start some discussion about a possible solution.
Our queue backlog is "spiky", large numbers of messages arrive on the queue both unpredictably and suddenly. To demonstrate this I created the following WorkerPodAutoscaler spec:
Then I sent 1000 messages to the empty queue (this took about 10 seconds).
A graph of the queue length over the coming minutes looks like this:
And a graph of desired replicas looks like this:
We can see that WPA behaves as expected. Replicas scale immediately to (1000/40=) 25, and queue length drops rapidly as a result. Then as the queue continues to fall replicas are removed and the message removal rate slows until some time later the queue is finally back to zero and replicas goes to zero too.
The problem for us is the way the number of workers reduces in proportion to the queue length, which means the removal-rate is constantly falling and therefore items remain on the queue for longer than we would like. For us, the length of the queue backlog is irrelevant as a SLO, what matters is the amount of time items are sitting on the queue.
In the example above we can see that it's taken eight minutes for the queue to finally reach zero. For our use-case we do not want messages on the queue for any more than five minutes. I could try to mitigate this by reducing targetMessagesPerWorker, and this may be reasonably effective but will result in a lot of initial replicas and still suffer from an ever-decreasing removal rate, a very inefficient solution. Also behaviour would be different for larger/smaller spikes.
My suggestion would be a third alternative metric (in addition to
targetMessagesPerWorker
andsecondsToProcessOneJob
) called something likemaxDesiredMessageAge
.To create an algorithm based on that metric we also need to know some more information about the queue:
For SQS both those metrics can be found (or derived at least) from Cloudwatch:
ApproximateAgeOfOldestMessage
ApproximateNumberOfMessagesVisible/NumberOfMessagesDeleted
The algorithm would then be a loop that looks something like this (pseudo-code):
This should result in replicas ramping up more steadily as a result of the spike but remaining scaled-up in order to clear the queue within the target time.
You'll probably notice that the final calculation is pretty much what HorizontalPodAutoscaler does. However, using HPA doesn't work for this as it doesn't synchronise metrics with scaling events resulting in massive-overscaling, i.e. it scales-up metrics and then scales again and again because the queue metrics dont update quickly enough.
My example is very simplistic, but I'm just trying to figure out if it's feasible in theory at least. Would love to hear some thoughts...
The text was updated successfully, but these errors were encountered: