Allow setting 'minimum headroom' for autoscaling #148

yuvipanda · 2017-06-29T22:46:10Z

I want to be able to say 'if the cluster is more than X% full, scale up until it is not'. This is important in super dynamic clusters that are very spiky - we run a Kubernetes cluster for a University, and a large spike of pods start up when classes start. If we waited for them to fail Scheduling before adding more nodes, this provides them with a suboptimal experience (since it might take several minutes for a new node to spin up).

One problem would be defining what 'full' is, in a way that doesn't duplicate what's in the scheduler.

davidopp · 2017-06-29T22:48:49Z

We also got this request from a GKE customer recently. So there are at least two people who want it. :)

yuvipanda · 2017-06-29T23:18:50Z

I also want this on GKE :D

mwielgus · 2017-06-30T00:20:30Z

We are working on this #77.

yuvipanda · 2017-06-30T01:51:41Z

oooo, awesome! Is it being planned to coincide with 1.8? Or later?

…

On Thu, Jun 29, 2017 at 5:20 PM, Marcin Wielgus ***@***.***> wrote: We are working on this #77 <#77>. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#148 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAB23pfNlnVU2u4j5q4IO9qxCDNN0T3Xks5sJD9PgaJpZM4OJ9Pi> .

-- Yuvi Panda T http://yuvi.in/blog

davidopp · 2017-07-11T05:46:50Z

A nice complement to this feature would be a way to pre-pull images to the headroom nodes so that a pending pod pays neither the node creation overhead (headroom feature) nor the image pull overhead (pre-pull feature) and can start running right away. We'd need to figure out a way the cluster admin or user specify which images should be pre-pulled where.

jonastl · 2017-08-07T15:37:21Z

This. Together with making the scaling a parallel operation would solve our problem.
Waiting for 50-100 new machines to boot and install the backplane software takes forever, since scaling up is a serial operation (takes 160 seconds per node exactly on GKE)

MaciekPytel · 2017-08-07T16:22:26Z

@jonastl Which version of CA are you using? Scaling up shouldn't be serial - CA estimates how many nodes are required and adds them in a single request. And it only waits for request to come back, not for nodes to actually start.

MaciekPytel · 2017-08-07T16:23:44Z

Sorry, just realised you mentioned GKE - in this case I mean what cluster version are you using (as CA is bundled with cluster version on GKE).

jonastl · 2017-08-08T11:02:32Z

@MaciekPytel Version 1.6.7

MaciekPytel · 2017-08-08T12:45:16Z

@jonastl In that case it definitely shouldn't be serial. That being said your comment #77 (review) suggest you're using a very unusual setup, so perhaps there is a bug somewhere that only manifests for your setup.

It may be worth creating a new issue for that with some information about your setup (cluster version, cluster size, number of pods and description of how they're scheduled). Alternatively we can have a chat on kubernetes slack and see if there is something we can figure out quickly.

jonastl · 2017-08-25T20:20:18Z

@MaciekPytel, it turned out that when we enabled resource constraints (CPU and memory) to a degree that filled a node group member, then scaling speed was much improved, so my remark above about serial scaling can be scratched with this new insight.

The solution was non-obvious to us, but now that we've found out about the scalers behavior with the expected (undocumented) knobs turned, we're happy with the scaling speed.

davidopp · 2017-09-13T18:04:35Z

Is there an ETA on this?

mwielgus · 2017-09-13T18:08:02Z

Next K8S release (1.9). In 1.8 we were busy improving the performance of the current functionality and this feature makes all the computations much more complex.

yuvipanda · 2017-12-04T00:12:03Z

Is someone working on this for 1.9?

yuvipanda · 2017-12-04T02:16:43Z

After some thinking, I've come up with a scheme (for GKE) involving two nodepools that'll satisfy our use cases, and have written it up at berkeley-dsep-infra/data8xhub#7. If anyone with more knowledge of the autoscaler can take a look at that and lmk how terrible the idea is, I would highly appreciate it.

choldgraf · 2018-02-22T18:11:58Z

Any movement on this? It would be quite useful for ensuring we don't hit ceiling effects before new nodes are requested!

fejta-bot · 2018-05-23T18:14:07Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

yuvipanda · 2018-06-14T07:55:22Z

Any idea how we can help get this moving? :)

aleksandra-malinowska · 2018-06-14T08:09:56Z

This can be achieved using pod priority and preemption, see (How can I configure overprovisioning with Cluster Autoscaler?)[https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler]

MaciekPytel mentioned this issue Aug 8, 2017

Resource overprovisioning in Cluster Autoscaler kubernetes/enhancements#387

Closed

aleksandra-malinowska added area/cluster-autoscaler enhancement labels Nov 14, 2017

jacobtomlinson mentioned this issue Jan 4, 2018

Resource Slack for AutoScaler berkeley-dsep-infra/data8xhub#7

Open

yuvipanda mentioned this issue Jan 16, 2018

AWS Deployment pangeo-data/pangeo#71

Closed

yuvipanda mentioned this issue Feb 22, 2018

Work around launch rate failures whenever a new node is autoscaled up jupyterhub/mybinder.org-deploy#474

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2018

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed enhancement labels Jun 5, 2018

aleksandra-malinowska closed this as completed Jun 14, 2018

consideRatio mentioned this issue Jun 14, 2018

[WIP] Autoscaling - a living development documentation jupyterhub/zero-to-jupyterhub-k8s#503

Closed

minrk mentioned this issue Apr 27, 2021

user-scheduler's ranking isn't considered when evicting user placeholder pods - following eviction, that is where a user is scheduled jupyterhub/zero-to-jupyterhub-k8s#1851

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow setting 'minimum headroom' for autoscaling #148

Allow setting 'minimum headroom' for autoscaling #148

yuvipanda commented Jun 29, 2017

davidopp commented Jun 29, 2017

yuvipanda commented Jun 29, 2017

mwielgus commented Jun 30, 2017

yuvipanda commented Jun 30, 2017 via email

davidopp commented Jul 11, 2017

jonastl commented Aug 7, 2017 •

edited

Loading

MaciekPytel commented Aug 7, 2017

MaciekPytel commented Aug 7, 2017

jonastl commented Aug 8, 2017

MaciekPytel commented Aug 8, 2017

jonastl commented Aug 25, 2017

davidopp commented Sep 13, 2017

mwielgus commented Sep 13, 2017

yuvipanda commented Dec 4, 2017

yuvipanda commented Dec 4, 2017

choldgraf commented Feb 22, 2018

fejta-bot commented May 23, 2018

yuvipanda commented Jun 14, 2018

aleksandra-malinowska commented Jun 14, 2018

Allow setting 'minimum headroom' for autoscaling #148

Allow setting 'minimum headroom' for autoscaling #148

Comments

yuvipanda commented Jun 29, 2017

davidopp commented Jun 29, 2017

yuvipanda commented Jun 29, 2017

mwielgus commented Jun 30, 2017

yuvipanda commented Jun 30, 2017 via email

davidopp commented Jul 11, 2017

jonastl commented Aug 7, 2017 • edited Loading

MaciekPytel commented Aug 7, 2017

MaciekPytel commented Aug 7, 2017

jonastl commented Aug 8, 2017

MaciekPytel commented Aug 8, 2017

jonastl commented Aug 25, 2017

davidopp commented Sep 13, 2017

mwielgus commented Sep 13, 2017

yuvipanda commented Dec 4, 2017

yuvipanda commented Dec 4, 2017

choldgraf commented Feb 22, 2018

fejta-bot commented May 23, 2018

yuvipanda commented Jun 14, 2018

aleksandra-malinowska commented Jun 14, 2018

jonastl commented Aug 7, 2017 •

edited

Loading