Feature request: Auto scale support #4

xychu · 2020-12-15T03:35:20Z

The total available resources in cluster may change over time, it would be nice to support auto-scale if user want their training jobs to use as most resources as possible.

Common case will be:

create a et-job with --np, --min-np and --max-np, --np tend to be small since launcher won't start if not at least np workers ready
auto scale-out if extra worker could be created until reach --max-np
[optional] auto scale-in if some worker been preempted or failed unexpected

If this is not the default behavior, it can be controlled by something like a scalePolicy:auto.

The text was updated successfully, but these errors were encountered:

xiaozhouX · 2020-12-16T09:29:03Z

The total available resources in cluster may change over time, it would be nice to support auto-scale if user want their training jobs to use as most resources as possible.

Common case will be:

create a et-job with --np, --min-np and --max-np, --np tend to be small since launcher won't start if not at least np workers ready

auto scale-out if extra worker could be created until reach --max-np

[optional] auto scale-in if some worker been preempted or failed unexpected

If this is not the default behavior, it can be controlled by something like a scalePolicy:auto.

We can support scalein the trainingJob when job's workers preempted or failed.
But as to auto scale-up, the main problem is that how can et-operator knows when to scaleup and which job can scaleup .
In my mind, there are 2 approach:

et-operator watch the cluster resources, simulate scheduling, then make decision whether or not scale up a job. Like cluster-autoscaler does.
et-operator just create target workers pod, let the scheduler in k8s cluster deal with those pending pods. et-operator need to start launcher pod when --min-np pods is running.

For now, I prefer the second way. Maybe in the future, we can add a thirdPary component, who's job is monitoring Spot instance price / cluster resource / trainingJob GPU usage, and trigger trainingJob scaleup or scalein

xychu · 2020-12-17T01:33:49Z

et-operator just create target workers pod, let the scheduler in k8s cluster deal with those pending pods. et-operator need to start launcher pod when --min-np pods is running.

For now, I prefer the second way. Maybe in the future, we can add a thirdPary component, who's job is monitoring Spot instance price / cluster resource / trainingJob GPU usage, and trigger trainingJob scaleup or scalein

Yeah agreed. That should be enough now and if we want give user more control on the auto-scale part, we can add a scale interval, so that jobs will not be restarted too frequently.

xychu linked a pull request Dec 31, 2020 that will close this issue

WIP: Add auto scale out #6

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Auto scale support #4

Feature request: Auto scale support #4

xychu commented Dec 15, 2020

xiaozhouX commented Dec 16, 2020

xychu commented Dec 17, 2020

Feature request: Auto scale support #4

Feature request: Auto scale support #4

Comments

xychu commented Dec 15, 2020

xiaozhouX commented Dec 16, 2020

xychu commented Dec 17, 2020