Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Auto scale support #4

Open
xychu opened this issue Dec 15, 2020 · 2 comments · May be fixed by #6
Open

Feature request: Auto scale support #4

xychu opened this issue Dec 15, 2020 · 2 comments · May be fixed by #6

Comments

@xychu
Copy link
Contributor

xychu commented Dec 15, 2020

The total available resources in cluster may change over time, it would be nice to support auto-scale if user want their training jobs to use as most resources as possible.

Common case will be:

  • create a et-job with --np, --min-np and --max-np, --np tend to be small since launcher won't start if not at least np workers ready
  • auto scale-out if extra worker could be created until reach --max-np
  • [optional] auto scale-in if some worker been preempted or failed unexpected

If this is not the default behavior, it can be controlled by something like a scalePolicy:auto.

@xiaozhouX
Copy link
Collaborator

The total available resources in cluster may change over time, it would be nice to support auto-scale if user want their training jobs to use as most resources as possible.

Common case will be:

  • create a et-job with --np, --min-np and --max-np, --np tend to be small since launcher won't start if not at least np workers ready
  • auto scale-out if extra worker could be created until reach --max-np
  • [optional] auto scale-in if some worker been preempted or failed unexpected

If this is not the default behavior, it can be controlled by something like a scalePolicy:auto.

We can support scalein the trainingJob when job's workers preempted or failed.
But as to auto scale-up, the main problem is that how can et-operator knows when to scaleup and which job can scaleup .
In my mind, there are 2 approach:

  1. et-operator watch the cluster resources, simulate scheduling, then make decision whether or not scale up a job. Like cluster-autoscaler does.
  2. et-operator just create target workers pod, let the scheduler in k8s cluster deal with those pending pods. et-operator need to start launcher pod when --min-np pods is running.

For now, I prefer the second way. Maybe in the future, we can add a thirdPary component, who's job is monitoring Spot instance price / cluster resource / trainingJob GPU usage, and trigger trainingJob scaleup or scalein

@xychu
Copy link
Contributor Author

xychu commented Dec 17, 2020

  1. et-operator just create target workers pod, let the scheduler in k8s cluster deal with those pending pods. et-operator need to start launcher pod when --min-np pods is running.

For now, I prefer the second way. Maybe in the future, we can add a thirdPary component, who's job is monitoring Spot instance price / cluster resource / trainingJob GPU usage, and trigger trainingJob scaleup or scalein

Yeah agreed. That should be enough now and if we want give user more control on the auto-scale part, we can add a scale interval, so that jobs will not be restarted too frequently.

@xychu xychu linked a pull request Dec 31, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants