-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Auto scale support #4
Comments
We can support scalein the
For now, I prefer the second way. Maybe in the future, we can add a thirdPary component, who's job is monitoring Spot instance price / cluster resource / trainingJob GPU usage, and trigger trainingJob scaleup or scalein |
Yeah agreed. That should be enough now and if we want give user more control on the auto-scale part, we can add a scale interval, so that jobs will not be restarted too frequently. |
The total available resources in cluster may change over time, it would be nice to support auto-scale if user want their training jobs to use as most resources as possible.
Common case will be:
--np
,--min-np
and--max-np
,--np
tend to be small since launcher won't start if not at least np workers ready--max-np
If this is not the default behavior, it can be controlled by something like a
scalePolicy:auto
.The text was updated successfully, but these errors were encountered: