-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] docker run options #3921
[autoscaler] docker run options #3921
Conversation
|
||
# List of shell commands to run to set up nodes. | ||
setup_commands: | ||
# Consider uncommenting these if you also want to run apt-get commands during setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Clean this
|
||
# List of shell commands to run to set up nodes. | ||
setup_commands: | ||
# Note: if you're developing Ray, you probably want to create an AMI that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Check these
Test PASSed. |
5ed9ba2
to
523abef
Compare
Test PASSed. |
Test PASSed. |
For fast reference:
|
Test FAILed. |
19548a3
to
996954c
Compare
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
AWS works here:
|
Test PASSed. |
Test PASSed. |
How far is this from being pushed? 👍 |
This is pretty much done, I just need to get my latest changes merged here. If by the question about DGX-2 you mean whether it's possible to use the autoscaler on local machines (instead of gcp/aws), then yeah, it's possible using the local cluster setup. Does that answer your question? |
In our environment (dgx-2 cluster) we use docker to run all of our experiments. In my case, I've started utilizing ray/rllib/tune, but for some reason, it runs either VERY SLOW (not really getting any output after initialization) (Can be outperformed by a single 1080TI for PPO for instance). |
python/ray/autoscaler/autoscaler.py
Outdated
@@ -103,6 +104,9 @@ | |||
"file_mounts": (dict, OPTIONAL), | |||
|
|||
# List of common shell commands to run to initialize nodes. | |||
"startup_commands": (list, OPTIONAL), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardliaw Does this seem reasonable to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The startup_commands
are pretty much the same as setup_commands
except that they are run on the host instead of the docker container. One way to achieve the same functionality would be to always run the setup_commands
on the host instead of docker, i.e. require the docker commands to be handled explicitly. I don't think that would be unreasonable either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think this is fine; the main nit I have is that I'd find having both startup_commands
and setup_commands
confusing... is there some other naming that indicates that these commands will be run directly on the instance (and not wrapped in any way, unlike setup_commands)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. That would be solved by combining startup and setup commands and making them always run on host. I'm not sure what to call these to not make them confusing. Maybe initialization_commands
and setup_commands
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that works
Test FAILed. |
Revert "[autoscaler] Speedups (ray-project#3720)" This reverts commit 315edab. docker check add docker flag Revert "Revert "[autoscaler] Speedups (ray-project#3720)"" This reverts commit 352d52e90e69566408c01375cc9bdb002a1e6d94.
* Update cluster_name for both aws and gcp * Fix gcp accelerators and images such that the nvidia drivers get automatically installed
bbdbb20
to
b1ba569
Compare
@perara Yeah, previously if you ran things on autoscaler + docker, then you couldn't have used gpus for the tune runs. Sounds like this might solve the problem. |
For reference, here's a command to test this on gcp:
|
Co-Authored-By: hartikainen <[email protected]>
@richardliaw This runs as expected on GCP. Would you mind testing it on AWS since I don't have quota for GPU machines there. |
Oops, that was maybe a bit too hasty conclusion. The |
Test PASSed. |
Ok, should be fixed now. Both |
Test FAILed. |
Test PASSed. |
python/ray/autoscaler/commands.py
Outdated
auth_config=config["auth"], | ||
cluster_name=config["cluster_name"], | ||
file_mounts=config["file_mounts"], | ||
initialization_commands=config["initialization_commands"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, do you need to run these here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are already run in _get_head_node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you're right. Good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! AWS works on my end. Just one comment.
Test PASSed. |
This should be ready to go. |
Test PASSed. |
Test FAILed. |
What do these changes do?
Adds support for docker options, allowing for example use of nvidia-docker.
TODO:
ray exec --docker
commandRelated issue number
#2657