-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune/autoscaler/docker] GPU support in docker #8975
Comments
Hey @richardliaw - thanks for your reply, I'll give it a try and let you know guys know what happened. It is also great to hear you are working on this - I believe stable autoscaler + docker with GPU support combination would be really beneficial for many ML scenarios because it cleanly separates concerns:
In this way ML engineer could just change docker image reference in cluster YAML leaving remaining parts as is (as opposed to updating setup_commands for each different ML task). Then it would be really easy to execute different training tasks on the same cluster, provided ray handles correctly the sequence of (a) updating docker image reference in cluster YAML and (b) |
@vtomenko Please do let us know what the result is for running it with GPUs! If you have any further questions or problems please let me know! |
I'm using ray version 0.8.5 and trying to test basic docker setup. From autoscaler doc:
It does not seem to be the case, the error when creating the cluster with say busybox docker image: Installing docker in Here is what I have in
Commands above run fine. The next thing ray tries to pull the image, and this is where it fails:
Note: when I manually attach to cluster after this error and run Can you please point me to any currently working example on how to configure autoscaler with docker? |
Hmm, let me try working on a solution to this--if you rerun after the first install, does it work? |
I use AWS, attaching configuration I tried Rerun does not help:
|
I'll try using that AMI! Thanks for sharing this! |
The issue is that we reuse SSH sessions. I'm working on a PR now, but in the meantime there is a subpart workaround:
|
Thanks @ijrsvt , I tried the workaround and it fails. The error seems to be similar to the one reported previously for rerun. Does the workaround work for you?
|
I think this error is because the docker image (busybox) in your YAML does
not have bash installed.
…On Thu, Jun 18, 2020 at 3:49 PM Volodymyr Tomenko ***@***.***> wrote:
Thanks @ijrsvt <https://github.com/ijrsvt> , I tried the workaround and
it fails. The error seems to be similar to the one reported previously for
rerun. Does the workaround work for you?
2020-06-18 22:45:57,181 INFO updater.py:264 -- NodeUpdater:
i-0230e74237fc85cc4: Running docker inspect -f '{{.State.Running}}' busybox
|| docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p
4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash
on 54.187.139.187...
bash: cannot set terminal process group (-1): Inappropriate ioctl for
device
bash: no job control in this shell
Template parsing error: template: :1:8: executing "" at <.State.Running>:
map has no entry for key "State"
WARNING: Published ports are discarded when using host network mode
99e67b7ac1327a73ee5f81be226f96a3712d785d5475080c511c60bddbe40609
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:349: starting container process caused "exec: "bash":
executable file not found in $PATH": unknown.
2020-06-18 22:45:57,860 INFO log_timer.py:17 -- NodeUpdater:
i-0230e74237fc85cc4: Setup commands completed [LogTimer=679ms]
2020-06-18 22:45:57,861 INFO log_timer.py:17 -- NodeUpdater:
i-0230e74237fc85cc4: Applied config
0318fe55a69a325f60716771d1a6f9f9a36457ec [LogTimer=3565ms]
2020-06-18 22:45:57,861 ERROR updater.py:359 -- NodeUpdater:
i-0230e74237fc85cc4: Error updating (Exit Status 127) ssh -i
/home/ubuntu/.ssh/ray-autoscaler.pem -o ConnectTimeout=120s -o
StrictHostKeyChecking=no -o ControlMaster=auto -o
ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C -o ControlPersist=10s -o
IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o
ServerAliveCountMax=3 ***@***.*** bash --login -c -i 'true &&
source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore &&
docker inspect -f '"'"'{{.State.Running}}'"'"' busybox || docker run --rm
--name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e
LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash'
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File
"/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py",
line 362, in run
raise e
File
"/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py",
line 352, in run
self.do_update()
File
"/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py",
line 440, in do_update
self.cmd_runner.run(cmd)
File
"/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py",
line 274, in run
self.process_runner.check_call(final_cmd)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i',
'/home/ubuntu/.ssh/ray-autoscaler.pem', '-o', 'ConnectTimeout=120s', '-o',
'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o',
'ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C', '-o',
'ControlPersist=10s', '-o', 'IdentitiesOnly=yes', '-o',
'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', '-o',
'ServerAliveCountMax=3', ***@***.***', 'bash', '--login',
'-c', '-i', ''true && source ~/.bashrc && export OMP_NUM_THREADS=1
PYTHONWARNINGS=ignore && docker inspect -f '"'"'{{.State.Running}}'"'"'
busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076
-p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest
bash'']' returned non-zero exit status 127.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8975 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFC5KQX7IJC6GDNYE6KWCR3RXKKX5ANCNFSM4N77F7FA>
.
|
Makes sense, thank you @ijrsvt ! I will update the YAML and use the docker image I compiled to train ML models and see what happens. |
Awesome--I have a PR out to automatically install docker if it is not preinstalled; this won't be a problem in the future! |
Hey @ijrsvt , thank you for the PR - any chance it is going to be merged into master soon so that I can avoid the workaround? |
@vtomenko I am closing it for the moment because it actually breaks some of the autoscaler--It should be merged in about a week |
@vtomenko any updates on your end? How are things going? |
@ijrsvt , I got it working with the following configuration for AWS:
Note: I also added openssh-client to docker image because autoscaler manages worker nodes via ssh from docker container on head node |
@vtomenko Great to hear!! The autoscaler should work without needing to directly SSH into the containers on the worker nodes. Was this not happening for you? |
Oh--that totally makes sense--my bad. I mis-read that as an |
Should be supported now. Try using |
GPU support in docker
Reading docs on Automatic Cluster Setup, section "Docker":
However, aws/example-gpu-docker.yaml coupled with merged PR suggests GPU is supported.
Could you please clarify and if GPU is indeed supported in docker - would docker be good choice from stability perspective?
The text was updated successfully, but these errors were encountered: