Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune/autoscaler/docker] GPU support in docker #8975

Closed
vtomenko opened this issue Jun 16, 2020 · 21 comments
Closed

[tune/autoscaler/docker] GPU support in docker #8975

vtomenko opened this issue Jun 16, 2020 · 21 comments
Assignees
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical tune Tune-related issues

Comments

@vtomenko
Copy link

GPU support in docker

Reading docs on Automatic Cluster Setup, section "Docker":

This currently does not have GPU support

However, aws/example-gpu-docker.yaml coupled with merged PR suggests GPU is supported.

Could you please clarify and if GPU is indeed supported in docker - would docker be good choice from stability perspective?

@vtomenko vtomenko added the question Just a question :) label Jun 16, 2020
@richardliaw
Copy link
Contributor

Hey @vtomenko - you might run into some bugs with the current state of the code, but @ijrsvt is hard at work on improving this right now.

@ijrsvt feel free to chime in

@rkooo567 rkooo567 added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Jun 17, 2020
@vtomenko
Copy link
Author

Hey @richardliaw - thanks for your reply, I'll give it a try and let you know guys know what happened.

It is also great to hear you are working on this - I believe stable autoscaler + docker with GPU support combination would be really beneficial for many ML scenarios because it cleanly separates concerns:

  • high-level cluster description including instances details handled by cluster YAML
  • specific ML training task fully contained (including dependencies) and described using docker

In this way ML engineer could just change docker image reference in cluster YAML leaving remaining parts as is (as opposed to updating setup_commands for each different ML task). Then it would be really easy to execute different training tasks on the same cluster, provided ray handles correctly the sequence of (a) updating docker image reference in cluster YAML and (b) ray up

@ijrsvt
Copy link
Contributor

ijrsvt commented Jun 17, 2020

@vtomenko Please do let us know what the result is for running it with GPUs! If you have any further questions or problems please let me know!

@richardliaw
Copy link
Contributor

@vtomenko absolutely agree! Will keep you updated from our side too.

cc @anabranch @pcmoritz

@vtomenko
Copy link
Author

vtomenko commented Jun 18, 2020

I'm using ray version 0.8.5 and trying to test basic docker setup.

From autoscaler doc:

Docker: Specify docker image. This executes all commands on all nodes in the docker container, and opens all the necessary ports to support the Ray cluster. It will also automatically install Docker if Docker is not installed.

It does not seem to be the case, the error when creating the cluster with say busybox docker image:
Command 'docker' not found, but can be installed with:

Installing docker in initialization_commands section also does not help for reasons described in #7519

Here is what I have in initialization_commands:

[
"sudo apt update -y",
"sudo apt install docker.io -y",
"sudo usermod -aG docker $USER",
"sudo systemctl restart docker"
]

Commands above run fine. The next thing ray tries to pull the image, and this is where it fails:

2020-06-18 06:48:14,835 INFO updater.py:264 -- NodeUpdater: i-025cd6bcda77a92aa: Running sudo usermod -aG docker $USER on 35.165.161.154...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2020-06-18 06:48:14,916 INFO updater.py:264 -- NodeUpdater: i-025cd6bcda77a92aa: Running sudo systemctl restart docker on 35.165.161.154...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2020-06-18 06:48:16,019 INFO updater.py:264 -- NodeUpdater: i-025cd6bcda77a92aa: Running docker pull busybox:latest on 35.165.161.154...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.40/images/create?fromImage=busybox&tag=latest: dial unix /var/run/docker.sock: connect: permission denied
2020-06-18 06:48:16,114 INFO log_timer.py:17 -- NodeUpdater: i-025cd6bcda77a92aa: Initialization commands completed [LogTimer=23330ms]
2020-06-18 06:48:16,114 INFO log_timer.py:17 -- NodeUpdater: i-025cd6bcda77a92aa: Applied config a8575940426af8a45f754a232f9474405ae13ee8 [LogTimer=51055ms]
2020-06-18 06:48:16,114 ERROR updater.py:359 -- NodeUpdater: i-025cd6bcda77a92aa: Error updating (Exit Status 1) ssh -i /home/ubuntu/.ssh/ray-autoscaler_1_us-west-2.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C -o ControlPersist=10s -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 [email protected] bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker pull busybox:latest'
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 362, in run
raise e
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 352, in run
self.do_update()
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 436, in do_update
self.cmd_runner.run(cmd)
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 274, in run
self.process_runner.check_call(final_cmd)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/ubuntu/.ssh/ray-autoscaler_1_us-west-2.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C', '-o', 'ControlPersist=10s', '-o', 'IdentitiesOnly=yes', '-o', 'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', '-o', 'ServerAliveCountMax=3', '[email protected]', 'bash', '--login', '-c', '-i', "'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker pull busybox:latest'"]' returned non-zero exit status 1.

Note: when I manually attach to cluster after this error and run docker pull busybox (as non root user) it works as expected.

Can you please point me to any currently working example on how to configure autoscaler with docker?

@ijrsvt
Copy link
Contributor

ijrsvt commented Jun 18, 2020

Hmm, let me try working on a solution to this--if you rerun after the first install, does it work?
Also what are you running on: a local cluster or public cloud?

@vtomenko
Copy link
Author

I use AWS, attaching configuration I tried

Rerun does not help:

2020-06-18 08:01:00,874 INFO updater.py:264 -- NodeUpdater: i-0374e72fcc8b8ea2a: Running docker inspect -f '{{.State.Running}}' busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash on 54.69.9.155...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Template parsing error: template: :1:8: executing "" at <.State.Running>: map has no entry for key "State"
WARNING: Published ports are discarded when using host network mode
e2735eef81dcd0dd304fd131ab7c092c68556b8414e908449bd925fb08810bd7
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: "bash": executable file not found in $PATH": unknown.
2020-06-18 08:01:01,527 INFO log_timer.py:17 -- NodeUpdater: i-0374e72fcc8b8ea2a: Setup commands completed [LogTimer=653ms]
2020-06-18 08:01:01,527 INFO log_timer.py:17 -- NodeUpdater: i-0374e72fcc8b8ea2a: Applied config 9784c6630decbe7b8ad6cb2a34f12c3c48314063 [LogTimer=2745ms]
2020-06-18 08:01:01,527 ERROR updater.py:359 -- NodeUpdater: i-0374e72fcc8b8ea2a: Error updating (Exit Status 127) ssh -i /home/ubuntu/.ssh/ray-autoscaler_1_us-west-2.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C -o ControlPersist=10s -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 [email protected] bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f '"'"'{{.State.Running}}'"'"' busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash'
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 362, in run
raise e
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 352, in run
self.do_update()
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 440, in do_update
self.cmd_runner.run(cmd)
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 274, in run
self.process_runner.check_call(final_cmd)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/ubuntu/.ssh/ray-autoscaler_1_us-west-2.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C', '-o', 'ControlPersist=10s', '-o', 'IdentitiesOnly=yes', '-o', 'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', '-o', 'ServerAliveCountMax=3', '[email protected]', 'bash', '--login', '-c', '-i', ''true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f '"'"'{{.State.Running}}'"'"' busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash'']' returned non-zero exit status 127.

@ijrsvt
Copy link
Contributor

ijrsvt commented Jun 18, 2020

I'll try using that AMI! Thanks for sharing this!

@ijrsvt
Copy link
Contributor

ijrsvt commented Jun 18, 2020

The issue is that we reuse SSH sessions. I'm working on a PR now, but in the meantime there is a subpart workaround:

  1. Run ray up, with the initialization commands you have above
  2. Remove the initialization commands
  3. Rerun ray up in about 15 seconds or so.

@vtomenko
Copy link
Author

Thanks @ijrsvt , I tried the workaround and it fails. The error seems to be similar to the one reported previously for rerun. Does the workaround work for you?

2020-06-18 22:45:57,181 INFO updater.py:264 -- NodeUpdater: i-0230e74237fc85cc4: Running docker inspect -f '{{.State.Running}}' busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash on 54.187.139.187...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Template parsing error: template: :1:8: executing "" at <.State.Running>: map has no entry for key "State"
WARNING: Published ports are discarded when using host network mode
99e67b7ac1327a73ee5f81be226f96a3712d785d5475080c511c60bddbe40609
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: "bash": executable file not found in $PATH": unknown.
2020-06-18 22:45:57,860 INFO log_timer.py:17 -- NodeUpdater: i-0230e74237fc85cc4: Setup commands completed [LogTimer=679ms]
2020-06-18 22:45:57,861 INFO log_timer.py:17 -- NodeUpdater: i-0230e74237fc85cc4: Applied config 0318fe55a69a325f60716771d1a6f9f9a36457ec [LogTimer=3565ms]
2020-06-18 22:45:57,861 ERROR updater.py:359 -- NodeUpdater: i-0230e74237fc85cc4: Error updating (Exit Status 127) ssh -i /home/ubuntu/.ssh/ray-autoscaler.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C -o ControlPersist=10s -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 [email protected] bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f '"'"'{{.State.Running}}'"'"' busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash'
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 362, in run
raise e
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 352, in run
self.do_update()
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 440, in do_update
self.cmd_runner.run(cmd)
File "/home/ubuntu/projects/ai-ml-model-search/env/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 274, in run
self.process_runner.check_call(final_cmd)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/ubuntu/.ssh/ray-autoscaler.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_1d41c853af/e55cd6ce94/%C', '-o', 'ControlPersist=10s', '-o', 'IdentitiesOnly=yes', '-o', 'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', '-o', 'ServerAliveCountMax=3', '[email protected]', 'bash', '--login', '-c', '-i', ''true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f '"'"'{{.State.Running}}'"'"' busybox || docker run --rm --name busybox -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321 -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --net=host busybox:latest bash'']' returned non-zero exit status 127.

@ijrsvt
Copy link
Contributor

ijrsvt commented Jun 18, 2020 via email

@vtomenko
Copy link
Author

Makes sense, thank you @ijrsvt ! I will update the YAML and use the docker image I compiled to train ML models and see what happens.

@ijrsvt
Copy link
Contributor

ijrsvt commented Jun 19, 2020

Awesome--I have a PR out to automatically install docker if it is not preinstalled; this won't be a problem in the future!

@vtomenko
Copy link
Author

Hey @ijrsvt , thank you for the PR - any chance it is going to be merged into master soon so that I can avoid the workaround?

@edoakes edoakes added P2 Important issue, but not time-critical enhancement Request for new feature and/or capability and removed question Just a question :) triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 26, 2020
@ijrsvt
Copy link
Contributor

ijrsvt commented Jun 26, 2020

@vtomenko I am closing it for the moment because it actually breaks some of the autoscaler--It should be merged in about a week

@richardliaw richardliaw added the tune Tune-related issues label Jul 8, 2020
@ijrsvt
Copy link
Contributor

ijrsvt commented Jul 20, 2020

@vtomenko any updates on your end? How are things going?

@vtomenko
Copy link
Author

vtomenko commented Jul 24, 2020

@ijrsvt , I got it working with the following configuration for AWS:

  • head/worker nodes: Deep Learning Base AMI (Ubuntu 18.04) (so that all required GPU drivers etc. and docker is installed)
  • instance type: g4dn.xlarge
  • docker image: based on nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04 (so that container can use GPUs) + in my case mxnet packages with GPU support
  • for custom docker repository: added "docker login ..." to initialization_commands section in cluster YAML.

Note: I also added openssh-client to docker image because autoscaler manages worker nodes via ssh from docker container on head node

@ijrsvt
Copy link
Contributor

ijrsvt commented Jul 24, 2020

@vtomenko Great to hear!! The autoscaler should work without needing to directly SSH into the containers on the worker nodes. Was this not happening for you?

@vtomenko
Copy link
Author

@ijrsvt , my understanding is that autoscaler ssh into worker node from docker container on head node. So if the container on head node does not have ssh installed, worker nodes are not configured properly. Looks like same issue here #5496

@ijrsvt
Copy link
Contributor

ijrsvt commented Jul 24, 2020

Oh--that totally makes sense--my bad. I mis-read that as an openssh-server. The general requirements of the autoscaler are captured in this Dockerfile. {I'll make sure to add SSH into this} Please let me know if you have any other problems/questions/feedback! I'm always happy to help!

@richardliaw
Copy link
Contributor

Should be supported now. Try using rayproject/ray:latest-gpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

5 participants