Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] docker run options #3921

Merged
merged 23 commits into from
Feb 13, 2019

Conversation

hartikainen
Copy link
Contributor

@hartikainen hartikainen commented Feb 1, 2019

What do these changes do?

Adds support for docker options, allowing for example use of nvidia-docker.

TODO:

  • Test install_docker (broken on master but works when reverting [autoscaler] Speedups #3720)
  • Test nvidia docker on gce
  • Test nvidia docker on ec2
  • Add ray exec --docker command

Related issue number

#2657


# List of shell commands to run to set up nodes.
setup_commands:
# Consider uncommenting these if you also want to run apt-get commands during setup
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Clean this


# List of shell commands to run to set up nodes.
setup_commands:
# Note: if you're developing Ray, you probably want to create an AMI that
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Check these

@richardliaw richardliaw self-assigned this Feb 1, 2019
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11367/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11428/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11469/
Test PASSed.

@richardliaw
Copy link
Contributor

For fast reference:

ray exec aws/example-gpu.yaml --docker 'python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"'--start

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11485/
Test FAILed.

@hartikainen hartikainen force-pushed the feature/nvidia-docker branch 2 times, most recently from 19548a3 to 996954c Compare February 4, 2019 01:55
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11490/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11491/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11495/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11503/
Test PASSed.

@richardliaw
Copy link
Contributor

richardliaw commented Feb 4, 2019

AWS works here:

ray --logging-level=DEBUG exec aws/example-gpu-docker.yaml --docker 'ls; python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"' --start --stop

2019-02-03 23:08:24,412	INFO commands.py:173 -- get_or_create_head_node: Launching new head node...
2019-02-03 23:08:25,624	INFO commands.py:186 -- get_or_create_head_node: Updating files on head node...
2019-02-03 23:08:25,625	INFO updater.py:126 -- NodeUpdater: i-0bd3f18776afefed6: Updating to 451596001be28c222890ed76ad86454b4efe1827
2019-02-03 23:08:25,626	INFO updater.py:88 -- NodeUpdater: Waiting for IP of i-0bd3f18776afefed6...
2019-02-03 23:08:25,753	INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-node-status=waiting-for-ssh on ['i-0bd3f18776afefed6'] [LogTimer=127ms]
2019-02-03 23:08:26,239	INFO log_timer.py:21 -- NodeUpdater: i-0bd3f18776afefed6: Got IP [LogTimer=613ms]
2019-02-03 23:08:26,280	INFO updater.py:153 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:08:26,280	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:08:26,281	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:08:31,361	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:08:36,362	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:08:36,362	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:08:41,401	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:08:46,405	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:08:46,405	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:08:51,437	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:08:56,441	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:08:56,442	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:09:01,475	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:09:06,481	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:09:06,481	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:09:11,518	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:09:16,519	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:09:16,519	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:09:21,555	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:09:26,557	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:09:26,558	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:09:30,637	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:09:35,639	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:09:35,639	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:09:35,708	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:09:40,710	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:09:40,710	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:09:44,741	INFO log_timer.py:21 -- NodeUpdater: i-0bd3f18776afefed6: Got SSH [LogTimer=78461ms]
2019-02-03 23:09:44,741	INFO updater.py:196 -- NodeUpdater: i-0bd3f18776afefed6: Syncing /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem to ~/ray_bootstrap_key.pem...
2019-02-03 23:09:44,748	INFO updater.py:263 -- NodeUpdater: Running mkdir -p ~ on 54.149.108.99...
2019-02-03 23:09:45,068	INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-node-status=syncing-files on ['i-0bd3f18776afefed6'] [LogTimer=327ms]
2019-02-03 23:09:46,496	INFO log_timer.py:21 -- NodeUpdater i-0bd3f18776afefed6: Synced /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem to ~/ray_bootstrap_key.pem [LogTimer=1749ms]
2019-02-03 23:09:46,496	INFO updater.py:196 -- NodeUpdater: i-0bd3f18776afefed6: Syncing /var/folders/05/8j_ttn9s7cqdf1cdnr8mq6w40n25bb/T/ray-bootstrap-tlxxml3r to ~/ray_bootstrap_config.yaml...
2019-02-03 23:09:46,497	INFO updater.py:263 -- NodeUpdater: Running mkdir -p ~ on 54.149.108.99...
2019-02-03 23:09:46,920	INFO log_timer.py:21 -- NodeUpdater i-0bd3f18776afefed6: Synced /var/folders/05/8j_ttn9s7cqdf1cdnr8mq6w40n25bb/T/ray-bootstrap-tlxxml3r to ~/ray_bootstrap_config.yaml [LogTimer=423ms]
2019-02-03 23:09:46,921	INFO updater.py:263 -- NodeUpdater: Running docker inspect -f '{{.State.Running}}' ray-nvidia-docker-test || docker run --rm --name ray-nvidia-docker-test -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.12.0-gpu-py3 bash on 54.149.108.99...
2019-02-03 23:09:50,192	INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-node-status=setting-up on ['i-0bd3f18776afefed6'] [LogTimer=119ms]
2019-02-03 23:16:51,325	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'apt-get -y update'  on 54.149.108.99...
2019-02-03 23:16:58,444	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'apt-get -y upgrade'  on 54.149.108.99...
2019-02-03 23:20:12,752	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'apt-get install -y git wget cmake psmisc'  on 54.149.108.99...
2019-02-03 23:20:24,253	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.6.2-cp35-cp35m-manylinux1_x86_64.whl'  on 54.149.108.99...
2019-02-03 23:20:33,126	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'pip install boto3==1.4.8'  on 54.149.108.99...
2019-02-03 23:20:36,577	INFO updater.py:263 -- NodeUpdater: Running docker cp ~/ray_bootstrap_config.yaml ray-nvidia-docker-test:ray_bootstrap_config.yaml on 54.149.108.99...
2019-02-03 23:20:37,541	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'cp /ray_bootstrap_config.yaml ~/ray_bootstrap_config.yaml'  on 54.149.108.99...
2019-02-03 23:20:37,826	INFO updater.py:263 -- NodeUpdater: Running docker cp ~/ray_bootstrap_key.pem ray-nvidia-docker-test:ray_bootstrap_key.pem on 54.149.108.99...
2019-02-03 23:20:38,177	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'cp /ray_bootstrap_key.pem ~/ray_bootstrap_key.pem'  on 54.149.108.99...
2019-02-03 23:20:38,454	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'ray stop'  on 54.149.108.99...
2019-02-03 23:20:39,425	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'ulimit -n 65536; ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml'  on 54.149.108.99...
2019-02-03 23:20:40,923	INFO log_timer.py:21 -- NodeUpdater: i-0bd3f18776afefed6: Setup commands completed [LogTimer=654002ms]
2019-02-03 23:20:40,923	INFO log_timer.py:21 -- NodeUpdater: i-0bd3f18776afefed6: Applied config 451596001be28c222890ed76ad86454b4efe1827 [LogTimer=735298ms]
2019-02-03 23:20:41,712	INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-node-status=up-to-date on ['i-0bd3f18776afefed6'] [LogTimer=788ms]
2019-02-03 23:20:41,828	INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-runtime-config=451596001be28c222890ed76ad86454b4efe1827 on ['i-0bd3f18776afefed6'] [LogTimer=116ms]
2019-02-03 23:20:41,907	INFO commands.py:243 -- get_or_create_head_node: Head node up-to-date, IP address is: 54.149.108.99
To monitor auto-scaling activity, you can run:

  ray exec aws/example-gpu-docker.yaml --docker 'tail -n 100 -f /tmp/ray/session_*/logs/monitor*'

To open a console on the cluster:

  ray attach aws/example-gpu-docker.yaml

To ssh manually to the cluster, run:

  ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem [email protected]

2019-02-03 23:20:42,662	INFO updater.py:88 -- NodeUpdater: Waiting for IP of i-0bd3f18776afefed6...
2019-02-03 23:20:42,663	INFO log_timer.py:21 -- NodeUpdater: i-0bd3f18776afefed6: Got IP [LogTimer=225ms]
2019-02-03 23:20:42,690	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'ls; python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"' ; docker exec  ray-nvidia-docker-test /bin/sh -c 'ray stop; ray teardown ~/ray_bootstrap_config.yaml --yes --workers-only' ; sudo shutdown -h now on 54.149.108.99...
1_hello_tensorflow.ipynb
2_getting_started.ipynb
3_mnist_from_scratch.ipynb
BUILD
LICENSE
2019-02-04 07:20:44.252635: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-04 07:20:45.000924: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-04 07:20:45.001376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:00:1d.0
totalMemory: 7.44GiB freeMemory: 7.36GiB
2019-02-04 07:20:45.080395: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-04 07:20:45.080773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:00:1e.0
totalMemory: 7.44GiB freeMemory: 7.36GiB
2019-02-04 07:20:45.084265: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-02-04 07:20:45.732223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-04 07:20:45.732272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1
2019-02-04 07:20:45.732281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N N
2019-02-04 07:20:45.732287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   N N
2019-02-04 07:20:45.732639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7098 MB memory) -> physical GPU (device: 0, name: Tesla M60, pci bus id: 0000:00:1d.0, compute capability: 5.2)
2019-02-04 07:20:45.733039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7098 MB memory) -> physical GPU (device: 1, name: Tesla M60, pci bus id: 0000:00:1e.0, compute capability: 5.2)
tf.Tensor(-592.2466, shape=(), dtype=float32)
WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
Traceback (most recent call last):
  File "/usr/local/bin/ray", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/ray/scripts/scripts.py", line 711, in main
    return cli()
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/ray/scripts/scripts.py", line 486, in teardown
    teardown_cluster(cluster_config_file, yes, workers_only, cluster_name)
  File "/usr/local/lib/python3.5/dist-packages/ray/autoscaler/commands.py", line 78, in teardown_cluster
    validate_config(config)
  File "/usr/local/lib/python3.5/dist-packages/ray/autoscaler/autoscaler.py", line 670, in validate_config
    check_extraneous(config, schema)
  File "/usr/local/lib/python3.5/dist-packages/ray/autoscaler/autoscaler.py", line 661, in check_extraneous
    check_extraneous(config[k], v)
  File "/usr/local/lib/python3.5/dist-packages/ray/autoscaler/autoscaler.py", line 648, in check_extraneous
    k, list(schema.keys())))
ValueError: Unexpected config key `install_docker` not in ['image', 'container_name']
Shared connection to 54.149.108.99 closed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11528/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11543/
Test PASSed.

@perara
Copy link

perara commented Feb 11, 2019

How far is this from being pushed? 👍
Also, would this possibly support machines such as DGX-2?

@hartikainen
Copy link
Contributor Author

This is pretty much done, I just need to get my latest changes merged here. If by the question about DGX-2 you mean whether it's possible to use the autoscaler on local machines (instead of gcp/aws), then yeah, it's possible using the local cluster setup. Does that answer your question?

@perara
Copy link

perara commented Feb 11, 2019

@hartikainen

In our environment (dgx-2 cluster) we use docker to run all of our experiments. In my case, I've started utilizing ray/rllib/tune, but for some reason, it runs either VERY SLOW (not really getting any output after initialization) (Can be outperformed by a single 1080TI for PPO for instance).
But I suppose the cause is the lack of GPU support in docker containers as of now?

@@ -103,6 +104,9 @@
"file_mounts": (dict, OPTIONAL),

# List of common shell commands to run to initialize nodes.
"startup_commands": (list, OPTIONAL),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardliaw Does this seem reasonable to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startup_commands are pretty much the same as setup_commands except that they are run on the host instead of the docker container. One way to achieve the same functionality would be to always run the setup_commands on the host instead of docker, i.e. require the docker commands to be handled explicitly. I don't think that would be unreasonable either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think this is fine; the main nit I have is that I'd find having both startup_commands and setup_commands confusing... is there some other naming that indicates that these commands will be run directly on the instance (and not wrapped in any way, unlike setup_commands)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. That would be solved by combining startup and setup commands and making them always run on host. I'm not sure what to call these to not make them confusing. Maybe initialization_commands and setup_commands?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that works

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11794/
Test FAILed.

hartikainen and others added 8 commits February 11, 2019 13:11
Revert "[autoscaler] Speedups (ray-project#3720)"

This reverts commit 315edab.

docker check

add docker flag

Revert "Revert "[autoscaler] Speedups (ray-project#3720)""

This reverts commit 352d52e90e69566408c01375cc9bdb002a1e6d94.
* Update cluster_name for both aws and gcp
* Fix gcp accelerators and images such that the nvidia drivers get
  automatically installed
@hartikainen
Copy link
Contributor Author

@perara Yeah, previously if you ran things on autoscaler + docker, then you couldn't have used gpus for the tune runs. Sounds like this might solve the problem.

@hartikainen
Copy link
Contributor Author

hartikainen commented Feb 11, 2019

For reference, here's a command to test this on gcp:

ray down -y ${RAY_PATH}/python/ray/autoscaler/gcp/example-gpu-docker.yaml \
    && ray exec ${RAY_PATH}/python/ray/autoscaler/gcp/example-gpu-docker.yaml \
    --docker 'python -c "import tensorflow as tf; tf.enable_eager_execution();
              print(tf.reduce_sum(tf.random_normal([1000, 1000])))"'
    --start

@hartikainen
Copy link
Contributor Author

@richardliaw This runs as expected on GCP. Would you mind testing it on AWS since I don't have quota for GPU machines there.

@hartikainen
Copy link
Contributor Author

Oops, that was maybe a bit too hasty conclusion. The example-full.yaml fails due to the new initialization_commands.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11796/
Test PASSed.

@hartikainen
Copy link
Contributor Author

Ok, should be fixed now. Both gcp/example-full.yaml and gcp/example-gpu-docker.yaml run as expected.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11801/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11804/
Test PASSed.

auth_config=config["auth"],
cluster_name=config["cluster_name"],
file_mounts=config["file_mounts"],
initialization_commands=config["initialization_commands"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, do you need to run these here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are already run in _get_head_node

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you're right. Good catch.

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! AWS works on my end. Just one comment.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11829/
Test PASSed.

@hartikainen
Copy link
Contributor Author

This should be ready to go.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11833/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11872/
Test FAILed.

@richardliaw richardliaw merged commit 729d0b2 into ray-project:master Feb 13, 2019
@hartikainen hartikainen deleted the feature/nvidia-docker branch February 13, 2019 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants