[autoscaler] docker run options #3921

hartikainen · 2019-02-01T00:10:21Z

What do these changes do?

Adds support for docker options, allowing for example use of nvidia-docker.

TODO:

Test install_docker (broken on master but works when reverting [autoscaler] Speedups #3720)
Test nvidia docker on gce
Test nvidia docker on ec2
Add ray exec --docker command

Related issue number

#2657

hartikainen · 2019-02-01T00:43:08Z

python/ray/autoscaler/gcp/example-gpu.yaml

+
+# List of shell commands to run to set up nodes.
+setup_commands:
+    # Consider uncommenting these if you also want to run apt-get commands during setup


TODO: Clean this

hartikainen · 2019-02-01T00:43:32Z

python/ray/autoscaler/aws/example-gpu.yaml

+
+# List of shell commands to run to set up nodes.
+setup_commands:
+    # Note: if you're developing Ray, you probably want to create an AMI that


TODO: Check these

AmplabJenkins · 2019-02-01T02:33:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11367/
Test PASSed.

AmplabJenkins · 2019-02-02T06:43:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11428/
Test PASSed.

AmplabJenkins · 2019-02-03T06:07:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11469/
Test PASSed.

richardliaw · 2019-02-03T09:02:36Z

For fast reference:

ray exec aws/example-gpu.yaml --docker 'python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"'--start

AmplabJenkins · 2019-02-03T22:25:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11485/
Test FAILed.

AmplabJenkins · 2019-02-04T01:55:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11490/
Test PASSed.

AmplabJenkins · 2019-02-04T02:23:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11491/
Test PASSed.

AmplabJenkins · 2019-02-04T02:42:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11495/
Test PASSed.

AmplabJenkins · 2019-02-04T05:09:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11503/
Test PASSed.

richardliaw · 2019-02-04T07:29:15Z

AWS works here:

ray --logging-level=DEBUG exec aws/example-gpu-docker.yaml --docker 'ls; python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"' --start --stop

2019-02-03 23:08:24,412	INFO commands.py:173 -- get_or_create_head_node: Launching new head node...
2019-02-03 23:08:25,624	INFO commands.py:186 -- get_or_create_head_node: Updating files on head node...
2019-02-03 23:08:25,625	INFO updater.py:126 -- NodeUpdater: i-0bd3f18776afefed6: Updating to 451596001be28c222890ed76ad86454b4efe1827
2019-02-03 23:08:25,626	INFO updater.py:88 -- NodeUpdater: Waiting for IP of i-0bd3f18776afefed6...
2019-02-03 23:08:25,753	INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-node-status=waiting-for-ssh on ['i-0bd3f18776afefed6'] [LogTimer=127ms]
2019-02-03 23:08:26,239	INFO log_timer.py:21 -- NodeUpdater: i-0bd3f18776afefed6: Got IP [LogTimer=613ms]
2019-02-03 23:08:26,280	INFO updater.py:153 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:08:26,280	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:08:26,281	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:08:31,361	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:08:36,362	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:08:36,362	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:08:41,401	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:08:46,405	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:08:46,405	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:08:51,437	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:08:56,441	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:08:56,442	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:09:01,475	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:09:06,481	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:09:06,481	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:09:11,518	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:09:16,519	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:09:16,519	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:09:21,555	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:09:26,557	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:09:26,558	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:09:30,637	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:09:35,639	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:09:35,639	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:09:35,708	DEBUG updater.py:173 -- NodeUpdater: i-0bd3f18776afefed6: SSH not up, retrying: (Exit Status 255): ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=5s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=yes [email protected] bash --login -c 'set -i || true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && uptime'
2019-02-03 23:09:40,710	DEBUG updater.py:159 -- NodeUpdater: i-0bd3f18776afefed6: Waiting for SSH...
2019-02-03 23:09:40,710	INFO updater.py:263 -- NodeUpdater: Running uptime on 54.149.108.99...
2019-02-03 23:09:44,741	INFO log_timer.py:21 -- NodeUpdater: i-0bd3f18776afefed6: Got SSH [LogTimer=78461ms]
2019-02-03 23:09:44,741	INFO updater.py:196 -- NodeUpdater: i-0bd3f18776afefed6: Syncing /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem to ~/ray_bootstrap_key.pem...
2019-02-03 23:09:44,748	INFO updater.py:263 -- NodeUpdater: Running mkdir -p ~ on 54.149.108.99...
2019-02-03 23:09:45,068	INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-node-status=syncing-files on ['i-0bd3f18776afefed6'] [LogTimer=327ms]
2019-02-03 23:09:46,496	INFO log_timer.py:21 -- NodeUpdater i-0bd3f18776afefed6: Synced /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem to ~/ray_bootstrap_key.pem [LogTimer=1749ms]
2019-02-03 23:09:46,496	INFO updater.py:196 -- NodeUpdater: i-0bd3f18776afefed6: Syncing /var/folders/05/8j_ttn9s7cqdf1cdnr8mq6w40n25bb/T/ray-bootstrap-tlxxml3r to ~/ray_bootstrap_config.yaml...
2019-02-03 23:09:46,497	INFO updater.py:263 -- NodeUpdater: Running mkdir -p ~ on 54.149.108.99...
2019-02-03 23:09:46,920	INFO log_timer.py:21 -- NodeUpdater i-0bd3f18776afefed6: Synced /var/folders/05/8j_ttn9s7cqdf1cdnr8mq6w40n25bb/T/ray-bootstrap-tlxxml3r to ~/ray_bootstrap_config.yaml [LogTimer=423ms]
2019-02-03 23:09:46,921	INFO updater.py:263 -- NodeUpdater: Running docker inspect -f '{{.State.Running}}' ray-nvidia-docker-test || docker run --rm --name ray-nvidia-docker-test -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.12.0-gpu-py3 bash on 54.149.108.99...
2019-02-03 23:09:50,192	INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-node-status=setting-up on ['i-0bd3f18776afefed6'] [LogTimer=119ms]
2019-02-03 23:16:51,325	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'apt-get -y update'  on 54.149.108.99...
2019-02-03 23:16:58,444	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'apt-get -y upgrade'  on 54.149.108.99...
2019-02-03 23:20:12,752	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'apt-get install -y git wget cmake psmisc'  on 54.149.108.99...
2019-02-03 23:20:24,253	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.6.2-cp35-cp35m-manylinux1_x86_64.whl'  on 54.149.108.99...
2019-02-03 23:20:33,126	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'pip install boto3==1.4.8'  on 54.149.108.99...
2019-02-03 23:20:36,577	INFO updater.py:263 -- NodeUpdater: Running docker cp ~/ray_bootstrap_config.yaml ray-nvidia-docker-test:ray_bootstrap_config.yaml on 54.149.108.99...
2019-02-03 23:20:37,541	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'cp /ray_bootstrap_config.yaml ~/ray_bootstrap_config.yaml'  on 54.149.108.99...
2019-02-03 23:20:37,826	INFO updater.py:263 -- NodeUpdater: Running docker cp ~/ray_bootstrap_key.pem ray-nvidia-docker-test:ray_bootstrap_key.pem on 54.149.108.99...
2019-02-03 23:20:38,177	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'cp /ray_bootstrap_key.pem ~/ray_bootstrap_key.pem'  on 54.149.108.99...
2019-02-03 23:20:38,454	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'ray stop'  on 54.149.108.99...
2019-02-03 23:20:39,425	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'ulimit -n 65536; ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml'  on 54.149.108.99...
2019-02-03 23:20:40,923	INFO log_timer.py:21 -- NodeUpdater: i-0bd3f18776afefed6: Setup commands completed [LogTimer=654002ms]
2019-02-03 23:20:40,923	INFO log_timer.py:21 -- NodeUpdater: i-0bd3f18776afefed6: Applied config 451596001be28c222890ed76ad86454b4efe1827 [LogTimer=735298ms]
2019-02-03 23:20:41,712	INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-node-status=up-to-date on ['i-0bd3f18776afefed6'] [LogTimer=788ms]
2019-02-03 23:20:41,828	INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-runtime-config=451596001be28c222890ed76ad86454b4efe1827 on ['i-0bd3f18776afefed6'] [LogTimer=116ms]
2019-02-03 23:20:41,907	INFO commands.py:243 -- get_or_create_head_node: Head node up-to-date, IP address is: 54.149.108.99
To monitor auto-scaling activity, you can run:

  ray exec aws/example-gpu-docker.yaml --docker 'tail -n 100 -f /tmp/ray/session_*/logs/monitor*'

To open a console on the cluster:

  ray attach aws/example-gpu-docker.yaml

To ssh manually to the cluster, run:

  ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem [email protected]

2019-02-03 23:20:42,662	INFO updater.py:88 -- NodeUpdater: Waiting for IP of i-0bd3f18776afefed6...
2019-02-03 23:20:42,663	INFO log_timer.py:21 -- NodeUpdater: i-0bd3f18776afefed6: Got IP [LogTimer=225ms]
2019-02-03 23:20:42,690	INFO updater.py:263 -- NodeUpdater: Running docker exec  ray-nvidia-docker-test /bin/sh -c 'ls; python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"' ; docker exec  ray-nvidia-docker-test /bin/sh -c 'ray stop; ray teardown ~/ray_bootstrap_config.yaml --yes --workers-only' ; sudo shutdown -h now on 54.149.108.99...
1_hello_tensorflow.ipynb
2_getting_started.ipynb
3_mnist_from_scratch.ipynb
BUILD
LICENSE
2019-02-04 07:20:44.252635: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-04 07:20:45.000924: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-04 07:20:45.001376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:00:1d.0
totalMemory: 7.44GiB freeMemory: 7.36GiB
2019-02-04 07:20:45.080395: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-04 07:20:45.080773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:00:1e.0
totalMemory: 7.44GiB freeMemory: 7.36GiB
2019-02-04 07:20:45.084265: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-02-04 07:20:45.732223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-04 07:20:45.732272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1
2019-02-04 07:20:45.732281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N N
2019-02-04 07:20:45.732287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   N N
2019-02-04 07:20:45.732639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7098 MB memory) -> physical GPU (device: 0, name: Tesla M60, pci bus id: 0000:00:1d.0, compute capability: 5.2)
2019-02-04 07:20:45.733039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7098 MB memory) -> physical GPU (device: 1, name: Tesla M60, pci bus id: 0000:00:1e.0, compute capability: 5.2)
tf.Tensor(-592.2466, shape=(), dtype=float32)
WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
Traceback (most recent call last):
  File "/usr/local/bin/ray", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/ray/scripts/scripts.py", line 711, in main
    return cli()
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/ray/scripts/scripts.py", line 486, in teardown
    teardown_cluster(cluster_config_file, yes, workers_only, cluster_name)
  File "/usr/local/lib/python3.5/dist-packages/ray/autoscaler/commands.py", line 78, in teardown_cluster
    validate_config(config)
  File "/usr/local/lib/python3.5/dist-packages/ray/autoscaler/autoscaler.py", line 670, in validate_config
    check_extraneous(config, schema)
  File "/usr/local/lib/python3.5/dist-packages/ray/autoscaler/autoscaler.py", line 661, in check_extraneous
    check_extraneous(config[k], v)
  File "/usr/local/lib/python3.5/dist-packages/ray/autoscaler/autoscaler.py", line 648, in check_extraneous
    k, list(schema.keys())))
ValueError: Unexpected config key `install_docker` not in ['image', 'container_name']
Shared connection to 54.149.108.99 closed.

AmplabJenkins · 2019-02-04T10:18:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11528/
Test PASSed.

AmplabJenkins · 2019-02-04T23:19:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11543/
Test PASSed.

perara · 2019-02-11T04:32:49Z

How far is this from being pushed? 👍
Also, would this possibly support machines such as DGX-2?

hartikainen · 2019-02-11T05:30:24Z

This is pretty much done, I just need to get my latest changes merged here. If by the question about DGX-2 you mean whether it's possible to use the autoscaler on local machines (instead of gcp/aws), then yeah, it's possible using the local cluster setup. Does that answer your question?

perara · 2019-02-11T06:23:25Z

@hartikainen

In our environment (dgx-2 cluster) we use docker to run all of our experiments. In my case, I've started utilizing ray/rllib/tune, but for some reason, it runs either VERY SLOW (not really getting any output after initialization) (Can be outperformed by a single 1080TI for PPO for instance).
But I suppose the cause is the lack of GPU support in docker containers as of now?

hartikainen · 2019-02-11T20:51:39Z

python/ray/autoscaler/autoscaler.py

@@ -103,6 +104,9 @@
    "file_mounts": (dict, OPTIONAL),

    # List of common shell commands to run to initialize nodes.
+    "startup_commands": (list, OPTIONAL),


@richardliaw Does this seem reasonable to you?

The startup_commands are pretty much the same as setup_commands except that they are run on the host instead of the docker container. One way to achieve the same functionality would be to always run the setup_commands on the host instead of docker, i.e. require the docker commands to be handled explicitly. I don't think that would be unreasonable either.

yeah I think this is fine; the main nit I have is that I'd find having both startup_commands and setup_commands confusing... is there some other naming that indicates that these commands will be run directly on the instance (and not wrapped in any way, unlike setup_commands)?

I agree. That would be solved by combining startup and setup commands and making them always run on host. I'm not sure what to call these to not make them confusing. Maybe initialization_commands and setup_commands?

yeah that works

AmplabJenkins · 2019-02-11T21:09:52Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11794/
Test FAILed.

Revert "[autoscaler] Speedups (ray-project#3720)" This reverts commit 315edab. docker check add docker flag Revert "Revert "[autoscaler] Speedups (ray-project#3720)"" This reverts commit 352d52e90e69566408c01375cc9bdb002a1e6d94.

* Update cluster_name for both aws and gcp * Fix gcp accelerators and images such that the nvidia drivers get automatically installed

hartikainen · 2019-02-11T21:33:04Z

@perara Yeah, previously if you ran things on autoscaler + docker, then you couldn't have used gpus for the tune runs. Sounds like this might solve the problem.

hartikainen · 2019-02-11T23:47:23Z

For reference, here's a command to test this on gcp:

ray down -y ${RAY_PATH}/python/ray/autoscaler/gcp/example-gpu-docker.yaml \
    && ray exec ${RAY_PATH}/python/ray/autoscaler/gcp/example-gpu-docker.yaml \
    --docker 'python -c "import tensorflow as tf; tf.enable_eager_execution();
              print(tf.reduce_sum(tf.random_normal([1000, 1000])))"'
    --start

python/ray/autoscaler/autoscaler.py

python/ray/scripts/scripts.py

python/ray/autoscaler/updater.py

Co-Authored-By: hartikainen <[email protected]>

hartikainen · 2019-02-12T00:21:24Z

@richardliaw This runs as expected on GCP. Would you mind testing it on AWS since I don't have quota for GPU machines there.

hartikainen · 2019-02-12T00:24:03Z

Oops, that was maybe a bit too hasty conclusion. The example-full.yaml fails due to the new initialization_commands.

AmplabJenkins · 2019-02-12T00:31:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11796/
Test PASSed.

hartikainen · 2019-02-12T00:32:59Z

Ok, should be fixed now. Both gcp/example-full.yaml and gcp/example-gpu-docker.yaml run as expected.

AmplabJenkins · 2019-02-12T01:44:11Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11801/
Test FAILed.

AmplabJenkins · 2019-02-12T03:33:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11804/
Test PASSed.

richardliaw · 2019-02-12T05:25:19Z

python/ray/autoscaler/commands.py

+            auth_config=config["auth"],
+            cluster_name=config["cluster_name"],
+            file_mounts=config["file_mounts"],
+            initialization_commands=config["initialization_commands"],


hm, do you need to run these here?

I think these are already run in _get_head_node

I see, you're right. Good catch.

richardliaw

Looks good! AWS works on my end. Just one comment.

AmplabJenkins · 2019-02-12T11:16:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11829/
Test PASSed.

hartikainen · 2019-02-12T17:26:00Z

This should be ready to go.

AmplabJenkins · 2019-02-12T19:44:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11833/
Test PASSed.

AmplabJenkins · 2019-02-13T13:08:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11872/
Test FAILed.

hartikainen commented Feb 1, 2019

View reviewed changes

richardliaw self-assigned this Feb 1, 2019

hartikainen force-pushed the feature/nvidia-docker branch from 5ed9ba2 to 523abef Compare February 2, 2019 04:06

hartikainen force-pushed the feature/nvidia-docker branch 2 times, most recently from 19548a3 to 996954c Compare February 4, 2019 01:55

hartikainen commented Feb 11, 2019

View reviewed changes

hartikainen and others added 8 commits February 11, 2019 13:11

Implement docker run_options for autoscaler

9cf4a7d

Add gpu autoscaler examples

8be884e

Add install-nvidia-driver metadata to gcp gpu example configuration

f52904e

add_docker_support

0cde2d1

Revert "[autoscaler] Speedups (ray-project#3720)" This reverts commit 315edab. docker check add docker flag Revert "Revert "[autoscaler] Speedups (ray-project#3720)"" This reverts commit 352d52e90e69566408c01375cc9bdb002a1e6d94.

fix

2923307

modifications_and_docs

8653d6a

rename

aecbeb5

Update example-gpu-docker configurations

a4e257e

* Update cluster_name for both aws and gcp * Fix gcp accelerators and images such that the nvidia drivers get automatically installed

hartikainen added 3 commits February 11, 2019 13:28

Add startup_commands for example-gpu-docker.yaml

f91f790

Bump up ray version in autoscaler configs

c8591e8

Remove duplicate psmisc installation command from gcp example

b1ba569

hartikainen force-pushed the feature/nvidia-docker branch from bbdbb20 to b1ba569 Compare February 11, 2019 21:29

richardliaw reviewed Feb 12, 2019

View reviewed changes

python/ray/autoscaler/autoscaler.py Outdated Show resolved Hide resolved

richardliaw reviewed Feb 12, 2019

View reviewed changes

python/ray/scripts/scripts.py Outdated Show resolved Hide resolved

richardliaw reviewed Feb 12, 2019

View reviewed changes

python/ray/autoscaler/updater.py Outdated Show resolved Hide resolved

richardliaw and others added 2 commits February 11, 2019 16:06

Apply suggestions from code review

cf7ea64

Co-Authored-By: hartikainen <[email protected]>

Rename startup_commands -> initialization_commands

c010f00

Add initialization_commands to example-full.yaml configs

0cc6638

richardliaw reviewed Feb 12, 2019

View reviewed changes

richardliaw approved these changes Feb 12, 2019

View reviewed changes

fix wheels, tests

a2fce3e

Remove duplicate initialization_commands from exec_cluster command

e8cd337

fix_test

29e36fd

richardliaw merged commit 729d0b2 into ray-project:master Feb 13, 2019

hartikainen deleted the feature/nvidia-docker branch February 13, 2019 20:27

vtomenko mentioned this pull request Jun 16, 2020

[tune/autoscaler/docker] GPU support in docker #8975

Closed

[autoscaler] docker run options #3921

[autoscaler] docker run options #3921

Conversation

hartikainen commented Feb 1, 2019 • edited Loading

What do these changes do?

TODO:

Related issue number

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Feb 1, 2019

AmplabJenkins commented Feb 2, 2019

AmplabJenkins commented Feb 3, 2019

richardliaw commented Feb 3, 2019

AmplabJenkins commented Feb 3, 2019

AmplabJenkins commented Feb 4, 2019

AmplabJenkins commented Feb 4, 2019

AmplabJenkins commented Feb 4, 2019

AmplabJenkins commented Feb 4, 2019

richardliaw commented Feb 4, 2019 • edited Loading

AmplabJenkins commented Feb 4, 2019

AmplabJenkins commented Feb 4, 2019

perara commented Feb 11, 2019

hartikainen commented Feb 11, 2019

perara commented Feb 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Feb 11, 2019

hartikainen commented Feb 11, 2019

hartikainen commented Feb 11, 2019 • edited Loading

hartikainen commented Feb 12, 2019

hartikainen commented Feb 12, 2019

AmplabJenkins commented Feb 12, 2019

hartikainen commented Feb 12, 2019

AmplabJenkins commented Feb 12, 2019

AmplabJenkins commented Feb 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Feb 12, 2019

hartikainen commented Feb 12, 2019

AmplabJenkins commented Feb 12, 2019

AmplabJenkins commented Feb 13, 2019

hartikainen commented Feb 1, 2019 •

edited

Loading

richardliaw commented Feb 4, 2019 •

edited

Loading

hartikainen commented Feb 11, 2019 •

edited

Loading