-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] docker run options #3921
Merged
richardliaw
merged 23 commits into
ray-project:master
from
hartikainen:feature/nvidia-docker
Feb 13, 2019
Merged
Changes from 20 commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
9cf4a7d
Implement docker run_options for autoscaler
hartikainen 8be884e
Add gpu autoscaler examples
hartikainen f52904e
Add install-nvidia-driver metadata to gcp gpu example configuration
hartikainen 0cde2d1
add_docker_support
richardliaw 2923307
fix
richardliaw 8653d6a
modifications_and_docs
richardliaw aecbeb5
rename
richardliaw a4e257e
Update example-gpu-docker configurations
hartikainen ed6d117
Update gcp gpu example
hartikainen 3cc079a
Fix tensorflow docker image in gcp gpu example
hartikainen b725bf1
tf works
richardliaw 1212dbf
fix docker exec
richardliaw fab4192
Remove Docker Installation
richardliaw 7bf9d16
Add startup_commands for autoscaler config
hartikainen f91f790
Add startup_commands for example-gpu-docker.yaml
hartikainen c8591e8
Bump up ray version in autoscaler configs
hartikainen b1ba569
Remove duplicate psmisc installation command from gcp example
hartikainen cf7ea64
Apply suggestions from code review
richardliaw c010f00
Rename startup_commands -> initialization_commands
hartikainen 0cc6638
Add initialization_commands to example-full.yaml configs
hartikainen a2fce3e
fix wheels, tests
richardliaw e8cd337
Remove duplicate initialization_commands from exec_cluster command
hartikainen 29e36fd
fix_test
richardliaw File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
# An unique identifier for the head node and workers of this cluster. | ||
cluster_name: gpu-docker | ||
|
||
# The minimum number of workers nodes to launch in addition to the head | ||
# node. This number should be >= 0. | ||
min_workers: 0 | ||
|
||
# The maximum number of workers nodes to launch in addition to the head | ||
# node. This takes precedence over min_workers. | ||
max_workers: 2 | ||
|
||
# The initial number of worker nodes to launch in addition to the head | ||
# node. When the cluster is first brought up (or when it is refreshed with a | ||
# subsequent `ray up`) this number of nodes will be started. | ||
initial_workers: 0 | ||
|
||
# This executes all commands on all nodes in the docker container, | ||
# and opens all the necessary ports to support the Ray cluster. | ||
# Empty string means disabled. | ||
docker: | ||
image: "tensorflow/tensorflow:1.12.0-gpu-py3" | ||
container_name: "ray-nvidia-docker-test" # e.g. ray_docker | ||
run_options: | ||
- --runtime=nvidia | ||
|
||
# The autoscaler will scale up the cluster to this target fraction of resource | ||
# usage. For example, if a cluster of 10 nodes is 100% busy and | ||
# target_utilization is 0.8, it would resize the cluster to 13. This fraction | ||
# can be decreased to increase the aggressiveness of upscaling. | ||
# This value must be less than 1.0 for scaling to happen. | ||
target_utilization_fraction: 0.8 | ||
|
||
# If a node is idle for this many minutes, it will be removed. | ||
idle_timeout_minutes: 5 | ||
|
||
# Cloud-provider specific configuration. | ||
provider: | ||
type: aws | ||
region: us-west-2 | ||
# Availability zone(s), comma-separated, that nodes may be launched in. | ||
# Nodes are currently spread between zones by a round-robin approach, | ||
# however this implementation detail should not be relied upon. | ||
availability_zone: us-west-2a,us-west-2b | ||
|
||
# How Ray will authenticate with newly launched nodes. | ||
auth: | ||
ssh_user: ubuntu | ||
# By default Ray creates a new private keypair, but you can also use your own. | ||
# If you do so, make sure to also set "KeyName" in the head and worker node | ||
# configurations below. | ||
# ssh_private_key: /path/to/your/key.pem | ||
|
||
# Provider-specific config for the head node, e.g. instance type. By default | ||
# Ray will auto-configure unspecified fields such as SubnetId and KeyName. | ||
# For more documentation on available fields, see: | ||
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances | ||
head_node: | ||
InstanceType: p2.xlarge | ||
ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0 | ||
|
||
# You can provision additional disk space with a conf as follows | ||
BlockDeviceMappings: | ||
- DeviceName: /dev/sda1 | ||
Ebs: | ||
VolumeSize: 100 | ||
|
||
# Additional options in the boto docs. | ||
|
||
# Provider-specific config for worker nodes, e.g. instance type. By default | ||
# Ray will auto-configure unspecified fields such as SubnetId and KeyName. | ||
# For more documentation on available fields, see: | ||
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances | ||
worker_nodes: | ||
InstanceType: m5.large | ||
ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0 | ||
|
||
# Run workers on spot by default. Comment this out to use on-demand. | ||
InstanceMarketOptions: | ||
MarketType: spot | ||
# Additional options can be found in the boto docs, e.g. | ||
# SpotOptions: | ||
# MaxPrice: MAX_HOURLY_PRICE | ||
|
||
# Additional options in the boto docs. | ||
|
||
# Files or directories to copy to the head and worker nodes. The format is a | ||
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g. | ||
file_mounts: { | ||
# "/path1/on/remote/machine": "/path1/on/local/machine", | ||
# "/path2/on/remote/machine": "/path2/on/local/machine", | ||
} | ||
|
||
# List of shell commands to run to set up nodes. | ||
setup_commands: | ||
# - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.6.3-cp27-cp27mu-manylinux1_x86_64.whl | ||
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.6.3-cp35-cp35m-manylinux1_x86_64.whl | ||
# - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.6.3-cp36-cp36m-manylinux1_x86_64.whl | ||
|
||
# Custom commands that will be run on the head node after common setup. | ||
head_setup_commands: | ||
- pip install boto3==1.4.8 # 1.4.8 adds InstanceMarketOptions | ||
|
||
# Custom commands that will be run on worker nodes after common setup. | ||
worker_setup_commands: [] | ||
|
||
# Command to start ray on the head node. You don't need to change this. | ||
head_start_ray_commands: | ||
- ray stop | ||
- ulimit -n 65536; ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml | ||
|
||
# Command to start ray on worker nodes. You don't need to change this. | ||
worker_start_ray_commands: | ||
- ray stop | ||
- ulimit -n 65536; ray start --redis-address=$RAY_HEAD_IP:6379 --object-manager-port=8076 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, do you need to run these here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are already run in
_get_head_node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you're right. Good catch.