Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Add Real Multi-Node Tests (Stress Tests) #2877

Closed
3 tasks done
richardliaw opened this issue Sep 15, 2018 · 3 comments
Closed
3 tasks done

[tune] Add Real Multi-Node Tests (Stress Tests) #2877

richardliaw opened this issue Sep 15, 2018 · 3 comments
Assignees
Labels
tune Tune-related issues

Comments

@richardliaw
Copy link
Contributor

richardliaw commented Sep 15, 2018

Describe the problem

We should have tests that ensure that Tune runs even after node failure, etc.

Source code / logs

#2840, #2851.

@richardliaw richardliaw self-assigned this Sep 15, 2018
@richardliaw
Copy link
Contributor Author

Done with #3309.

@richardliaw
Copy link
Contributor Author

We also need to actually test on a multi-node cluster; right now, any trial starting from a different node than its checkpoint I think will restart.

These tests should perhaps go into application stress-testing, although there's no easy way to kill and start new nodes with the autoscaler...

@richardliaw richardliaw changed the title [tune] Add Multi-Node Tests for Tune [tune] Add Real Multi-Node Tests (Stress Tests) Jan 7, 2019
@richardliaw
Copy link
Contributor Author

import json
import os
import random

import numpy as np

import ray
from ray.tune import Trainable, run_experiments, sample_from
from ray.tune.schedulers import AsyncHyperBandScheduler


class MyTrainableClass(Trainable):
    def _setup(self, config):
        self.timestep = 0

    def _train(self):
        self.timestep += 1
        v = np.tanh(float(self.timestep) / self.config["width"])
        v *= self.config["height"]
        import time; time.sleep(1)
        return {"episode_reward_mean": v}

    def _save(self, checkpoint_dir):
        path = os.path.join(checkpoint_dir, "checkpoint")
        with open(path, "w") as f:
            f.write(json.dumps({"timestep": self.timestep}))
        return path

    def _restore(self, checkpoint_path):
        with open(checkpoint_path) as f:
            self.timestep = json.loads(f.read())["timestep"]


if __name__ == "__main__":
    ray.init(redis_address="localhost:6379")

    run_experiments(
        {
            "asynchyperband_test": {
                "run": MyTrainableClass,
                "stop": {
                    "training_iteration": 20,
                },
                "num_samples": 10,
                "resources_per_trial": {
                    "cpu": 2,
                    "gpu": 0
                },
                "checkpoint_freq":3,
                "config": {
                    "width": sample_from(
                        lambda spec: 10 + int(90 * random.random())),
                    "height": sample_from(
                        lambda spec: int(100 * random.random())),
                },
            }
        },verbose=False)

with

cluster_name: install2

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 2

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 2

target_utilization_fraction: 0.8

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-east-1
    availability_zone: us-east-1a

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu

head_node:
    InstanceType: c5.2xlarge
    ImageId: ami-6d720012
    InstanceMarketOptions:
        MarketType: spot
    # You can provision additional disk space with a conf as follows
    # BlockDeviceMappings:
    #     - DeviceName: /dev/sda1
    #       Ebs:
    #           VolumeSize: 100

    # Additional options in the boto docs.

worker_nodes:
    InstanceType: c5.2xlarge
    ImageId: ami-e07e779a #ami-d884b1bd  # Amazon Deep Learning AMI (Ubuntu)
    InstanceMarketOptions:
        MarketType: spot
        # Additional options can be found in the boto docs, e.g.
        #   SpotOptions:
        #       MaxPrice: MAX_HOURLY_PRICE

    # Additional options in the boto docs.

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
}

# List of shell commands to run to set up nodes.
setup_commands:
    - source activate tensorflow_p36 && ray ||  pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.6.1-cp36-cp36m-manylinux1_x86_64.whl

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
    - pip install boto3==1.4.8  # 1.4.8 adds InstanceMarketOptions

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
     - source activate tensorflow_p36 && ray stop
     - source activate tensorflow_p36 && ulimit -c unlimited && ray start --head --redis-port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --num-cpus=0

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - source activate tensorflow_p36 && ray stop
    - source activate tensorflow_p36 && ray start --redis-address=$RAY_HEAD_IP:6379 #--num-cpus=32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

1 participant