[tune] Add Real Multi-Node Tests (Stress Tests) #2877

richardliaw · 2018-09-15T04:18:56Z

Describe the problem

We should have tests that ensure that Tune runs even after node failure, etc.

Pseudo-multinode tests
Multi-node stress tests
Cross-node recovery ([tune] Cross-Node Recovery #3725)

Source code / logs

#2840, #2851.

richardliaw · 2018-12-29T04:05:55Z

Done with #3309.

richardliaw · 2019-01-07T19:38:54Z

We also need to actually test on a multi-node cluster; right now, any trial starting from a different node than its checkpoint I think will restart.

These tests should perhaps go into application stress-testing, although there's no easy way to kill and start new nodes with the autoscaler...

richardliaw · 2019-01-08T23:11:19Z

import json
import os
import random

import numpy as np

import ray
from ray.tune import Trainable, run_experiments, sample_from
from ray.tune.schedulers import AsyncHyperBandScheduler


class MyTrainableClass(Trainable):
    def _setup(self, config):
        self.timestep = 0

    def _train(self):
        self.timestep += 1
        v = np.tanh(float(self.timestep) / self.config["width"])
        v *= self.config["height"]
        import time; time.sleep(1)
        return {"episode_reward_mean": v}

    def _save(self, checkpoint_dir):
        path = os.path.join(checkpoint_dir, "checkpoint")
        with open(path, "w") as f:
            f.write(json.dumps({"timestep": self.timestep}))
        return path

    def _restore(self, checkpoint_path):
        with open(checkpoint_path) as f:
            self.timestep = json.loads(f.read())["timestep"]


if __name__ == "__main__":
    ray.init(redis_address="localhost:6379")

    run_experiments(
        {
            "asynchyperband_test": {
                "run": MyTrainableClass,
                "stop": {
                    "training_iteration": 20,
                },
                "num_samples": 10,
                "resources_per_trial": {
                    "cpu": 2,
                    "gpu": 0
                },
                "checkpoint_freq":3,
                "config": {
                    "width": sample_from(
                        lambda spec: 10 + int(90 * random.random())),
                    "height": sample_from(
                        lambda spec: int(100 * random.random())),
                },
            }
        },verbose=False)

with

cluster_name: install2

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 2

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 2

target_utilization_fraction: 0.8

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-east-1
    availability_zone: us-east-1a

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu

head_node:
    InstanceType: c5.2xlarge
    ImageId: ami-6d720012
    InstanceMarketOptions:
        MarketType: spot
    # You can provision additional disk space with a conf as follows
    # BlockDeviceMappings:
    #     - DeviceName: /dev/sda1
    #       Ebs:
    #           VolumeSize: 100

    # Additional options in the boto docs.

worker_nodes:
    InstanceType: c5.2xlarge
    ImageId: ami-e07e779a #ami-d884b1bd  # Amazon Deep Learning AMI (Ubuntu)
    InstanceMarketOptions:
        MarketType: spot
        # Additional options can be found in the boto docs, e.g.
        #   SpotOptions:
        #       MaxPrice: MAX_HOURLY_PRICE

    # Additional options in the boto docs.

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
}

# List of shell commands to run to set up nodes.
setup_commands:
    - source activate tensorflow_p36 && ray ||  pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.6.1-cp36-cp36m-manylinux1_x86_64.whl

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
    - pip install boto3==1.4.8  # 1.4.8 adds InstanceMarketOptions

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
     - source activate tensorflow_p36 && ray stop
     - source activate tensorflow_p36 && ulimit -c unlimited && ray start --head --redis-port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --num-cpus=0

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - source activate tensorflow_p36 && ray stop
    - source activate tensorflow_p36 && ray start --redis-address=$RAY_HEAD_IP:6379 #--num-cpus=32

richardliaw self-assigned this Sep 15, 2018

richardliaw closed this as completed Dec 29, 2018

richardliaw reopened this Jan 7, 2019

richardliaw changed the title ~~[tune] Add Multi-Node Tests for Tune~~ [tune] Add Real Multi-Node Tests (Stress Tests) Jan 7, 2019

richardliaw added the tune Tune-related issues label Mar 5, 2020

amogkam mentioned this issue Jun 16, 2020

[Testing] Multi-node Training+Tune Long Running Test #8966

Merged

6 tasks

richardliaw closed this as completed Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Add Real Multi-Node Tests (Stress Tests) #2877

[tune] Add Real Multi-Node Tests (Stress Tests) #2877

richardliaw commented Sep 15, 2018 •

edited

Loading

richardliaw commented Dec 29, 2018

richardliaw commented Jan 7, 2019

richardliaw commented Jan 8, 2019

[tune] Add Real Multi-Node Tests (Stress Tests) #2877

[tune] Add Real Multi-Node Tests (Stress Tests) #2877

Comments

richardliaw commented Sep 15, 2018 • edited Loading

Describe the problem

Source code / logs

richardliaw commented Dec 29, 2018

richardliaw commented Jan 7, 2019

richardliaw commented Jan 8, 2019

richardliaw commented Sep 15, 2018 •

edited

Loading