Skip to content

Latest commit

 

History

History
369 lines (263 loc) · 23.1 KB

faq.md

File metadata and controls

369 lines (263 loc) · 23.1 KB

Syne Tune FAQ

Table of contents

What are the different installations options supported?

To install Syne Tune with minimal dependencies from pip, you can simply do:

pip install 'syne-tune[core]'

If you want in addition to install our own Gaussian process based optimizers, Ray Tune or Bore optimizer, you can run pip install 'syne-tune[X]' where X can be

  • gpsearchers: For built-in Gaussian process based optimizers
  • raytune: For Ray Tune optimizers
  • benchmarks: For installing dependencies required to run all benchmarks
  • extra: For installing all the above
  • bore: For Bore optimizer
  • kde: For KDE optimizer

For instance, pip install 'syne-tune[gpsearchers]' will install Syne Tune along with many built-in Gaussian process optimizers.

To install the latest version from git, run the following:

pip install git+https://github.com/awslabs/syne-tune.git

For local development, we recommend to use the following setup which will enable you to easily test your changes:

pip install --upgrade pip
git clone https://github.com/awslabs/syne-tune.git
cd syne-tune
pip install -e '.[extra]'

How can I run on AWS and SageMaker?

If you want to launch experiments on SageMaker rather than on your local machine, you will need access to AWS and SageMaker on your machine. Make sure that:

  • awscli is installed (see this link)
  • AWS credentials have been set properly (see this link).
  • The necessary SageMaker role has been created (see this page for instructions. If you've created a SageMaker notebook in the past, this role should already have been created for you).

The following command should run without error if your credentials are available:

python -c "import boto3; print(boto3.client('sagemaker').list_training_jobs(MaxResults=1))"

You can also run the following example that evaluates trials on SageMaker to test your setup.

python examples/launch_height_sagemaker.py

What are the metrics reported by default when calling the Reporter?

Whenever you call the reporter to log a result, the worker time-stamp, the worker time since the creation of the reporter and the number of times the reporter was called are logged under the fields st_worker_timestamp, st_worker_time, and st_worker_iter. In addition, when running on SageMaker, a dollar-cost estimate is logged under the field st_worker_cost.

To see this behavior, you can simply call the reporter to see those metrics:

from syne_tune.report import Reporter
reporter = Reporter()
for step in range(3):
   reporter(step=step, metric=float(step) / 3)
   
# [tune-metric]: {"step": 0, "metric": 0.0, "st_worker_timestamp": 1644311849.6071281, "st_worker_time": 0.0001048670000045604, "st_worker_iter": 0}
# [tune-metric]: {"step": 1, "metric": 0.3333333333333333, "st_worker_timestamp": 1644311849.6071832, "st_worker_time": 0.00015910100000837701, "st_worker_iter": 1}
# [tune-metric]: {"step": 2, "metric": 0.6666666666666666, "st_worker_timestamp": 1644311849.60733, "st_worker_time": 0.00030723599996917983, "st_worker_iter": 2}

How can I utilize multiple GPUs?

To utilize multiple GPUs you can either use the backend LocalBackend which will run on the GPU available in a local machine. You can also run on a remote AWS machine with multiple GPUs using the local backend and the remote launcher, see I don’t want to wait, how can I launch the tuning on a remote machine? or run with the SageMaker backend which spins-up one training job per trial.

When evaluating trials on a local machine with LocalBackend, by default each trial is allocated to the least occupied GPU by setting CUDA_VISIBLE_DEVICES environment variable.

What is the default mode when performing optimization?

The default mode is min when performing optimization so the target metric is minimized. The mode can be configured when instantiating a scheduler.

How are trials evaluated on a local machine?

When trials are executed locally (e.g. when LocalBackend is used), each trial is evaluated as a different sub-process. As such the number of concurrent configurations evaluated at the same time (set by n_workers) should account for the capacity of the machine where the trials are executed.

Where can I find the output of the tuning?

When running locally, the output of the tuning is saved under ~/syne-tune/{tuner-name}/ by default. When running remotely on SageMaker, the output of the tuning is saved under /opt/ml/checkpoints/ by default and the tuning output is synced regularly to s3://{sagemaker-default-bucket}/syne-tune/{tuner-name}/.

If you run remote tuning via the CLI, the tuning output is synced to s3://{sagemaker-default-bucket}/syne-tune/{experiment-name}/{tuner-name}/, where experiment-name is the prefix of tuner-name without the datetime extension (in the example above, experiment-name = 'train-height').

How can I change the default output folder where tuning results are stored?

To change the path where tuning results are written, you can set the environment variable SYNETUNE_FOLDER to the folder that you want.

For instance, the following runs a tuning where results tuning files are written under ~/new-syne-tune-folder:

export SYNETUNE_FOLDER="~/new-syne-tune-folder"
python examples/launch_height_baselines.py

You can also do the following for instance to permanently change the output folder of Syne Tune:

echo 'export SYNETUNE_FOLDER="~/new-syne-tune-folder"' >> ~/.bashrc && source ~/.bashrc

What does the output of the tuning contain?

Syne Tune stores the following files metadata.json, results.csv.zip, and tuner.dill which are respectively metadata of the tuning job, results obtained at each time-step and state of the tuner.

How can I enable trial checkpointing?

Since trials may be paused and resumed (either by schedulers or when using spot-instances), the user may checkpoint intermediate results to avoid starting computation from scratch. Model outputs and checkpoints must be written into a specific local path given by the command line argument st_checkpoint_dir. Saving/loading model checkpoint from this directory enables to save/load the state when the job is stopped/resumed (setting the folder correctly and uniquely per trial is the responsibility of the backend), see checkpoint_example.py for a working example of a tuning script with checkpointing enabled.

When using SageMaker backend or tuning remotely, we use the SageMaker checkpoint mechanism under the hood to sync local checkpoints to s3. Checkpoints are synced to s3://{sagemaker-default-bucket}/syne-tune/{tuner-name}/{trial-id}/, where sagemaker-default-bucket is the default bucket for SageMaker. This can also be configured, for instance if you launch experiments with the benchmarking CLI, the files are written to s3://{sagemaker-default-bucket}/syne-tune/{experiment-name}/{tuner-name}/{trial-id}/.

Which schedulers make use of checkpointing?

Checkpointing means storing the state of a trial (i.e., model parameters, optimizer or learning rate scheduler parameters), so that it can be paused and potentially resumed at a later point in time, without having to start training from scratch. Syne Tune checkpointing support is detailed in checkpointing and checkpoint_example.py. Once checkpointing to and from a local file is added to a training script, Syne Tune is managing the checkpoints (e.g., copy to/from cloud storage for distributed back-ends).

The following schedulers make use of checkpointing:

  • Promotion-based Hyperband: HyperbandScheduler(type='promotion', ...) The code runs without checkpointing, but in this case, any trial which is resumed is started from scratch. For example, if a trial was paused after 9 epochs of training and is resumed later, training starts from scratch and the first 9 epochs are wasted effort. Moreover, extra variance is introduced by starting from scratch, since weights may be initialized differently. It is not recommended to run promotion-based Hyperband without checkpointing.
  • Population-based training: PopulationBasedTraining PBT does not work without checkpointing.
  • Synchronous Hyperband: SynchronousGeometricHyperbandScheduler This code runs without checkpointing, but wastes effort in the same sense as promotion-based asynchronous Hyperband

Is the tuner checkpointed?

Yes. When performing the tuning, the tuner state is regularly saved on the experiment path under tuner.dill (every 10 seconds which can be configured with results_update_interval). This allows to use spot-instances when running a tuning remotely with the remote launcher. It also allows to resume a past experiment or analyse the state of scheduler at any point.

Where can I find the output of my trials?

When running LocalBackend locally, results of trials are saved under ~/syne-tune/{tuner-name}/{trial-id}/ and contains the following files:

  • config.json: configuration that is being evaluated in the trial
  • std.err: standard error
  • std.out: standard output

In addition all checkpointing files used by a training script such as intermediate model checkpoint will also be located there. This is exemplified in the following example:

tree ~/syne-tune/train-height-2022-01-12-11-08-40-971/
~/syne-tune/train-height-2022-01-12-11-08-40-971/
├── 0
│   ├── config.json
│   ├── std.err
│   ├── std.out
│   └── stop
├── 1
│   ├── config.json
│   ├── std.err
│   ├── std.out
│   └── stop
├── 2
│   ├── config.json
│   ├── std.err
│   ├── std.out
│   └── stop
├── 3
│   ├── config.json
│   ├── std.err
│   ├── std.out
│   └── stop
├── metadata.json
├── results.csv.zip
└── tuner.dill

When running tuning remotely with the remote launcher, only config.json, metadata.json, results.csv.zip and tuner.dill are synced with S3 unless store_logs_localbackend in which case the trial logs and informations are also persisted.

How can I plot the results of a tuning?

The easiest way to plot the result of a tuning experiment is to call the following:

tuner = Tuner(
    ...
    tuner_name="plot-results-demo",
)
tuner.run()
tuning_experiment = load_experiment(tuner.name)
tuning_experiment.plot()

This generates a plot of the best value found over time. Note that this you can also plot the results while the experiment is running as results are updated continuously.

How can I specify additional tuning metadata?

By default, Syne Tune stores the time, the names and modes of the metric being tuner, the name of the entrypoint, the name backend and the scheduler name. You can also add custom metadata to your tuning job by setting metadata in Tuner as follow:

tuner = Tuner(
    ...
    tuner_name="plot-results-demo",
    metadata={"tag": "special-tag", "user": "alice"},
)

All Syne Tune and user metadata are saved when the tuner starts under metadata.json.

How do I append additional information to the results which are stored?

Results are processed and stored by callbacks passed to Tuner, in particular see tuner_callback.py. In order to add more information to these results, you can inherit from StoreResultsCallback. A good example is given in searcher_callback.py.

If you run experiments with tabulated benchmarks using the SimulatorBackend, as demonstrated in launch_nasbench201_simulated.py, results are stored by SimulatorCallback instead, and you need to inherit from this class, as shown in searcher_callback.py.

I don’t want to wait, how can I launch the tuning on a remote machine?

You can use the remote launcher to launch a tuning on a remote machine. The remote launcher supports both LocalBackend and SageMakerBackend. In the former case, multiple trials will be evaluated on the remote machine (one use-case being to use a beefy machine), in the latter case trials will be evaluated as separate SageMaker training jobs.

See for an example on how to run tuning with the remote launcher: launch_height_sagemaker_remotely.py

How can I run many experiments in parallel?

You can call the remote launcher multiple times to schedules a list of experiments. In some case, you will want more flexibility and directly write your experiment loop, you can check benchmark_loop for an example.

How can I access results after tuning remotely?

You can either call load_experiment("{tuner-name}") which will download files from s3 if the experiment is not found locally. You can also sync directly files from s3 under ~/syne-tune/ folder in batch for instance by running:

aws s3 sync s3://{sagemaker-default-bucket}/syne-tune/{tuner-name}/ ~/syne-tune/  --include "*"  --exclude "*tuner.dill"

To get all results without the tuner state (you can ommit the include and exclude if you also want to include the tuner state).

How can I specify dependencies to remote launcher or when using the SageMaker backend?

When you run remote code, you often need to install packages (e.g. scipy) or have custom code available.

  • To install packages, you can add a file requirements.txt in the same folder as your endpoint script. All those packages will be installed by SageMaker when docker container starts.
  • To include custom code (for instance a library that you are working on), you can set the parameter dependencies on the remote launcher or on a SageMaker framework to a list of folders. The folders indicated will be compressed, sent to s3 and added to the python path when the container starts. You can see launch_remote.py for an example setting dependencies in a SageMaker estimator.

How can I benchmark different methods?

Syne Tune provides a command line launcher in benchmarking/cli/launch_hpo.py. It allows to launch a range of experiments with a single command. The CLI is documented here. In order to use the CLI with your own training script, you need add some meta-information, as explained here.

Another possibility is to write a custom loop over different methods and benchmarks and an example is provided and documented in this directory.

What different schedulers do you support? What are the main differences between them?

We refer to HPO algorithms as schedulers. A scheduler decides which configurations to assign to new trials, but also when to stop a running or resume a paused trial. Some schedulers delegate the first decision to a searcher. The most important differences between schedulers in the single-objective case are:

  • Does the scheduler stop trials early or pause and resume trials (HyperbandScheduler) or not (FIFOScheduler). The former requires a resource dimension (e.g., number of epochs; size of training set) and slightly more elaborate reporting (e.g., evaluation after every epoch), but can outperform the latter by a large margin.
  • Does the searcher suggest new configurations by uniform random sampling (searcher='random') or by sequential model-based decision-making (searcher='bayesopt', searcher='kde'). The latter can be more expensive if a lot of trials are run, but can also be more sample-efficient.

An overview of this landscape is given here. Further schedulers provided by Syne Tune include:

Most of those methods can be accessed with short names by from baselines.py.

How do I define the search space?

While the training script defines the function to be optimized, some care needs to be taken to define the search space for the hyperparameter optimization problem. This being a global optimization problem without gradients easily available, it is most important to reduce the number of parameters. Some advice is given here.

A powerful approach is to run experiments in parallel. Namely, split your hyperparameters into groups A, B, such that HPO over B is tractable. Draw a set of N configurations from A at random, then start N HPO experiments in parallel, where in each of them the search space is over B only, while the parameters in A are fixed. Syne Tune supports massively parallel experimentation.

How can I visualize the progress of my tuning experiment with Tensorboard?

To visualize the progress of Syne Tune in Tensorboard, you can pass the TensorboardCallback to the Tuner object:

tuner = Tuner(
    ...
    callbacks=[TensorboardCallback()],
)

Note that, you need to install TensorboardX to use this callback. You can install it by:

pip install tensorboardX

This will log all metrics that are reported in your training script via the report(...) function. Now, to open Tensorboard, run:

tensorboard --logdir ~/syne-tune/{tuner-name}/tensorboard_output

If you want to plot the cumulative optimum of the metric you want to optimize, you can pass the target_metric argument to TensorboardCallback. This will also report the best found hyperparameter configuration over time.

[How can I add a new tabular or surrogate benchmark?

To add a new dataset of tabular evaluations, you need to

  1. write a blackbox recipe able to regenerate it by extending BlackboxRecipe. You need in particular to provide the name of the blackbox, the reference so that users are prompted to cite the appropriated paper and a code that can generate it from scratch, see lcbench.py for an example.
  2. add your new recipe class in recipes to make it available in Syne Tune.