Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR/train] Use new Train API #25735

Merged
merged 89 commits into from
Jul 7, 2022
Merged
Show file tree
Hide file tree
Changes from 87 commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
b39a864
Use new Train API for examples
Yard1 Jun 13, 2022
b31399e
Fix FailureConfig not being a dataclass
Yard1 Jun 14, 2022
5cc9229
Fix errors
Yard1 Jun 14, 2022
baaf8a5
Merge branch 'master' into use_new_train_api
Yard1 Jun 14, 2022
5230218
Fix
Yard1 Jun 14, 2022
ef4a3fc
Fix link
Yard1 Jun 14, 2022
f5cfe62
Fix simple example
Yard1 Jun 14, 2022
468f7e8
train loop utils
Yard1 Jun 14, 2022
4ef6302
Remove tensorboard example
Yard1 Jun 14, 2022
5db3c14
PBT test update
Yard1 Jun 14, 2022
cb805f2
WIP
Yard1 Jun 14, 2022
2f69e37
Do not use pipeline
Yard1 Jun 15, 2022
0d8eeb4
Remove callback test
Yard1 Jun 15, 2022
4a3103e
Examples tests
Yard1 Jun 15, 2022
f7f3ea8
Move tests
Yard1 Jun 15, 2022
50ca40b
Fixture fix
Yard1 Jun 15, 2022
1872f73
Merge branch 'master' into use_new_train_api
Yard1 Jun 16, 2022
10d88d3
Merge branch 'master' into use_new_train_api
Yard1 Jun 16, 2022
20b7075
CI fixes
Yard1 Jun 16, 2022
c3b7d42
Fix
Yard1 Jun 16, 2022
33f8fd1
Merge branch 'master' into use_new_train_api
Yard1 Jun 16, 2022
37b8182
Apply suggestions from code review
Yard1 Jun 16, 2022
6f8d7e0
Fix tracked checkpoint error
Yard1 Jun 16, 2022
85cb1a7
CI fixes
Yard1 Jun 16, 2022
86a71d6
Add checkpoint configuration to `RunConfig`
Yard1 Jun 20, 2022
41eb780
Add `best_checkpoint` and `dataframe` to `Result`
Yard1 Jun 20, 2022
eb2eb67
Tests, fixes
Yard1 Jun 20, 2022
024932e
Result grid tweaks
Yard1 Jun 20, 2022
abf2cdc
Extend
Yard1 Jun 20, 2022
1f1d28b
Merge branch 'ray-project:master' into more_checkpoint_configurability
Yard1 Jun 20, 2022
563bc33
Update result_grid.py
Yard1 Jun 21, 2022
d0261be
Fix
Yard1 Jun 21, 2022
56df493
Lint
Yard1 Jun 21, 2022
ef0c75a
Lint
Yard1 Jun 21, 2022
3464c93
WIP
Yard1 Jun 21, 2022
ee87c12
Renaming
Yard1 Jun 21, 2022
fe9d68e
Merge branch 'more_checkpoint_configurability' into use_new_train_api
Yard1 Jun 21, 2022
b10fe1e
Improve test coverage
Yard1 Jun 21, 2022
4dbccca
Simplify
Yard1 Jun 21, 2022
27e531c
Docstring tweak
Yard1 Jun 21, 2022
7d1abfe
Remove docstring
Yard1 Jun 21, 2022
b0dd3ba
Fix
Yard1 Jun 21, 2022
1c2e4b1
Merge branch 'more_checkpoint_configurability' into use_new_train_api
Yard1 Jun 21, 2022
5b226ab
Tweak docstring
Yard1 Jun 21, 2022
65ce1d3
Fix
Yard1 Jun 21, 2022
555f705
Merge branch 'more_checkpoint_configurability' into use_new_train_api
Yard1 Jun 21, 2022
1e1fbea
Use CheckpointStrategy
Yard1 Jun 22, 2022
3aa277d
Merge branch 'master' into more_checkpoint_configurability
Yard1 Jun 22, 2022
e19d40f
Fix
Yard1 Jun 22, 2022
5cbb15f
Merge branch 'master' into more_checkpoint_configurability
Yard1 Jun 24, 2022
fd96174
dataframe -> metrics_dataframe
Yard1 Jun 24, 2022
8d5f1b3
CheckpointStrategy -> CheckpointConfig
Yard1 Jun 24, 2022
0482bce
Missed this
Yard1 Jun 24, 2022
207d8d1
Merge branch 'more_checkpoint_configurability' into use_new_train_api
Yard1 Jun 24, 2022
0cb579d
Update test_result_grid.py
Yard1 Jun 24, 2022
7ade7e4
Fix
Yard1 Jun 24, 2022
0937dc8
Apply feeedback from code review
Yard1 Jun 24, 2022
49ffb18
Merge branch 'more_checkpoint_configurability' into use_new_train_api
Yard1 Jun 24, 2022
b993627
Fix lint
Yard1 Jun 24, 2022
9244b8e
Merge branch 'more_checkpoint_configurability' into use_new_train_api
Yard1 Jun 24, 2022
ed870bd
Update python/ray/train/__init__.py
Yard1 Jun 24, 2022
ad90782
Merge branch 'master' into more_checkpoint_configurability
Yard1 Jun 27, 2022
c777bb5
Merge branch 'master' into use_new_train_api
Yard1 Jun 27, 2022
77305b2
Merge branch 'more_checkpoint_configurability' into use_new_train_api
Yard1 Jun 27, 2022
a4fd532
Fix CI
Yard1 Jun 27, 2022
d0ae2ba
Use warnings.warn
Yard1 Jun 28, 2022
d44f750
Make method privat
Yard1 Jun 28, 2022
c9d3380
Update python/ray/util/ml_utils/checkpoint_manager.py
Yard1 Jun 28, 2022
5c0a753
Update checkpoint_manager.py
Yard1 Jun 28, 2022
19108f4
Merge branch 'more_checkpoint_configurability' into use_new_train_api
Yard1 Jun 29, 2022
44f62e0
Merge branch 'master' into use_new_train_api
Yard1 Jun 29, 2022
c7b783b
Fix test
Yard1 Jun 29, 2022
2e9ec66
Rename files
Yard1 Jun 30, 2022
2bf89d2
Use keras callback
Yard1 Jun 30, 2022
375790e
Revert docstring changes
Yard1 Jun 30, 2022
de5103e
Merge branch 'master' into use_new_train_api
Yard1 Jun 30, 2022
baaaf47
Rename example files in docs
Yard1 Jun 30, 2022
d931a50
Merge branch 'master' into use_new_train_api
Yard1 Jun 30, 2022
691ce99
Add legacy tests
Yard1 Jun 30, 2022
b407873
Merge branch 'master' into use_new_train_api
Yard1 Jul 5, 2022
2c7611c
Merge branch 'ray-project:master' into use_new_train_api
Yard1 Jul 6, 2022
587ad56
Add todo
Yard1 Jul 6, 2022
0b05727
Merge branch 'master' into use_new_train_api
Yard1 Jul 6, 2022
139f44d
Use `trial_logdir` instead
Yard1 Jul 6, 2022
3a4d3f3
Fix
Yard1 Jul 6, 2022
a064f96
Merge branch 'ray-project:master' into use_new_train_api
Yard1 Jul 7, 2022
302d336
Merge branch 'ray-project:master' into use_new_train_api
Yard1 Jul 7, 2022
2ea93d7
Only print metrics
Yard1 Jul 7, 2022
f0d3beb
Merge branch 'master' into use_new_train_api
Yard1 Jul 7, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions doc/source/train/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ General Examples
PyTorch
~~~~~~~

* :doc:`/train/examples/train_linear_example`:
* :doc:`/train/examples/torch_linear_example`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Simple example for PyTorch.

* :doc:`/train/examples/train_fashion_mnist_example`:
* :doc:`/train/examples/torch_fashion_mnist_example`:
End-to-end example for PyTorch.

* :doc:`/train/examples/transformers/transformers_example`:
Expand Down Expand Up @@ -59,10 +59,10 @@ Ray Datasets Integration Examples
* :doc:`/train/examples/tensorflow_linear_dataset_example`:
Simple example for training a linear TensorFlow model.

* :doc:`/train/examples/train_linear_dataset_example`:
* :doc:`/train/examples/torch_linear_dataset_example`:
Simple example for training a linear PyTorch model.

* :doc:`/train/examples/tune_linear_dataset_example`:
* :doc:`/train/examples/tune_torch_linear_dataset_example`:
Simple example for tuning a linear PyTorch model.


Expand All @@ -75,7 +75,7 @@ Ray Tune Integration Examples
* :doc:`/train/examples/tune_tensorflow_mnist_example`:
End-to-end example for tuning a TensorFlow model.

* :doc:`/train/examples/tune_cifar_pytorch_pbt_example`:
* :doc:`/train/examples/tune_cifar_torch_pbt_example`:
End-to-end example for tuning a PyTorch model with PBT.

..
Expand Down
6 changes: 6 additions & 0 deletions doc/source/train/examples/torch_fashion_mnist_example.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
:orphan:

torch_fashion_mnist_example
===========================

.. literalinclude:: /../../python/ray/train/examples/torch_fashion_mnist_example.py
6 changes: 6 additions & 0 deletions doc/source/train/examples/torch_linear_dataset_example.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
:orphan:

torch_linear_dataset_example
============================

.. literalinclude:: /../../python/ray/train/examples/torch_linear_dataset_example.py
6 changes: 6 additions & 0 deletions doc/source/train/examples/torch_linear_example.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
:orphan:

torch_linear_example
====================

.. literalinclude:: /../../python/ray/train/examples/torch_linear_example.py
6 changes: 0 additions & 6 deletions doc/source/train/examples/train_fashion_mnist_example.rst

This file was deleted.

6 changes: 0 additions & 6 deletions doc/source/train/examples/train_linear_dataset_example.rst

This file was deleted.

6 changes: 0 additions & 6 deletions doc/source/train/examples/train_linear_example.rst

This file was deleted.

This file was deleted.

6 changes: 6 additions & 0 deletions doc/source/train/examples/tune_cifar_torch_pbt_example.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
:orphan:

tune_cifar_torch_pbt_example
============================

.. literalinclude:: /../../python/ray/train/examples/tune_cifar_torch_pbt_example.py
6 changes: 0 additions & 6 deletions doc/source/train/examples/tune_linear_dataset_example.rst

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
:orphan:

tune_torch_linear_dataset_example
=================================

.. literalinclude:: /../../python/ray/air/examples/pytorch/tune_torch_linear_dataset_example.py
7 changes: 5 additions & 2 deletions python/ray/air/result.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from typing import Any, Dict, List, Optional, Tuple
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple

from ray.air.checkpoint import Checkpoint
from ray.util.annotations import PublicAPI
Expand All @@ -15,7 +16,7 @@ class Result:
This is the class produced by Trainer.fit().
It contains a checkpoint, which can be used for resuming training and for
creating a Predictor object. It also contains a metrics object describing
training metrics. `error` is included so that non successful runs
training metrics. ``error`` is included so that non successful runs
and trials can be represented as well.

The constructor is a private API.
Expand All @@ -24,6 +25,7 @@ class Result:
metrics: The final metrics as reported by an Trainable.
checkpoint: The final checkpoint of the Trainable.
error: The execution error of the Trainable run, if the trial finishes in error.
log_dir: Directory where the trial logs are saved.
metrics_dataframe: The full result dataframe of the Trainable.
The dataframe is indexed by iterations and contains reported
metrics.
Expand All @@ -37,6 +39,7 @@ class Result:
metrics: Optional[Dict[str, Any]]
checkpoint: Optional[Checkpoint]
error: Optional[Exception]
log_dir: Optional[Path]
metrics_dataframe: Optional[pd.DataFrame]
best_checkpoints: Optional[List[Tuple[Checkpoint, Dict[str, Any]]]]

Expand Down
25 changes: 3 additions & 22 deletions python/ray/train/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -39,15 +39,6 @@ py_test(
deps = [":train_lib"]
)

py_test(
name = "torch_tensorboard_profiler_example",
size = "small",
main = "examples/torch_tensorboard_profiler_example.py",
srcs = ["examples/torch_tensorboard_profiler_example.py"],
tags = ["team:ml", "exclusive"],
deps = [":train_lib"]
)

py_test(
name = "transformers_example_gpu",
size = "large",
Expand All @@ -73,25 +64,15 @@ py_test(
)

py_test(
name = "tune_cifar_pytorch_pbt_example",
name = "tune_cifar_torch_pbt_example",
size = "medium",
main = "examples/tune_cifar_pytorch_pbt_example.py",
srcs = ["examples/tune_cifar_pytorch_pbt_example.py"],
main = "examples/tune_cifar_torch_pbt_example.py",
srcs = ["examples/tune_cifar_torch_pbt_example.py"],
tags = ["team:ml", "exclusive", "pytorch", "tune"],
deps = [":train_lib"],
args = ["--smoke-test"]
)

py_test(
Yard1 marked this conversation as resolved.
Show resolved Hide resolved
name = "tune_linear_dataset_example",
size = "medium",
main = "examples/tune_linear_dataset_example.py",
srcs = ["examples/tune_linear_dataset_example.py"],
tags = ["team:ml", "exclusive", "gpu_only", "tune"],
deps = [":train_lib"],
args = ["--smoke-test", "--use-gpu"]
)

py_test(
name = "tune_linear_example",
size = "medium",
Expand Down
22 changes: 12 additions & 10 deletions python/ray/train/examples/horovod/horovod_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,17 @@
import os

import horovod.torch as hvd
import ray
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data.distributed
from filelock import FileLock
from ray.train import Trainer
from torchvision import datasets, transforms

import ray
from ray import train
from ray.train.horovod import HorovodTrainer


def metric_average(val, name):
tensor = torch.tensor(val)
Expand Down Expand Up @@ -142,21 +144,21 @@ def train_func(config):

model, optimizer, train_loader, train_sampler = setup(config)

results = []
for epoch in range(num_epochs):
loss = train_epoch(
model, optimizer, train_sampler, train_loader, epoch, log_interval, use_cuda
)
results.append(loss)
return results
train.report(loss=loss)


def main(num_workers, use_gpu, kwargs):
trainer = Trainer("horovod", use_gpu=use_gpu, num_workers=num_workers)
trainer.start()
loss_per_epoch = trainer.run(train_func, config=kwargs)
trainer.shutdown()
print(loss_per_epoch)
trainer = HorovodTrainer(
train_func,
train_loop_config=kwargs,
scaling_config={"use_gpu": use_gpu, "num_workers": num_workers},
)
results = trainer.fit()
print(results)
Yard1 marked this conversation as resolved.
Show resolved Hide resolved


# Horovod Class API.
Expand Down
25 changes: 14 additions & 11 deletions python/ray/train/examples/mlflow_fashion_mnist_example.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,23 @@
import argparse

from ray.train import Trainer
from ray.train.examples.train_fashion_mnist_example import train_func
from ray.train.callbacks.logging import MLflowLoggerCallback
from ray.air import RunConfig
from ray.train.examples.torch_fashion_mnist_example import train_func
from ray.train.torch import TorchTrainer
from ray.tune.integration.mlflow import MLflowLoggerCallback


def main(num_workers=2, use_gpu=False):
trainer = Trainer(backend="torch", num_workers=num_workers, use_gpu=use_gpu)
trainer.start()
final_results = trainer.run(
train_func=train_func,
config={"lr": 1e-3, "batch_size": 64, "epochs": 4},
callbacks=[MLflowLoggerCallback(experiment_name="train_fashion_mnist")],
trainer = TorchTrainer(
train_func,
train_loop_config={"lr": 1e-3, "batch_size": 64, "epochs": 4},
scaling_config={"num_workers": num_workers, "use_gpu": use_gpu},
run_config=RunConfig(
callbacks=[MLflowLoggerCallback(experiment_name="train_fashion_mnist")]
),
)
final_results = trainer.fit()

print("Full losses for rank 0 worker: ", final_results)
print("Full results for rank 0 worker: ", final_results)
Yard1 marked this conversation as resolved.
Show resolved Hide resolved


if __name__ == "__main__":
Expand Down Expand Up @@ -44,7 +47,7 @@ def main(num_workers=2, use_gpu=False):
import ray

if args.smoke_test:
ray.init(num_cpus=2)
ray.init(num_cpus=4)
args.num_workers = 2
args.use_gpu = False
else:
Expand Down
51 changes: 32 additions & 19 deletions python/ray/train/examples/mlflow_simple_example.py
Original file line number Diff line number Diff line change
@@ -1,40 +1,53 @@
from ray import train
from ray.train import Trainer
from ray.train.callbacks import MLflowLoggerCallback, TBXLoggerCallback
from ray.air import RunConfig
from ray.train.torch import TorchTrainer
from ray.tune.integration.mlflow import MLflowLoggerCallback
from ray.tune.logger import TBXLoggerCallback


def train_func():
for i in range(3):
train.report(epoch=i)


trainer = Trainer(backend="torch", num_workers=2)
trainer.start()
trainer = TorchTrainer(
train_func,
scaling_config={"num_workers": 2},
run_config=RunConfig(
callbacks=[
MLflowLoggerCallback(experiment_name="train_experiment"),
TBXLoggerCallback(),
],
),
)

# Run the training function, logging all the intermediate results
# to MLflow and Tensorboard.
result = trainer.run(
train_func,
callbacks=[
MLflowLoggerCallback(experiment_name="train_experiment"),
TBXLoggerCallback(),
],
)
result = trainer.fit()

# Print the latest run directory and keep note of it.
# For example: /home/ray_results/train_2021-09-01_12-00-00/run_001
print("Run directory:", trainer.latest_run_dir)
# For MLFLow logs:

# MLFlow logs will by default be saved in an `mlflow` directory
# in the current working directory.

trainer.shutdown()
# $ cd mlflow
# # View the MLflow UI.
# $ mlflow ui

# You can change the directory by setting the `tracking_uri` argument
# in `MLflowLoggerCallback`.

# For TensorBoard logs:

# Print the latest run directory and keep note of it.
# For example: /home/ubuntu/ray_results/TorchTrainer_2022-06-13_20-31-06
print("Run directory:", result.log_dir.parent) # TensorBoard is saved in parent dir

# How to visualize the logs

# Navigate to the run directory of the trainer.
# For example `cd /home/ray_results/train_2021-09-01_12-00-00/run_001`
# For example `cd /home/ubuntu/ray_results/TorchTrainer_2022-06-13_20-31-06`
# $ cd <TRAINER_RUN_DIR>
#
# # View the MLflow UI.
# $ mlflow ui
#
# # View the tensorboard UI.
# $ tensorboard --logdir .
Loading