Skip to content

Commit

Permalink
[Doc][AIR] Improve visibility of Trainer restore and stateful callbac…
Browse files Browse the repository at this point in the history
…k restoration (#34350)
  • Loading branch information
justinvyu authored Apr 20, 2023
1 parent 42dd417 commit 3d94498
Show file tree
Hide file tree
Showing 10 changed files with 366 additions and 160 deletions.
3 changes: 2 additions & 1 deletion doc/source/ray-air/examples/gptj_deepspeed_fine_tuning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -559,6 +559,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -567,7 +568,7 @@
"We pass the preprocessors we have defined earlier as an argument, wrapped in a {class}`~ray.data.preprocessors.chain.Chain`. The preprocessor will be included with the returned {class}`~ray.air.checkpoint.Checkpoint`, meaning it will also be applied during inference.\n",
"\n",
"```{note}\n",
"If you want to upload checkpoints to cloud storage (eg. S3), use {class}`~ray.tune.syncer.SyncConfig` - see {ref}`train-config-sync` for an example. Using cloud storage is highly recommended, especially for production.\n",
"If you want to upload checkpoints to cloud storage (eg. S3), set {class}`air.RunConfig(storage_path) <ray.air.RunConfig>`. See {ref}`train-run-config` for an example. Using cloud storage is highly recommended, especially for production.\n",
"```"
]
},
Expand Down
2 changes: 1 addition & 1 deletion doc/source/train/api/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -232,4 +232,4 @@ Restoration API for Built-in Trainers

.. seealso::

See :ref:`train-restore-faq` for more details on when and how trainer restore should be used.
See :ref:`train-restore-guide` for more details on when and how trainer restore should be used.
90 changes: 63 additions & 27 deletions doc/source/train/config_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,36 +7,51 @@ The following overviews how to configure scale-out, run options, and fault-toler
For more details on how to configure data ingest, also refer to :ref:`air-ingest`.

Scaling Configurations in Train (``ScalingConfig``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
---------------------------------------------------

The scaling configuration specifies distributed training properties like the number of workers or the
resources per worker.

The properties of the scaling configuration are :ref:`tunable <air-tuner-search-space>`.

:class:`ScalingConfig API reference <ray.air.config.ScalingConfig>`

.. literalinclude:: doc_code/key_concepts.py
:language: python
:start-after: __scaling_config_start__
:end-before: __scaling_config_end__

.. seealso::

See the :class:`~ray.air.ScalingConfig` API reference.

.. _train-run-config:

Run Configuration in Train (``RunConfig``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
------------------------------------------

The run configuration specifies distributed training properties like the number of workers or the
resources per worker.
``RunConfig`` is a configuration object used in Ray Train to define the experiment
spec that corresponds to a call to ``trainer.fit()``.

The properties of the run configuration are :ref:`not tunable <air-tuner-search-space>`.
It includes settings such as the experiment name, storage path for results,
stopping conditions, custom callbacks, checkpoint configuration, verbosity level,
and logging options.

Many of these settings are configured through other config objects and passed through
the ``RunConfig``. The following sub-sections contain descriptions of these configs.

:class:`RunConfig API reference <ray.air.config.RunConfig>`
The properties of the run configuration are :ref:`not tunable <air-tuner-search-space>`.

.. literalinclude:: doc_code/key_concepts.py
:language: python
:start-after: __run_config_start__
:end-before: __run_config_end__

.. seealso::

See the :class:`~ray.air.RunConfig` API reference.

See :ref:`tune-storage-options` for storage configuration examples (related to ``storage_path``).


Failure configurations in Train (``FailureConfig``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -45,30 +60,15 @@ The failure configuration specifies how training failures should be dealt with.
As part of the RunConfig, the properties of the failure configuration
are :ref:`not tunable <air-tuner-search-space>`.

:class:`FailureConfig API reference <ray.air.config.FailureConfig>`

.. literalinclude:: doc_code/key_concepts.py
:language: python
:start-after: __failure_config_start__
:end-before: __failure_config_end__

.. _train-config-sync:

Sync configurations in Train (``SyncConfig``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. seealso::

The sync configuration specifies how to synchronize checkpoints between the
Ray cluster and remote storage.

As part of the RunConfig, the properties of the sync configuration
are :ref:`not tunable <air-tuner-search-space>`.

:class:`SyncConfig API reference <ray.tune.syncer.SyncConfig>`

.. literalinclude:: doc_code/key_concepts.py
:language: python
:start-after: __sync_config_start__
:end-before: __sync_config_end__
See the :class:`~ray.air.FailureConfig` API reference.


Checkpoint configurations in Train (``CheckpointConfig``)
Expand All @@ -80,10 +80,46 @@ and how many checkpoints to keep.
As part of the RunConfig, the properties of the checkpoint configuration
are :ref:`not tunable <air-tuner-search-space>`.

:class:`CheckpointConfig API reference <ray.air.config.CheckpointConfig>`

.. literalinclude:: doc_code/key_concepts.py
:language: python
:start-after: __checkpoint_config_start__
:end-before: __checkpoint_config_end__

Trainers of certain frameworks including :class:`~ray.train.xgboost.XGBoostTrainer`,
:class:`~ray.train.lightgbm.LightGBMTrainer`, and :class:`~ray.train.huggingface.HuggingFaceTrainer`
implement checkpointing out of the box. For these trainers, checkpointing can be
enabled by setting the checkpoint frequency within the :class:`~ray.air.CheckpointConfig`.

.. literalinclude:: doc_code/key_concepts.py
:language: python
:start-after: __checkpoint_config_ckpt_freq_start__
:end-before: __checkpoint_config_ckpt_freq_end__

.. warning::

``checkpoint_frequency`` and other parameters do *not* work for trainers
that accept a custom training loop such as :class:`~ray.train.torch.TorchTrainer`,
since checkpointing is fully user-controlled.

.. seealso::

See the :class:`~ray.air.CheckpointConfig` API reference.


Synchronization configurations in Train (``tune.SyncConfig``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``tune.SyncConfig`` specifies how synchronization of results
and checkpoints should happen in a distributed Ray cluster.

As part of the RunConfig, the properties of the failure configuration
are :ref:`not tunable <air-tuner-search-space>`.

.. note::

This configuration is mostly relevant to running multiple Train runs with a
Ray Tune. See :ref:`tune-storage-options` for a guide on using the ``SyncConfig``.

.. seealso::

See the :class:`~ray.tune.syncer.SyncConfig` API reference.
125 changes: 114 additions & 11 deletions doc/source/train/dl_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -504,6 +504,8 @@ The following figure shows how these two sessions look like in a Data Parallel t
..
https://docs.google.com/drawings/d/1g0pv8gqgG29aPEPTcd4BC0LaRNbW1sAkv3H6W1TCp0c/edit
.. _train-dl-saving-checkpoints:

Saving checkpoints
++++++++++++++++++

Expand Down Expand Up @@ -688,6 +690,8 @@ You may also config ``CheckpointConfig`` to keep the "N best" checkpoints persis
# ('local_path', '/home/ubuntu/ray_results/TorchTrainer_2022-06-24_21-34-49/TorchTrainer_7988b_00000_0_2022-06-24_21-34-49/checkpoint_000002')
.. _train-dl-loading-checkpoints:

Loading checkpoints
+++++++++++++++++++

Expand Down Expand Up @@ -945,25 +949,124 @@ metrics from multiple workers.
.. _train-fault-tolerance:

Fault Tolerance & Elastic Training
----------------------------------
Fault Tolerance
---------------

Automatically Recover from Train Worker Failures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Ray Train has built-in fault tolerance to recover from worker failures (i.e.
``RayActorError``\s). When a failure is detected, the workers will be shut
down and new workers will be added in. The training function will be
restarted, but progress from the previous execution can be resumed through
checkpointing.
down and new workers will be added in.

.. note:: Elastic Training is not yet supported.

The training function will be restarted, but progress from the previous execution can
be resumed through checkpointing.

.. warning:: In order to retain progress when recovery, your training function
**must** implement logic for both saving *and* loading :ref:`checkpoints
<train-checkpointing>`.
.. tip::
In order to retain progress when recovery, your training function
**must** implement logic for both :ref:`saving <train-dl-saving-checkpoints>`
*and* :ref:`loading checkpoints <train-dl-loading-checkpoints>`.

Each instance of recovery from a worker failure is considered a retry. The
number of retries is configurable through the ``max_failures`` attribute of the
``failure_config`` argument set in the ``run_config`` argument passed to the
``Trainer``.
:class:`~ray.air.FailureConfig` argument set in the :class:`~ray.air.RunConfig`
passed to the ``Trainer``:

.. note:: Elastic Training is not yet supported.
.. literalinclude:: doc_code/key_concepts.py
:language: python
:start-after: __failure_config_start__
:end-before: __failure_config_end__

.. _train-restore-guide:

Restore a Ray Train Experiment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

At the experiment level, :ref:`Trainer restoration <trainer-restore>`
allows you to resume a previously interrupted experiment from where it left off.

A Train experiment may be interrupted due to one of the following reasons:

- The experiment was manually interrupted (e.g., Ctrl+C, or pre-empted head node instance).
- The head node crashed (e.g., OOM or some other runtime error).
- The entire cluster went down (e.g., network error affecting all nodes).

Trainer restoration is possible for all of Ray Train's built-in trainers,
but we use ``TorchTrainer`` in the examples for demonstration.
We also use ``<Framework>Trainer`` to refer to methods that are shared across all
built-in trainers.

Let's say your initial Train experiment is configured as follows.
The actual training loop is just for demonstration purposes: the important detail is that
:ref:`saving <train-dl-saving-checkpoints>` *and* :ref:`loading checkpoints <train-dl-loading-checkpoints>`
has been implemented.

.. literalinclude:: doc_code/dl_guide.py
:language: python
:start-after: __ft_initial_run_start__
:end-before: __ft_initial_run_end__

The results and checkpoints of the experiment are saved to the path configured by :class:`~ray.air.config.RunConfig`.
If the experiment has been interrupted due to one of the reasons listed above, use this path to resume:

.. literalinclude:: doc_code/dl_guide.py
:language: python
:start-after: __ft_restored_run_start__
:end-before: __ft_restored_run_end__

.. tip::

You can also restore from a remote path (e.g., from an experiment directory stored in a s3 bucket).

.. literalinclude:: doc_code/dl_guide.py
:language: python
:dedent:
:start-after: __ft_restore_from_cloud_initial_start__
:end-before: __ft_restore_from_cloud_initial_end__

.. literalinclude:: doc_code/dl_guide.py
:language: python
:dedent:
:start-after: __ft_restore_from_cloud_restored_start__
:end-before: __ft_restore_from_cloud_restored_end__

.. note::

Different trainers may allow more parameters to be optionally re-specified on restore.
Only **datasets** are required to be re-specified on restore, if they were supplied originally.

See :ref:`train-framework-specific-restore` for more details.


Auto-resume
+++++++++++

Adding the branching logic below will allow you to run the same script after the interrupt,
picking up training from where you left on the previous run. Notice that we use the
:meth:`<Framework>Trainer.can_restore <ray.train.trainer.BaseTrainer.can_restore>` utility method
to determine the existence and validity of the given experiment directory.

.. literalinclude:: doc_code/dl_guide.py
:language: python
:start-after: __ft_autoresume_start__
:end-before: __ft_autoresume_end__

.. seealso::

See the :meth:`BaseTrainer.restore <ray.train.trainer.BaseTrainer.restore>` docstring
for a full example.

.. note::

`<Framework>Trainer.restore` is different from
:class:`<Framework>Trainer(..., resume_from_checkpoint=...) <ray.train.trainer.BaseTrainer>`.
`resume_from_checkpoint` is meant to be used to start a *new* Train experiment,
which writes results to a new directory and starts over from iteration 0.

`<Framework>Trainer.restore` is used to continue an existing experiment, where
new results will continue to be appended to existing logs.

.. Running on pre-emptible machines
.. --------------------------------
Expand Down
Loading

0 comments on commit 3d94498

Please sign in to comment.