[Doc][AIR] Improve visibility of Trainer restore and stateful callbac…

…k restoration (#34350)
ray-project · Apr 20, 2023 · 3d94498 · 3d94498
1 parent 42dd417
commit 3d94498
Show file tree

Hide file tree

Showing 10 changed files with 366 additions and 160 deletions.
diff --git a/doc/source/ray-air/examples/gptj_deepspeed_fine_tuning.ipynb b/doc/source/ray-air/examples/gptj_deepspeed_fine_tuning.ipynb
@@ -559,6 +559,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -567,7 +568,7 @@
     "We pass the preprocessors we have defined earlier as an argument, wrapped in a {class}`~ray.data.preprocessors.chain.Chain`. The preprocessor will be included with the returned {class}`~ray.air.checkpoint.Checkpoint`, meaning it will also be applied during inference.\n",
     "\n",
     "```{note}\n",
-    "If you want to upload checkpoints to cloud storage (eg. S3), use {class}`~ray.tune.syncer.SyncConfig` - see {ref}`train-config-sync` for an example. Using cloud storage is highly recommended, especially for production.\n",
+    "If you want to upload checkpoints to cloud storage (eg. S3), set {class}`air.RunConfig(storage_path) <ray.air.RunConfig>`. See {ref}`train-run-config` for an example. Using cloud storage is highly recommended, especially for production.\n",
     "```"
    ]
   },

diff --git a/doc/source/train/api/api.rst b/doc/source/train/api/api.rst
@@ -232,4 +232,4 @@ Restoration API for Built-in Trainers
 
 .. seealso::
 
-    See :ref:`train-restore-faq` for more details on when and how trainer restore should be used.
+    See :ref:`train-restore-guide` for more details on when and how trainer restore should be used.
diff --git a/doc/source/train/config_guide.rst b/doc/source/train/config_guide.rst
@@ -7,36 +7,51 @@ The following overviews how to configure scale-out, run options, and fault-toler
 For more details on how to configure data ingest, also refer to :ref:`air-ingest`.
 
 Scaling Configurations in Train (``ScalingConfig``)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+---------------------------------------------------
 
 The scaling configuration specifies distributed training properties like the number of workers or the
 resources per worker.
 
 The properties of the scaling configuration are :ref:`tunable <air-tuner-search-space>`.
 
-:class:`ScalingConfig API reference <ray.air.config.ScalingConfig>`
-
 .. literalinclude:: doc_code/key_concepts.py
     :language: python
     :start-after: __scaling_config_start__
     :end-before: __scaling_config_end__
 
+.. seealso::
+
+    See the :class:`~ray.air.ScalingConfig` API reference.
+
+.. _train-run-config:
 
 Run Configuration in Train (``RunConfig``)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+------------------------------------------
 
-The run configuration specifies distributed training properties like the number of workers or the
-resources per worker.
+``RunConfig`` is a configuration object used in Ray Train to define the experiment
+spec that corresponds to a call to ``trainer.fit()``.
 
-The properties of the run configuration are :ref:`not tunable <air-tuner-search-space>`.
+It includes settings such as the experiment name, storage path for results,
+stopping conditions, custom callbacks, checkpoint configuration, verbosity level,
+and logging options.
+
+Many of these settings are configured through other config objects and passed through
+the ``RunConfig``. The following sub-sections contain descriptions of these configs.
 
-:class:`RunConfig API reference <ray.air.config.RunConfig>`
+The properties of the run configuration are :ref:`not tunable <air-tuner-search-space>`.
 
 .. literalinclude:: doc_code/key_concepts.py
     :language: python
     :start-after: __run_config_start__
     :end-before: __run_config_end__
 
+.. seealso::
+
+    See the :class:`~ray.air.RunConfig` API reference.
+
+    See :ref:`tune-storage-options` for storage configuration examples (related to ``storage_path``).
+
+
 Failure configurations in Train (``FailureConfig``)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -45,30 +60,15 @@ The failure configuration specifies how training failures should be dealt with.
 As part of the RunConfig, the properties of the failure configuration
 are :ref:`not tunable <air-tuner-search-space>`.
 
-:class:`FailureConfig API reference <ray.air.config.FailureConfig>`
 
 .. literalinclude:: doc_code/key_concepts.py
     :language: python
     :start-after: __failure_config_start__
     :end-before: __failure_config_end__
 
-.. _train-config-sync:
-
-Sync configurations in Train (``SyncConfig``)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. seealso::
 
-The sync configuration specifies how to synchronize checkpoints between the
-Ray cluster and remote storage.
-
-As part of the RunConfig, the properties of the sync configuration
-are :ref:`not tunable <air-tuner-search-space>`.
-
-:class:`SyncConfig API reference <ray.tune.syncer.SyncConfig>`
-
-.. literalinclude:: doc_code/key_concepts.py
-    :language: python
-    :start-after: __sync_config_start__
-    :end-before: __sync_config_end__
+    See the :class:`~ray.air.FailureConfig` API reference.
 
 
 Checkpoint configurations in Train (``CheckpointConfig``)
@@ -80,10 +80,46 @@ and how many checkpoints to keep.
 As part of the RunConfig, the properties of the checkpoint configuration
 are :ref:`not tunable <air-tuner-search-space>`.
 
-:class:`CheckpointConfig API reference <ray.air.config.CheckpointConfig>`
-
 .. literalinclude:: doc_code/key_concepts.py
     :language: python
     :start-after: __checkpoint_config_start__
     :end-before: __checkpoint_config_end__
 
+Trainers of certain frameworks including :class:`~ray.train.xgboost.XGBoostTrainer`,
+:class:`~ray.train.lightgbm.LightGBMTrainer`, and :class:`~ray.train.huggingface.HuggingFaceTrainer`
+implement checkpointing out of the box. For these trainers, checkpointing can be
+enabled by setting the checkpoint frequency within the :class:`~ray.air.CheckpointConfig`.
+
+.. literalinclude:: doc_code/key_concepts.py
+    :language: python
+    :start-after: __checkpoint_config_ckpt_freq_start__
+    :end-before: __checkpoint_config_ckpt_freq_end__
+
+.. warning::
+
+    ``checkpoint_frequency`` and other parameters do *not* work for trainers
+    that accept a custom training loop such as :class:`~ray.train.torch.TorchTrainer`,
+    since checkpointing is fully user-controlled.
+
+.. seealso::
+
+    See the :class:`~ray.air.CheckpointConfig` API reference.
+
+
+Synchronization configurations in Train (``tune.SyncConfig``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``tune.SyncConfig`` specifies how synchronization of results
+and checkpoints should happen in a distributed Ray cluster.
+
+As part of the RunConfig, the properties of the failure configuration
+are :ref:`not tunable <air-tuner-search-space>`.
+
+.. note::
+
+    This configuration is mostly relevant to running multiple Train runs with a
+    Ray Tune. See :ref:`tune-storage-options` for a guide on using the ``SyncConfig``.
+
+.. seealso::
+
+    See the :class:`~ray.tune.syncer.SyncConfig` API reference.
diff --git a/doc/source/train/dl_guide.rst b/doc/source/train/dl_guide.rst
@@ -504,6 +504,8 @@ The following figure shows how these two sessions look like in a Data Parallel t
 ..
   https://docs.google.com/drawings/d/1g0pv8gqgG29aPEPTcd4BC0LaRNbW1sAkv3H6W1TCp0c/edit
 
+.. _train-dl-saving-checkpoints:
+
 Saving checkpoints
 ++++++++++++++++++
 
@@ -688,6 +690,8 @@ You may also config ``CheckpointConfig`` to keep the "N best" checkpoints persis
     # ('local_path', '/home/ubuntu/ray_results/TorchTrainer_2022-06-24_21-34-49/TorchTrainer_7988b_00000_0_2022-06-24_21-34-49/checkpoint_000002')
 
 
+.. _train-dl-loading-checkpoints:
+
 Loading checkpoints
 +++++++++++++++++++
 
@@ -945,25 +949,124 @@ metrics from multiple workers.
 
 .. _train-fault-tolerance:
 
-Fault Tolerance & Elastic Training
-----------------------------------
+Fault Tolerance
+---------------
+
+Automatically Recover from Train Worker Failures
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Ray Train has built-in fault tolerance to recover from worker failures (i.e.
 ``RayActorError``\s). When a failure is detected, the workers will be shut
-down and new workers will be added in. The training function will be
-restarted, but progress from the previous execution can be resumed through
-checkpointing.
+down and new workers will be added in.
+
+.. note:: Elastic Training is not yet supported.
+
+The training function will be restarted, but progress from the previous execution can
+be resumed through checkpointing.
 
-.. warning:: In order to retain progress when recovery, your training function
-   **must** implement logic for both saving *and* loading :ref:`checkpoints
-   <train-checkpointing>`.
+.. tip::
+    In order to retain progress when recovery, your training function
+    **must** implement logic for both :ref:`saving <train-dl-saving-checkpoints>`
+    *and* :ref:`loading checkpoints <train-dl-loading-checkpoints>`.
 
 Each instance of recovery from a worker failure is considered a retry. The
 number of retries is configurable through the ``max_failures`` attribute of the
-``failure_config`` argument set in the ``run_config`` argument passed to the
-``Trainer``.
+:class:`~ray.air.FailureConfig` argument set in the :class:`~ray.air.RunConfig`
+passed to the ``Trainer``:
 
-.. note:: Elastic Training is not yet supported.
+.. literalinclude:: doc_code/key_concepts.py
+    :language: python
+    :start-after: __failure_config_start__
+    :end-before: __failure_config_end__
+
+.. _train-restore-guide:
+
+Restore a Ray Train Experiment
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+At the experiment level, :ref:`Trainer restoration <trainer-restore>`
+allows you to resume a previously interrupted experiment from where it left off.
+
+A Train experiment may be interrupted due to one of the following reasons:
+
+- The experiment was manually interrupted (e.g., Ctrl+C, or pre-empted head node instance).
+- The head node crashed (e.g., OOM or some other runtime error).
+- The entire cluster went down (e.g., network error affecting all nodes).
+
+Trainer restoration is possible for all of Ray Train's built-in trainers,
+but we use ``TorchTrainer`` in the examples for demonstration.
+We also use ``<Framework>Trainer`` to refer to methods that are shared across all
+built-in trainers.
+
+Let's say your initial Train experiment is configured as follows.
+The actual training loop is just for demonstration purposes: the important detail is that
+:ref:`saving <train-dl-saving-checkpoints>` *and* :ref:`loading checkpoints <train-dl-loading-checkpoints>`
+has been implemented.
+
+.. literalinclude:: doc_code/dl_guide.py
+    :language: python
+    :start-after: __ft_initial_run_start__
+    :end-before: __ft_initial_run_end__
+
+The results and checkpoints of the experiment are saved to the path configured by :class:`~ray.air.config.RunConfig`.
+If the experiment has been interrupted due to one of the reasons listed above, use this path to resume:
+
+.. literalinclude:: doc_code/dl_guide.py
+    :language: python
+    :start-after: __ft_restored_run_start__
+    :end-before: __ft_restored_run_end__
+
+.. tip::
+
+    You can also restore from a remote path (e.g., from an experiment directory stored in a s3 bucket).
+
+    .. literalinclude:: doc_code/dl_guide.py
+        :language: python
+        :dedent:
+        :start-after: __ft_restore_from_cloud_initial_start__
+        :end-before: __ft_restore_from_cloud_initial_end__
+
+    .. literalinclude:: doc_code/dl_guide.py
+        :language: python
+        :dedent:
+        :start-after: __ft_restore_from_cloud_restored_start__
+        :end-before: __ft_restore_from_cloud_restored_end__
+
+.. note::
+
+    Different trainers may allow more parameters to be optionally re-specified on restore.
+    Only **datasets** are required to be re-specified on restore, if they were supplied originally.
+
+    See :ref:`train-framework-specific-restore` for more details.
+
+
+Auto-resume
++++++++++++
+
+Adding the branching logic below will allow you to run the same script after the interrupt,
+picking up training from where you left on the previous run. Notice that we use the
+:meth:`<Framework>Trainer.can_restore <ray.train.trainer.BaseTrainer.can_restore>` utility method
+to determine the existence and validity of the given experiment directory.
+
+.. literalinclude:: doc_code/dl_guide.py
+    :language: python
+    :start-after: __ft_autoresume_start__
+    :end-before: __ft_autoresume_end__
+
+.. seealso::
+
+    See the :meth:`BaseTrainer.restore <ray.train.trainer.BaseTrainer.restore>` docstring
+    for a full example.
+
+.. note::
+
+    `<Framework>Trainer.restore` is different from
+    :class:`<Framework>Trainer(..., resume_from_checkpoint=...) <ray.train.trainer.BaseTrainer>`.
+    `resume_from_checkpoint` is meant to be used to start a *new* Train experiment,
+    which writes results to a new directory and starts over from iteration 0.
+
+    `<Framework>Trainer.restore` is used to continue an existing experiment, where
+    new results will continue to be appended to existing logs.
 
 .. Running on pre-emptible machines
 .. --------------------------------