Skip to content

Commit

Permalink
docs: searcher context removal docs
Browse files Browse the repository at this point in the history
  • Loading branch information
azhou-determined committed Oct 31, 2024
1 parent f9ac6bc commit bd53030
Show file tree
Hide file tree
Showing 9 changed files with 344 additions and 60 deletions.
235 changes: 235 additions & 0 deletions docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,241 @@ profiling batches 3 and 4.
rendering times for TensorBoard and memory issues. For long-running experiments, it is
recommended to configure a profiling schedule.

*******************
DeepSpeed Trainer
*******************

With the DeepSpeed Trainer API, you can implement and iterate on model training code locally before
running on cluster. When you are satisfied with your model code, you configure and submit the code
on cluster.

The DeepSpeed Trainer API lets you do the following:

- Work locally, iterating on your model code.
- Debug models in your favorite debug environment (e.g., directly on your machine, IDE, or Jupyter
notebook).
- Run training scripts without needing to use an experiment configuration file.
- Load previously saved checkpoints directly into your model.

Initializing the Trainer
========================

After defining the PyTorch Trial, initialize the trial and the trainer.
:meth:`~determined.pytorch.deepspeed.init` returns a
:class:`~determined.pytorch.deepspeed.DeepSpeedTrialContext` for instantiating
:class:`~determined.pytorch.deepspeed.DeepSpeedTrial`. Initialize
:class:`~determined.pytorch.deepspeed.Trainer` with the trial and context.

.. code:: python
from determined.pytorch import deepspeed as det_ds
def main():
with det_ds.init() as train_context:
trial = MyTrial(train_context)
trainer = det_ds.Trainer(trial, train_context)
if __name__ == "__main__":
# Configure logging
logging.basicConfig(level=logging.INFO, format=det.LOG_FORMAT)
main()
Training is configured with a call to :meth:`~determined.pytorch.deepspeed.Trainer.fit` with
training loop arguments, such as checkpointing periods, validation periods, and checkpointing
policy.

.. code:: diff
from determined import pytorch
from determined.pytorch import deepspeed as det_ds
def main():
with det_ds.init() as train_context:
trial = MyTrial(train_context)
trainer = det_ds.Trainer(trial, train_context)
+ trainer.fit(
+ max_length=pytorch.Epoch(10),
+ checkpoint_period=pytorch.Batch(100),
+ validation_period=pytorch.Batch(100),
+ checkpoint_policy="all"
+ )
if __name__ == "__main__":
# Configure logging
logging.basicConfig(level=logging.INFO, format=det.LOG_FORMAT)
main()
Run Your Training Script Locally
================================

Run training scripts locally without submitting to a cluster or defining an experiment configuration
file.

.. code:: python
from determined import pytorch
from determined.pytorch import deepspeed as det_ds
def main():
with det_ds.init() as train_context:
trial = MyTrial(train_context)
trainer = det_ds.Trainer(trial, train_context)
trainer.fit(
max_length=pytorch.Epoch(10),
checkpoint_period=pytorch.Batch(100),
validation_period=pytorch.Batch(100),
checkpoint_policy="all",
)
if __name__ == "__main__":
# Configure logging
logging.basicConfig(level=logging.INFO, format=det.LOG_FORMAT)
main()
You can run this Python script directly (``python3 train.py``), or in a Jupyter notebook. This code
will train for ten epochs, and checkpoint and validate every 100 batches.

Local Distributed Training
==========================

Local training can utilize multiple GPUs on a single node with a few modifications to the above
code.

.. code:: diff
import deepspeed
def main():
+ # Initialize distributed backend before det_ds.init()
+ deepspeed.init_distributed()
+ # Set flag used by internal PyTorch training loop
+ os.environ["DET_MANUAL_INIT_DISTRIBUTED"] = "true"
+ # Initialize DistributedContext
with det_ds.init(
+ distributed=core.DistributedContext.from_deepspeed()
) as train_context:
trial = MyTrial(train_context)
trainer = det_ds.Trainer(trial, train_context)
trainer.fit(
max_length=pytorch.Epoch(10),
checkpoint_period=pytorch.Batch(100),
validation_period=pytorch.Batch(100),
checkpoint_policy="all"
)
This code can be directly invoked with your distributed backend's launcher: ``deepspeed --num_gpus=4
trainer.py --deepspeed --deepspeed_config ds_config.json``

Test Mode
=========

Trainer accepts a test_mode parameter which, if true, trains and validates your training code for
only one batch, checkpoints, then exits. This is helpful for debugging code or writing automated
tests around your model code.

.. code:: diff
trainer.fit(
max_length=pytorch.Epoch(10),
checkpoint_period=pytorch.Batch(100),
validation_period=pytorch.Batch(100),
+ test_mode=True
)
Prepare Your Training Code for Deploying to a Determined Cluster
================================================================

Once you are satisfied with the results of training the model locally, you submit the code to a
cluster. This example allows for distributed training locally and on cluster without having to make
code changes.

Example workflow of frequent iterations between local debugging and cluster deployment:

.. code:: diff
def main():
+ local = det.get_cluster_info() is None
+ if local:
+ # Local: configure local distributed training.
+ deepspeed.init_distributed()
+ # Set flag used by internal PyTorch training loop
+ os.environ["DET_MANUAL_INIT_DISTRIBUTED"] = "true"
+ distributed_context = core.DistributedContext.from_deepspeed()
+ latest_checkpoint = None
+ else:
+ # On-cluster: Determined will automatically detect distributed context.
+ distributed_context = None
+ # On-cluster: configure the latest checkpoint for pause/resume training functionality.
+ latest_checkpoint = det.get_cluster_info().latest_checkpoint
+ with det_ds.init(
+ distributed=distributed_context
) as train_context:
trial = DCGANTrial(train_context)
trainer = det_ds.Trainer(trial, train_context)
trainer.fit(
max_length=pytorch.Epoch(11),
checkpoint_period=pytorch.Batch(100),
validation_period=pytorch.Batch(100),
+ latest_checkpoint=latest_checkpoint,
)
To run Trainer API solely on-cluster, the code is much simpler:

.. code:: python
def main():
with det_ds.init() as train_context:
trial_inst = gan_model.DCGANTrial(train_context)
trainer = det_ds.Trainer(trial_inst, train_context)
trainer.fit(
max_length=pytorch.Epoch(11),
checkpoint_period=pytorch.Batch(100),
validation_period=pytorch.Batch(100),
latest_checkpoint=det.get_cluster_info().latest_checkpoint,
)
Submit Your Trial for Training on Cluster
=========================================

To run your experiment on cluster, you'll need to create an experiment configuration (YAML) file.
Your experiment configuration file must contain searcher configuration and entrypoint.

.. code:: python
name: dcgan_deepspeed_mnist
searcher:
name: single
metric: validation_loss
resources:
slots_per_trial: 2
entrypoint: python3 -m determined.launch.deepspeed python3 train.py
Submit the trial to the cluster:

.. code:: bash
det e create det.yaml .
If your training code needs to read some values from the experiment configuration,
``pytorch.deepspeed.init()`` accepts an ``exp_conf`` argument which allows calling
``context.get_experiment_config()`` from ``DeepSpeedTrialContext``.

Profiling
=========

When training on cluster, you can enable the system metrics profiler by adding a parameter to your
``fit()`` call:

.. code:: diff
trainer.fit(
...,
+ profiling_enabled=True
)
*****************************
Known DeepSpeed Constraints
*****************************
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,14 @@ experiment configuration, specifying an appropriate DeepSpeed configuration.
Reference conversion example:

.. code:: diff
+import deepspeed
-class MyTrial(PyTorchTrial):
+class MyTrial(DeepSpeedTrial):
-from determined import pytorch
+from determined.pytorch import deepspeed as det_ds
-class MyTrial(pytorch.PyTorchTrial):
+class MyTrial(det_ds.DeepSpeedTrial):
def __init__(self, context):
self.context = context
self.args = AttrDict(self.context.get_hparams())
Expand Down
46 changes: 1 addition & 45 deletions docs/reference/experiment-config-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -319,25 +319,6 @@ While debugging, the logger will display lines highlighted in blue for easy iden
Validation Policy
*******************

.. _experiment-config-min-validation-period:

``min_validation_period``
=========================

Optional. Specifies the minimum frequency at which validation should be run for each trial.

- The frequency should be defined using a nested dictionary indicating the unit as records,
batches, or epochs. For example:

.. code:: yaml
min_validation_period:
epochs: 2
- :class:`~determined.pytorch.deepspeed.DeepSpeedTrial` and
:class:`~determined.keras.TFKerasTrial`: If this is in the unit of epochs, ``records_per_epoch``
must be specified.

.. _experiment-config-perform-initial-validation:

``perform_initial_validation``
Expand All @@ -360,25 +341,6 @@ Determined checkpoints in the following situations:
- Prior to the searcher making a decision based on the validation of trials, ensuring consistency
in case of a failure.

.. _experiment-config-min-checkpoint-period:

``min_checkpoint_period``
=========================

Optional. Specifies the minimum frequency for running checkpointing for each trial.

- This value should be set using a nested dictionary in the form of records, batches, or epochs.
For example:

.. code:: yaml
min_checkpoint_period:
epochs: 2
- :class:`~determined.pytorch.deepspeed.DeepSpeedTrial` and
:class:`~determined.keras.TFKerasTrial`: If the unit is in epochs, you must also specify
``records_per_epoch``.

``checkpoint_policy``
=====================

Expand All @@ -394,8 +356,7 @@ Should be set to one of the following values:

- ``none``: A checkpoint will never be taken *due* to a validation. However, even with this policy
selected, checkpoints are still expected to be taken after the trial is finished training, due to
cluster scheduling decisions, before search method decisions, or due to
:ref:`min_checkpoint_period <experiment-config-min-checkpoint-period>`.
cluster scheduling decisions, or when specified in training code.

.. _checkpoint-storage:

Expand Down Expand Up @@ -835,18 +796,13 @@ Single
The ``single`` search method does not perform a hyperparameter search at all; rather, it trains a
single trial for a fixed length. When using this search method, all of the hyperparameters specified
in the :ref:`hyperparameters <experiment-configuration_hyperparameters>` section must be constants.
By default, validation metrics are only computed once, after the specified length of training has
been completed; :ref:`min_validation_period <experiment-config-min-validation-period>` can be used
to specify that validation metrics should be computed more frequently.

``metric``
----------

Required. The name of the validation metric used to evaluate the performance of a hyperparameter
configuration.

.. _experiment-configuration_single-searcher-max-length:

**Optional Fields**

``smaller_is_better``
Expand Down
13 changes: 13 additions & 0 deletions docs/reference/training/api-deepspeed-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,16 @@ documentation):
- :ref:`determined.pytorch.samplers <pytorch-samplers>`
- :ref:`determined.pytorch.MetricReducer <pytorch-metric-reducer>`
- :ref:`determined.pytorch.PyTorchCallback <pytorch-callbacks>`

******************************************
``determined.pytorch.deepspeed.Trainer``
******************************************

.. autoclass:: determined.pytorch.deepspeed.Trainer
:members:

*****************************************
``determined.pytorch.deepspeed.init()``
*****************************************

.. autofunction:: determined.pytorch.deepspeed.init
Loading

0 comments on commit bd53030

Please sign in to comment.