docs: searcher context removal docs

determined-ai · Oct 31, 2024 · bd53030 · bd53030
1 parent f9ac6bc
commit bd53030
Show file tree

Hide file tree

Showing 9 changed files with 344 additions and 60 deletions.
diff --git a/docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst b/docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst
@@ -365,6 +365,241 @@ profiling batches 3 and 4.
    rendering times for TensorBoard and memory issues. For long-running experiments, it is
    recommended to configure a profiling schedule.
 
+*******************
+ DeepSpeed Trainer
+*******************
+
+With the DeepSpeed Trainer API, you can implement and iterate on model training code locally before
+running on cluster. When you are satisfied with your model code, you configure and submit the code
+on cluster.
+
+The DeepSpeed Trainer API lets you do the following:
+
+-  Work locally, iterating on your model code.
+-  Debug models in your favorite debug environment (e.g., directly on your machine, IDE, or Jupyter
+   notebook).
+-  Run training scripts without needing to use an experiment configuration file.
+-  Load previously saved checkpoints directly into your model.
+
+Initializing the Trainer
+========================
+
+After defining the PyTorch Trial, initialize the trial and the trainer.
+:meth:`~determined.pytorch.deepspeed.init` returns a
+:class:`~determined.pytorch.deepspeed.DeepSpeedTrialContext` for instantiating
+:class:`~determined.pytorch.deepspeed.DeepSpeedTrial`. Initialize
+:class:`~determined.pytorch.deepspeed.Trainer` with the trial and context.
+
+.. code:: python
+
+   from determined.pytorch import deepspeed as det_ds
+
+   def main():
+       with det_ds.init() as train_context:
+           trial = MyTrial(train_context)
+           trainer = det_ds.Trainer(trial, train_context)
+
+   if __name__ == "__main__":
+       # Configure logging
+       logging.basicConfig(level=logging.INFO, format=det.LOG_FORMAT)
+       main()
+
+Training is configured with a call to :meth:`~determined.pytorch.deepspeed.Trainer.fit` with
+training loop arguments, such as checkpointing periods, validation periods, and checkpointing
+policy.
+
+.. code:: diff
+
+   from determined import pytorch
+   from determined.pytorch import deepspeed as det_ds
+
+   def main():
+       with det_ds.init() as train_context:
+           trial = MyTrial(train_context)
+           trainer = det_ds.Trainer(trial, train_context)
+   +       trainer.fit(
+   +           max_length=pytorch.Epoch(10),
+   +           checkpoint_period=pytorch.Batch(100),
+   +           validation_period=pytorch.Batch(100),
+   +           checkpoint_policy="all"
+   +       )
+
+
+   if __name__ == "__main__":
+       # Configure logging
+       logging.basicConfig(level=logging.INFO, format=det.LOG_FORMAT)
+       main()
+
+Run Your Training Script Locally
+================================
+
+Run training scripts locally without submitting to a cluster or defining an experiment configuration
+file.
+
+.. code:: python
+
+   from determined import pytorch
+   from determined.pytorch import deepspeed as det_ds
+
+   def main():
+       with det_ds.init() as train_context:
+           trial = MyTrial(train_context)
+           trainer = det_ds.Trainer(trial, train_context)
+           trainer.fit(
+               max_length=pytorch.Epoch(10),
+               checkpoint_period=pytorch.Batch(100),
+               validation_period=pytorch.Batch(100),
+               checkpoint_policy="all",
+           )
+
+
+   if __name__ == "__main__":
+       # Configure logging
+       logging.basicConfig(level=logging.INFO, format=det.LOG_FORMAT)
+       main()
+
+You can run this Python script directly (``python3 train.py``), or in a Jupyter notebook. This code
+will train for ten epochs, and checkpoint and validate every 100 batches.
+
+Local Distributed Training
+==========================
+
+Local training can utilize multiple GPUs on a single node with a few modifications to the above
+code.
+
+.. code:: diff
+
+   import deepspeed
+
+   def main():
+   +     # Initialize distributed backend before det_ds.init()
+   +     deepspeed.init_distributed()
+   +     # Set flag used by internal PyTorch training loop
+   +     os.environ["DET_MANUAL_INIT_DISTRIBUTED"] = "true"
+   +     # Initialize DistributedContext
+         with det_ds.init(
+   +       distributed=core.DistributedContext.from_deepspeed()
+         ) as train_context:
+             trial = MyTrial(train_context)
+             trainer = det_ds.Trainer(trial, train_context)
+             trainer.fit(
+                 max_length=pytorch.Epoch(10),
+                 checkpoint_period=pytorch.Batch(100),
+                 validation_period=pytorch.Batch(100),
+                 checkpoint_policy="all"
+             )
+
+This code can be directly invoked with your distributed backend's launcher: ``deepspeed --num_gpus=4
+trainer.py --deepspeed --deepspeed_config ds_config.json``
+
+Test Mode
+=========
+
+Trainer accepts a test_mode parameter which, if true, trains and validates your training code for
+only one batch, checkpoints, then exits. This is helpful for debugging code or writing automated
+tests around your model code.
+
+.. code:: diff
+
+    trainer.fit(
+                 max_length=pytorch.Epoch(10),
+                 checkpoint_period=pytorch.Batch(100),
+                 validation_period=pytorch.Batch(100),
+   +             test_mode=True
+             )
+
+Prepare Your Training Code for Deploying to a Determined Cluster
+================================================================
+
+Once you are satisfied with the results of training the model locally, you submit the code to a
+cluster. This example allows for distributed training locally and on cluster without having to make
+code changes.
+
+Example workflow of frequent iterations between local debugging and cluster deployment:
+
+.. code:: diff
+
+    def main():
+   +   local = det.get_cluster_info() is None
+   +   if local:
+   +       # Local: configure local distributed training.
+   +       deepspeed.init_distributed()
+   +       # Set flag used by internal PyTorch training loop
+   +       os.environ["DET_MANUAL_INIT_DISTRIBUTED"] = "true"
+   +       distributed_context = core.DistributedContext.from_deepspeed()
+   +       latest_checkpoint = None
+   +   else:
+   +       # On-cluster: Determined will automatically detect distributed context.
+   +       distributed_context = None
+   +       # On-cluster: configure the latest checkpoint for pause/resume training functionality.
+   +       latest_checkpoint = det.get_cluster_info().latest_checkpoint
+
+   +     with det_ds.init(
+   +       distributed=distributed_context
+         ) as train_context:
+             trial = DCGANTrial(train_context)
+             trainer = det_ds.Trainer(trial, train_context)
+             trainer.fit(
+                 max_length=pytorch.Epoch(11),
+                 checkpoint_period=pytorch.Batch(100),
+                 validation_period=pytorch.Batch(100),
+   +             latest_checkpoint=latest_checkpoint,
+             )
+
+To run Trainer API solely on-cluster, the code is much simpler:
+
+.. code:: python
+
+   def main():
+       with det_ds.init() as train_context:
+           trial_inst = gan_model.DCGANTrial(train_context)
+           trainer = det_ds.Trainer(trial_inst, train_context)
+           trainer.fit(
+               max_length=pytorch.Epoch(11),
+               checkpoint_period=pytorch.Batch(100),
+               validation_period=pytorch.Batch(100),
+               latest_checkpoint=det.get_cluster_info().latest_checkpoint,
+           )
+
+Submit Your Trial for Training on Cluster
+=========================================
+
+To run your experiment on cluster, you'll need to create an experiment configuration (YAML) file.
+Your experiment configuration file must contain searcher configuration and entrypoint.
+
+.. code:: python
+
+   name: dcgan_deepspeed_mnist
+   searcher:
+     name: single
+     metric: validation_loss
+   resources:
+     slots_per_trial: 2
+   entrypoint: python3 -m determined.launch.deepspeed python3 train.py
+
+Submit the trial to the cluster:
+
+.. code:: bash
+
+   det e create det.yaml .
+
+If your training code needs to read some values from the experiment configuration,
+``pytorch.deepspeed.init()`` accepts an ``exp_conf`` argument which allows calling
+``context.get_experiment_config()`` from ``DeepSpeedTrialContext``.
+
+Profiling
+=========
+
+When training on cluster, you can enable the system metrics profiler by adding a parameter to your
+``fit()`` call:
+
+.. code:: diff
+
+    trainer.fit(
+       ...,
+   +   profiling_enabled=True
+    )
+
 *****************************
  Known DeepSpeed Constraints
 *****************************

diff --git a/docs/model-dev-guide/api-guides/apis-howto/deepspeed/pytorch2deepspeed.rst b/docs/model-dev-guide/api-guides/apis-howto/deepspeed/pytorch2deepspeed.rst
@@ -17,9 +17,14 @@ experiment configuration, specifying an appropriate DeepSpeed configuration.
 Reference conversion example:
 
 .. code:: diff
+   +import deepspeed
 
-   -class MyTrial(PyTorchTrial):
-   +class MyTrial(DeepSpeedTrial):
+   -from determined import pytorch
+   +from determined.pytorch import deepspeed as det_ds
+
+
+   -class MyTrial(pytorch.PyTorchTrial):
+   +class MyTrial(det_ds.DeepSpeedTrial):
         def __init__(self, context):
            self.context = context
            self.args = AttrDict(self.context.get_hparams())

diff --git a/docs/reference/experiment-config-reference.rst b/docs/reference/experiment-config-reference.rst
@@ -319,25 +319,6 @@ While debugging, the logger will display lines highlighted in blue for easy iden
  Validation Policy
 *******************
 
-.. _experiment-config-min-validation-period:
-
-``min_validation_period``
-=========================
-
-Optional. Specifies the minimum frequency at which validation should be run for each trial.
-
--  The frequency should be defined using a nested dictionary indicating the unit as records,
-   batches, or epochs. For example:
-
-.. code:: yaml
-
-   min_validation_period:
-      epochs: 2
-
--  :class:`~determined.pytorch.deepspeed.DeepSpeedTrial` and
-   :class:`~determined.keras.TFKerasTrial`: If this is in the unit of epochs, ``records_per_epoch``
-   must be specified.
-
 .. _experiment-config-perform-initial-validation:
 
 ``perform_initial_validation``
@@ -360,25 +341,6 @@ Determined checkpoints in the following situations:
 -  Prior to the searcher making a decision based on the validation of trials, ensuring consistency
    in case of a failure.
 
-.. _experiment-config-min-checkpoint-period:
-
-``min_checkpoint_period``
-=========================
-
-Optional. Specifies the minimum frequency for running checkpointing for each trial.
-
--  This value should be set using a nested dictionary in the form of records, batches, or epochs.
-   For example:
-
-   .. code:: yaml
-
-      min_checkpoint_period:
-         epochs: 2
-
--  :class:`~determined.pytorch.deepspeed.DeepSpeedTrial` and
-   :class:`~determined.keras.TFKerasTrial`: If the unit is in epochs, you must also specify
-   ``records_per_epoch``.
-
 ``checkpoint_policy``
 =====================
 
@@ -394,8 +356,7 @@ Should be set to one of the following values:
 
 -  ``none``: A checkpoint will never be taken *due* to a validation. However, even with this policy
    selected, checkpoints are still expected to be taken after the trial is finished training, due to
-   cluster scheduling decisions, before search method decisions, or due to
-   :ref:`min_checkpoint_period <experiment-config-min-checkpoint-period>`.
+   cluster scheduling decisions, or when specified in training code.
 
 .. _checkpoint-storage:
 
@@ -835,18 +796,13 @@ Single
 The ``single`` search method does not perform a hyperparameter search at all; rather, it trains a
 single trial for a fixed length. When using this search method, all of the hyperparameters specified
 in the :ref:`hyperparameters <experiment-configuration_hyperparameters>` section must be constants.
-By default, validation metrics are only computed once, after the specified length of training has
-been completed; :ref:`min_validation_period <experiment-config-min-validation-period>` can be used
-to specify that validation metrics should be computed more frequently.
 
 ``metric``
 ----------
 
 Required. The name of the validation metric used to evaluate the performance of a hyperparameter
 configuration.
 
-.. _experiment-configuration_single-searcher-max-length:
-
 **Optional Fields**
 
 ``smaller_is_better``

diff --git a/docs/reference/training/api-deepspeed-reference.rst b/docs/reference/training/api-deepspeed-reference.rst
@@ -48,3 +48,16 @@ documentation):
 -  :ref:`determined.pytorch.samplers <pytorch-samplers>`
 -  :ref:`determined.pytorch.MetricReducer <pytorch-metric-reducer>`
 -  :ref:`determined.pytorch.PyTorchCallback <pytorch-callbacks>`
+
+******************************************
+ ``determined.pytorch.deepspeed.Trainer``
+******************************************
+
+.. autoclass:: determined.pytorch.deepspeed.Trainer
+   :members:
+
+*****************************************
+ ``determined.pytorch.deepspeed.init()``
+*****************************************
+
+.. autofunction:: determined.pytorch.deepspeed.init