ray-project · justinvyu · Jun 25, 2024 · May 17, 2024 · Jun 3, 2024 · Jun 11, 2024
@@ -13,6 +13,10 @@ Key advantages include:
 
 For more details about Ray Data, including comparisons to alternatives, see :ref:`Ray Data Overview <data_overview>`.
 
+.. note::
+
+    In addition to Ray Data, you can continue to use framework-native data utilities with Ray Train, such as PyTorch Dataset, Hugging Face Dataset, and Lightning DataModule.
+
 In this guide, we will cover how to incorporate Ray Data into your Ray Train script, and different ways to customize your data ingestion pipeline.
 
 .. TODO: Replace this image with a better one.
@@ -258,8 +262,7 @@ Some frameworks provide their own dataset and data loading utilities. For exampl
 - **Hugging Face:** `Dataset <https://huggingface.co/docs/datasets/index>`_
 - **PyTorch Lightning:** `LightningDataModule <https://lightning.ai/docs/pytorch/stable/data/datamodule.html>`_
 
-These utilities can still be used directly with Ray Train. In particular, you may want to do this if you already have your data ingestion pipeline set up.
-However, for more performant large-scale data ingestion we do recommend migrating to Ray Data.
+You can still use these framework data utilities directly with Ray Train.
 
 At a high level, you can compare these concepts as follows:
 
@@ -276,34 +279,45 @@ At a high level, you can compare these concepts as follows:
      - n/a
      - :meth:`ray.data.Dataset.iter_torch_batches`
 
-
-For more details, see the following sections for each framework.
+For more details, see the following sections for each framework:
 
 .. tab-set::
 
-    .. tab-item:: PyTorch Dataset and DataLoader
+    .. tab-item:: PyTorch DataLoader
 
-        **Option 1 (with Ray Data):** Convert your PyTorch Dataset to a Ray Dataset and pass it into the Trainer via  ``datasets`` argument.
-        Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
-        You can convert this to replace the PyTorch DataLoader via :meth:`ray.data.DataIterator.iter_torch_batches`.
+        **Option 1 (with Ray Data):** 
+
+        1. Convert your PyTorch Dataset to a Ray Dataset. 
+        2. Pass the Ray Dataset into the TorchTrainer via  ``datasets`` argument.
+        3. Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
+        4. Create a dataset iterable via :meth:`ray.data.DataIterator.iter_torch_batches`.
 
         For more details, see the :ref:`Migrating from PyTorch Datasets and DataLoaders <migrate_pytorch>`.
 
-        **Option 2 (without Ray Data):** Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``.
-        You can use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training.
+        **Option 2 (without Ray Data):** 
+
+        1. Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``.
+        2. Use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training.
 
     .. tab-item:: LightningDataModule
 
         The ``LightningDataModule`` is created with PyTorch ``Dataset``\s and ``DataLoader``\s. You can apply the same logic here.
 
     .. tab-item:: Hugging Face Dataset
 
-        **Option 1 (with Ray Data):** Convert your Hugging Face Dataset to a Ray Dataset and pass it into the Trainer via the ``datasets`` argument.
-        Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
+        **Option 1 (with Ray Data):** 
+
+        1. Convert your Hugging Face Dataset to a Ray Dataset. For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
+        2. Pass the Ray Dataset into the TorchTrainer via the ``datasets`` argument.
+        3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`.
+        4. Create a iterable dataset via :meth:`ray.data.DataIterator.iter_torch_batches`. 
+        5. Pass the iterable dataset while initializing ``transformers.Trainer``.
+        6. Wrap your transformers trainer with the :meth:`ray.train.huggingface.transformers.prepare_trainer` utility.
 
-        For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
+        **Option 2 (without Ray Data):** 
 
-        **Option 2 (without Ray Data):** Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``.
+        1. Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``.
+        2. Pass the Hugging Face Dataset into ``transformers.Trainer`` during initialization.
 
 .. tip::