Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train][Doc] Update PyTorch Data Ingestion User Guide #45421

Merged
merged 4 commits into from
Jun 25, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 28 additions & 14 deletions doc/source/train/user-guides/data-loading-preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ Key advantages include:

For more details about Ray Data, including comparisons to alternatives, see :ref:`Ray Data Overview <data_overview>`.

.. note::

In addition to Ray Data, you can continue to use framework-native data utilities with Ray Train, such as PyTorch Dataset, Hugging Face Dataset, and Lightning DataModule.

In this guide, we will cover how to incorporate Ray Data into your Ray Train script, and different ways to customize your data ingestion pipeline.

.. TODO: Replace this image with a better one.
Expand Down Expand Up @@ -258,8 +262,7 @@ Some frameworks provide their own dataset and data loading utilities. For exampl
- **Hugging Face:** `Dataset <https://huggingface.co/docs/datasets/index>`_
- **PyTorch Lightning:** `LightningDataModule <https://lightning.ai/docs/pytorch/stable/data/datamodule.html>`_

These utilities can still be used directly with Ray Train. In particular, you may want to do this if you already have your data ingestion pipeline set up.
However, for more performant large-scale data ingestion we do recommend migrating to Ray Data.
woshiyyya marked this conversation as resolved.
Show resolved Hide resolved
You can still use these framework data utilities directly with Ray Train.

At a high level, you can compare these concepts as follows:

Expand All @@ -276,34 +279,45 @@ At a high level, you can compare these concepts as follows:
- n/a
- :meth:`ray.data.Dataset.iter_torch_batches`


For more details, see the following sections for each framework.
For more details, see the following sections for each framework:
woshiyyya marked this conversation as resolved.
Show resolved Hide resolved

.. tab-set::

.. tab-item:: PyTorch Dataset and DataLoader
.. tab-item:: PyTorch DataLoader

**Option 1 (with Ray Data):** Convert your PyTorch Dataset to a Ray Dataset and pass it into the Trainer via ``datasets`` argument.
Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
You can convert this to replace the PyTorch DataLoader via :meth:`ray.data.DataIterator.iter_torch_batches`.
**Option 1 (with Ray Data):**

1. Convert your PyTorch Dataset to a Ray Dataset.
2. Pass the Ray Dataset into the TorchTrainer via ``datasets`` argument.
3. Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
4. Create a dataset iterable via :meth:`ray.data.DataIterator.iter_torch_batches`.

For more details, see the :ref:`Migrating from PyTorch Datasets and DataLoaders <migrate_pytorch>`.

**Option 2 (without Ray Data):** Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``.
You can use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training.
**Option 2 (without Ray Data):**

1. Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``.
2. Use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training.

.. tab-item:: LightningDataModule

The ``LightningDataModule`` is created with PyTorch ``Dataset``\s and ``DataLoader``\s. You can apply the same logic here.

.. tab-item:: Hugging Face Dataset

**Option 1 (with Ray Data):** Convert your Hugging Face Dataset to a Ray Dataset and pass it into the Trainer via the ``datasets`` argument.
Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
**Option 1 (with Ray Data):**

1. Convert your Hugging Face Dataset to a Ray Dataset. For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
2. Pass the Ray Dataset into the TorchTrainer via the ``datasets`` argument.
3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`.
4. Create a iterable dataset via :meth:`ray.data.DataIterator.iter_torch_batches`.
5. Pass the iterable dataset while initializing ``transformers.Trainer``.
6. Wrap your transformers trainer with the :meth:`ray.train.huggingface.transformers.prepare_trainer` utility.

For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
**Option 2 (without Ray Data):**

**Option 2 (without Ray Data):** Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``.
1. Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``.
2. Pass the Hugging Face Dataset into ``transformers.Trainer`` during initialization.

.. tip::

Expand Down
Loading