New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Train][Doc] Update PyTorch Data Ingestion User Guide #45421

Merged

justinvyu merged 4 commits into ray-project:master from woshiyyya:doc/update_pytorch_dataset

Jun 25, 2024

Member

woshiyyya commented May 17, 2024 •

edited by justinvyu

Loading

Why are these changes needed?

This PR improves framework migration steps to Ray Data for data ingest user guide.

Follow-up PRs should do a larger restructure to make this guide more readable:

Future restructure plans

Restructured the "Starting from PyTorch Data" section for better readability. The following PR will incorporate the following restructure:

Introduction (don’t need a title)
- Layout the structure of the doc
- Comparing framework-native utilities v.s. Ray Data, and guide user to choose one or the other.
  - Add quick link to the corresponding for "Use Framework-native .." and "Use Ray Data" sections
  - Show the benefits of ray data (can include contents in the blog post)
Use Ray Data
- Quickstart (add code snippets for each subsection)
- Loading data
- Inputting and splitting data
- …
- Remaining Ray Data Usage guide (Splitting Dataset, ..., Performance tips)
Use Framework-native Data Utilities
- Quickstart
- Code snippets for Torch,Lighting, and HF

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(


          update

83feb79

Signed-off-by: woshiyyya <[email protected]>

woshiyyya requested review from matthewdeng, justinvyu and a team as code owners

May 17, 2024 22:04

woshiyyya assigned justinvyu and matthewdeng

matthewdeng reviewed

View reviewed changes

Contributor

matthewdeng left a comment

Nice! Love the added benefits & cleaner steps!

doc/source/train/user-guides/data-loading-preprocessing.rst Show resolved Hide resolved

doc/source/train/user-guides/data-loading-preprocessing.rst Outdated

               .. tab-set::
-                  .. tab-item:: PyTorch Dataset and DataLoader
+                  .. tab-item:: PyTorch

Contributor

matthewdeng May 17, 2024

The original names were more explicit to make it clear that this is referring to the dataset framework, rather than the training framework.

doc/source/train/user-guides/data-loading-preprocessing.rst Outdated

                    - n/a
                    - :meth:`ray.data.Dataset.iter_torch_batches`
+              Why using Ray Data?

Contributor

matthewdeng May 18, 2024

Suggested change

      
            Why using Ray Data?
          
            Comparison with Ray Data

doc/source/train/user-guides/data-loading-preprocessing.rst Show resolved Hide resolved

doc/source/train/user-guides/data-loading-preprocessing.rst Outdated

-                      **Option 1 (with Ray Data):** Convert your PyTorch Dataset to a Ray Dataset and pass it into the Trainer via  ``datasets`` argument.
-                      Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
-                      You can convert this to replace the PyTorch DataLoader via :meth:`ray.data.DataIterator.iter_torch_batches`.
+. Convert your PyTorch Dataset to a Ray Dataset and

Contributor

matthewdeng May 18, 2024

nit:

Suggested change

      
                    1. Convert your PyTorch Dataset to a Ray Dataset and 
          
                    1. Convert your PyTorch Dataset to a Ray Dataset.

There are some other small typos/formatting errors that I'll review more thoroughly in a follow-up review.

doc/source/train/user-guides/data-loading-preprocessing.rst Outdated


		For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
		Option 2 (with HuggingFace Dataset):

Contributor

matthewdeng May 18, 2024

nit: I understand why you chose to do this but I'm also a little worried this might be confusing since Option 1 technically does also use Hugging Face Datasets.

Member Author

woshiyyya May 18, 2024 •

edited

Loading

Oh I realized the difference now. Previously, this section aims at teaching users how to convert their HF dataset to Ray Dataset, then do training. But this PR tries to directly categorize on what we eventually use in the training function.

# prev
HF Dataset -> Ray Data -> HF Transformers
            HF Dataset -> HF Transformers

# now
Ray Data -> HF Transformers
HF Dataset -> HF Transformers

My consideration here is that we'd better not force everyone to take the "HF Dataset -> Ray Data" conversion step.

For example, their original datasets format could be parquet, and before onboarding Ray, they already build a HF Dataset from parquet file, then feed it to HF Trainer.

In this case, they can either build ray dataset from parquet or from HF dataset.

# Before onboarding Ray
raw data -> HF dataset -> HF transformer

# After onboarding Ray
option 1: raw data -> HF dataset -> Ray Data -> HF transformer

v.s. 

option 2:  raw data -> Ray Data -> HF transformer

We can discuss more in person next week.

doc/source/train/user-guides/data-loading-preprocessing.rst Outdated

Comment on lines 287 to 298

+              **Streaming execution**:
+              - The preprocessing pipeline will be executed lazily and stream the data batches into training workers.
+              - Training can start immediately without significant up-front preprocessing time.
+              **Automatic data sharding**:
+              - The dataset will be automatically sharded across all training workers.
-              For more details, see the following sections for each framework.
+              **Leverage additional resources for preprocessing**
+              - Ray Data can utilize all resources in the Ray cluster for preprocessing, not just those on your training nodes.

Contributor

matthewdeng May 18, 2024

This is good content that I think everyone should read, regardless of whether or not they are starting with PyTorch data. Do you think we could bring this higher up in the guide (e.g. even in the introduction), and then reference it from here?

Member Author

woshiyyya May 18, 2024

OK. Sounds good to me.

justinvyu reviewed

View reviewed changes

Contributor

justinvyu left a comment •

edited

Loading

High level comments:

Can we copy some content from this blog post? This user guide should be the place to compare Ray Data against other data ingest solutions. Particularly, I'm thinking of copying over the diagrams as well as the table comparing against torch dataloader, HF dataset, tf data, etc.
Proposed restructure of this guide:

(Ray Data + Ray Train) Quickstart
    -> Code examples, with the "Option 1: Ray Data" moved over here under each framework.
Why use Ray Data?
    -> Comparison with Other Data Ingest Solutions
        -> Comparison table
Alternative to Ray Data Ingest (Framework-native Dataloaders)
    -> "Ray Data is the recommended data loading solution for scalable blah blah blah, but Ray Train still integrates well with existing dataloading solutions you may be using, such as X, Y, Z.
    -> Link to the framework user guides, since we already go over how to set up the framework native dataloaders.
Ray Data Configurations
    -> all the remaining sections become subsections.

I think this structure fixes the problem where I was getting lost in the middle of the user guide because it suddenly starts talking about pytorch dataloader -- it wasn't clear that there are 2 separate paths: Ray Data vs. Alternatives. Now, we first put Ray Data front and center and make the case for it. Then, we talk about alternatives that are still integrated nicely.

For a follow-up PR, it would be nice to have some more realistic examples. For example, show read_parquet("s3://...") instead of the from_items dummy dataset that we have right now in the torch ray data quickstart. Can borrow this from the blog post again.


          update

d58ce33

Signed-off-by: woshiyyya <[email protected]>

woshiyyya requested review from hongpeng-guo and raulchen as code owners

June 3, 2024 16:59

Member Author

woshiyyya commented Jun 5, 2024 •

edited

Loading

Discussed with Angelina and below is a proposal of the user guide:

Introduction (don’t need a title)
- Layout the structure of the doc
- Comparing framework-native utilities v.s. Ray Data, and guide user to choose one or the other.
  - Add quick link to the corresponding for "Use Framework-native .." and "Use Ray Data" sections
  - Show the benefits of ray data (can include contents in the blog post)
Use Framework-native Data Utilities
- Quickstart
  - Code snippets for Torch,Lighting, and HF
Use Ray Data
- Quickstart (add code snippets for each subsection)
  - Loading data
  - Inputting and splitting data
  - …
- Remaining Ray Data Usage guide (Splitting Dataset, ..., Performance tips)

Screenshot 2024-06-05 at 4 08 57 PM

Main considerations:

Previously there are too many first level sections. The new proposal aims to make the user guide layout super clear -> Choose "Framework-native utilities" or Choose "Ray Data".
Do not mention Framework native utilities in the Ray Data section, and do not mention Ray Data in the Framework native utilities section.

Any thoughts? @matthewdeng @justinvyu @hongpeng-guo

Contributor

hongpeng-guo commented Jun 5, 2024

Discussed with Angelina and below is a proposal of the user guide:

Introduction (don’t need a title)

Layout the structure of the doc

Comparing framework-native utilities v.s. Ray Data, and cuide user to choose one or the other.

Add quick link to the corresponding for "Use Framework-native .." and "Use Ray Data" sections

Show the benefits of ray data (can include contents in the blog post)

Use Framework-native Data Utilities

Quickstart

Code snippets for Torch,Lighting, and HF

Use Ray Data

Quickstart (add code snippets for each subsection)

Loading data

Inputting and splitting data

…

Remaining Ray Data Usage guide (Splitting Dataset, ..., Performance tips)

Main considerations:

Previously there are too many first level sections. The new proposal aims to make the user guide layout super clear -> Choose "Framework-native utilities" or Choose "Ray Data".

Do not mention Framework native utilities in the Ray Data section, and do not mention Ray Data in the Framework native utilities section.

Any thoughts? @matthewdeng @justinvyu @hongpeng-guo

I think the new proposal is very clear! Just one question whether we should put "Framework-native utilities` first, or "Ray data" first in this doc page.

Contributor

matthewdeng commented Jun 6, 2024

My preference would be to put Ray Data before the native ones. Though I do think it would be good either way to add a concise tip at the start of each section to point the user to the other section.

woshiyyya requested a review from justinvyu

June 7, 2024 22:46

Member Author

woshiyyya commented Jun 7, 2024

@matthewdeng @hongpeng-guo ok, let's put "Ray Data" at first before the native ones, and cross link each other at the beginning of their section. I'll do it in the following PR.

woshiyyya added 2 commits

June 11, 2024 18:11

fix

91f0474

Signed-off-by: yunxuanx <[email protected]>


          update

6d2b097

Signed-off-by: yunxuanx <[email protected]>

justinvyu approved these changes

View reviewed changes

Contributor

justinvyu left a comment

Let's merge this one first!

justinvyu enabled auto-merge (squash)

June 25, 2024 21:44

github-actions bot added the go label

justinvyu merged commit 9ba6656 into ray-project:master

8 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

matthewdeng matthewdeng left review comments

justinvyu justinvyu approved these changes

hongpeng-guo Awaiting requested review from hongpeng-guo hongpeng-guo is a code owner

raulchen Awaiting requested review from raulchen raulchen is a code owner

Labels

go