Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] [Docs] Consolidate shuffling-related information into Shuffling Data page #44098

Merged
merged 6 commits into from
Mar 20, 2024

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Mar 18, 2024

Why are these changes needed?

Consolidate shuffling-related information spread out across Ray Data docs into a new Shuffling Data page.

New docs page: https://anyscale-ray--44098.com.readthedocs.build/en/44098/data/shuffling-data.html

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
@@ -162,7 +161,7 @@ program might run out of memory. If you encounter an out-of-memory error, decrea
.. _stateful_transforms:

Stateful Transforms
==============================
===================
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to rest of PR, but fix the title underline.

Copy link
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, I think mostly addressing existing docs you copied over. Lgtm otherwise.

Shuffle the ordering of files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To randomly shuffle the ordering of input files before reading, call a function like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change to something like "call a read function that supports shuffling e.g. call read images...". Seems a little unclear what "function like read_images` actually means?

Local shuffle when iterating over batches
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To locally shuffle a subset of rows, call a function like :meth:`~ray.data.Dataset.iter_batches`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again maybe be more descriptive than function like. For this case, if there are few enough options (3ish?), maybe just list them.

To locally shuffle a subset of rows, call a function like :meth:`~ray.data.Dataset.iter_batches`
and specify `local_shuffle_buffer_size`. This shuffles the rows up to a provided buffer
size during iteration. See more details in
:ref:`Iterating over batches with shuffling <iterating-over-batches-with-shuffling>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move that information to this page and then have a small reference on that page to this broader shuffle page?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i felt that it was important to keep this information in the iteration page as well, since it can be a pretty core part of iter_batch-like methods for ML training. and there's greater detail about each iter_batch method for torch/tf, which seems out of place to put in this shuffle page. but if others feel the same, we can move it here


.. _optimizing_shuffles:

Advanced: Optimizing shuffles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might just be a formatting thing, but should this be a subheading or a top level heading? Actually seems like all of the subsections are subsections of "Types of shuffling", is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, made this a new subheading outside of Types of shuffling, and titles below are subtitled under the new Advanced: Optimizing shuffles section.

Advanced: Optimizing shuffles
-----------------------------

Shuffle operations are *all-to-all* operations where the entire Dataset must be materialized in memory before execution can proceed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems to not be super relevant to shuffling? Is the idea that these optimization might also apply to other all-to-all operations? The "these" in the below line is also unclear. I would have thought it was talking about shuffle operations but think it is talking about all to all operations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move some of this to the "Enabling push-based shuffle" below which seems related?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 seemed slightly out of place to me. Wonder if we should just remove this section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this section (but kept the note), since the content is also discussed under the Enabling push-based shuffle subsection.

Comment on lines 118 to 119
randomness of the training data. Based on a
`theoretical foundation <https://arxiv.org/abs/1709.10432>`__ all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
randomness of the training data. Based on a
`theoretical foundation <https://arxiv.org/abs/1709.10432>`__ all
randomness of the training data. Based on a
`theoretical foundation <https://arxiv.org/abs/1709.10432>`__ , all


Some Dataset operations require a *shuffle* operation, meaning that data is shuffled from all of the input partitions to all of the output partitions.
These operations include :meth:`Dataset.random_shuffle <ray.data.Dataset.random_shuffle>`,
:meth:`Dataset.sort <ray.data.Dataset.sort>` and :meth:`Dataset.groupby <ray.data.Dataset.groupby>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not super intuitive why these operations require a shuffle, if possible maybe add a quick sentence explaining?

Some Dataset operations require a *shuffle* operation, meaning that data is shuffled from all of the input partitions to all of the output partitions.
These operations include :meth:`Dataset.random_shuffle <ray.data.Dataset.random_shuffle>`,
:meth:`Dataset.sort <ray.data.Dataset.sort>` and :meth:`Dataset.groupby <ray.data.Dataset.groupby>`.
Shuffle can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Shuffle can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory.
Shuffling can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory.

:meth:`Dataset.sort <ray.data.Dataset.sort>` and :meth:`Dataset.groupby <ray.data.Dataset.groupby>`.
Shuffle can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory.

Datasets provides an alternative shuffle implementation known as push-based shuffle for improving large-scale performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is outdated verbage?

Suggested change
Datasets provides an alternative shuffle implementation known as push-based shuffle for improving large-scale performance.
Ray Data provides an alternative shuffle implementation known as push-based shuffle for improving large-scale performance.

Comment on lines 71 to 74
If you observe reduced throughput when using ``local_shuffle_buffer_size``;
one way to diagnose this is to check the total time spent in batch creation by
examining the ``ds.stats()`` output (``In batch formatting``, under
``Batch iteration time breakdown``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you observe reduced throughput when using ``local_shuffle_buffer_size``;
one way to diagnose this is to check the total time spent in batch creation by
examining the ``ds.stats()`` output (``In batch formatting``, under
``Batch iteration time breakdown``).
If you observe reduced throughput when using ``local_shuffle_buffer_size``,
check the total time spent in batch creation by
examining the ``ds.stats()`` output (``In batch formatting``, under
``Batch iteration time breakdown``).

Comment on lines 77 to 78
time spent in other steps, one way to improve performance is to decrease
``local_shuffle_buffer_size`` or turn off the local shuffle buffer altogether and only :ref:`shuffle the ordering of files <shuffling_file_order>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time spent in other steps, one way to improve performance is to decrease
``local_shuffle_buffer_size`` or turn off the local shuffle buffer altogether and only :ref:`shuffle the ordering of files <shuffling_file_order>`.
time spent in other steps, decrease
``local_shuffle_buffer_size`` or turn off the local shuffle buffer altogether and only :ref:`shuffle the ordering of files <shuffling_file_order>`.

Advanced: Optimizing shuffles
-----------------------------

Shuffle operations are *all-to-all* operations where the entire Dataset must be materialized in memory before execution can proceed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 seemed slightly out of place to me. Wonder if we should just remove this section?

shuffle="files",
)

.. tip::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't really feel like a tip to me. IMO, it might be better to make this regular text here and in the other sections. In general, I think we should try to use admonitions sparingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved out of tip and into regular text

Comment on lines 74 to 76
``Batch iteration time breakdown``).

If this time is significantly larger than the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like these sentences should be part of the same paragraph?

Suggested change
``Batch iteration time breakdown``).
If this time is significantly larger than the
``Batch iteration time breakdown``). If this time is significantly larger than the

@bveeramani bveeramani merged commit 02a235d into ray-project:master Mar 20, 2024
5 checks passed
scottjlee added a commit to scottjlee/ray that referenced this pull request Mar 20, 2024
…ng Data` page (ray-project#44098)

Consolidate shuffling-related information spread out across Ray Data docs into a new Shuffling Data page.

Signed-off-by: Scott Lee <[email protected]>
@scottjlee scottjlee mentioned this pull request Mar 20, 2024
8 tasks
khluu pushed a commit that referenced this pull request Mar 20, 2024
…ng Data` page (#44098) (#44171)

Cherry-pick #44098. Docs-only change.

Consolidate shuffling-related information spread out across Ray Data docs into a new Shuffling Data page.

Signed-off-by: Scott Lee <[email protected]>
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Mar 27, 2024
…ng Data` page (ray-project#44098)

Consolidate shuffling-related information spread out across Ray Data docs into a new Shuffling Data page.

Signed-off-by: Scott Lee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants