-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] [Docs] Consolidate shuffling-related information into Shuffling Data
page
#44098
Conversation
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
@@ -162,7 +161,7 @@ program might run out of memory. If you encounter an out-of-memory error, decrea | |||
.. _stateful_transforms: | |||
|
|||
Stateful Transforms | |||
============================== | |||
=================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated to rest of PR, but fix the title underline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments, I think mostly addressing existing docs you copied over. Lgtm otherwise.
doc/source/data/shuffling-data.rst
Outdated
Shuffle the ordering of files | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
To randomly shuffle the ordering of input files before reading, call a function like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change to something like "call a read function that supports shuffling e.g. call read images...". Seems a little unclear what "function like read_images` actually means?
doc/source/data/shuffling-data.rst
Outdated
Local shuffle when iterating over batches | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
To locally shuffle a subset of rows, call a function like :meth:`~ray.data.Dataset.iter_batches` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again maybe be more descriptive than function like. For this case, if there are few enough options (3ish?), maybe just list them.
To locally shuffle a subset of rows, call a function like :meth:`~ray.data.Dataset.iter_batches` | ||
and specify `local_shuffle_buffer_size`. This shuffles the rows up to a provided buffer | ||
size during iteration. See more details in | ||
:ref:`Iterating over batches with shuffling <iterating-over-batches-with-shuffling>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move that information to this page and then have a small reference on that page to this broader shuffle page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i felt that it was important to keep this information in the iteration page as well, since it can be a pretty core part of iter_batch-like methods for ML training. and there's greater detail about each iter_batch
method for torch/tf, which seems out of place to put in this shuffle page. but if others feel the same, we can move it here
|
||
.. _optimizing_shuffles: | ||
|
||
Advanced: Optimizing shuffles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might just be a formatting thing, but should this be a subheading or a top level heading? Actually seems like all of the subsections are subsections of "Types of shuffling", is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, made this a new subheading outside of Types of shuffling
, and titles below are subtitled under the new Advanced: Optimizing shuffles
section.
doc/source/data/shuffling-data.rst
Outdated
Advanced: Optimizing shuffles | ||
----------------------------- | ||
|
||
Shuffle operations are *all-to-all* operations where the entire Dataset must be materialized in memory before execution can proceed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section seems to not be super relevant to shuffling? Is the idea that these optimization might also apply to other all-to-all operations? The "these" in the below line is also unclear. I would have thought it was talking about shuffle operations but think it is talking about all to all operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move some of this to the "Enabling push-based shuffle" below which seems related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 seemed slightly out of place to me. Wonder if we should just remove this section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this section (but kept the note), since the content is also discussed under the Enabling push-based shuffle
subsection.
doc/source/data/shuffling-data.rst
Outdated
randomness of the training data. Based on a | ||
`theoretical foundation <https://arxiv.org/abs/1709.10432>`__ all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
randomness of the training data. Based on a | |
`theoretical foundation <https://arxiv.org/abs/1709.10432>`__ all | |
randomness of the training data. Based on a | |
`theoretical foundation <https://arxiv.org/abs/1709.10432>`__ , all |
doc/source/data/shuffling-data.rst
Outdated
|
||
Some Dataset operations require a *shuffle* operation, meaning that data is shuffled from all of the input partitions to all of the output partitions. | ||
These operations include :meth:`Dataset.random_shuffle <ray.data.Dataset.random_shuffle>`, | ||
:meth:`Dataset.sort <ray.data.Dataset.sort>` and :meth:`Dataset.groupby <ray.data.Dataset.groupby>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not super intuitive why these operations require a shuffle, if possible maybe add a quick sentence explaining?
doc/source/data/shuffling-data.rst
Outdated
Some Dataset operations require a *shuffle* operation, meaning that data is shuffled from all of the input partitions to all of the output partitions. | ||
These operations include :meth:`Dataset.random_shuffle <ray.data.Dataset.random_shuffle>`, | ||
:meth:`Dataset.sort <ray.data.Dataset.sort>` and :meth:`Dataset.groupby <ray.data.Dataset.groupby>`. | ||
Shuffle can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shuffle can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory. | |
Shuffling can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory. |
doc/source/data/shuffling-data.rst
Outdated
:meth:`Dataset.sort <ray.data.Dataset.sort>` and :meth:`Dataset.groupby <ray.data.Dataset.groupby>`. | ||
Shuffle can be challenging to scale to large data sizes and clusters, especially when the total dataset size can't fit into memory. | ||
|
||
Datasets provides an alternative shuffle implementation known as push-based shuffle for improving large-scale performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this is outdated verbage?
Datasets provides an alternative shuffle implementation known as push-based shuffle for improving large-scale performance. | |
Ray Data provides an alternative shuffle implementation known as push-based shuffle for improving large-scale performance. |
doc/source/data/shuffling-data.rst
Outdated
If you observe reduced throughput when using ``local_shuffle_buffer_size``; | ||
one way to diagnose this is to check the total time spent in batch creation by | ||
examining the ``ds.stats()`` output (``In batch formatting``, under | ||
``Batch iteration time breakdown``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you observe reduced throughput when using ``local_shuffle_buffer_size``; | |
one way to diagnose this is to check the total time spent in batch creation by | |
examining the ``ds.stats()`` output (``In batch formatting``, under | |
``Batch iteration time breakdown``). | |
If you observe reduced throughput when using ``local_shuffle_buffer_size``, | |
check the total time spent in batch creation by | |
examining the ``ds.stats()`` output (``In batch formatting``, under | |
``Batch iteration time breakdown``). |
doc/source/data/shuffling-data.rst
Outdated
time spent in other steps, one way to improve performance is to decrease | ||
``local_shuffle_buffer_size`` or turn off the local shuffle buffer altogether and only :ref:`shuffle the ordering of files <shuffling_file_order>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time spent in other steps, one way to improve performance is to decrease | |
``local_shuffle_buffer_size`` or turn off the local shuffle buffer altogether and only :ref:`shuffle the ordering of files <shuffling_file_order>`. | |
time spent in other steps, decrease | |
``local_shuffle_buffer_size`` or turn off the local shuffle buffer altogether and only :ref:`shuffle the ordering of files <shuffling_file_order>`. |
doc/source/data/shuffling-data.rst
Outdated
Advanced: Optimizing shuffles | ||
----------------------------- | ||
|
||
Shuffle operations are *all-to-all* operations where the entire Dataset must be materialized in memory before execution can proceed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 seemed slightly out of place to me. Wonder if we should just remove this section?
doc/source/data/shuffling-data.rst
Outdated
shuffle="files", | ||
) | ||
|
||
.. tip:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't really feel like a tip to me. IMO, it might be better to make this regular text here and in the other sections. In general, I think we should try to use admonitions sparingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved out of tip and into regular text
doc/source/data/shuffling-data.rst
Outdated
``Batch iteration time breakdown``). | ||
|
||
If this time is significantly larger than the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like these sentences should be part of the same paragraph?
``Batch iteration time breakdown``). | |
If this time is significantly larger than the | |
``Batch iteration time breakdown``). If this time is significantly larger than the |
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
…ng Data` page (ray-project#44098) Consolidate shuffling-related information spread out across Ray Data docs into a new Shuffling Data page. Signed-off-by: Scott Lee <[email protected]>
…ng Data` page (#44098) (#44171) Cherry-pick #44098. Docs-only change. Consolidate shuffling-related information spread out across Ray Data docs into a new Shuffling Data page. Signed-off-by: Scott Lee <[email protected]>
…ng Data` page (ray-project#44098) Consolidate shuffling-related information spread out across Ray Data docs into a new Shuffling Data page. Signed-off-by: Scott Lee <[email protected]>
Why are these changes needed?
Consolidate shuffling-related information spread out across Ray Data docs into a new
Shuffling Data
page.New docs page: https://anyscale-ray--44098.com.readthedocs.build/en/44098/data/shuffling-data.html
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.