[Datasets] [Local Shuffle - 1/N] Add local shuffling option. #26094

clarkzinzow · 2022-06-24T23:20:41Z

This PR adds a local shuffling option to ds.iter_batches(), a lightweight alternative to the global ds.random_shuffle() that randomly shuffles data using a local in-memory shuffle buffer and yields shuffled batches.

Not all training datasets/models benefit from high-quality global or pseudo-global (windowed) shuffles, but in these cases, users still want to cheaply decorrelate samples to a small degree. This local shuffle option (optionally coupled with block randomization via ds.randomize_block_order()) yields a high throughput in-iterator shuffling option.

API Usage

ds = ray.data.range(10000)

for batch in ds.iter_batches(batch_size=100, local_shuffle_buffer_size=10000):
    print(batch)

TODOs

Move to local_shuffle_buffer_size API.
Feature guide updates.
Add to to_torch() and to_tf() APIs.
Preliminary benchmarks.
In follow-up PR(s), we will look at adding an option for pushing this local shuffling to a background worker (thread, actor, etc.).

Related issue number

Closes #24159, closes #18297

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/data/_internal/batcher.py

clarkzinzow · 2022-06-24T23:31:19Z

python/ray/data/_internal/block_batching.py

+            with stats.iter_get_s.timer():
+                block = ray.get(block)
+            # NOTE: Since we add one block at a time and then immediately consume
+            # batches, we don't check batcher.can_add() before adding the block.


batcher.can_add() will be used once we're adding multiple blocks before attempting to consume batches, e.g. if prefetching and shuffling was connected via a queue. We may go this route in a future PR, but with the current implementation, we're guaranteed that the block can be added.

We could add an assert batcher.can_add(block) here, in the calling code, in order to more strongly document this guarantee in addition to the comment, but that assertion is already done within batcher.add(block), so I left it out and opted for the comment.

clarkzinzow · 2022-06-24T23:32:23Z

python/ray/data/_internal/block_batching.py

        while batcher.has_batch():
+            # While the batcher has full batches, yield batches.
+            with stats.iter_next_batch_s.timer():


Added a new iterator stage timer, since next_batch() can potentially be a bit expensive (building the shuffle buffer, generating random indices, etc.)

python/ray/data/_internal/batcher.py

ericl · 2022-06-29T22:28:27Z

@krfricke , could you also take a look at this with Matt and Jian?

python/ray/data/dataset.py

krfricke

Left a few comments

python/ray/data/tests/test_dataset.py

python/ray/data/_internal/batcher.py

python/ray/data/_internal/block_batching.py

python/ray/data/_internal/batcher.py

python/ray/data/dataset.py

krfricke

Thanks for the changes - this looks good to me!

clarkzinzow · 2022-07-02T17:24:53Z

Thanks for the review @krfricke!

@matthewdeng @jianoaix with one approving review, I'll go ahead with the feature guide updates and benchmarking.

python/ray/data/_internal/block_batching.py

python/ray/data/dataset.py

python/ray/data/_internal/batcher.py

jianoaix · 2022-07-07T03:49:38Z

python/ray/data/_internal/batcher.py


 from ray.data.block import Block, BlockAccessor
 from ray.data._internal.delegating_block_builder import DelegatingBlockBuilder


-class Batcher:
-    """Chunks blocks into batches.
+class BatcherInterface:


Can you add a test_batcher.py covering code here? We now have added complexities.

python/ray/data/_internal/batcher.py

clarkzinzow · 2022-07-15T15:38:49Z

@matthewdeng @ericl @jianoaix @krfricke API has been updated to the local_shuffle_buffer_size API, if I can get a high-level 👍 on the API, I can start on the feature guide. In the meantime, I'm going to look at adding more test coverage for the batchers.

jianoaix · 2022-07-15T15:59:09Z

Look good to me, thanks for patience and update!

matthewdeng · 2022-07-15T16:09:38Z

python/ray/data/dataset.py

+                random but will be faster and less resource-intensive. This buffer size
+                must be greater than or equal to ``batch_size``, and therefore
+                ``batch_size`` must also be specified when using local shuffling.


OOC is this a requirement for the implementation to work, or imposed because having a buffer size smaller than the batch size results in close to 0 randomness?

We discussed this a bit here: #26094 (comment)

It's required for the current implementation to work, but we could have a much simpler shuffling algorithm for the unbatched case. To keep the PR small and given that we're not aware of any use cases for unbatched local shuffling, we've been treating that as a P1 for a follow-up PR.

clarkzinzow · 2022-07-16T01:14:42Z

@matthewdeng @jianoaix @ericl PR is updated with a feature guide and some focused test coverage for the batcher, PTAL!

doc/source/ray-air/check-ingest.rst

python/ray/data/_internal/batcher.py

jianoaix · 2022-07-16T02:27:15Z

python/ray/data/dataset.py

+            local_shuffle_buffer_size: If non-None, the data will be randomly shuffled
+                using a local in-memory shuffle buffer, and this value will serve as the
+                minimum number of rows that must be in the local in-memory shuffle
+                buffer in order to yield a batch. This is a light-weight alternative to


Can you mention how the last remainder rows are handled? When there are less than local_shuffle_buffer_size rows, we should let users know if they should expect batches yielded from them.

Added a sentence, but the wording feels a bit off... let me know what you think!

jianoaix · 2022-07-16T02:29:16Z

python/ray/data/_internal/batcher.py

+        assert self._shuffle_buffer is not None
+        buffer_size = BlockAccessor.for_block(self._shuffle_buffer).num_rows()
+        # Truncate the batch to the buffer size, if necessary.
+        batch_size = min(self._batch_size, buffer_size)


So looks we do continue to yield batches when the num of rows in the buffer drop below shuffle_buffer_min_size. Can you adjust the parameter documentation above to reflect this?

Same as above.

python/ray/data/tests/test_batcher.py

Co-authored-by: Eric Liang <[email protected]>

Signed-off-by: Matthew Deng <[email protected]>

ericl · 2022-07-17T04:55:02Z

Test failures.

Signed-off-by: Richard Liaw <[email protected]>

…ject#26094) Co-authored-by: Eric Liang <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: Matthew Deng <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Signed-off-by: Ubuntu <[email protected]>

…ject#26094) Co-authored-by: Eric Liang <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: Matthew Deng <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Signed-off-by: Xiaowei Jiang <[email protected]>

…ject#26094) Co-authored-by: Eric Liang <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: Matthew Deng <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Signed-off-by: Stefan van der Kleij <[email protected]>

clarkzinzow requested review from ericl, scv119, jjyao and jianoaix as code owners June 24, 2022 23:20

clarkzinzow assigned ericl, matthewdeng and jianoaix Jun 24, 2022

clarkzinzow commented Jun 24, 2022

View reviewed changes

python/ray/data/_internal/batcher.py Outdated Show resolved Hide resolved

clarkzinzow commented Jun 24, 2022

View reviewed changes

python/ray/data/_internal/batcher.py Show resolved Hide resolved

ericl assigned krfricke and unassigned ericl Jun 29, 2022

krfricke self-requested a review June 29, 2022 22:29

clarkzinzow force-pushed the datasets/feat/local-shuffle branch from 91bd0d6 to 04783ee Compare June 30, 2022 00:51

matthewdeng reviewed Jun 30, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

krfricke reviewed Jun 30, 2022

View reviewed changes

clarkzinzow force-pushed the datasets/feat/local-shuffle branch 2 times, most recently from 44faf67 to 3b764ef Compare July 2, 2022 03:02

krfricke approved these changes Jul 2, 2022

View reviewed changes

clarkzinzow added 3 commits July 6, 2022 16:46

Add local shuffling option.

ff5a1b8

PR feedback

7d4d529

Remove shuffle_buffer_capacity API arg.

86becc1

clarkzinzow force-pushed the datasets/feat/local-shuffle branch from 3b764ef to 86becc1 Compare July 6, 2022 16:47

jianoaix reviewed Jul 7, 2022

View reviewed changes

clarkzinzow added 2 commits July 9, 2022 01:33

PR feedback

e8478f3

PR feedback

0b9895e

matthewdeng reviewed Jul 15, 2022

View reviewed changes

PR feedback.

f22f50c

clarkzinzow requested review from maxpumperla, pcmoritz, richardliaw, edoakes and simon-mo as code owners July 16, 2022 01:12

ericl approved these changes Jul 16, 2022

View reviewed changes

doc/source/ray-air/check-ingest.rst Outdated Show resolved Hide resolved

doc/source/ray-air/check-ingest.rst Outdated Show resolved Hide resolved

jianoaix reviewed Jul 16, 2022

View reviewed changes

python/ray/data/tests/test_batcher.py Show resolved Hide resolved

matthewdeng and others added 2 commits July 16, 2022 18:26

Update doc/source/ray-air/check-ingest.rst

47aa32c

Co-authored-by: Eric Liang <[email protected]>

address comments

50029ab

Signed-off-by: Matthew Deng <[email protected]>

jianoaix approved these changes Jul 17, 2022

View reviewed changes

richardliaw added 3 commits July 17, 2022 00:56

Merge branch 'master' into datasets/feat/local-shuffle

e901609

update-doc

e098d2c

Signed-off-by: Richard Liaw <[email protected]>

fix-dataset-api

2613454

Signed-off-by: Richard Liaw <[email protected]>

richardliaw merged commit 864af14 into ray-project:master Jul 17, 2022

scv119 mentioned this pull request Jul 23, 2022

[Core][Nightly] chaos_pipelined_ingestion_1500_gb_15_windows is failing consistently #26812

Closed

amogkam mentioned this pull request Oct 28, 2022

[AIR] Update kinks in examples #25197

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] [Local Shuffle - 1/N] Add local shuffling option. #26094

[Datasets] [Local Shuffle - 1/N] Add local shuffling option. #26094

clarkzinzow commented Jun 24, 2022 •

edited

Loading

clarkzinzow Jun 24, 2022

clarkzinzow Jun 24, 2022

ericl commented Jun 29, 2022

krfricke left a comment

krfricke left a comment

clarkzinzow commented Jul 2, 2022

jianoaix Jul 7, 2022

clarkzinzow commented Jul 15, 2022 •

edited

Loading

jianoaix commented Jul 15, 2022

matthewdeng Jul 15, 2022

clarkzinzow Jul 16, 2022

clarkzinzow commented Jul 16, 2022

jianoaix Jul 16, 2022

matthewdeng Jul 17, 2022

jianoaix Jul 16, 2022

matthewdeng Jul 17, 2022

ericl commented Jul 17, 2022

[Datasets] [Local Shuffle - 1/N] Add local shuffling option. #26094

[Datasets] [Local Shuffle - 1/N] Add local shuffling option. #26094

Conversation

clarkzinzow commented Jun 24, 2022 • edited Loading

API Usage

TODOs

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Jun 29, 2022

krfricke left a comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

clarkzinzow commented Jul 2, 2022

Choose a reason for hiding this comment

clarkzinzow commented Jul 15, 2022 • edited Loading

jianoaix commented Jul 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow commented Jul 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Jul 17, 2022

clarkzinzow commented Jun 24, 2022 •

edited

Loading

clarkzinzow commented Jul 15, 2022 •

edited

Loading