Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] [Local Shuffle - 1/N] Add local shuffling option. #26094

Merged

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Jun 24, 2022

This PR adds a local shuffling option to ds.iter_batches(), a lightweight alternative to the global ds.random_shuffle() that randomly shuffles data using a local in-memory shuffle buffer and yields shuffled batches.

Not all training datasets/models benefit from high-quality global or pseudo-global (windowed) shuffles, but in these cases, users still want to cheaply decorrelate samples to a small degree. This local shuffle option (optionally coupled with block randomization via ds.randomize_block_order()) yields a high throughput in-iterator shuffling option.

API Usage

ds = ray.data.range(10000)

for batch in ds.iter_batches(batch_size=100, local_shuffle_buffer_size=10000):
    print(batch)

TODOs

  • Move to local_shuffle_buffer_size API.
  • Feature guide updates.
  • Add to to_torch() and to_tf() APIs.
  • Preliminary benchmarks.
  • In follow-up PR(s), we will look at adding an option for pushing this local shuffling to a background worker (thread, actor, etc.).

Related issue number

Closes #24159, closes #18297

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

with stats.iter_get_s.timer():
block = ray.get(block)
# NOTE: Since we add one block at a time and then immediately consume
# batches, we don't check batcher.can_add() before adding the block.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batcher.can_add() will be used once we're adding multiple blocks before attempting to consume batches, e.g. if prefetching and shuffling was connected via a queue. We may go this route in a future PR, but with the current implementation, we're guaranteed that the block can be added.

We could add an assert batcher.can_add(block) here, in the calling code, in order to more strongly document this guarantee in addition to the comment, but that assertion is already done within batcher.add(block), so I left it out and opted for the comment.

while batcher.has_batch():
# While the batcher has full batches, yield batches.
with stats.iter_next_batch_s.timer():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new iterator stage timer, since next_batch() can potentially be a bit expensive (building the shuffle buffer, generating random indices, etc.)

@ericl ericl assigned krfricke and unassigned ericl Jun 29, 2022
@ericl
Copy link
Contributor

ericl commented Jun 29, 2022

@krfricke , could you also take a look at this with Matt and Jian?

@krfricke krfricke self-requested a review June 29, 2022 22:29
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments

python/ray/data/tests/test_dataset.py Show resolved Hide resolved
python/ray/data/_internal/batcher.py Outdated Show resolved Hide resolved
python/ray/data/_internal/batcher.py Show resolved Hide resolved
python/ray/data/_internal/batcher.py Show resolved Hide resolved
python/ray/data/_internal/batcher.py Show resolved Hide resolved
python/ray/data/_internal/batcher.py Outdated Show resolved Hide resolved
python/ray/data/_internal/block_batching.py Outdated Show resolved Hide resolved
python/ray/data/_internal/batcher.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
@clarkzinzow clarkzinzow force-pushed the datasets/feat/local-shuffle branch 2 times, most recently from 44faf67 to 3b764ef Compare July 2, 2022 03:02
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes - this looks good to me!

@clarkzinzow
Copy link
Contributor Author

Thanks for the review @krfricke!

@matthewdeng @jianoaix with one approving review, I'll go ahead with the feature guide updates and benchmarking.

python/ray/data/_internal/block_batching.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/_internal/batcher.py Show resolved Hide resolved

from ray.data.block import Block, BlockAccessor
from ray.data._internal.delegating_block_builder import DelegatingBlockBuilder


class Batcher:
"""Chunks blocks into batches.
class BatcherInterface:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test_batcher.py covering code here? We now have added complexities.

python/ray/data/_internal/batcher.py Show resolved Hide resolved
python/ray/data/_internal/batcher.py Show resolved Hide resolved
python/ray/data/_internal/batcher.py Show resolved Hide resolved
@clarkzinzow
Copy link
Contributor Author

clarkzinzow commented Jul 15, 2022

@matthewdeng @ericl @jianoaix @krfricke API has been updated to the local_shuffle_buffer_size API, if I can get a high-level 👍 on the API, I can start on the feature guide. In the meantime, I'm going to look at adding more test coverage for the batchers.

@jianoaix
Copy link
Contributor

Look good to me, thanks for patience and update!

Comment on lines +2466 to +2468
random but will be faster and less resource-intensive. This buffer size
must be greater than or equal to ``batch_size``, and therefore
``batch_size`` must also be specified when using local shuffling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC is this a requirement for the implementation to work, or imposed because having a buffer size smaller than the batch size results in close to 0 randomness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this a bit here: #26094 (comment)

It's required for the current implementation to work, but we could have a much simpler shuffling algorithm for the unbatched case. To keep the PR small and given that we're not aware of any use cases for unbatched local shuffling, we've been treating that as a P1 for a follow-up PR.

@clarkzinzow
Copy link
Contributor Author

@matthewdeng @jianoaix @ericl PR is updated with a feature guide and some focused test coverage for the batcher, PTAL!

doc/source/ray-air/check-ingest.rst Outdated Show resolved Hide resolved
doc/source/ray-air/check-ingest.rst Outdated Show resolved Hide resolved
python/ray/data/_internal/batcher.py Outdated Show resolved Hide resolved
python/ray/data/_internal/batcher.py Outdated Show resolved Hide resolved
local_shuffle_buffer_size: If non-None, the data will be randomly shuffled
using a local in-memory shuffle buffer, and this value will serve as the
minimum number of rows that must be in the local in-memory shuffle
buffer in order to yield a batch. This is a light-weight alternative to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention how the last remainder rows are handled? When there are less than local_shuffle_buffer_size rows, we should let users know if they should expect batches yielded from them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sentence, but the wording feels a bit off... let me know what you think!

assert self._shuffle_buffer is not None
buffer_size = BlockAccessor.for_block(self._shuffle_buffer).num_rows()
# Truncate the batch to the buffer size, if necessary.
batch_size = min(self._batch_size, buffer_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looks we do continue to yield batches when the num of rows in the buffer drop below shuffle_buffer_min_size. Can you adjust the parameter documentation above to reflect this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

matthewdeng and others added 2 commits July 16, 2022 18:26
@ericl
Copy link
Contributor

ericl commented Jul 17, 2022

Test failures.

@richardliaw richardliaw merged commit 864af14 into ray-project:master Jul 17, 2022
jianoaix pushed a commit to jianoaix/ray that referenced this pull request Jul 18, 2022
…ject#26094)

Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
Co-authored-by: Matthew Deng <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Signed-off-by: Ubuntu <[email protected]>
xwjiang2010 pushed a commit to xwjiang2010/ray that referenced this pull request Jul 19, 2022
…ject#26094)

Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
Co-authored-by: Matthew Deng <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Signed-off-by: Xiaowei Jiang <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…ject#26094)

Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
Co-authored-by: Matthew Deng <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Signed-off-by: Stefan van der Kleij <[email protected]>
@amogkam amogkam mentioned this pull request Oct 28, 2022
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Support local shuffle via in-memory buffer [datasets] Requirements for my use case
6 participants