[Data] Async `iter_batches` #33510

amogkam · 2023-03-21T05:39:44Z

TODO:

clean up stats
release test

Why are these changes needed?

Related issue number

Closes #33508

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: amogkam <[email protected]>

…atches Signed-off-by: amogkam <[email protected]>

Signed-off-by: amogkam <[email protected]>

ericl

Overall structure makes a lot of sense...

Some meta-review comments:

Should probably split this PR into a few pieces: common thread pool infra, pieces of the new batching code, and then the integrations
Should add more types

ericl · 2023-03-22T00:34:20Z

python/ray/data/_internal/iter_batches.py

+
+logger = logging.getLogger(__name__)
+
+PREFETCHER_ACTOR_NAMESPACE = "ray.dataset"


We should really try to get rid of the actor prefetcher in 2.5...

ericl · 2023-03-22T00:35:25Z

python/ray/data/_internal/iter_batches.py

+        yield enter_result
+
+
+def iter_batches(


This function sounds very innocuous when it's doing a lot under the hood (creating threads, etc.).

How about making it a class, such as ThreadPoolBatcher, to make it more clear when reading code that we're setting up this state and executing batching in parallel?

ericl · 2023-03-22T00:39:57Z

python/ray/data/_internal/iter_batches.py

+            # Step 5: Make sure to preserve order from threadpool results.
+            yield from _preserve_order(batch_iter)
+        else:
+            # If no batch prefetching is specified, then don't use a threadpool.


It may be preferable to always use a threadpool, but use size 1 if there isn't any prefetching. This avoids two different code paths being in play.

ericl · 2023-03-22T00:42:19Z

python/ray/data/_internal/iter_batches.py

+            also be None, meaning the entirety of the last block is included in this
+            batch. If this value is None, this allows us to eagerly clear the last
+            block in this batch after reading, since the last block is not included in
+            any other batches.


Nice, was wondering how the GC would work.

ericl · 2023-03-22T00:42:41Z

python/ray/data/_internal/iter_batches.py

+
+    def _async_iter_batches(block_refs):
+        # Step 1: Construct logical batches based on the metadata.
+        batch_iter = _bundle_block_refs_to_logical_batches(


I think you need to add type annotations for each of these iterators.

ericl · 2023-03-22T00:43:38Z

python/ray/data/_internal/iter_batches.py

+    yield from async_batch_iter
+
+
+def legacy_iter_batches(


Should move legacy code to its own file.

ericl · 2023-03-22T00:45:58Z

python/ray/data/_internal/iter_batches.py

+
+        def threadpool_computations(logical_batch_iter: Iterator[LogicalBatch]):
+            # Step 4.1: Resolve the blocks.
+            resolved_batch_iter = _resolve_blocks(


It seems you do block deletion here, but isn't there a risk of a race condition if a block gets deleted that another thread still needs? I would expect you would have to do deletion at the end in serial order / from a single thread.

ericl · 2023-03-22T00:49:07Z

python/ray/data/_internal/iter_batches.py

+            fetch_queue.put(e, block=True)
+
+    threads = [
+        threading.Thread(target=execute_computation, args=(i,))


Do you want to mark these as daemon threads?

ericl · 2023-03-22T00:50:50Z

python/ray/data/_internal/iter_batches.py

+                batch_iter, fn=threadpool_computations, num_workers=prefetch_batches
+            )
+            # Step 5: Make sure to preserve order from threadpool results.
+            yield from _preserve_order(batch_iter)


Suggested change

yield from _preserve_order(batch_iter)

yield from _restore_original_order(batch_iter)

Signed-off-by: amogkam <[email protected]>

amogkam added 9 commits March 15, 2023 21:47

wip

33dd633

Signed-off-by: amogkam <[email protected]>

wip

8ad0f88

Signed-off-by: amogkam <[email protected]>

wip

220c564

Signed-off-by: amogkam <[email protected]>

wip

0871add

Signed-off-by: amogkam <[email protected]>

wip

955cbc6

Signed-off-by: amogkam <[email protected]>

remove shuffling batcher

c606949

Signed-off-by: amogkam <[email protected]>

wip

cec85d5

Signed-off-by: amogkam <[email protected]>

wip

be296cc

Signed-off-by: amogkam <[email protected]>

update

7d8cc4b

Signed-off-by: amogkam <[email protected]>

amogkam requested review from richardliaw, gjoliver, krfricke, xwjiang2010, matthewdeng, Yard1, maxpumperla, a team, ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners March 21, 2023 05:39

amogkam added 6 commits March 20, 2023 22:40

revert back

4cb7b85

Signed-off-by: amogkam <[email protected]>

update

39480cd

Signed-off-by: amogkam <[email protected]>

new line

b4bf5f6

Signed-off-by: amogkam <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into async-iter-b…

381e68b

…atches Signed-off-by: amogkam <[email protected]>

merge

a2f065e

Signed-off-by: amogkam <[email protected]>

update

9b6ed20

Signed-off-by: amogkam <[email protected]>

amogkam assigned ericl Mar 21, 2023

amogkam assigned c21, clarkzinzow and jianoaix Mar 21, 2023

amogkam mentioned this pull request Mar 21, 2023

[Data] collate_fn in iter_torch_batches could be a bottleneck #33508

Closed

wip

0e92938

Signed-off-by: amogkam <[email protected]>

ericl reviewed Mar 22, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 22, 2023

amogkam added 2 commits March 21, 2023 18:10

fix

e75923a

Signed-off-by: amogkam <[email protected]>

pipeline

f9265c6

Signed-off-by: amogkam <[email protected]>

ericl closed this Mar 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Async `iter_batches` #33510

[Data] Async `iter_batches` #33510

amogkam commented Mar 21, 2023 •

edited by Yard1

Loading

ericl left a comment

ericl Mar 22, 2023

ericl Mar 22, 2023

ericl Mar 22, 2023

ericl Mar 22, 2023

ericl Mar 22, 2023

ericl Mar 22, 2023

ericl Mar 22, 2023

ericl Mar 22, 2023

ericl Mar 22, 2023


		logger = logging.getLogger(__name__)

		PREFETCHER_ACTOR_NAMESPACE = "ray.dataset"

	yield from _preserve_order(batch_iter)
	yield from _restore_original_order(batch_iter)

[Data] Async iter_batches #33510

[Data] Async iter_batches #33510

Conversation

amogkam commented Mar 21, 2023 • edited by Yard1 Loading

Why are these changes needed?

Related issue number

Checks

ericl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[Data] Async `iter_batches` #33510

[Data] Async `iter_batches` #33510

amogkam commented Mar 21, 2023 •

edited by Yard1

Loading