[dataset] Support push-based shuffle in groupby operations #25910

stephanie-wang · 2022-06-18T01:38:26Z

Why are these changes needed?

Allows option for push-based shuffle in groupby operations.

stephanie-wang · 2022-06-30T22:44:57Z

Turns out reduce completely aggregates the inputs, but we need to only partially aggregate the inputs during the merge stage, so I made some changes to the BlockAccessor interface to support this. @ericl or @clarkzinzow can you review the new changes?

clarkzinzow

LGTM overall, I added something similar to this for a generic tree-reduce operation in a prototype, happy to see this land!

Just a few nits, I'd like to see if the use_push_based_shuffle parameterized fixture would work since that should decrease the code impact on the tests by a good bit.

clarkzinzow · 2022-06-30T22:51:35Z

python/ray/data/_internal/simple_block.py

@@ -356,6 +356,7 @@ def aggregate_combined_blocks(
        blocks: List[Block[Tuple[KeyType, AggType]]],
        key: KeyFn,
        aggs: Tuple[AggregateFn],
+        finalize: bool,
    ) -> Tuple[Block[Tuple[KeyType, U]], BlockMetadata]:


Note that this changes the return type to Tuple[Block[Tuple[KeyType, Union[U, AggType]]], BlockMetadata]

clarkzinzow · 2022-07-01T20:26:30Z

python/ray/data/tests/test_dataset.py

-        .count()
-    )
-    assert agg_ds.count() == 0
+@pytest.mark.parametrize("use_push_based_shuffle", [False, True])


Nit: You should be able to reduce the context munging and try-finally boilerplate by parameterizing over a few fixtures that does this for you, e.g.

@pytest.fixture(params=[True, False]) def use_push_baed_shuffle(request): ctx = ray.data.context.DatasetContext.get_current() original = ctx.use_push_based_shuffle ctx.use_push_based_shuffle = request.param yield ctx.use_push_based_shuffle = original def test_groupby_arrow(ray_start_regular_shared, use_push_based_shuffle): # Test empty dataset. agg_ds = ( ray.data.range_table(10) .filter(lambda r: r["value"] > 10) .groupby("value") .count() ) assert agg_ds.count() == 0

groupby

3eb472c

stephanie-wang requested review from ericl, scv119, clarkzinzow, jjyao and jianoaix as code owners June 18, 2022 01:38

stephanie-wang assigned clarkzinzow Jun 18, 2022

ericl approved these changes Jun 18, 2022

View reviewed changes

stephanie-wang added 3 commits June 30, 2022 09:45

Merge remote-tracking branch 'upstream/master' into push-based-groupby

daeabdf

groupby aggregate test

282ec4a

fix groupby

6c42148

lint

4a8b7f2

clarkzinzow approved these changes Jul 1, 2022

View reviewed changes

clean

a9e5233

stephanie-wang merged commit 68b8933 into ray-project:master Jul 2, 2022

stephanie-wang deleted the push-based-groupby branch July 2, 2022 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dataset] Support push-based shuffle in groupby operations #25910

[dataset] Support push-based shuffle in groupby operations #25910

stephanie-wang commented Jun 18, 2022

stephanie-wang commented Jun 30, 2022

clarkzinzow left a comment

clarkzinzow Jun 30, 2022

clarkzinzow Jul 1, 2022

[dataset] Support push-based shuffle in groupby operations #25910

[dataset] Support push-based shuffle in groupby operations #25910

Conversation

stephanie-wang commented Jun 18, 2022

Why are these changes needed?

stephanie-wang commented Jun 30, 2022

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow Jun 30, 2022

Choose a reason for hiding this comment

clarkzinzow Jul 1, 2022

Choose a reason for hiding this comment