[data] Stage fusion optimizations, off by default #22373

ericl · 2022-02-15T03:14:46Z

Why are these changes needed?

This PR adds the following stage fusion optimizations (off by default). In a later PR, I plan to enable this by default for DatasetPipelines.

Stage fusion: Whether to fuse compatible OneToOne stages.
Read stage fusion: Whether to fuse read stages into downstream OneToOne stages. This is accomplished by rewriting the read stage (LazyBlockList) into a transformation over a collection of read tasks (BlockList -> MapBatches(do_read)).
Shuffle stage fusion: Whether to fuse compatible OneToOne stages into shuffle stages that support specifying a map-side block UDF.

Stages are considered compatible if their compute strategy is the same ("tasks" vs "actors"), and they have the same Ray remote args. Currently, the PR is ignoring the remote args of read tasks, but this will be fixed as a followup (I didn't want to change the read tasks default here).

I've experimented with this locally and the memory reduction is ~50%, mostly from clearing input blocks. In the distributed setting I'd expect more savings especially under memory pressure, since fusion suppresses excess spilling / transfer of blocks.

Related issue number

Towards #18791

clarkzinzow

Overall paradigm is looking good, making sure that read-time load balancing is preserved when given _spread_resource_prefix="node:" is the only big blocker that I see.

python/ray/data/impl/plan.py

python/ray/data/dataset_pipeline.py

python/ray/data/impl/plan.py

ericl

Updated.

clarkzinzow

LGTM!

clarkzinzow · 2022-02-17T01:09:03Z

python/ray/data/tests/test_stats.py

@@ -167,6 +167,7 @@ def test_dataset_pipeline_stats_basic(ray_start_regular_shared):
    for batch in pipe.iter_batches():
        pass
    stats = canonicalize(pipe.stats())
+    print(stats)


Suggested change

print(stats)

This PR adds the following stage fusion optimizations (off by default). In a later PR, I plan to enable this by default for DatasetPipelines. - Stage fusion: Whether to fuse compatible OneToOne stages. - Read stage fusion: Whether to fuse read stages into downstream OneToOne stages. This is accomplished by rewriting the read stage (LazyBlockList) into a transformation over a collection of read tasks (BlockList -> MapBatches(do_read)). - Shuffle stage fusion: Whether to fuse compatible OneToOne stages into shuffle stages that support specifying a map-side block UDF. Stages are considered compatible if their compute strategy is the same ("tasks" vs "actors"), and they have the same Ray remote args. Currently, the PR is ignoring the remote args of read tasks, but this will be fixed as a followup (I didn't want to change the read tasks default here).

ericl added 30 commits February 8, 2022 20:27

update

05b4399

opt

7d2e8a4

fix stats

f9d63b0

update

686f451

add schema

dca8ea2

update

46cfa02

wip

bf7e80c

finish

6ebc8e6

Merge remote-tracking branch 'upstream/master' into prototype-lazy

351a9e9

add moving

633a11d

fix pipeline test

e77a364

update

399b21d

fix tests

ee9d2dd

update

4b367f9

update

d88f687

fix unit tests

a4f657e

fix pipeline test

c53af61

fix

c7c47be

wip

ddffad2

wip

eada5df

fusion wip

c68e7ca

wip

ae55cb6

Merge remote-tracking branch 'upstream/master' into prototype-lazy

06a7631

fix

4cbdce8

Merge remote-tracking branch 'upstream/master' into prototype-lazy

75f7aa9

Merge branch 'prototype-lazy' into stage-fusion

b53d10d

order stages

ce83ff4

update

bb8d2a0

update

37a3828

Merge branch 'ordered-dict' into stage-fusion

ee9d971

ericl added 7 commits February 15, 2022 13:19

Merge remote-tracking branch 'upstream/master' into stage-fusion

609659a

Merge remote-tracking branch 'upstream/master' into stage-fusion

b29a7c0

update

de21abb

wip

3b65296

wip

cff3fee

wip

9e9a7a9

add optimiez

c8633c3

ericl requested a review from clarkzinzow as a code owner February 16, 2022 00:46

ericl added 5 commits February 15, 2022 16:50

update

a4a5954

update

6e5455d

update

f382590

docs

cee880e

add docs

70ebe02

ericl changed the title ~~[WIP] Stage fusion~~ [data] Stage fusion optimizations, off by default Feb 16, 2022

ericl assigned scv119, jjyao and clarkzinzow Feb 16, 2022

false by default

1dd3eaa

clarkzinzow requested changes Feb 16, 2022

View reviewed changes

ericl added 2 commits February 16, 2022 12:52

update

930b705

fix stats

e5e1c89

ericl commented Feb 16, 2022

View reviewed changes

ericl added 3 commits February 16, 2022 16:21

Merge remote-tracking branch 'upstream/master' into stage-fusion

e098438

update

e3dcf9a

fix clear

e7bde93

clarkzinzow approved these changes Feb 17, 2022

View reviewed changes

ericl added 2 commits February 16, 2022 18:23

Merge remote-tracking branch 'upstream/master' into stage-fusion

a1585de

fix

a5e82ae

ericl merged commit 786c575 into ray-project:master Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Stage fusion optimizations, off by default #22373

[data] Stage fusion optimizations, off by default #22373

ericl commented Feb 15, 2022 •

edited

Loading

clarkzinzow left a comment

ericl left a comment

clarkzinzow left a comment

clarkzinzow Feb 17, 2022

[data] Stage fusion optimizations, off by default #22373

[data] Stage fusion optimizations, off by default #22373

Conversation

ericl commented Feb 15, 2022 • edited Loading

Why are these changes needed?

Related issue number

clarkzinzow left a comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow Feb 17, 2022

Choose a reason for hiding this comment

ericl commented Feb 15, 2022 •

edited

Loading