Fuse operations with different numbers of tasks #368

tomwhite · 2024-01-30T16:25:42Z

Fixes #284

This adds the necessary logic to the fuse code to fuse two operations that have a different number of tasks, but fusion must be controlled manually using the always_fuse argument to the optimization functions. This is fine for experimenting with the optimizer, but we'll probably want to add better heuristics about when to fuse these types of operations automatically in the future.

when fusing operations with different numbers of tasks

TomNicholas · 2024-01-30T16:51:19Z

cubed/array_api/statistical_functions.py

@@ -47,6 +47,7 @@ def mean(x, /, *, axis=None, keepdims=False, use_new_impl=False):
        dtype=dtype,
        keepdims=keepdims,
        use_new_impl=use_new_impl,
+        split_every=split_every,


Is this split_every argument a temporary implementation detail whilst we figure out fusing heuristics? It would be nice to relate the meaning of this argument back to the discussion in #284 (as I currently don't quite understand what it means)

The split_every argument is what we referred to as "fan-in" in #284 - the number of chunks read by one task doing the reduction step. It's the same in Dask and can be a dictionary indicating the number of chunks to read in each dimension.

I hope it is something that we can get better heuristics for (or at least good defaults) - possibly by measuring trade offs like they did in the Primula paper, see #331.

TomNicholas · 2024-01-30T16:52:49Z

This is very exciting!

tomwhite · 2024-01-31T14:17:46Z

I've generated a few plan visualizations to give an idea of what the code does.

Running this cut-down quadratic means example

https://github.com/tomwhite/cubed/blob/1762d3c685e7a1333c2809d2d9e419ff6508c5ce/cubed/tests/test_core.py#L527-L537

gives the following plan using the old optimization algorithm (for t_length=500):

The reduce rounds (for the mean operations) have 50, 4, and 1 task(s).

If we use the new optimization algorithm then it will fuse the first mean operation that has two inputs into one:

    m = xp.mean(uv, axis=0, split_every=10, use_new_impl=True)

    m.visualize(
        filename=tmp_path / "quad_means",
        optimize_function=multiple_inputs_optimize_dag
    )

Notice that the number of tasks has gone down from 160 to 59, and the amount of intermediate data stored has gone down from 16.8GB to 7.7GB. The reduce rounds have 50, 5, and 1 task(s) - this is controlled by the split_every=10 argument.

If we now fuse the first two operations (purple boxes with rounded corners) then we get make the plan even smaller. Note in the code below we have to explicitly ask for this operation to be fused.

This is using the logic from this PR which is able to fuse operations with a different number of tasks. The first operation (marked __mul__) has 50 tasks, and the second (mean) has 5.

    m = xp.mean(uv, axis=0, split_every=10, use_new_impl=True)

    m.visualize(
        filename=tmp_path / "quad_means",
        optimize_function=partial(multiple_inputs_optimize_dag, always_fuse=["op-008"]),
    )

There are now only two reduce rounds: with 5 and 1 task(s). And a total of only 8 tasks, and 166.8MB of intermediate data - a significant saving.

Add tests to check num_input_blocks

tomwhite · 2024-01-31T17:32:09Z

The mypy failures suggested to me that num_input_blocks belongs to BlockwiseSpec, not PrimitiveOperation, since it doesn't apply to a rechunk operation, for example. This change means that num_input_blocks also sits next to function_nargs, which it is related to, so that seems better.

TomNicholas · 2024-02-01T16:17:50Z

Wow that's a big difference. Thanks for writing this out @tomwhite.

I was about to try and estimate the projected speedup here (assuming IO dominates the execution time), but without being able to see the number of tasks in each stage it's not trivial to reason about. (Half the number of zarr stores to be written, but not necessarily twice as fast because do array-008 and array-007 have the same amount of parallelism that they did before optimization?) Not seeing this information in the visualization also made coiled/feedback#271 (comment) less clear - perhaps this information should be included in the graph visualization itself?

tomwhite · 2024-02-01T16:47:26Z

It's definitely hard to reason about, even with the number of tasks in each stage. (BTW I have listed the number of tasks in each stage in the comment above.) The total number of stages is probably the biggest factor in how long the computation takes. I plan to run some benchmarks to measure the performance of these workloads.

But I agree that it would be useful to put the number of tasks in the visualization. It does appear in the tooltip for nodes, but for some reason that doesn't appear when the SVG is embedded in another page.

tomwhite added 6 commits January 30, 2024 13:54

Add num_input_blocks to primitive operation for bookkeeping

05583bd

when fusing operations with different numbers of tasks

Implement merge_chunks_new using general_blockwise

726bcf8

Add failing test for merge_chunks_new

5b861ff

Don't assert that num tasks must be the same in fuse_multiple

de9d176

Fuse primitive ops with different numbers of tasks

ddd340c

Improve quad means test

9a7c3ed

TomNicholas reviewed Jan 30, 2024

View reviewed changes

Move 'num_input_blocks' to blockwise spec

263d89c

Add tests to check num_input_blocks

Add another test, and improve code formatting

e54aa1b

tomwhite mentioned this pull request Feb 1, 2024

Include number of tasks in operation nodes in graph visualization #370

Closed

tomwhite merged commit 8831b94 into main Feb 5, 2024
7 checks passed

tomwhite deleted the fuse-different-num-tasks branch February 5, 2024 08:23

tomwhite mentioned this pull request Mar 11, 2024

Optimizing reduction #331

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse operations with different numbers of tasks #368

Fuse operations with different numbers of tasks #368

tomwhite commented Jan 30, 2024

TomNicholas Jan 30, 2024

tomwhite Jan 31, 2024

TomNicholas commented Jan 30, 2024

tomwhite commented Jan 31, 2024

tomwhite commented Jan 31, 2024

TomNicholas commented Feb 1, 2024 •

edited

Loading

tomwhite commented Feb 1, 2024

Fuse operations with different numbers of tasks #368

Fuse operations with different numbers of tasks #368

Conversation

tomwhite commented Jan 30, 2024

TomNicholas Jan 30, 2024

Choose a reason for hiding this comment

tomwhite Jan 31, 2024

Choose a reason for hiding this comment

TomNicholas commented Jan 30, 2024

tomwhite commented Jan 31, 2024

tomwhite commented Jan 31, 2024

TomNicholas commented Feb 1, 2024 • edited Loading

tomwhite commented Feb 1, 2024

TomNicholas commented Feb 1, 2024 •

edited

Loading