Add need_data_exchange in the ExecutionPlan to indicate whether a physical operator needs data exchange #4586

yahoNanJing · 2022-12-12T08:03:04Z

Which issue does this PR close?

Closes #4585.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mingmwang · 2022-12-12T09:28:34Z

datafusion/core/src/physical_plan/mod.rs

+///     2. CoalescePartitionsExec for collapsing all of the partitions into one without ordering guarantee
+///     3. SortPreservingMergeExec for collapsing all of the sorted partitions into one with ordering guarantee
+pub trait DataExchangeExecutionPlan: ExecutionPlan {}
+


datafusion/core/src/physical_optimizer/enforcement.rs

yahoNanJing · 2022-12-13T03:33:41Z

If with this PR merged into the DataFusion, the Ballista side will be much more clear. One example is as follows:
yahoNanJing/arrow-ballista@9dbcedc

Hi @andygrove, @mingmwang, @alamb, could you help review this PR?

By the way, for the last commit for UT work around, if we upgrade the arrow-rs to 29.0.0, the issue can be avoided and the last commit in this PR can be reverted.

alamb · 2022-12-13T12:17:16Z

By the way, for the last commit for UT work around, if we upgrade the arrow-rs to 29.0.0, the issue can be avoided and the last commit in this PR can be reverted.

I think @tustvold has a PR close to ready to go for the upgrade: #4587

I'll try and get it ready later today if he hasn't had a chance

alamb

Thank you @yahoNanJing and @mingmwang these changes look good to me.

It seems like this PR contains both need_data_exchange as well as some improvements to the enforcement pass to run more sorts in parallel -- perhaps the PR title could be modified to include the optimizer improvements as well

By the way, for the last commit for UT work around, if we upgrade the arrow-rs to 29.0.0, the issue can be avoided and the last commit in this PR can be reverted.

I merged the arrow 29 upgrade #4587 so if you pick up the latest master changes the tests should now pass

alamb · 2022-12-13T19:42:44Z

datafusion/core/src/physical_optimizer/enforcement.rs

@@ -835,13 +836,42 @@ fn new_join_conditions(
    new_join_on
 }

+/// Within this function, it checks whether we need to add additional plan operators


Thank you for these comments

alamb · 2022-12-13T19:47:11Z

datafusion/core/src/physical_optimizer/enforcement.rs

+        .and_then(|sort_exec| {
+            // If it's already preserving the partitioning, it can be regarded as a local sort
+            // and there's no need for this optimization
+            if !sort_exec.preserve_partitioning() {


I wonder if this should also check if the SortExecs input has has more than one partition -- otherwise the SortPreservingMerge will be a noop.

Agree. I think this check is still required to avoid unnecessary shuffle/data exchange.
The reason is that the SortExec's input might be another CoalescePartitionsExec, it is unnecessary to change SortExec to SortPreservingMerge + parallel SortExec

Agree. Although the current implementation will not introduce additional shuffling, since SortPreservingMerge will also check the input partition count. I'll add the check

datafusion/core/src/physical_optimizer/enforcement.rs

alamb · 2022-12-13T19:56:17Z

datafusion/core/src/physical_plan/mod.rs

@@ -243,6 +243,34 @@ pub trait ExecutionPlan: Debug + Send + Sync {
    fn statistics(&self) -> Statistics;
 }

+/// Indicate whether a data exchange is needed, which will be very helpful


Suggested change

/// Indicate whether a data exchange is needed, which will be very helpful

/// Indicate whether a data exchange is needed at the input of `plan`, which will be very helpful

alamb · 2022-12-13T19:58:59Z

datafusion/core/src/physical_plan/mod.rs

+///     1. RepartitionExec for changing the partition number between two operators
+///     2. CoalescePartitionsExec for collapsing all of the partitions into one without ordering guarantee
+///     3. SortPreservingMergeExec for collapsing all of the sorted partitions into one with ordering guarantee
+pub fn need_data_exchange(plan: Arc<dyn ExecutionPlan>) -> bool {


I wonder if you could generalize this code so that it compared plan.child().output_partitioning() and plan.required_input_distribution()

Though it was not trivial when I was thinking about it

I'm not sure whether we can decide whether a data exchange is needed only depends on the plan.child().output_partitioning() and plan.required_input_distribution(). For example, how can we make a decision for Join which has two children. The enforcement rule will decide and add necessary physical operators like, RepartitionExec, CoalescePartitionsExec, SortPreservingMergeExec. This should not be part of this function.

alamb · 2022-12-13T20:00:01Z

datafusion/core/src/physical_plan/planner.rs

-                    } else {
-                        Arc::new(SortExec::try_new(sort_expr, physical_input, *fetch)?)
-                    })
+                    Ok(Arc::new(SortExec::try_new(sort_expr, physical_input, *fetch)?))


Is the rationale here that the enforcement phase now creates the pattern with sort and merge?

Yes. It's better to avoid manually adding physical operators. It's the enforcement job.

alamb · 2022-12-13T20:01:14Z

datafusion/core/tests/sql/joins.rs

+                "SortPreservingMergeExec: [t1_id@0 ASC NULLS LAST]",
+                "  SortExec: [t1_id@0 ASC NULLS LAST]",


This is a cool plan -- it is the first time I know of that DataFusion crates plans using SortPreservingMerge directly 👍

The rationale for this, as I understand it, is to support sorting in parallel. Internally the Sort operator does use a Merge if it has multiple input partitions. I wonder if we have any opinions about making the parallelism explicit in the plan (like this) or implict (within the operator)?

I think explicit in the plan is consistent with other parts of DataFusion (e.g. CoalscePartionExec)

fyi @tustvold

In a previous PR for TopK-queries I added the check on a LIMIT to be present on purpose,
The reason is that SortPreservingMergeExec + parallel SortExec is slower than CoalescePartitionsExec + SortExec in my tests, so actually making queries run quite a bit slower. I think we have to use other means to parallelize sorts, such as range partitioning.

Why SortPreservingMergeExec + parallel SortExec is slower than CoalescePartitionsExec + SortExec? Maybe it depends on the input data volumes need to be sorted.
Do you have some test data and SQL ?

I don't have a reproducable example ready unfortunately, the way I tested some queries was running some SELECT x, y from test order by z type of queries on TCP-h order lines table (using 16 partitions).

This was before the #4301 got merged though, so maybe things are different now 🤔

What about adding some similar queries to the parquet SQL benchmarks to confirm?

I think the core idea of divide and conquer for parallel sorting and merging is very general. Also curious why the performance is worse than the single node global sort.

what's more, the reason of depending on the limit to decide whether to change to parallel sort is also not strong. Sometimes, the limit number is very large and nothing can be pruned.

Both excellent points.

Limit looked like a relatively good heuristic, as one often provides a smaller value than the size of the input like 10, 100, 1000 or 10K but not 10M, 100M etc. One could fine tune the heuristic by only doing it on a "sufficiently" small limit. The limit causes the number of rows as input to SortPreservingMergeExec to be only partitions * limit (beside also speeding up the parallel sort).

I am not sure whether we should expect SortPreservingMergeExec + parallel SortExec to be (much) faster than CoalescePartitionsExec + SortExec. SortPreservingMergeExec itself is running on one partition/thread, is a relative heavy operation and also needs to wait on the sorted input partitions to be all ready.

Maybe with some fine-tuning (or maybe already after #4301) it is possible it is faster than a single-threaded SortExec in certain cases, but other approaches like using range partitioning will be better for query parallelism.

Dandandan

I think we have to put back a condition for limit to be present to parallelize sorts with SortPreservingMergeExec for now to recover the previous behavior.

…sical operator needs data exchange

This reverts commit fc136f6.

yahoNanJing · 2022-12-14T10:29:20Z

Just rebased the latest master branch.

alamb · 2022-12-14T20:17:14Z

Is this PR ready for another round of review?

yahoNanJing · 2022-12-15T01:39:55Z

Hi @alamb, yes. This PR is ready for another round of review except for the concern of the removal for the limit check.

yahoNanJing · 2022-12-15T01:40:25Z

Maybe I can firstly add back the limit check.

alamb

LGTM -- I also think @Dandandan 's comments have been addressed.

Thank you @yahoNanJing and @mingmwang

alamb · 2022-12-15T19:20:41Z

datafusion/core/src/physical_optimizer/enforcement.rs

+            // - There's no limit pushed down to the local sort (It's still controversial)
+            if sort_exec.input().output_partitioning().partition_count() > 1
+                && !sort_exec.preserve_partitioning()
+                && sort_exec.fetch().is_some()


FYI @Dandandan the check for local limit has been restored

alamb · 2022-12-15T19:21:02Z

datafusion/core/src/physical_optimizer/enforcement.rs

+            // There are three situations that there's no need for this optimization
+            // - There's only one input partition;
+            // - It's already preserving the partitioning so that it can be regarded as a local sort
+            // - There's no limit pushed down to the local sort (It's still controversial)


Might be worth a link to the ticket / PR for anyone who sees this comment and wants more context / backstore

Changes addressed. Please re-review

Dandandan · 2022-12-16T20:01:02Z

Nice!

BTW I am not reagainst the change to use SortpreserveExec in more cases - if we can show we can maintain or improve the performance, then that would be perfect 👍

ursabot · 2022-12-16T20:07:20Z

Benchmark runs are scheduled for baseline = ca8985e and contender = 920f11a. 920f11a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

yahoNanJing requested review from alamb and Dandandan December 12, 2022 08:03

github-actions bot added the core Core DataFusion crate label Dec 12, 2022

yahoNanJing marked this pull request as draft December 12, 2022 09:11

mingmwang reviewed Dec 12, 2022

View reviewed changes

datafusion/core/src/physical_optimizer/enforcement.rs Show resolved Hide resolved

yahoNanJing force-pushed the issue-4585 branch from 5c9043e to dc6f95a Compare December 12, 2022 11:19

yahoNanJing changed the title ~~Add a trait DataExchangeExecutionPlan to indicate whether an execution plan needs a data exchange~~ Add need_data_exchange in the ExecutionPlan to indicate whether a physical operator needs data exchange Dec 13, 2022

yahoNanJing marked this pull request as ready for review December 13, 2022 03:30

alamb reviewed Dec 13, 2022

View reviewed changes

alamb approved these changes Dec 13, 2022

View reviewed changes

Dandandan previously requested changes Dec 13, 2022

View reviewed changes

kyotoYaho added 6 commits December 14, 2022 17:01

Add need_data_exchange in the ExecutionPlan to indicate whether a phy…

ce8742d

…sical operator needs data exchange

Always Prefer SortPreservingMergeExec to the global SortExec

d0cff54

Temporary remove unsupported ut caused by arrow-rs

fc136f6

Move out the method need_data_exchange from ExecutionPlan

90c4743

Revert "Temporary remove unsupported ut caused by arrow-rs"

63a0d75

This reverts commit fc136f6.

Fix for comments

afafc53

yahoNanJing force-pushed the issue-4585 branch from 70482e3 to afafc53 Compare December 14, 2022 10:28

yahoNanJing force-pushed the issue-4585 branch from 12498c1 to 047b69c Compare December 15, 2022 01:47

Deal with controversial part

2313008

yahoNanJing force-pushed the issue-4585 branch from 047b69c to 2313008 Compare December 15, 2022 02:51

alamb approved these changes Dec 15, 2022

View reviewed changes

Dandandan approved these changes Dec 16, 2022

View reviewed changes

Dandandan merged commit 920f11a into apache:master Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add need_data_exchange in the ExecutionPlan to indicate whether a physical operator needs data exchange #4586

Add need_data_exchange in the ExecutionPlan to indicate whether a physical operator needs data exchange #4586

yahoNanJing commented Dec 12, 2022

mingmwang Dec 12, 2022

yahoNanJing commented Dec 13, 2022

alamb commented Dec 13, 2022

alamb left a comment

alamb Dec 13, 2022

alamb Dec 13, 2022

mingmwang Dec 14, 2022

yahoNanJing Dec 14, 2022

alamb Dec 13, 2022

alamb Dec 13, 2022

yahoNanJing Dec 14, 2022

alamb Dec 13, 2022

yahoNanJing Dec 14, 2022

alamb Dec 13, 2022

Dandandan Dec 13, 2022 •

edited

Loading

mingmwang Dec 14, 2022

Dandandan Dec 14, 2022

yahoNanJing Dec 14, 2022

yahoNanJing Dec 14, 2022

Dandandan Dec 14, 2022

Dandandan left a comment

yahoNanJing commented Dec 14, 2022

alamb commented Dec 14, 2022

yahoNanJing commented Dec 15, 2022

yahoNanJing commented Dec 15, 2022

alamb left a comment

alamb Dec 15, 2022

Dandandan Dec 16, 2022

alamb Dec 15, 2022

Dandandan commented Dec 16, 2022

ursabot commented Dec 16, 2022

	/// Indicate whether a data exchange is needed, which will be very helpful
	/// Indicate whether a data exchange is needed at the input of `plan`, which will be very helpful

		"SortPreservingMergeExec: [t1_id@0 ASC NULLS LAST]",
		" SortExec: [t1_id@0 ASC NULLS LAST]",

Add need_data_exchange in the ExecutionPlan to indicate whether a physical operator needs data exchange #4586

Add need_data_exchange in the ExecutionPlan to indicate whether a physical operator needs data exchange #4586

Conversation

yahoNanJing commented Dec 12, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

yahoNanJing commented Dec 13, 2022

alamb commented Dec 13, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan Dec 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan left a comment

Choose a reason for hiding this comment

yahoNanJing commented Dec 14, 2022

alamb commented Dec 14, 2022

yahoNanJing commented Dec 15, 2022

yahoNanJing commented Dec 15, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Dec 16, 2022

ursabot commented Dec 16, 2022

Dandandan Dec 13, 2022 •

edited

Loading