More realistic sort benchmarks #5881

tustvold · 2023-04-05T13:20:10Z

Which issue does this PR close?

Relates to #5879

Rationale for this change

This combines the merge and sort benchmarks together to avoid code duplication, it also makes a number of changes to make the benchmarks more realistic:

Each partition / stream now contains multiple batches, an average of 13
The sort_merge bench now performs a partitioned sort instead of sorting into a single partition and then running a SortPreservingMerge on the single partition (which is a no-op)
Tuples are sliced and then collected into their output arrays, this ensures that DictionaryArray aren't all using the same identical dictionary values
Merge benchmark is run on presorted data, with the other benchmarks now run on unsorted data

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

tustvold · 2023-04-05T13:20:43Z

datafusion/core/benches/sort.rs

-// parameters:
-//
-// Input schemas
-lazy_static! {


Lazy static seemed overkill for what this was doing, I opted to keep things simple

alamb

Thank you @tustvold -- I reviewed this and it makes sense to me. I also spot checked the inputs (by printing them out to standard out and it looked good)

cc @jaylmiller

alamb · 2023-04-05T20:12:21Z

datafusion/core/benches/sort.rs

-    plan: Arc<dyn ExecutionPlan>,
-    partition_count: usize,
-}
+    fn sort(partitions: &[Vec<RecordBatch>]) -> Self {


Suggested change

fn sort(partitions: &[Vec<RecordBatch>]) -> Self {

/// Test SortExec in "non partitioned" mode which sorts the input streams

/// into a single sorted output stream

fn sort(partitions: &[Vec<RecordBatch>]) -> Self {

alamb · 2023-04-05T20:12:45Z

datafusion/core/benches/sort.rs

+        }
+    }
+
+    fn sort_partitioned(partitions: &[Vec<RecordBatch>]) -> Self {


Suggested change

fn sort_partitioned(partitions: &[Vec<RecordBatch>]) -> Self {

/// Test SortExec in "partitioned" mode which sorts the input streams

/// individually into some number of output streams

fn sort_partitioned(partitions: &[Vec<RecordBatch>]) -> Self {

alamb · 2023-04-05T20:17:33Z

datafusion/core/benches/sort.rs

-    fn run(&self) {
-        let plan = Arc::clone(&self.plan);
-        let task_ctx = Arc::clone(&self.task_ctx);
+    fn sort_merge(partitions: &[Vec<RecordBatch>]) -> Self {


Suggested change

fn sort_merge(partitions: &[Vec<RecordBatch>]) -> Self {

/// Test SortExec in "partitioned" mode followed by a SortPreservingMerge

fn sort_merge(partitions: &[Vec<RecordBatch>]) -> Self {

alamb · 2023-04-05T20:17:55Z

datafusion/core/benches/sort.rs

+        let exec = MemoryExec::try_new(partitions, schema, None).unwrap();
+        let exec =
+            SortExec::new_with_partitioning(sort.clone(), Arc::new(exec), true, None);
+        let plan = Arc::new(SortPreservingMergeExec::new(sort, Arc::new(exec)));


I would expect this to behave the same, performance wise, as sort -- is that your expectation too?

Eventually yes, currently for single column cases it performs worse

alamb · 2023-04-05T20:18:21Z

datafusion/core/benches/sort.rs

    runtime: Runtime,
    task_ctx: Arc<TaskContext>,

    // The plan to run
    plan: Arc<dyn ExecutionPlan>,
 }

-impl SortBenchCase {
+impl BenchCase {
    /// Prepare to run a benchmark that merges the specified
    /// partitions (streams) together using all keyes


Suggested change

/// partitions (streams) together using all keyes

/// pre-sorted partitions (streams) together using all keys

jaylmiller · 2023-04-05T21:18:40Z

Thank you @tustvold -- I reviewed this and it makes sense to me. I also spot checked the inputs (by printing them out to standard out and it looked good)

cc @jaylmiller

Looks good to me as well!

More realistic sort benchmarks

b2f2d06

tustvold commented Apr 5, 2023

View reviewed changes

Clippy

c37fbda

github-actions bot added the core Core DataFusion crate label Apr 5, 2023

This was referenced Apr 5, 2023

Use SortPreservingMerge for in memory sort #5851

Closed

Generify SortPreservingMerge (#5882) (#5879) #5886

Merged

alamb approved these changes Apr 5, 2023

View reviewed changes

alamb mentioned this pull request Apr 5, 2023

Clean up SortExec creation and add doc comments #5889

Merged

Review feedback

338cdb3

tustvold merged commit 5c0fe0d into apache:main Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More realistic sort benchmarks #5881

More realistic sort benchmarks #5881

tustvold commented Apr 5, 2023 •

edited

Loading

tustvold Apr 5, 2023

alamb left a comment

alamb Apr 5, 2023

alamb Apr 5, 2023

alamb Apr 5, 2023

alamb Apr 5, 2023

tustvold Apr 5, 2023

alamb Apr 5, 2023

jaylmiller commented Apr 5, 2023

	fn sort_merge(partitions: &[Vec<RecordBatch>]) -> Self {
	/// Test SortExec in "partitioned" mode followed by a SortPreservingMerge
	fn sort_merge(partitions: &[Vec<RecordBatch>]) -> Self {

	/// partitions (streams) together using all keyes
	/// pre-sorted partitions (streams) together using all keys

More realistic sort benchmarks #5881

More realistic sort benchmarks #5881

Conversation

tustvold commented Apr 5, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

tustvold Apr 5, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 5, 2023

Choose a reason for hiding this comment

alamb Apr 5, 2023

Choose a reason for hiding this comment

alamb Apr 5, 2023

Choose a reason for hiding this comment

alamb Apr 5, 2023

Choose a reason for hiding this comment

tustvold Apr 5, 2023

Choose a reason for hiding this comment

alamb Apr 5, 2023

Choose a reason for hiding this comment

jaylmiller commented Apr 5, 2023

tustvold commented Apr 5, 2023 •

edited

Loading