Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generify SortPreservingMerge (#5882) (#5879) #5886

Merged
merged 3 commits into from
Apr 7, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Apr 5, 2023

Which issue does this PR close?

Part of #5882
Relates to #5879

Rationale for this change

In order to be able to special case cursors (#5882), we need to first decouple SortPreservingMerge from SortKeyCursor.

What changes are included in this PR?

  • Splits various functionality into smaller, crate-private modules
  • Makes SortPreservingMergeStream generic over Cursor and accept a type-erased CursorStream
  • Splits batch construction logic into BatchBuilder

Are these changes tested?

Are there any user-facing changes?

No

/// Will then drop any batches for which all rows have been yielded to the output
///
/// Returns `None` if no pending rows
pub fn build_record_batch(&mut self) -> Result<Option<RecordBatch>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is moved pretty much verbatim from sort_preserving_merge.rs

type CursorStream<C> = Box<dyn PartitionedStream<Output = Result<(C, RecordBatch)>>>;

#[derive(Debug)]
struct SortPreservingMergeStream<C> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ported from sort_preserving_merge.rs but tweaked heavily to make it slightly easier to follow (hopefully)


pub use cursor::SortKeyCursor;
pub use index::RowIndex;

pub(crate) struct SortedStream {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This abstraction was a tad pointless, it was only used by ExternalSorter to propagate the size of the in-memory sorted batches. I adjusted it to just call init_mem_used itself, and created #5885 to track the broader pre-existing issue of memory accounting within merge streams

@tustvold
Copy link
Contributor Author

tustvold commented Apr 5, 2023

I have confirmed this has no discernible impact on the existing benchmarks, nor the benchmarks added in #5881

@alamb
Copy link
Contributor

alamb commented Apr 5, 2023

cc @yjshen and @jaylmiller

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold -- this looks like an improvement to me. I plan to run the sort benchmarks against this branch as well given how important sorting is to so many usecases.

I think we should leave this one open for a few days to allow others who might be interested to comment

cc @Dandandan in case you know of others

@tustvold
Copy link
Contributor Author

tustvold commented Apr 6, 2023

#5894 and #5895 both build off this and improve performance by about ~10% each. I will leave this open for a bit longer, but unless anybody objects I intend to merge this tomorrow morning to keep things moving along.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the performance benchmarks on this branch (2023-04-06-sorting.txt) and they confirm what @tustvold reported (no descernable change)

@alamb alamb mentioned this pull request Apr 6, 2023
6 tasks
@tustvold
Copy link
Contributor Author

tustvold commented Apr 6, 2023

5f7a3d6#diff-95c601f2909c85052924354d8d161a9f7bf539c47b59abb1d525a6d217b4e402R60-R61 shows how this can be used to specialize merge for a single primitive column

@tustvold tustvold merged commit e41711b into apache:main Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants