Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement special GroupColumn support for byte view #12809

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

Rachelint
Copy link
Contributor

@Rachelint Rachelint commented Oct 8, 2024

Which issue does this PR close?

Closes #12771

Rationale for this change

The new column based multi gourp by values impl is proved to be performant, but it is still not supported for byte view column now.
This pr will support this for getting better performance when we enable string view by default.

What changes are included in this PR?

Support new excellent column based multi group values for byte view column.

Are these changes tested?

Yes, test by new unit tests and e2e tests (most of them helped by @alamb )

Are there any user-facing changes?

No.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great -- I am going to try and hook it up and write a few tests

Thanks @Rachelint

@Rachelint
Copy link
Contributor Author

Rachelint commented Oct 11, 2024

This looks great -- I am going to try and hook it up and write a few tests

Thanks @Rachelint

Thanks @alamb , I am working on implementing the rest main function take_n and build now.
A bit busy recent few days... I will help continue to push this forward from today.

@Rachelint
Copy link
Contributor Author

The rest work is to add tests.
Will finish it today.

@alamb alamb mentioned this pull request Oct 11, 2024
@alamb
Copy link
Contributor

alamb commented Oct 11, 2024

Amazing @Rachelint -- thank you -- I actually hacked a bit on it too on a plane ride -- I pushed what I had here: #12883

Maybe you can use / repurpose the tests.

I'll try and find time to review this weekend, but I may not have as much time as normal

@Rachelint
Copy link
Contributor Author

Amazing @Rachelint -- thank you -- I actually hacked a bit on it too on a plane ride -- I pushed what I had here: #12883

Maybe you can use / repurpose the tests.

I'll try and find time to review this weekend, but I may not have as much time as normal

Thanks, it helps much!

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 12, 2024
@Rachelint
Copy link
Contributor Author

Rachelint commented Oct 12, 2024

take_n is actually complex, I fixed ByteViewGroupValueBuilder::take_n

It is close to be ready, let's add more unit testcases before.

@Rachelint Rachelint marked this pull request as ready for review October 13, 2024 10:53
@Rachelint
Copy link
Contributor Author

Rachelint commented Oct 13, 2024

Thanks help from @alamb , I think this pr is ready now.

@Rachelint Rachelint changed the title [WIP] Impl byte view column Impl byte view column Oct 13, 2024
@alamb alamb changed the title Impl byte view column Implement special GroupCollumn support for byte view Oct 13, 2024
@alamb alamb changed the title Implement special GroupCollumn support for byte view Implement special GroupColumn support for byte view Oct 13, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started reviewing / testing this. It is looking great so far. Leaving a partial review while I work on the rest of it


let views = ScalarBuffer::from(views);

Arc::new(GenericByteViewArray::<B>::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use new_unchecked here.

I ran a flamegraph flamegraph and 5% of the time was spent validating that the output was valid utf8

Screenshot 2024-10-13 at 7 45 52 AM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried this locally, the performance goes from

Elapsed 0.612 seconds.

to

Elapsed 0.528 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed PR here: Rachelint#1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice inspect, I am checking it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @Rachelint -- this looks so great. I found it well commented, well structured, and well tested.

cc @jayzhan211 your GroupColumn pattern is really working well

There are two test cases I think we need to cover (below), but otherwise I think this PR is good to go.

I am also testing this PR with some other things to see if we can get the string view code enabled by default (finally): #12092

I also ran test coverage of this PR like this:

cargo llvm-cov --html  -p datafusion-physical-plan --lib

Here is the report:
coverage.zip

In general very nice job with coverage. There are a few items that appear to be untested:

Screenshot 2024-10-13 at 8 46 38 AM Screenshot 2024-10-13 at 8 47 09 AM

I also think it would be great to add some additional testing in fhe form of aggregate fuzz testing (mostly for the take_n logic). I have some ideas (in #12847) that I hope to refine tomorrow


// The `n == len` case, we need to take all
if self.len() == n {
let new_builder = Self::new().with_max_block_size(self.max_block_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

// - Get the last related `buffer index`(let's name it `buffer index n`)
// from last non-inlined `view`
//
// - Take buffers, the key is that we need to know if we need to take
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these comments. Very nice 💯

// 6. Take non-inlined + while last buffer in ``in_progress`
// 7. Take all views at once

let mut builder =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍


if let Some(view) = last_non_inlined_view {
let view = ByteView::from(*view);
let last_related_buffer_index = view.buffer_index as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a name like last_remaining_buffer_index might be clearer about what this quantity represents


// Build array and return
let views = ScalarBuffer::from(first_n_views);
Arc::new(GenericByteViewArray::<B>::new(views, buffers, null_buffer))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, I think we should use new_unchecked here as all the data is valid by construction (maybe we could keep the check in debug builds)

.rev()
.find(|view| ((**view) as u32) > 12);

if let Some(view) = last_non_inlined_view {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically, you could reduce the indenting in this function by using a let else, like

        let Some(view) = last_non_inlined_view else {
            let views = ScalarBuffer::from(first_n_views);
            return Arc::new(GenericByteViewArray::<B>::new(
                views,
                Vec::new(),
                null_buffer,
            ))
        }

fn take_buffers_with_partial_last(
&mut self,
last_related_buffer_index: usize,
take_len: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could call this last_take_len or something to note it is the number of bytes being taken from the last buffer

@Rachelint
Copy link
Contributor Author

Rachelint commented Oct 13, 2024

Thank you so much @Rachelint -- this looks so great. I found it well commented, well structured, and well tested.

cc @jayzhan211 your GroupColumn pattern is really working well

There are two test cases I think we need to cover (below), but otherwise I think this PR is good to go.

I am also testing this PR with some other things to see if we can get the string view code enabled by default (finally): #12092

I also ran test coverage of this PR like this:

cargo llvm-cov --html  -p datafusion-physical-plan --lib

Here is the report: coverage.zip
...
I also think it would be great to add some additional testing in fhe form of aggregate fuzz testing (mostly for the take_n logic). I have some ideas (in #12847) that I hope to refine tomorrow

Comments about readability improvement are fixed.

I am adding test for better test coverage.

@Rachelint
Copy link
Contributor Author

Rachelint commented Oct 13, 2024

@alamb 👍 Thanks for reminding about the test coverage.

After checking the codes again more carefully, I found some testcases indeed don't cover code paths as I expected.

I have refined the tests for equal_to and take_n, and all related paths are covered according to the report now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement special Groups for StringViews
2 participants