Implement special `GroupColumn` support for byte view #12809

Rachelint · 2024-10-08T10:59:13Z

Which issue does this PR close?

Rationale for this change

The new column based multi gourp by values impl is proved to be performant, but it is still not supported for byte view column now.
This pr will support this for getting better performance when we enable string view by default.

What changes are included in this PR?

Support new excellent column based multi group values for byte view column.

Are these changes tested?

Yes, test by new unit tests and e2e tests (most of them helped by @alamb )

Are there any user-facing changes?

No.

alamb

This looks great -- I am going to try and hook it up and write a few tests

Thanks @Rachelint

Rachelint · 2024-10-11T17:36:31Z

This looks great -- I am going to try and hook it up and write a few tests

Thanks @Rachelint

Thanks @alamb , I am working on implementing the rest main function take_n and build now.
A bit busy recent few days... I will help continue to push this forward from today.

Rachelint · 2024-10-11T20:22:38Z

The rest work is to add tests.
Will finish it today.

alamb · 2024-10-11T23:32:24Z

Amazing @Rachelint -- thank you -- I actually hacked a bit on it too on a plane ride -- I pushed what I had here: #12883

Maybe you can use / repurpose the tests.

I'll try and find time to review this weekend, but I may not have as much time as normal

Rachelint · 2024-10-12T05:33:42Z

Amazing @Rachelint -- thank you -- I actually hacked a bit on it too on a plane ride -- I pushed what I had here: #12883

Maybe you can use / repurpose the tests.

I'll try and find time to review this weekend, but I may not have as much time as normal

Thanks, it helps much!

Rachelint · 2024-10-12T17:31:45Z

take_n is actually complex, I fixed ByteViewGroupValueBuilder::take_n

It is close to be ready, let's add more unit testcases before.

Rachelint · 2024-10-13T10:54:21Z

Thanks help from @alamb , I think this pr is ready now.

alamb

I started reviewing / testing this. It is looking great so far. Leaving a partial review while I work on the rest of it

alamb · 2024-10-13T11:51:21Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+
+        let views = ScalarBuffer::from(views);
+
+        Arc::new(GenericByteViewArray::<B>::new(


I think we should use new_unchecked here.

I ran a flamegraph and 5% of the time was spent validating that the output was valid utf8

When I tried this locally, the performance goes from

Elapsed 0.612 seconds.

to

Elapsed 0.528 seconds.

Proposed PR here: Rachelint#1

Nice inspect, I am checking it.

alamb

Thank you so much @Rachelint -- this looks so great. I found it well commented, well structured, and well tested.

cc @jayzhan211 your GroupColumn pattern is really working well

There are two test cases I think we need to cover (below), but otherwise I think this PR is good to go.

I am also testing this PR with some other things to see if we can get the string view code enabled by default (finally): #12092

I also ran test coverage of this PR like this:

cargo llvm-cov --html  -p datafusion-physical-plan --lib

Here is the report:
coverage.zip

In general very nice job with coverage. There are a few items that appear to be untested:

I also think it would be great to add some additional testing in fhe form of aggregate fuzz testing (mostly for the take_n logic). I have some ideas (in #12847) that I hope to refine tomorrow

alamb · 2024-10-13T12:04:40Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+
+        // The `n == len` case, we need to take all
+        if self.len() == n {
+            let new_builder = Self::new().with_max_block_size(self.max_block_size);


alamb · 2024-10-13T12:05:22Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+        //   - Get the last related `buffer index`(let's name it `buffer index n`)
+        //     from last non-inlined `view`
+        //
+        //   - Take buffers, the key is that we need to know if we need to take


Thank you for these comments. Very nice 💯

alamb · 2024-10-13T12:10:52Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+        //   6. Take non-inlined + while last buffer in ``in_progress`
+        //   7. Take all views at once
+
+        let mut builder =


alamb · 2024-10-13T12:38:29Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+
+        if let Some(view) = last_non_inlined_view {
+            let view = ByteView::from(*view);
+            let last_related_buffer_index = view.buffer_index as usize;


I think a name like last_remaining_buffer_index might be clearer about what this quantity represents

alamb · 2024-10-13T12:39:42Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+
+            // Build array and return
+            let views = ScalarBuffer::from(first_n_views);
+            Arc::new(GenericByteViewArray::<B>::new(views, buffers, null_buffer))


as above, I think we should use new_unchecked here as all the data is valid by construction (maybe we could keep the check in debug builds)

alamb · 2024-10-13T12:40:51Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+            .rev()
+            .find(|view| ((**view) as u32) > 12);
+
+        if let Some(view) = last_non_inlined_view {


Stylistically, you could reduce the indenting in this function by using a let else, like

let Some(view) = last_non_inlined_view else { let views = ScalarBuffer::from(first_n_views); return Arc::new(GenericByteViewArray::<B>::new( views, Vec::new(), null_buffer, )) }

alamb · 2024-10-13T12:42:11Z

datafusion/physical-plan/src/aggregates/group_values/group_column.rs

+    fn take_buffers_with_partial_last(
+        &mut self,
+        last_related_buffer_index: usize,
+        take_len: usize,


maybe we could call this last_take_len or something to note it is the number of bytes being taken from the last buffer

Do not re-validate output is utf8

Rachelint · 2024-10-13T13:50:16Z

Thank you so much @Rachelint -- this looks so great. I found it well commented, well structured, and well tested.

cc @jayzhan211 your GroupColumn pattern is really working well

There are two test cases I think we need to cover (below), but otherwise I think this PR is good to go.

I am also testing this PR with some other things to see if we can get the string view code enabled by default (finally): #12092

I also ran test coverage of this PR like this:
cargo llvm-cov --html  -p datafusion-physical-plan --lib
Here is the report: coverage.zip
...
I also think it would be great to add some additional testing in fhe form of aggregate fuzz testing (mostly for the take_n logic). I have some ideas (in #12847) that I hope to refine tomorrow

Comments about readability improvement are fixed.

I am adding test for better test coverage.

Rachelint · 2024-10-13T17:57:10Z

@alamb 👍 Thanks for reminding about the test coverage.

After checking the codes again more carefully, I found some testcases indeed don't cover code paths as I expected.

I have refined the tests for equal_to and take_n, and all related paths are covered according to the report now!

Rachelint added 2 commits October 8, 2024 14:35

define ByteGroupValueViewBuilder.

ca033e0

impl append.

ffcc1a2

github-actions bot added the physical-expr Physical Expressions label Oct 8, 2024

alamb mentioned this pull request Oct 9, 2024

Performance: Add "read strings as binary" option for parquet #12788

Open

impl equal to.

4842965

Rachelint force-pushed the impl-byte-view-column branch from ac96b5d to 4842965 Compare October 9, 2024 17:49

Rachelint added 2 commits October 10, 2024 01:52

fix compile.

66bb7be

fix comments.

ef1efce

alamb mentioned this pull request Oct 10, 2024

Enable datafusion.execution.parquet.schema_force_string_view by default #11682

Open

alamb reviewed Oct 11, 2024

View reviewed changes

Rachelint added 4 commits October 12, 2024 03:40

impl take_n.

152a8b1

impl build.

d61c3ec

impl rest functions in GroupColumn.

151377e

fix output when panic.

63e11cb

alamb mentioned this pull request Oct 11, 2024

Impl byte view column #12883

Draft

add e2e sql tests.

15d8349

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 12, 2024

Rachelint added 4 commits October 12, 2024 19:15

add unit tests.

d9ee724

switch to a really elegant style codes from alamb.

beffa35

fix take_n.

46822f9

improve comments.

3a93584

Rachelint added 2 commits October 13, 2024 01:40

fix compile.

f99f55c

fix clippy.

37b4816

Rachelint force-pushed the impl-byte-view-column branch from 5221185 to ab4c198 Compare October 13, 2024 09:08

define more testcases in test_byte_view_take_n.

d78c68d

Rachelint force-pushed the impl-byte-view-column branch from ab4c198 to d78c68d Compare October 13, 2024 09:09

connect up.

7cb7dfc

Rachelint marked this pull request as ready for review October 13, 2024 10:53

fix doc.

e6c7e7e

Rachelint changed the title ~~[WIP] Impl byte view column~~ Impl byte view column Oct 13, 2024

alamb changed the title ~~Impl byte view column~~ Implement special GroupCollumn support for byte view Oct 13, 2024

alamb changed the title ~~Implement special GroupCollumn support for byte view~~ Implement special GroupColumn support for byte view Oct 13, 2024

Do not re-validate output is utf8

36d556e

alamb reviewed Oct 13, 2024

View reviewed changes

This was referenced Oct 13, 2024

Do not re-validate output is utf8 Rachelint/arrow-datafusion#1

Merged

Enable reading StringViewArray by default from Parquet #12092

Draft

alamb reviewed Oct 13, 2024

View reviewed changes

Rachelint and others added 3 commits October 13, 2024 21:11

Merge pull request #1 from alamb/alamb/tweak-group

f76c376

Do not re-validate output is utf8

switch to unchecked when building array.

1fd926f

improve naming.

34918cb

use let else to make the codes clearer.

8348024

Rachelint force-pushed the impl-byte-view-column branch from 5fed4eb to 8348024 Compare October 13, 2024 13:54

Rachelint added 2 commits October 13, 2024 21:59

fix typo.

023ed64

improve unit test coverage for ByteViewGroupValueBuilder.

c4d45c7

Rachelint force-pushed the impl-byte-view-column branch from 68b0eba to c4d45c7 Compare October 13, 2024 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement special `GroupColumn` support for byte view #12809

Implement special `GroupColumn` support for byte view #12809

Rachelint commented Oct 8, 2024 •

edited

Loading

alamb left a comment

Rachelint commented Oct 11, 2024 •

edited

Loading

Rachelint commented Oct 11, 2024

alamb commented Oct 11, 2024

Rachelint commented Oct 12, 2024

Rachelint commented Oct 12, 2024 •

edited

Loading

Rachelint commented Oct 13, 2024 •

edited

Loading

alamb left a comment

alamb Oct 13, 2024

alamb Oct 13, 2024

alamb Oct 13, 2024

Rachelint Oct 13, 2024

Rachelint Oct 13, 2024

alamb left a comment

alamb Oct 13, 2024

alamb Oct 13, 2024

alamb Oct 13, 2024

alamb Oct 13, 2024

alamb Oct 13, 2024

alamb Oct 13, 2024

alamb Oct 13, 2024

Rachelint commented Oct 13, 2024 •

edited

Loading

Rachelint commented Oct 13, 2024 •

edited

Loading


		let views = ScalarBuffer::from(views);

		Arc::new(GenericByteViewArray::<B>::new(

Implement special GroupColumn support for byte view #12809

Are you sure you want to change the base?

Implement special GroupColumn support for byte view #12809

Conversation

Rachelint commented Oct 8, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Rachelint commented Oct 11, 2024 • edited Loading

Rachelint commented Oct 11, 2024

alamb commented Oct 11, 2024

Rachelint commented Oct 12, 2024

Rachelint commented Oct 12, 2024 • edited Loading

Rachelint commented Oct 13, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rachelint commented Oct 13, 2024 • edited Loading

Rachelint commented Oct 13, 2024 • edited Loading

Implement special `GroupColumn` support for byte view #12809

Implement special `GroupColumn` support for byte view #12809

Rachelint commented Oct 8, 2024 •

edited

Loading

Rachelint commented Oct 11, 2024 •

edited

Loading

Rachelint commented Oct 12, 2024 •

edited

Loading

Rachelint commented Oct 13, 2024 •

edited

Loading

Rachelint commented Oct 13, 2024 •

edited

Loading

Rachelint commented Oct 13, 2024 •

edited

Loading