-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement special GroupColumn
support for byte view
#12809
base: main
Are you sure you want to change the base?
Conversation
ac96b5d
to
4842965
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great -- I am going to try and hook it up and write a few tests
Thanks @Rachelint
Thanks @alamb , I am working on implementing the rest main function |
The rest work is to add tests. |
Amazing @Rachelint -- thank you -- I actually hacked a bit on it too on a plane ride -- I pushed what I had here: #12883 Maybe you can use / repurpose the tests. I'll try and find time to review this weekend, but I may not have as much time as normal |
Thanks, it helps much! |
It is close to be ready, let's add more unit testcases before. |
5221185
to
ab4c198
Compare
ab4c198
to
d78c68d
Compare
Thanks help from @alamb , I think this pr is ready now. |
GroupCollumn
support for byte view
GroupCollumn
support for byte viewGroupColumn
support for byte view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started reviewing / testing this. It is looking great so far. Leaving a partial review while I work on the rest of it
|
||
let views = ScalarBuffer::from(views); | ||
|
||
Arc::new(GenericByteViewArray::<B>::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tried this locally, the performance goes from
Elapsed 0.612 seconds.
to
Elapsed 0.528 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed PR here: Rachelint#1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice inspect, I am checking it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @Rachelint -- this looks so great. I found it well commented, well structured, and well tested.
cc @jayzhan211 your GroupColumn pattern is really working well
There are two test cases I think we need to cover (below), but otherwise I think this PR is good to go.
I am also testing this PR with some other things to see if we can get the string view code enabled by default (finally): #12092
I also ran test coverage of this PR like this:
cargo llvm-cov --html -p datafusion-physical-plan --lib
Here is the report:
coverage.zip
In general very nice job with coverage. There are a few items that appear to be untested:
I also think it would be great to add some additional testing in fhe form of aggregate fuzz testing (mostly for the take_n
logic). I have some ideas (in #12847) that I hope to refine tomorrow
|
||
// The `n == len` case, we need to take all | ||
if self.len() == n { | ||
let new_builder = Self::new().with_max_block_size(self.max_block_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
// - Get the last related `buffer index`(let's name it `buffer index n`) | ||
// from last non-inlined `view` | ||
// | ||
// - Take buffers, the key is that we need to know if we need to take |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these comments. Very nice 💯
// 6. Take non-inlined + while last buffer in ``in_progress` | ||
// 7. Take all views at once | ||
|
||
let mut builder = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
|
||
if let Some(view) = last_non_inlined_view { | ||
let view = ByteView::from(*view); | ||
let last_related_buffer_index = view.buffer_index as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a name like last_remaining_buffer_index
might be clearer about what this quantity represents
|
||
// Build array and return | ||
let views = ScalarBuffer::from(first_n_views); | ||
Arc::new(GenericByteViewArray::<B>::new(views, buffers, null_buffer)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, I think we should use new_unchecked
here as all the data is valid by construction (maybe we could keep the check in debug builds)
.rev() | ||
.find(|view| ((**view) as u32) > 12); | ||
|
||
if let Some(view) = last_non_inlined_view { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically, you could reduce the indenting in this function by using a let else
, like
let Some(view) = last_non_inlined_view else {
let views = ScalarBuffer::from(first_n_views);
return Arc::new(GenericByteViewArray::<B>::new(
views,
Vec::new(),
null_buffer,
))
}
fn take_buffers_with_partial_last( | ||
&mut self, | ||
last_related_buffer_index: usize, | ||
take_len: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could call this last_take_len
or something to note it is the number of bytes being taken from the last buffer
Do not re-validate output is utf8
Comments about readability improvement are fixed. I am adding test for better test coverage. |
5fed4eb
to
8348024
Compare
68b0eba
to
c4d45c7
Compare
@alamb 👍 Thanks for reminding about the test coverage. After checking the codes again more carefully, I found some testcases indeed don't cover code paths as I expected. I have refined the tests for |
Which issue does this PR close?
Closes #12771
Rationale for this change
The new column based multi gourp by values impl is proved to be performant, but it is still not supported for byte view column now.
This pr will support this for getting better performance when we enable string view by default.
What changes are included in this PR?
Support new excellent column based multi group values for byte view column.
Are these changes tested?
Yes, test by new unit tests and e2e tests (most of them helped by @alamb )
Are there any user-facing changes?
No.