Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid concat in array_append #8137

Merged
merged 2 commits into from
Nov 13, 2023

Conversation

jayzhan211
Copy link
Contributor

Which issue does this PR close?

Follow up on #8108 (review)

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added the physical-expr Physical Expressions label Nov 11, 2023
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211 jayzhan211 marked this pull request as ready for review November 11, 2023 10:13
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211, looks good
Wondering if you can provide some numbers on how performance changed with the new method?

You can create a small local benchmark on zillions of values to check before and after.

@jayzhan211

This comment was marked as outdated.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Nov 12, 2023

unit: millis

Single Row

Number each row This Change Previous
1e8 116 178
2e8 391 252
5e8 1340 1533
8e8 2220 2478
1e9 16536 25460

Two Row

Number each row This Change Previous
1e8 251 1003
2e8 1043 2422
5e8 13155 41468

@comphead
Copy link
Contributor

Thanks @jayzhan211
Numbers for two rows is definitely promising.
However I'm not sure I'm following what is a single row or two row.
perhaps you can attach a benchmark code to make it more clear?

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Nov 12, 2023

Thanks @jayzhan211 Numbers for two rows is definitely promising. However I'm not sure I'm following what is a single row or two row. perhaps you can attach a benchmark code to make it more clear?

Single row is like [[1,2,3]] or ListArray( Int64Array(1,2,3) ).
Two row is like [[1,2,3], [4,5,6]] or ListArray (Int64Array(1,2,3), (4,5,6)) which is the target we are trying to optimized. We dont have to compute::concat each rows at the end instead we extend each row we need while iterating. This is why we can see Two row in this change outperform the original code.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jayzhan211

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 and @comphead -- this is so nice. It will be amazing if you could apply the same transformation to the rest of the functions in this module

values,
None,
)?)
return general_append_and_prepend(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

giphy

@alamb alamb merged commit d21a40c into apache:main Nov 13, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants