Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Utf8View for lpad scalar function #11941

Merged
merged 1 commit into from
Aug 13, 2024

Conversation

Omega359
Copy link
Contributor

Which issue does this PR close?

Closes #11857

Rationale for this change

Support utf8view for lpad udf.

What changes are included in this PR?

Code, tests.

Are these changes tested?

Yes.

Are there any user-facing changes?

No

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 12, 2024
@Omega359 Omega359 changed the title Update LPAD scalar function to support Utf8View Implement Utf8View for lpad scalar function Aug 12, 2024
@Omega359 Omega359 marked this pull request as ready for review August 12, 2024 02:13
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive @Omega359 👏 I especially like the testing improvements of this PR

I mentioned some places this function could be made faster, but that can totally be done as a follow on (maybe waiting for someone for whom the performance of lpad is important). I can file a follow on ticket

s.push_str(string);
Ok(Some(s))
}
let mut s: String = " ".repeat(length - graphemes.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR just follows the existing pattern so I don't think any changes are needed, but I think using a GenericStringBuilder would be much faster here (as it wouldn't allocate a string for each output array element)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a test locally with using GenericStringBuilder vs what is in this PR and the differences are within +/- 10% for 1024 and 2048 element arrays for each of the 3 string types. Worth noting though is that timing between runs on my laptop is not really stable at all (+/- 10% or so between runs of the exact same code).

Interestingly the utf8view seems to be consistently slightly slower than just plain utf8 at least in this function.

lpad function/utf8 type/2048
                        time:   [715.78 µs 718.43 µs 721.89 µs]
                        
lpad function/largeutf8 type/2048
                        time:   [745.44 µs 759.62 µs 779.84 µs]
                       
lpad function/stringview type/2048
                        time:   [734.25 µs 751.13 µs 773.67 µs]
                       

let length = if length < 0 { 0 } else { length as usize };
if length == 0 {
Ok(Some("".to_string()))
if length < graphemes.len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here I think GenericStringBuilder would likely be faster

@Omega359
Copy link
Contributor Author

I was looking at @Lordworms implementation in #11942 and I think it would make sense to align the two implementations. In some aspects I like his approach much more than the one I took in this PR.

@alamb
Copy link
Contributor

alamb commented Aug 12, 2024

I was looking at @Lordworms implementation in #11942 and I think it would make sense to align the two implementations. In some aspects I like his approach much more than the one I took in this PR.

Do you think we should merge this PR and work on improvements in a follow on PR?

@Omega359
Copy link
Contributor Author

Do you think we should merge this PR and work on improvements in a follow on PR?

I have no objections to that.

@alamb alamb merged commit 508da80 into apache:main Aug 13, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 13, 2024

Thanks @Omega359 -- let's merge this to get the initial implementation and tests in and then we can work on improvements as a follow on PR.

Let me know if you would like me to file a ticket for

I was looking at @Lordworms implementation in #11942 and I think it would make sense to align the two implementations. In some aspects I like his approach much more than the one I took in this PR.

@Omega359
Copy link
Contributor Author

Let me know if you would like me to file a ticket for

I was looking at @Lordworms implementation in #11942 and I think it would make sense to align the two implementations. In some aspects I like his approach much more than the one I took in this PR.

I've got the work done locally in a branch along with the GenericStringBuilder work and the benchmark to go with it. I'll create a ticket tomorrow for it thanks @alamb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update LPAD scalar function to support Utf8View
2 participants