Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dictionary array support for substring function #1665

Merged
merged 4 commits into from
May 9, 2022

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented May 6, 2022

Which issue does this PR close?

Closes #1656.

Rationale for this change

Currently the substring kernel only support "plain" arrays but not dictionary encoded ones. With dictionary array, the compute could be much more efficient since it only needs to be done on the dictionary values.

What changes are included in this PR?

This PR adds the support of dictionary array for substring kernel.

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 6, 2022
/// let error = substring(&array, 0, Some(5)).unwrap_err().to_string();
/// assert!(error.contains("invalid utf-8 boundary"));
/// ```
pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this func to the beginning of the file, before all other non-public ones, for better readability.

DataType::Dictionary(kt, _) => {
substring_dict!(
kt,
Int8: Int8Type,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may make this shorter via concat_idents (e.g., concat_idents($t, Type)) but it's only available in nightly.

@sunchao sunchao added the enhancement Any new improvement worthy of a entry in the changelog label May 6, 2022
@@ -954,6 +992,56 @@ mod tests {
without_nulls_generic_string::<i64>()
}

#[test]
fn dictionary() -> Result<()> {
Copy link
Member

@viirya viirya May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn dictionary() -> Result<()> {
fn test_substring_dictionary() -> Result<()> {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not necessary to add test_ prefix for Rust tests since they are already under the tests module. The substring here also seem redundant since the full test name compute::kernels::substring::tests::dictionary already contain it.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. A few minor comments.

@codecov-commenter
Copy link

codecov-commenter commented May 6, 2022

Codecov Report

Merging #1665 (6203019) into master (a38e460) will increase coverage by 0.05%.
The diff coverage is 93.61%.

@@            Coverage Diff             @@
##           master    #1665      +/-   ##
==========================================
+ Coverage   83.10%   83.16%   +0.05%     
==========================================
  Files         193      193              
  Lines       55864    56039     +175     
==========================================
+ Hits        46425    46603     +178     
+ Misses       9439     9436       -3     
Impacted Files Coverage Δ
arrow/src/compute/kernels/substring.rs 99.36% <93.61%> (-0.64%) ⬇️
arrow/src/array/array.rs 89.67% <0.00%> (-3.04%) ⬇️
parquet/src/arrow/array_reader/test_util.rs 81.08% <0.00%> (-2.26%) ⬇️
arrow/src/ipc/reader.rs 88.80% <0.00%> (-0.23%) ⬇️
arrow/src/compute/kernels/take.rs 95.27% <0.00%> (-0.14%) ⬇️
parquet/src/arrow/array_reader.rs 89.76% <0.00%> (-0.07%) ⬇️
arrow-flight/src/utils.rs 0.00% <0.00%> (ø)
integration-testing/src/lib.rs 0.00% <0.00%> (ø)
arrow/src/array/array_binary.rs 93.30% <0.00%> (ø)
arrow/src/array/array_string.rs 97.73% <0.00%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a38e460...6203019. Read the comment docs.

/// let error = substring(&array, 0, Some(5)).unwrap_err().to_string();
/// assert!(error.contains("invalid utf-8 boundary"));
/// ```
pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {
Copy link
Contributor

@HaoYang670 HaoYang670 May 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit: Maybe we could let length be Option<u32>. Because the longest length will not exceed 1<<31 - 1 (for LargeBinaryArray and LargeStringArray)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think this is not quite related to this PR. I can open another one for the change.

/// ```
///
/// # Error
/// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also update that Dictionary arrays with [large]string/[large]binary values are also accepted here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Updated.

@alamb
Copy link
Contributor

alamb commented May 9, 2022

Thank you @sunchao ❤️

@viirya viirya merged commit daed6ab into apache:master May 9, 2022
@viirya
Copy link
Member

viirya commented May 9, 2022

Merged, thanks @sunchao @HaoYang670 @alamb

@sunchao sunchao deleted the dict-substring branch May 9, 2022 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add dictionary array support for substring function
5 participants