Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dictionary data type in array_to_string #10908

Merged
merged 10 commits into from
Jun 23, 2024

Conversation

EduardoVega
Copy link
Contributor

@EduardoVega EduardoVega commented Jun 14, 2024

Which issue does this PR close?

Closes #10862

Rationale for this change

Go to issue.

What changes are included in this PR?

Dictionary is added as a supported data type in compute_array_to_string, so that it can properly handle a Dictionary Array.

Are these changes tested?

Yes

Are there any user-facing changes?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you su much for this contribution @EduardoVega

Can you please add some tests, ideally in sqllogicteset format?

Here are the instructions: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest

Ideally you should be able to extend one of the existing test files in https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files with the @appletreeisyellow 's reproduer from #10862

For example, this might be a good spit

## array_to_string (aliases: `list_to_string`, `array_join`, `list_join`)

datafusion/functions-array/src/string.rs Show resolved Hide resolved
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 16, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for this PR @EduardoVega -- very much appreciated

I still think this PR is not quite correct as it is operating on the dictionary rather than the logical values of the dictionary array.

I pushed a new commit to your branch with some additional tests that I think show wrong results.

I can help fix the code too, but figured I would leave the feedback in case you wanted to do so.

format!("could not cast {} to AnyDictionaryArray", arr.data_type())
)
})?;
compute_array_to_string(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some tests showing how this results in incorrect values.

I think you may be able to call to_string! directly to iterate over the values of the dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the feedback, let me figure it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb thanks for the tests, I think I have a better understanding of DictionaryArray.

Based on my understanding, I converted dictionary keys and values to something iterable. Then using the keys to access the values I got the logical array. I think tests are now showing correct results.

Finally, I am happy to work on any new feedback I receive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing @EduardoVega -- thank you so much -- this is quite an impressive first contribution. I didn't realize it was going to be quite as tricky.

@alamb alamb marked this pull request as draft June 17, 2024 16:24
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for working on this @EduardoVega -- it is quite a bit more complicated than I realized.

@@ -281,6 +283,49 @@ pub(super) fn array_to_string_inner(args: &[ArrayRef]) -> Result<ArrayRef> {

Ok(arg)
}
Dictionary(_, _) => {
let any_dict_array = arr.as_any_dictionary_opt().ok_or_else(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @EduardoVega -- it is very impressive you got this to work

I am somewhat concerned that this is quite a bit of code (it actually handles dictionaries of any type, not just string dictionaries, and that it also supports arbitrary key types -- not just the integer types specified)

I am going to try and simplify this a bit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 5a5397b that I think simplifies the implementation a bit -- let me know what you think

($ARRAY_TYPE:ident) => {{
for x in downcast_arg!(any_dict_array.keys(), $ARRAY_TYPE) {
if let Some(x) = x {
keys.push(x.to_string().parse::<usize>().unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works, but converts integers to strings just to parse them again -- I think we can just support string dictionaries and avoid this

@alamb alamb marked this pull request as ready for review June 21, 2024 21:45
@alamb alamb merged commit 8d8dd90 into apache:main Jun 23, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 23, 2024

Thanks again @EduardoVega and @Weijun-H

appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jun 24, 2024
* Support dictionary data type in array_to_string

* Fix import

* Some tests

* Update datafusion/functions-array/src/string.rs

Co-authored-by: Alex Huang <[email protected]>

* Add some tests showing incorrect results

* Get logical array

* apply rust fmt

* Simplify implementation, avoid panics

---------

Co-authored-by: Alex Huang <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jun 25, 2024
* Support dictionary data type in array_to_string

* Fix import

* Some tests

* Update datafusion/functions-array/src/string.rs

Co-authored-by: Alex Huang <[email protected]>

* Add some tests showing incorrect results

* Get logical array

* apply rust fmt

* Simplify implementation, avoid panics

---------

Co-authored-by: Alex Huang <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jun 26, 2024
* Support dictionary data type in array_to_string

* Fix import

* Some tests

* Update datafusion/functions-array/src/string.rs

Co-authored-by: Alex Huang <[email protected]>

* Add some tests showing incorrect results

* Get logical array

* apply rust fmt

* Simplify implementation, avoid panics

---------

Co-authored-by: Alex Huang <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* Support dictionary data type in array_to_string

* Fix import

* Some tests

* Update datafusion/functions-array/src/string.rs

Co-authored-by: Alex Huang <[email protected]>

* Add some tests showing incorrect results

* Get logical array

* apply rust fmt

* Simplify implementation, avoid panics

---------

Co-authored-by: Alex Huang <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

array_to_string panic on dictionary values
3 participants