Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using dictionary arrays as filters #12382

Merged
merged 3 commits into from
Sep 10, 2024
Merged

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Sep 8, 2024

Fixes #12380

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Sep 8, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that a DictionaryArray with Boolean values is likely to only ever perform worse than a native BooleanArray

The rationale is that each element of a DictionaryArray takes at least 1 byte (for the index into the dictionary when indexed by Int8/UInt8) where each element in a BooleanArray consumes a single bit

So while I think this code is fine and we could merge it, the feature seems suspicious to me 🤔

@adriangb
Copy link
Contributor Author

adriangb commented Sep 8, 2024

I agree, I was thinking the same thing. The context is a kernel that operates on arrays, the output of which may or may not be used in a filter. I guess the kernel could do the conversion but I figured if we can support it here Filter could do the conversion internally (if it makes sense, maybe it already is?)

@alamb
Copy link
Contributor

alamb commented Sep 8, 2024

I agree, I was thinking the same thing. The context is a kernel that operates on arrays, the output of which may or may not be used in a filter. I guess the kernel could do the conversion but I figured if we can support it here Filter could do the conversion internally (if it makes sense, maybe it already is?)

I think this conversion is typically done in coercion -- like in https://docs.rs/datafusion/latest/datafusion/logical_expr/type_coercion/binary/fn.comparison_coercion.html

(which basically means DataFusion would apply a cast internally to cast Dictionary(*, Boolean) --> Boolean

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to allow Dictionary as a filter (even though I think subsequent processing will effectively cast it to a BooleanArray)

@Dandandan
Copy link
Contributor

I think that functions like json_contains shouldn't create a DictionaryArray but rather a BooleanArray directly (so no conversion is needed).

@adriangb
Copy link
Contributor Author

adriangb commented Sep 8, 2024

Yeah I agree, that's probably the better way to do things but does it hurt to also add this support? I guess it could be a performance foot gun to allow this, although I don't know how bad it would be. The alternative of not implementing this is the foot gun of: you implement a kernel, write some tests for it but not this specific case and then discovering via a user reporting a bug that DataFusion has this limitation and have to go patch your kernel to protect against this case.

@alamb
Copy link
Contributor

alamb commented Sep 9, 2024

Yeah I agree, that's probably the better way to do things but does it hurt to also add this support? I guess it could be a performance foot gun to allow this, although I don't know how bad it would be.

Yeah, I think the potential issue would be that your code creates these DictionaryArrays of Boolean that is quite a bit slower if you had created BooleanArray directly.

If DataFusion errors, then you know to go try and fix it, but if it doesn't error, then you may never know

That being said, I think this particular change is fine and that we could merge it for the reasons discussed above.

@alamb alamb merged commit 8d2b240 into apache:main Sep 10, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 10, 2024

Thanks again @adriangb and @Dandandan

@samuelcolvin
Copy link
Contributor

See #12511 which is closely related

@adriangb adriangb deleted the dict-null branch September 17, 2024 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dictionary columns (with a Boolean values) can't be used directly as filters
4 participants