Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix strpos invocation with dictionary and null #12712

Merged
merged 1 commit into from
Oct 3, 2024

Conversation

findepi
Copy link
Member

@findepi findepi commented Oct 2, 2024

In 1b3608d strpos signature was modified to indicate it supports dictionary as input argument, but the invoke method doesn't support them.

In 1b3608d `strpos` signature was
modified to indicate it supports dictionary as input argument, but the
invoke method doesn't support them.
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions labels Oct 2, 2024
Comment on lines +47 to +56
vec![
Exact(vec![Utf8, Utf8]),
Exact(vec![Utf8, LargeUtf8]),
Exact(vec![LargeUtf8, Utf8]),
Exact(vec![LargeUtf8, LargeUtf8]),
Exact(vec![Utf8View, Utf8View]),
Exact(vec![Utf8View, Utf8]),
Exact(vec![Utf8View, LargeUtf8]),
],
Volatility::Immutable,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This restores the code before 1b3608d . These are the types actually supported by invoke.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we invert the order? From another function we have this comment:

// Planner attempts coercion to the target type starting with the most preferred candidate.
// For example, given input (Utf8View, Int64), it first tries coercing to (Utf8View, Int64).
// If that fails, it proceeds to (Utf8, Int64).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comment. I just restored the code that used to be here.
How function implementor can know what is the preferred order? Shouldn't this rather be engine's responsibility, if its given a choice? Unless of course the preferred candidate is function-specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question and I don't have the answer. I just have followed that advice since I first saw it a month or two back. I agree that I would hope the coercing logic could select the most optimal candidate.

select position(1 in 1)
----
1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query used to fail also before 1b3608d , i don't know why.

According to

// Any type can be coerced into strings
(Utf8 | LargeUtf8, _) => Some(type_into.clone()),
, every type is implicitly coercible into a Utf8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this function only accept string? Why do we return 1 now

@jayzhan211
Copy link
Contributor

I think it is better to introduce Signature::String for this issue instead of the old approach that iterating all the possible types

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @findepi

I think it is better to introduce Signature::String for this issue instead of the old approach that iterating all the possible types

I agree @jayzhan211 -- However, I also think we could do it as a follow on PR.

One of the challenges will be to clearly document which of the many potential combinations Signature::String can produce (as the function will have to handle all combinations)

@goldmedal
Copy link
Contributor

Thanks @findepi for working this 👍
There are some TODO items related to this issue. Could you help to address them?

@alamb
Copy link
Contributor

alamb commented Oct 3, 2024

There are some TODO items related to this issue. Could you help to address them?

I will do this as a follow on PR

Update in #12739

@alamb alamb merged commit 1f2f02f into apache:main Oct 3, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 3, 2024

Thank you @findepi and @goldmedal

@findepi findepi deleted the findepi/strpos-null branch October 3, 2024 15:13
@findepi
Copy link
Member Author

findepi commented Oct 3, 2024

thanks @alamb for the merge and sorry for not following on @goldmedal 's #12712 (comment) yet.

@alamb
Copy link
Contributor

alamb commented Oct 3, 2024

thanks @alamb for the merge and sorry for not following on @goldmedal 's #12712 (comment) yet.

No worries -- thanks for the PR -- I am just trying to keep the code flowing and figured I would just make a small PR to get this one in

@jayzhan211
Copy link
Contributor

jayzhan211 commented Oct 4, 2024

@alamb @findepi Does anyone know why select position(1 in 1) is not failing now? Isn't this a regression? The change of the signature is actually incorrect IMO.

Upd: Can check out #12751 for the fix

@findepi
Copy link
Member Author

findepi commented Oct 4, 2024

Does anyone know why select position(1 in 1) is not failing now?

i called this out in #12712 (comment)

in fact, i don't understand why it was failing before -- #12308 (comment)

In DataFusion everything is coercible into Utf8 so if position(a in b) works for strings, it should work for numbers too (unfortunately).

Upd: Can check out #12751 for the fix

Thanks, will take a look!

@jayzhan211
Copy link
Contributor

jayzhan211 commented Oct 4, 2024

In DataFusion everything is coercible into Utf8 so if position(a in b) works for strings, it should work for numbers

I don't believe this is accurate. Ideally, the behavior should be customizable, with default behavior similar to Postgres. In this case, the query should raise an error rather than automatically coercing the value to a string.

@findepi
Copy link
Member Author

findepi commented Oct 7, 2024

i based my comment on this piece of code

// Any type can be coerced into strings
(Utf8 | LargeUtf8, _) => Some(type_into.clone()),

@jayzhan211
Copy link
Contributor

jayzhan211 commented Oct 7, 2024

i based my comment on this piece of code

// Any type can be coerced into strings
(Utf8 | LargeUtf8, _) => Some(type_into.clone()),

Great, that is the code path not used for signature like user-defined, numeric and so on. It is historic reason, I hope we can deprecate coerce_from in the future.

fn is_well_supported_signature(type_signature: &TypeSignature) -> bool {
if let TypeSignature::OneOf(signatures) = type_signature {
return signatures.iter().all(is_well_supported_signature);
}
matches!(
type_signature,
TypeSignature::UserDefined
| TypeSignature::Numeric(_)
| TypeSignature::Coercible(_)
| TypeSignature::Any(_)
)
}

@findepi
Copy link
Member Author

findepi commented Oct 7, 2024

coerced_from is used for "most signatures" starting from here:

let coerced_types = datafusion_expr::type_coercion::functions::data_types(
arg_types,
&self.signature,

including Exact, which seems the most typical signature definiiotn, so i understood it as the default coercion rules.

But you're right, this function is not used for UserDefined, Numeric and Coercible signatures. Why so?

fn is_well_supported_signature(type_signature: &TypeSignature) -> bool {
if let TypeSignature::OneOf(signatures) = type_signature {
return signatures.iter().all(is_well_supported_signature);
}
matches!(
type_signature,
TypeSignature::UserDefined
| TypeSignature::Numeric(_)
| TypeSignature::Coercible(_)
| TypeSignature::Any(_)
)
}

thanks for the link.
What is the definition of is_well_supported_signature? What is the difference between "well supported signature" and "not so well supported one"?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Oct 7, 2024

The main difference is that that don't call coerce_from.

The reason I introduce those new signature that avoid coerce_from is that, coerce_from is a single large coercion rule function, any change of it has unknown effect for all the downstream project as well, and it is also not designed for extensibility. I think we can use UserDefined by default and introduce more signature like Numeric, String, Coercible as a general signature, if we found the coercion rule fit more than one function.

After replacing all the function with well_supported_signature, I think the coercion rule will be simplified and easy to reason, any change of it is also more in control.

BUT, the whole plan seems better if we have logical type supported #12622, therefore I didn't push the plan for now.

alamb added a commit that referenced this pull request Oct 8, 2024
* Add support for external tables with qualified names (#12645)

* Make  support schemas

* Set default name to table

* Remove print statements and stale comment

* Add tests for create table

* Fix typo

* Update datafusion/sql/src/statement.rs

Co-authored-by: Jonah Gao <[email protected]>

* convert create_external_table to objectname

* Add sqllogic tests

* Fix failing tests

---------

Co-authored-by: Jonah Gao <[email protected]>

* Fix Regex signature types (#12690)

* Fix Regex signature types

* Uncomment the shared tests in string_query.slt.part and removed tests copies everywhere else

* Test `LIKE` and `MATCH` with flags; Remove new tests from regexp.slt

* Refactor `ByteGroupValueBuilder` to use `MaybeNullBufferBuilder` (#12681)

* Fix malformed hex string literal in docs (#12708)

* Simplify match patterns in coercion rules (#12711)

Remove conditions where unnecessary.
Refactor to improve readability.

* Remove aggregate functions dependency on frontend (#12715)

* Remove aggregate functions dependency on frontend

DataFusion is a SQL query engine and also a reusable library for
building query engines. The core functionality should not depend on
frontend related functionalities like `sqlparser` or `datafusion-sql`.

* Remove duplicate license header

* Minor: Remove clone in `transform_to_states` (#12707)

* rm clone

Signed-off-by: jayzhan211 <[email protected]>

* fmt

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>

* Refactor tests for union sorting properties, add tests for unions and constants (#12702)

* Refactor tests for union sorting properties

* update doc test

* Undo import reordering

* remove unecessary static lifetimes

* Fix: support Qualified Wildcard in count aggregate function (#12673)

* Reduce code duplication in `PrimitiveGroupValueBuilder` with const generics (#12703)

* Reduce code duplication in `PrimitiveGroupValueBuilder` with const generics

* Fix docs

* Disallow duplicated qualified field names (#12608)

* Disallow duplicated qualified field names

* Fix tests

* Optimize base64/hex decoding by pre-allocating output buffers (~2x faster) (#12675)

* add bench

* replace macro with generic function

* remove duplicated code

* optimize base64/hex decode

* Allow DynamicFileCatalog support to query partitioned file (#12683)

* support to query partitioned table for dynamic file catalog

* cargo clippy

* split partitions inferring to another function

* Support `LIMIT` Push-down logical plan optimization for `Extension` nodes (#12685)

* Update trait `UserDefinedLogicalNodeCore`

Signed-off-by: Austin Liu <[email protected]>

* Update corresponding interface

Signed-off-by: Austin Liu <[email protected]>

Add rewrite rule for `push-down-limit` for `Extension`

Signed-off-by: Austin Liu <[email protected]>

* Add rewrite rule for `push-down-limit` for `Extension` and tests

Signed-off-by: Austin Liu <[email protected]>

* Update corresponding interface

Signed-off-by: Austin Liu <[email protected]>

* Reorganize to match guard

Signed-off-by: Austin Liu <[email protected]>

* Clena up

Signed-off-by: Austin Liu <[email protected]>

Clean up

Signed-off-by: Austin Liu <[email protected]>

---------

Signed-off-by: Austin Liu <[email protected]>

* Fix AvroReader: Add union resolving for nested struct arrays (#12686)

* Add union resolving for nested struct arrays

* Add test

* Change test

* Reproduce index error

* fmt

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Adds macros for creating `WindowUDF` and `WindowFunction` expression (#12693)

* Adds macro for udwf singleton

* Adds a doc comment parameter to macro

* Add doc comment for `create_udwf` macro

* Uses default constructor

* Update `Cargo.lock` in `datafusion-cli`

* Fixes: expand `$FN_NAME` in doc strings

* Adds example for macro usage

* Renames macro

* Improve doc comments

* Rename udwf macro

* Minor: doc copy edits

* Adds macro for creating fluent-style expression API

* Adds support for 1 or more parameters in expression function

* Rewrite doc comments

* Rename parameters

* Minor: formatting

* Adds doc comment for `create_udwf_expr` macro

* Improve example docs

* Hides extraneous code in doc comments

* Add a one-line readme

* Adds doc test assertions + minor formatting fixes

* Adds common macro for defining user-defined window functions

* Adds doc comment for `define_udwf_and_expr`

* Defines `RowNumber` using common macro

* Add usage example for common macro

* Adds usage for custom constructor

* Add examples for remaining patterns

* Improve doc comments for usage examples

* Rewrite inner line docs

* Rewrite `create_udwf_expr!` doc comments

* Minor doc improvements

* Fix doc test and usage example

* Add inline comments for macro patterns

* Minor: change doc comment in example

* Support unparsing plans with both Aggregation and Window functions (#12705)

* Support unparsing plans with both Aggregation and Window functions (#35)

* Fix unparsing for aggregation grouping sets

* Add test for grouping set unparsing

* Update datafusion/sql/src/unparser/utils.rs

Co-authored-by: Jax Liu <[email protected]>

* Update datafusion/sql/src/unparser/utils.rs

Co-authored-by: Jax Liu <[email protected]>

* Update

* More tests

---------

Co-authored-by: Jax Liu <[email protected]>

* Fix strpos invocation with dictionary and null (#12712)

In 1b3608d `strpos` signature was
modified to indicate it supports dictionary as input argument, but the
invoke method doesn't support them.

* docs: Update DataFusion introduction to clarify that DataFusion does provide an "out of the box" query engine (#12666)

* Update DataFusion introduction to show that DataFusion offers packaged versions for end users

* change order

* Update README.md

Co-authored-by: Andrew Lamb <[email protected]>

* refine wording and update user guide for consistency

* prettier

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Framework for generating function docs from embedded code documentation (#12668)

* Initial work on #12432 to allow for generation of udf docs from embedded documentation in the code

* Add missing license header.

* Fixed examples.

* Fixing a really weird RustRover/wsl ... something. No clue what happened there.

* permission change

* Cargo fmt update.

* Refactored Documentation to allow it to be used in a const.

* Add documentation for syntax_example

* Refactoring Documentation based on PR feedback.

* Cargo fmt update.

* Doc update

* Fixed copy/paste error.

* Minor text updates.

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Add IMDB(JOB) Benchmark [2/N] (imdb queries) (#12529)

* imdb dataset

* cargo fmt

* Add 113 queries for IMDB(JOB)

Signed-off-by: Austin Liu <[email protected]>

* Add `get_query_sql` from `query_id` string

Signed-off-by: Austin Liu <[email protected]>

* Fix CSV reader & Remove Parquet partition

Signed-off-by: Austin Liu <[email protected]>

* Add benchmark IMDB runner

Signed-off-by: Austin Liu <[email protected]>

* Add `run_imdb` script

Signed-off-by: Austin Liu <[email protected]>

* Add checker for imdb option

Signed-off-by: Austin Liu <[email protected]>

* Add SLT for IMDB

Signed-off-by: Austin Liu <[email protected]>

* Fix `get_query_sql()` for CI roundtrip test

Signed-off-by: Austin Liu <[email protected]>

Fix `get_query_sql()` for CI roundtrip test

Signed-off-by: Austin Liu <[email protected]>

Fix `get_query_sql()` for CI roundtrip test

Signed-off-by: Austin Liu <[email protected]>

* Clean up

Signed-off-by: Austin Liu <[email protected]>

* Add missing license

Signed-off-by: Austin Liu <[email protected]>

* Add IMDB(JOB) queries `2b` to `5c`

Signed-off-by: Austin Liu <[email protected]>

* Add `INCLUDE_IMDB` in CI verify-benchmark-results

Signed-off-by: Austin Liu <[email protected]>

* Prepare IMDB dataset

Signed-off-by: Austin Liu <[email protected]>

Prepare IMDB dataset

Signed-off-by: Austin Liu <[email protected]>

* use uint as id type

* format

* Seperate `tpch` and `imdb` benchmarking CI jobs

Signed-off-by: Austin Liu <[email protected]>

Fix path

Signed-off-by: Austin Liu <[email protected]>

Fix path

Signed-off-by: Austin Liu <[email protected]>

Remove `tpch` in `imdb` benchmark

Signed-off-by: Austin Liu <[email protected]>

* Remove IMDB(JOB) slt in CI

Signed-off-by: Austin Liu <[email protected]>

Remove IMDB(JOB) slt in CI

Signed-off-by: Austin Liu <[email protected]>

---------

Signed-off-by: Austin Liu <[email protected]>
Co-authored-by: DouPache <[email protected]>

* Minor: avoid clone while calculating union equivalence properties (#12722)

* Minor: avoid clone while calculating union equivalence properties

* Update datafusion/physical-expr/src/equivalence/properties.rs

* fmt

* Simplify streaming_merge function parameters (#12719)

* simplify streaming_merge function parameters

* revert test change

* change StreamingMergeConfig into builder pattern

* Fix links on docs index page (#12750)

* Provide field and schema metadata missing on cross joins, and union with null fields. (#12729)

* test: reproducer for missing schema metadata on cross join

* fix: pass thru schema metadata on cross join

* fix: preserve metadata when transforming to view types

* test: reproducer for missing field metadata in left hand NULL field of union

* fix: preserve field metadata from right side of union

* chore: safe indexing

* Minor: Update string tests for strpos (#12739)

* Apply `type_union_resolution` to array and values (#12753)

* cleanup make array coercion rule

Signed-off-by: jayzhan211 <[email protected]>

* change to type union resolution

Signed-off-by: jayzhan211 <[email protected]>

* change value too

Signed-off-by: jayzhan211 <[email protected]>

* fix tpyo

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>

* Add `DocumentationBuilder::with_standard_argument` to reduce copy/paste (#12747)

* Add `DocumentationBuilder::with_standard_expression` to reduce copy/paste

* fix doc

* fix standard argument

* Update docs

* Improve documentation to explain what is different

* fix `equal_to` in `PrimitiveGroupValueBuilder` (#12758)

* fix `equal_to` in `PrimitiveGroupValueBuilder`.

* fix typo.

* add uts.

* reduce calling of `is_null`.

* Minor: doc how field name is to be set (#12757)

* Fix `equal_to` in `ByteGroupValueBuilder` (#12770)

* Fix `equal_to` in `ByteGroupValueBuilder`

* refactor null_equal_to

* Update datafusion/physical-plan/src/aggregates/group_values/group_column.rs

* Allow simplification even when nullable (#12746)

The nullable requirement seem to have been added in #1401 but as far as
I can tell they are not needed for these 2 cases.

I think this can be shown using this truth table: (generated using
datafusion-cli without this patch)
```
> CREATE TABLE t (v BOOLEAN) as values (true), (false), (NULL);
> select t.v, t2.v, t.v AND (t.v OR t2.v), t.v OR (t.v AND t2.v) from t cross join t as t2;
+-------+-------+---------------------+---------------------+
| v     | v     | t.v AND t.v OR t2.v | t.v OR t.v AND t2.v |
+-------+-------+---------------------+---------------------+
| true  | true  | true                | true                |
| true  | false | true                | true                |
| true  |       | true                | true                |
| false | true  | false               | false               |
| false | false | false               | false               |
| false |       | false               | false               |
|       | true  |                     |                     |
|       | false |                     |                     |
|       |       |                     |                     |
+-------+-------+---------------------+---------------------+
```

And it seems Spark applies both of these and DuckDB applies only the
first one.

* Fix unnest conjunction with selecting wildcard expression (#12760)

* fix unnest statement with wildcard expression

* add commnets

* Improve `round` scalar function unparsing for Postgres (#12744)

* Postgres: enforce required `NUMERIC` type for `round` scalar function (#34)

Includes initial support for dialects to override scalar functions unparsing

* Document scalar_function_to_sql_overrides fn

* Fix stack overflow calculating projected orderings (#12759)

* Fix stack overflow calculating projected orderings

* fix docs

* Port / Add Documentation for `VarianceSample` and `VariancePopulation` (#12742)

* Upgrade arrow/parquet to `53.1.0` / fix clippy (#12724)

* Update to arrow/parquet 53.1.0

* Update some API

* update for changed file sizes

* Use non deprecated APIs

* Use ParquetMetadataReader from @etseidl

* remove upstreamed implementation

* Update CSV schema

* Use upstream is_null and is_not_null kernels

* feat: add support for Substrait ExtendedExpression (#12728)

* Add support for serializing and deserializing Substrait ExtendedExpr message

* Address clippy reviews

* Reuse existing rename method

* Transformed::new_transformed: Fix documentation formatting (#12787)

Co-authored-by: Andrew Lamb <[email protected]>

* fix: Correct results for grouping sets when columns contain nulls (#12571)

* Fix grouping sets behavior when data contains nulls

* PR suggestion comment

* Update new test case

* Add grouping_id to the logical plan

* Add doc comment next to INTERNAL_GROUPING_ID

* Fix unparsing of Aggregate with grouping sets

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Migrate documentation for all string functions from scalar_functions.md to code  (#12775)

* Added documentation for string and unicode functions.

* Fixed issues with aliases.

* Cargo fmt.

* Minor doc fixes.

* Update docs for var_pop/samp

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Account for constant equivalence properties in union, tests (#12562)

* Minor: clarify comment about empty dependencies (#12786)

* Introduce Signature::String and return error if  input of `strpos` is integer (#12751)

* fix sig

Signed-off-by: jayzhan211 <[email protected]>

* fix

Signed-off-by: jayzhan211 <[email protected]>

* fix error

Signed-off-by: jayzhan211 <[email protected]>

* fix all signature

Signed-off-by: jayzhan211 <[email protected]>

* fix all signature

Signed-off-by: jayzhan211 <[email protected]>

* change default type

Signed-off-by: jayzhan211 <[email protected]>

* clippy

Signed-off-by: jayzhan211 <[email protected]>

* fix docs

Signed-off-by: jayzhan211 <[email protected]>

* rm deadcode

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

* rm test

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>

* Minor: improve docs on MovingMin/MovingMax (#12790)

* Add slt tests (#12721)

---------

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Co-authored-by: OussamaSaoudi <[email protected]>
Co-authored-by: Jonah Gao <[email protected]>
Co-authored-by: Dmitrii Blaginin <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Tomoaki Kawada <[email protected]>
Co-authored-by: Piotr Findeisen <[email protected]>
Co-authored-by: Jay Zhan <[email protected]>
Co-authored-by: HuSen <[email protected]>
Co-authored-by: Emil Ejbyfeldt <[email protected]>
Co-authored-by: Simon Vandel Sillesen <[email protected]>
Co-authored-by: Jax Liu <[email protected]>
Co-authored-by: Austin Liu <[email protected]>
Co-authored-by: JonasDev1 <[email protected]>
Co-authored-by: jcsherin <[email protected]>
Co-authored-by: Sergei Grebnov <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Bruce Ritchie <[email protected]>
Co-authored-by: DouPache <[email protected]>
Co-authored-by: mertak-synnada <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: wiedld <[email protected]>
Co-authored-by: kamille <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Val Lorentz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

STRPOS doesn't support (Dictionary(Int32, Utf8) as input type
5 participants