feat: add null input handling options for `any_value` #652

Blizzara · 2024-06-26T14:49:58Z

This adds a "ignore_nulls" option for any_value that can be used when converting e.g. Spark's first()/first_value()/any_value()

Blizzara · 2024-06-26T14:58:45Z

extensions/functions_arithmetic.yaml

+          - name: x
+            value: any
+        options:
+          null_handling:


I copied this from concat, does it make sense here or is better to add e.g. a boolean arg?

I think options make sense for this. The names ACCEPT_NULLS and IGNORE_NULLS sound a little weird to me, but I can't think of better ones currently.

I added that original option to concat, but I agree it does sound a little weird. How about changing the option name to ignore_nulls and have the options be ["True", "False"]. I think True/False may need to be quoted. If i recall correctly I didn't do this originally because no other options were quoted, but that's changed since then.

renamed in b18cecd - and quoted after, as that was needed to keep them as strings. Though it hurts my soul a bit to have a string "TRUE". Maybe should make them YES/NO instead to be less confusing 😅

Blizzara · 2024-06-26T14:59:41Z

extensions/functions_arithmetic.yaml

@@ -1563,6 +1563,43 @@ aggregate_functions:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR ]
        nullability: DECLARED_OUTPUT
        return: fp64?
+  - name: "first"


the window functions are called first/last_value, should we stick to that naming?

Spark seems to have both first and first_value, though first_value is only supported in SQL while first is supported as a method.
DataFusion has first_value

Looking at other engines, Postgres and Trino only have support for first_value and last_value as window functions, and don't have first and last aggregate functions.

I think it make sense to keep first/last and first_value/last_value as seperate.

Yeah, looks like this is a mess overall - postgres has only first_value as window, Spark has first and first_value being the same, duckdb has first for aggregate and both for window, DataFusion has first_value as aggregate...

The purpose of the functions is the same, though. I actually think it'd make sense for Substrait to only have one set of these (as aggregate), and maybe it should be the _value option to match already existing any_value. But dunno if we can remove the window versions, I guess that'd be a breaking change?

I renamed them to have the "_value" postfix now - small annoyance there is that if someone includes now both these and window functions they'll get duplicate signatures...

Currently the base name has to be unique across all of the files. So having two first_value functions with the same signature will likely mess things up. (And yes, we need a check.)

Currently the base name has to be unique across all of the files

I just saw this comment. I believe this is inconsistent with both the spirit and spec of extensions. Extensions should allow people to create new extensions that completely conflict with other extensions. That's why the spec specifies that functions are identified by a combination of their name and URI. Two different URIs can define the same names as entirely different things since they are in different namespaces.

Recent discussion on the topic:

#631 (comment)
#634

IIRC, the general consensus is "yes, different filenames should be able to have duplicate functio names but we're not there yet, there isn't much motivation to tackle the issue, and keeping the core Substrait functions unique makes life easier in the short term"

Blizzara · 2024-06-26T15:00:11Z

Onne thing I'm not sure - is it better to have these as both window and aggregate functions, or could we remove the window function version and just replace with this?

vbarua

Overall this seems reasonable to me, but I'm curious to see what others think.

vbarua · 2024-06-26T16:21:10Z

extensions/functions_arithmetic.yaml

+        nullability: DECLARED_OUTPUT
+        decomposable: MANY
+        intermediate: any?
+        return: any?


I think both of these functions should be in functions_aggregate_generic.yaml instead of functions_arithmetic.yaml

extensions/functions_arithmetic.yaml

vbarua · 2024-06-26T16:41:05Z

extensions/functions_arithmetic.yaml

+          - name: x
+            value: any
+        options:
+          null_handling:


I think options make sense for this. The names ACCEPT_NULLS and IGNORE_NULLS sound a little weird to me, but I can't think of better ones currently.

vbarua · 2024-06-26T16:44:46Z

extensions/functions_arithmetic.yaml

@@ -1563,6 +1563,43 @@ aggregate_functions:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR ]
        nullability: DECLARED_OUTPUT
        return: fp64?
+  - name: "first"


Looking at other engines, Postgres and Trino only have support for first_value and last_value as window functions, and don't have first and last aggregate functions.

I think it make sense to keep first/last and first_value/last_value as seperate.

EpsilonPrime · 2024-06-26T20:33:39Z

extensions/functions_aggregate_generic.yaml

@@ -35,3 +35,41 @@ aggregate_functions:
            value: any
        nullability: DECLARED_OUTPUT
        return: any?
+  - name: "first"
+    description: >-
+      First value from a group of values.


It's probably worth noting that order matters for these two functions. Acero will reject plans that don't have a defined ordering on the input which might be a reasonable practice.

Added a note c93c326!

EpsilonPrime · 2024-06-26T20:37:53Z

extensions/functions_aggregate_generic.yaml

@@ -35,3 +35,41 @@ aggregate_functions:
            value: any
        nullability: DECLARED_OUTPUT
        return: any?
+  - name: "first"


Should we call out to first_value with the difference to make it easier for folks to choose one or the other?

Just to make sure I understand, what do you see as the difference? My hope would be to replace the window versions with these completely, given that aggregate functions are also valid window functions, and the engines I looked at don't seem to make a difference between these - but maybe I missed something

The only difference I see is what context they can be called in.

Blizzara · 2024-06-26T21:00:25Z

Okays, I revamped the PR a bit - now it moves the first_value and last_value from window funcs into aggregate funcs, and adds the null handling options.

This makes most sense to me, but lmk what you think!

vbarua · 2024-06-26T21:07:56Z

extensions/functions_arithmetic.yaml

-        nullability: DECLARED_OUTPUT
-        decomposable: NONE
-        return: any1
-        window_type: PARTITION


Heads up, moving existing functions is a breaking change, and a relatively painful one to workaround. I would prefer if we could avoid breakage here.

See: #634 (comment)

I guess the options are:

move existing functions (breaking change

duplicate existing functions (bad)

add new functions with new name (leads to confusing end state as there's now two functions with same functionality but different name, unless we can deprecate the existing functions somehow)

For my need, I think both 1 and 3 work fine, so I don't have strong opinions - I guess it's a question of breaking now vs keeping a worse state for ever/until breaking later?

We could also keep the functions in the same file, but move them into the aggregate functions block.

I guess it's a question of breaking now vs keeping a worse state for ever/until breaking later?

I think it would good to move these eventually, but it would be nice to do after substrait-java can handle duplicate functions in different files, because then we could make the change by duplicating the functions into the new file in one release, and then removing the old functions in the next release.

I've filed substrait-io/substrait-java#275 to track this work in substrait-java.

Keeping the same file for now, fixing substrait-java, and then moving files works for me.

Ah, I thought "moving" meant moving also from window -> aggregation. Keeping them in the old file is fine for me -like fc3fa78 ?

westonpace · 2024-06-27T16:25:07Z

I don't love it but I won't necessarily vote against it. It isn't supported by some significant engines (e.g. Postgres, SQL server, Snowflake, ). However, it does appear to be supported by DataFusion and DuckDB so there is some representation. Databricks has first/last but it marks them as "order is non-deterministics" and it's not clear that Databricks supports "ORDER BY" in an aggregate expression.

My main problem is that first(x order by x) is the same as min(x) and first(x order by y) can be obtained by arg_min(x, y). I think min / arg_min are better since they don't require an order by statement and so they are more easily implemented by engines (yes, any engine can optimize first into min / arg_min but that's just introducing extra steps for no real gain). I think the more common result would be an engine falling back to its order by implementation and introducing an expensive sort into the query (arg_min and arg_max can be calculated without a sort). For example, this appears to be what DataFusion does today:

❯ EXPLAIN SELECT first_value(val ORDER BY val) FROM foo;
+---------------+------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                     |
+---------------+------------------------------------------------------------------------------------------+
| logical_plan  | Aggregate: groupBy=[[]], aggr=[[FIRST_VALUE(foo.val) ORDER BY [foo.val ASC NULLS LAST]]] |
|               |   TableScan: foo projection=[val]                                                        |
| physical_plan | AggregateExec: mode=Single, gby=[], aggr=[FIRST_VALUE(foo.val)]                          |
|               |   SortExec: expr=[val@0 ASC NULLS LAST]                                                  |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                        |
|               |                                                                                          |
+---------------+------------------------------------------------------------------------------------------+

That plan is more expensive than SELECT min(val) FROM foo even though they are identical. That being said, I do think we should finish up the arg_min / arg_max PR (#326)

Blizzara · 2024-06-27T19:36:06Z

I don't love it but I won't necessarily vote against it. It isn't supported by some significant engines (e.g. Postgres, SQL server, Snowflake, ). However, it does appear to be supported by DataFusion and DuckDB so there is some representation. Databricks has first/last but it marks them as "order is non-deterministics" and it's not clear that Databricks supports "ORDER BY" in an aggregate expression.

By Databricks, do you mean Spark? Turns out Spark has also any_value, and they are "interchanged" ie the implementation for any_value is just first. Substrait already has any_value, so one option is to just use that.

My main problem is that first(x order by x) is the same as min(x) and first(x order by y) can be obtained by arg_min(x, y).

FWIW, I don't think those are the main uses for first/last, rather I think the need is more for the "any_value" concept. Now that I think of it, I'm not sure why Spark even has a last, but maybe there's a reason.

The main reason I started this PR was to support Spark's distinct/dropDuplicates. Those get rewritten by the optimizer into a first aggregate: https://github.com/apache/spark/blob/df13ca05c475e98bf5c218a4503513065611a47f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L2262

We do have some uses of last as well, but it's possible those would be within windows, I don't have an easy way to check.

In the end, if it's preferable, I think I can actually turn Spark's first() aggregate into Substrait's any_value() and then that again into DataFusion's first_value(). Then if I do end up needing the last() as aggregate, I can add it as a Spark specific mapping or something.

Would that be better? Then I could change this PR to instead add the null-handling options into any_value, since those would be nice to have still.

EpsilonPrime · 2024-06-27T20:23:19Z

An alternative to first is to use a fetch relation but it becomes a lot more complicated to modify a complicated subquery to introduce it. Last does weird me out and it is a available in less places. Probably implemented for completeness and not actual use.

westonpace · 2024-06-28T15:04:32Z

Would that be better? Then I could change this PR to instead add the null-handling options into any_value, since those would be nice to have still.

Yes, I'd prefer that. Sorry for the churn. Agree the null handling is good.

any_value can be used in place of e.g. first() in Spark but it's missing an option for whether to ignore nulls in input or not

Blizzara · 2024-07-01T19:55:27Z

Yes, I'd prefer that. Sorry for the churn. Agree the null handling is good.

Done! All good, makes sense to be careful when adding stuff into the standard (or standard extensions)!

westonpace

+1, but I will point out that "ignore nulls" / "skip nulls" is a property that can apply to many (potentially all) aggregate functions.

Blizzara requested review from jacques-n, cpcloud, westonpace, EpsilonPrime and vbarua as code owners June 26, 2024 14:49

Blizzara commented Jun 26, 2024

View reviewed changes

Blizzara force-pushed the avo/first-last-as-aggregate-functions branch from 858438b to 0d924e9 Compare June 26, 2024 15:01

vbarua reviewed Jun 26, 2024

View reviewed changes

EpsilonPrime reviewed Jun 26, 2024

View reviewed changes

EpsilonPrime previously approved these changes Jun 26, 2024

View reviewed changes

vbarua reviewed Jun 26, 2024

View reviewed changes

Blizzara dismissed EpsilonPrime’s stale review via fc3fa78 June 27, 2024 07:36

EpsilonPrime previously approved these changes Jun 27, 2024

View reviewed changes

Blizzara dismissed EpsilonPrime’s stale review via ff68ec9 July 1, 2024 19:51

Blizzara force-pushed the avo/first-last-as-aggregate-functions branch from fc3fa78 to ff68ec9 Compare July 1, 2024 19:51

Blizzara changed the title ~~feat: add first and last aggregate functions~~ feat: add null input handling options for nth_value Jul 1, 2024

feat: add null input handling options for any_value

9563135

any_value can be used in place of e.g. first() in Spark but it's missing an option for whether to ignore nulls in input or not

Blizzara force-pushed the avo/first-last-as-aggregate-functions branch from ff68ec9 to 9563135 Compare July 1, 2024 19:54

Blizzara changed the title ~~feat: add null input handling options for nth_value~~ feat: add null input handling options for any_value Jul 1, 2024

westonpace approved these changes Jul 2, 2024

View reviewed changes

EpsilonPrime approved these changes Jul 3, 2024

View reviewed changes

EpsilonPrime merged commit 1890e6a into substrait-io:main Jul 3, 2024
17 checks passed

Blizzara deleted the avo/first-last-as-aggregate-functions branch July 8, 2024 09:14

Blizzara mentioned this pull request Aug 28, 2024

feat: add 'first' function #697

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add null input handling options for `any_value` #652

feat: add null input handling options for `any_value` #652

Blizzara commented Jun 26, 2024 •

edited

Loading

Blizzara Jun 26, 2024

vbarua Jun 26, 2024

richtia Jun 26, 2024

Blizzara Jun 26, 2024

Blizzara Jun 26, 2024

vbarua Jun 26, 2024

Blizzara Jun 26, 2024

Blizzara Jun 26, 2024

EpsilonPrime Jun 26, 2024

jacques-n Jul 28, 2024

westonpace Jul 29, 2024

Blizzara commented Jun 26, 2024

vbarua left a comment

vbarua Jun 26, 2024

vbarua Jun 26, 2024

vbarua Jun 26, 2024

EpsilonPrime Jun 26, 2024

Blizzara Jun 26, 2024

EpsilonPrime Jun 26, 2024

Blizzara Jun 26, 2024

EpsilonPrime Jun 26, 2024

Blizzara commented Jun 26, 2024

vbarua Jun 26, 2024

Blizzara Jun 26, 2024

vbarua Jun 26, 2024

EpsilonPrime Jun 27, 2024

Blizzara Jun 27, 2024

westonpace commented Jun 27, 2024

Blizzara commented Jun 27, 2024

EpsilonPrime commented Jun 27, 2024

westonpace commented Jun 28, 2024

Blizzara commented Jul 1, 2024

westonpace left a comment

feat: add null input handling options for any_value #652

feat: add null input handling options for any_value #652

Conversation

Blizzara commented Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara commented Jun 26, 2024

vbarua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara commented Jun 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Jun 27, 2024

Blizzara commented Jun 27, 2024

EpsilonPrime commented Jun 27, 2024

westonpace commented Jun 28, 2024

Blizzara commented Jul 1, 2024

westonpace left a comment

Choose a reason for hiding this comment

feat: add null input handling options for `any_value` #652

feat: add null input handling options for `any_value` #652

Blizzara commented Jun 26, 2024 •

edited

Loading