[SPARK-49022] Use Column Node API in Column #47688

hvanhovell · 2024-08-09T13:50:24Z

What changes were proposed in this pull request?

This PR makes the org.apache.spark.sql.Column and friends use the recently introduced ColumnNode API. This is a stepping stone towards making the Column API implementation agnostic.

Most of the changes are fairly mechanical, and they are mostly caused by the removal of the Column(Expression) constructor.

Why are the changes needed?

We want to create unified Scala interface for Classic and Connect. A language agnostic Column API implementation is part of this.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No

hvanhovell · 2024-08-13T01:21:29Z

python/pyspark/sql/functions/builtin.py

-    return Column(cast(JVMView, sc._jvm).Column(expr(*jcols + jfuns)))
-
+    jcols = [_to_java_column(c) for c in cols]
+    return Column(sc._jvm.Column.pysparkFn(name, _to_seq(sc, jcols + jfuns)))


@HyukjinKwon I cannot invoke Column.fn(...) here. Do you know why? I added Column.pysparkFn(...) as a workaround.

Ah, the Column.fn has to be decorated with scala.annotation.varargs (which will create the same signature for variant arguments for Java).

hvanhovell · 2024-08-14T01:21:25Z

This is waiting for #47746.

python/pyspark/sql/tests/test_dataframe.py

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala

HyukjinKwon · 2024-08-16T04:01:38Z

sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

@@ -26,8 +26,7 @@ import org.apache.spark.sql.catalyst.ScalaReflection
 import org.apache.spark.sql.catalyst.encoders.AgnosticEncoders.UnboundRowEncoder
 import org.apache.spark.sql.catalyst.encoders.encoderFor
 import org.apache.spark.sql.catalyst.expressions.{Expression, ScalaUDF}
-import org.apache.spark.sql.execution.aggregate.ScalaAggregator
-import org.apache.spark.sql.internal.UserDefinedFunctionLike
+import org.apache.spark.sql.internal.{InvokeInlineUserDefinedFunction, UserDefinedFunctionLike}


not so related to this PR but I wonder if we should name the package a little bit differently, e.g., org.apache.spark.sql.internal.api. At least I am checking which package does individual class belongs to.

You mean move all ColumnNode classes elsewhere?

sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

HyukjinKwon · 2024-08-16T04:11:03Z

sql/core/src/main/scala/org/apache/spark/sql/Column.scala


-  def unapply(col: Column): Option[Expression] = Some(col.expr)
+  def apply(node: => ColumnNode): Column = withOrigin(new Column(node))


I think the error report here would be a bit weird via Origin because the top level of the function call changed. We get Thread.currentThread().getStackTrace, and set it to Origin so whichever function fails, it will always point it here.

Would be great if we can double check the code like the below still works fine:

val df = spark.range(10) val df1 = df.withColumn("div_ten", df.col("id") / 10) val df2 = df1.withColumn("plus_four", df.col("id") + 4) // This is problematic divide operation that occurs DIVIDE_BY_ZERO. val df3 = df2.withColumn("div_zero", df.col("id") / 0) // Error here val df4 = df3.withColumn("minus_five", df.col("id") / 5) df4.collect()

should report sth like:

org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012 == DataFrame == "div" was called from <init>(<console>:7)

This is what get:

org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012 == DataFrame == "div" was called from <init>(<console>:1)

So that checks out. TBH this part is almost the same as before.

HyukjinKwon

Looks good otherwise. Mostly minor comments. Only one real comment from me is https://github.com/apache/spark/pull/47688/files#r1719271705 to make sure error reporting is still working fine.

hvanhovell · 2024-08-17T00:28:52Z

Tests have passed. I am merging this to master.

HyukjinKwon · 2024-08-17T00:44:53Z

Merged to master.

EnricoMi · 2024-08-18T11:23:29Z

@hvanhovell what is the recommended way to migrate user code like

new Column(MyExpression())

to this Column Node Api?

The following are all private[sql] / private[spark]:

Column(MyExpression())
new Column(ExpressionColumnNode(MyExpression()))

hvanhovell · 2024-08-18T17:30:12Z

@EnricoMi what are you exactly doing? Could you share an example? We can definitely open up the interface to allow for user extensions.

If it is basically adding functions, then I'd probably go with the Column.fn(...) approach in combination with a SparkSessionExtension that will register your custom expressions. This way we can easily make what you have created work with Spark Connect (and with SQL). I understand some folks are not interested in this, in that case we can still open up an expression way of creating a Column.

EnricoMi · 2024-08-18T18:20:59Z

... in that case we can still open up an expression way of creating a Column.

That would shortcut a lot of extra complexity and simplify backward compatibility.

HyukjinKwon · 2024-08-18T23:46:43Z

backward compatibility

Which backward compatibility? Showing an example would help a lot to understand. All those expressions are subject to be internal and private.

EnricoMi · 2024-08-19T10:22:27Z

The spark-extension packages provides some Dataset diff tooling. There, a user-defined comparison can simply be defined by implementing the scala.math.Equiv interface: https://github.com/G-Research/spark-extension/blob/master/src/main/scala/uk/co/gresearch/spark/diff/DiffComparators.scala#L41

That Equiv implementation is wrapped into an Expression (including codegen) and turned into a Comparatorthat is then used by the package to diff columns: given two columns left and right, return a Column that evaluates (compares the columns) to Boolean:

This obviously won't work for Spark connect, but with Column Node API this does not work for classic Spark client either.

That package supports Spark 3.0 - 3.5. Creating a Column from an Expression would allow for minimal changes to keep this working for Spark 4.0 with non-Connect client. This is what I meant with backward compatibility.

In order to support Spark Connect, there is no way around using the Spark Connect plugin / extensions.

Fixing compilation broken by apache/spark#47688.

EnricoMi · 2024-08-20T13:36:02Z

Currently working around this by gaining access to Column(Expression) in private[spark] package: https://github.com/G-Research/spark-extension/pull/256/files

### What changes were proposed in this pull request? This PR makes the org.apache.spark.sql.Column and friends use the recently introduced ColumnNode API. This is a stepping stone towards making the Column API implementation agnostic. Most of the changes are fairly mechanical, and they are mostly caused by the removal of the Column(Expression) constructor. ### Why are the changes needed? We want to create unified Scala interface for Classic and Connect. A language agnostic Column API implementation is part of this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47688 from hvanhovell/SPARK-49022. Authored-by: Herman van Hovell <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

… keep the behavior same ### What changes were proposed in this pull request? This PR is a followup of #47688 that keeps `Column.toString` as the same before. ### Why are the changes needed? To keep the same behaviour with Spark Classic and Connect. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released out yet. ### How was this patch tested? Will be added separately. I manually tested: ```scala import org.apache.spark.sql.functions.col val name = "with`!#$%dot".replace("`", "``") col(s"`${name}`").toString.equals("with`!#$%dot") ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48376 from HyukjinKwon/SPARK-49022-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

hvanhovell added 6 commits August 6, 2024 22:22

Integrate ColumnNode AST into Column.scala

d2ea80d

Add internally registered functions

e7a2a32

Move window to cool new API :)

a4e52f4

Improve Window

dcde4d4

Refactor ColumnNode API

9acfdbe

Support UDFs/UDAFs

6e31176

github-actions bot added SQL ML STRUCTURED STREAMING CORE AVRO CONNECT PROTOBUF labels Aug 9, 2024

hvanhovell added 2 commits August 12, 2024 21:03

Regular Fixes

ea07c58

UDF Fixes

73b1812

github-actions bot added BUILD PYTHON labels Aug 13, 2024

hvanhovell commented Aug 13, 2024

View reviewed changes

hvanhovell added 2 commits August 13, 2024 12:06

Add test for ColumnNode sql and normalize

3e41a98

Merge remote-tracking branch 'apache/master' into SPARK-49022

8dcb381

github-actions bot removed the CORE label Aug 14, 2024

hvanhovell added 2 commits August 14, 2024 00:17

Merge remote-tracking branch 'apache/master' into SPARK-49022

5fe4b18

Fix pyspark issues

763a082

github-actions bot added the PANDAS API ON SPARK label Aug 14, 2024

fixes

4244ef6

github-actions bot added the R label Aug 15, 2024

hvanhovell added 3 commits August 15, 2024 10:37

Fix Connect MiMa

e573f7c

Fix docs

4ba1d94

Merge remote-tracking branch 'apache/master' into SPARK-49022

35467a9