-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49022] Use Column Node API in Column #47688
Conversation
return Column(cast(JVMView, sc._jvm).Column(expr(*jcols + jfuns))) | ||
|
||
jcols = [_to_java_column(c) for c in cols] | ||
return Column(sc._jvm.Column.pysparkFn(name, _to_seq(sc, jcols + jfuns))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon I cannot invoke Column.fn(...)
here. Do you know why? I added Column.pysparkFn(...)
as a workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the Column.fn
has to be decorated with scala.annotation.varargs
(which will create the same signature for variant arguments for Java).
This is waiting for #47746. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala
Outdated
Show resolved
Hide resolved
@@ -26,8 +26,7 @@ import org.apache.spark.sql.catalyst.ScalaReflection | |||
import org.apache.spark.sql.catalyst.encoders.AgnosticEncoders.UnboundRowEncoder | |||
import org.apache.spark.sql.catalyst.encoders.encoderFor | |||
import org.apache.spark.sql.catalyst.expressions.{Expression, ScalaUDF} | |||
import org.apache.spark.sql.execution.aggregate.ScalaAggregator | |||
import org.apache.spark.sql.internal.UserDefinedFunctionLike | |||
import org.apache.spark.sql.internal.{InvokeInlineUserDefinedFunction, UserDefinedFunctionLike} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not so related to this PR but I wonder if we should name the package a little bit differently, e.g., org.apache.spark.sql.internal.api
. At least I am checking which package does individual class belongs to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean move all ColumnNode classes elsewhere?
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
Outdated
Show resolved
Hide resolved
|
||
def unapply(col: Column): Option[Expression] = Some(col.expr) | ||
def apply(node: => ColumnNode): Column = withOrigin(new Column(node)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the error report here would be a bit weird via Origin
because the top level of the function call changed. We get Thread.currentThread().getStackTrace
, and set it to Origin
so whichever function fails, it will always point it here.
Would be great if we can double check the code like the below still works fine:
val df = spark.range(10)
val df1 = df.withColumn("div_ten", df.col("id") / 10)
val df2 = df1.withColumn("plus_four", df.col("id") + 4)
// This is problematic divide operation that occurs DIVIDE_BY_ZERO.
val df3 = df2.withColumn("div_zero", df.col("id") / 0) // Error here
val df4 = df3.withColumn("minus_five", df.col("id") / 5)
df4.collect()
should report sth like:
org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== DataFrame ==
"div" was called from
<init>(<console>:7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what get:
org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== DataFrame ==
"div" was called from
<init>(<console>:1)
So that checks out. TBH this part is almost the same as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good otherwise. Mostly minor comments. Only one real comment from me is https://github.com/apache/spark/pull/47688/files#r1719271705 to make sure error reporting is still working fine.
Tests have passed. I am merging this to master. |
Merged to master. |
@hvanhovell what is the recommended way to migrate user code like
to this Column Node Api? The following are all
|
@EnricoMi what are you exactly doing? Could you share an example? We can definitely open up the interface to allow for user extensions. If it is basically adding functions, then I'd probably go with the |
That would shortcut a lot of extra complexity and simplify backward compatibility. |
Which backward compatibility? Showing an example would help a lot to understand. All those expressions are subject to be internal and private. |
The spark-extension packages provides some Dataset diff tooling. There, a user-defined comparison can simply be defined by implementing the That This obviously won't work for Spark connect, but with Column Node API this does not work for classic Spark client either. That package supports Spark 3.0 - 3.5. Creating a In order to support Spark Connect, there is no way around using the Spark Connect plugin / extensions. |
Fixing compilation broken by apache/spark#47688.
Currently working around this by gaining access to |
### What changes were proposed in this pull request? This PR makes the org.apache.spark.sql.Column and friends use the recently introduced ColumnNode API. This is a stepping stone towards making the Column API implementation agnostic. Most of the changes are fairly mechanical, and they are mostly caused by the removal of the Column(Expression) constructor. ### Why are the changes needed? We want to create unified Scala interface for Classic and Connect. A language agnostic Column API implementation is part of this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47688 from hvanhovell/SPARK-49022. Authored-by: Herman van Hovell <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? This PR makes the org.apache.spark.sql.Column and friends use the recently introduced ColumnNode API. This is a stepping stone towards making the Column API implementation agnostic. Most of the changes are fairly mechanical, and they are mostly caused by the removal of the Column(Expression) constructor. ### Why are the changes needed? We want to create unified Scala interface for Classic and Connect. A language agnostic Column API implementation is part of this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47688 from hvanhovell/SPARK-49022. Authored-by: Herman van Hovell <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
… keep the behavior same ### What changes were proposed in this pull request? This PR is a followup of #47688 that keeps `Column.toString` as the same before. ### Why are the changes needed? To keep the same behaviour with Spark Classic and Connect. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released out yet. ### How was this patch tested? Will be added separately. I manually tested: ```scala import org.apache.spark.sql.functions.col val name = "with`!#$%dot".replace("`", "``") col(s"`${name}`").toString.equals("with`!#$%dot") ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48376 from HyukjinKwon/SPARK-49022-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
This PR makes the org.apache.spark.sql.Column and friends use the recently introduced ColumnNode API. This is a stepping stone towards making the Column API implementation agnostic.
Most of the changes are fairly mechanical, and they are mostly caused by the removal of the Column(Expression) constructor.
Why are the changes needed?
We want to create unified Scala interface for Classic and Connect. A language agnostic Column API implementation is part of this.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests.
Was this patch authored or co-authored using generative AI tooling?
No