-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18777][PYTHON][SQL] Return UDF from udf.register #17831
Conversation
Test build #76380 has finished for PR 17831 at commit
|
Test build #76381 has finished for PR 17831 at commit
|
Test build #76382 has finished for PR 17831 at commit
|
Test build #76383 has finished for PR 17831 at commit
|
Thank so for doing this, will let me cleanup some of my PySpark notebooks :) |
LGTM, since this is just exposing existing functionality I think it makes sense to merge into 2.2 as well, but I'll leave this for a day incase anyone has any comments about this. |
Python interface |
I don't think the Scala API is particularly consistent and I'd argue that this does actually match the current API as much as is reasonable. Looking at the udf.register interface defined in UDFRegistration.scala this does return either a |
In Scala API, we have five different ways to register UDFs. We have to be more careful when changing the external APIs. It makes the users confused if we make the APIs more diverse. I do not suggest to make any further change in Python UDF, until we figure out how to make us easy to add more UDF types. |
I agree we need to be careful when making external API changes, but this is a pretty clear win for Python users and it matches the Scala API so I think improving this for Python users shouldn't necessarily block on any unification of the Scala API (is their a plan to improve this or JIRA associated with the planned Scala API changes I could subscribe to notifications on)? |
So far, one of the biggest issues is we always assume UDF is deterministic. It could cause incorrect results after query optimization. I am working on a UDF API change to add more types for UDFs. Will submit a PR recently. We can revisit this PR later. |
I feel like that's an unrelated challenge. I'm happy to see other improvements but I'm worried that we will hold up changes for things which aren't happening soon - is there a JIRA for these changes? |
@gatorsmile This sounds reasonable but I am not sure if I fully understand your concerns. If anything this brings PySpark closer to the Scala API. At this moment we have
and we would move to:
and similarly with other We can revisit this PR later. This, as pointed out by @holdenk, matches If you're planning breaking changes in the Scala API, it may render this PR obsolete, but we don't commit here to any particular implementation. The only promise here is that registering udf for SQL applications, returns an object, which can be used with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, it looks good to me.
python/pyspark/sql/catalog.py
Outdated
>>> from pyspark.sql.types import IntegerType | ||
>>> spark.catalog.registerFunction("stringLengthInt", lambda x: len(x), IntegerType()) | ||
>>> strlen = spark.catalog.registerFunction("stringLengthInt", len, IntegerType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks strlen
is not being used. Should we remove this or add an example for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to either capture the output (we could use _
I guess) or use it in the doctests:
>>> spark.catalog.registerFunction("stringLengthInt", len, IntegerType())
<function len>
I am not sure if it makes sense to use it, unless we provide more diverse examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think capturing is rather better because it is what this API returns (and as it dose not return None
anymore) if we don't have a better idea. Variable that indicates it's private, _
, looks also fine to me. No strong opinion. I am fine if we don't have a simpler and cleaner idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use _
. It is a common to indicate discarded output and, unlike repr
, it is portable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Yea, discarded output is a better term ...)
I want to know the UDF API change in more details and see if that blocks this PR. I would appreciate if there is a JIRA or details I could check. |
Test build #76416 has finished for PR 17831 at commit
|
@zero323 Will ping you when my PR is submitted. You can then add the UDF changes in Python. |
@HyukjinKwon @holdenk Spark 2.2 RC has been out. This PR is not a bug fix. Thus, this PR will be merged to the master branch instead of 2.2. I will submit a PR to show the changes soon. |
JIRA: https://issues.apache.org/jira/browse/SPARK-20586 is opened. |
Thanks @gatorsmile |
Yea, thanks for chiming in. It helped me a lot to understand the context. |
I think there is no conflict between #17848 and this. As of 2.2 we no longer return |
It sounds orthogonal to me as well. LGTM. |
cc @viirya too who I believe is also appropriate to review this. |
Thanks for making the #17848 PR with the details @gatorsmile it appears to be orthogonal to this change. Historically we've treated Python API parity fixes as closer to bug fixes rather than new features which is why I was thinking this made sense for 2.2, but I'll defer to your expertise in SQL and we can just target this for master. |
@gatorsmile want to know if you're ok with this going into master or if you still have concerns about this if its targeted to 2.3? |
This change LGTM. I go to check #17848. It seems to me that the PR simply adds two flags into ScalaUDF. It appears that there is not API change regarding with existing UDF registration. I agreed with @holdenk and @HyukjinKwon that it is orthogonal to this change for now. |
Thanks! Merging to master. |
Thanks everyone. |
What changes were proposed in this pull request?
functions.udf
tofunctions.UserDefinedFunction
.catalog.registerFunction
and dependent methods.catalog.registerFunction
andSQLContext.registerFunction
.How was this patch tested?