-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.4: Support pushing down system functions by V2 filters #7886
Conversation
Hi, @rdblue @aokolnychyi @szehon-ho @Fokko @nastra , could you help to review this when you are free? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks promising, wondering if we can split out unrelated changes for another pr
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java
Outdated
Show resolved
Hide resolved
.../spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredWithSystemFunctionScan.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/TestSparkV2Filters.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/TestSparkV2Filters.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/test/java/org/apache/iceberg/DateTimeUtil.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java
Outdated
Show resolved
Hide resolved
54f2bbc
to
eb87b00
Compare
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java
Outdated
Show resolved
Hide resolved
Thanks @rdblue, this has been rebased. |
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java
Outdated
Show resolved
Hide resolved
private static boolean isSystemFunc(org.apache.spark.sql.connector.expressions.Expression expr) { | ||
if (expr instanceof UserDefinedScalarFunc) { | ||
UserDefinedScalarFunc udf = (UserDefinedScalarFunc) expr; | ||
return udf.canonicalName().startsWith("iceberg") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this string check? I feel like all of the functions should be listed in the list on 369?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be possible some other UDF functions like 'catalog.other.years'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is that this is using the UDF's canonicalName
but the other check uses the function's name
. Those can differ: BucketFunction.name()
return bucket
, what the user would call, but the canonical function name identifies the exact bound function no matter how it is loaded so BucketInt.canonicalName()
returns iceberg.bucket(int)
.
If we want to limit to just the Iceberg-defined functions, then this is necessary. It may be better to have a set of supported functions that uses just the canonicalName
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be better to have a set of supported functions that uses just the canonicalName.
There are some canonicalName
that are not constants, for example:
@Override
public String canonicalName() {
return String.format("iceberg.bucket(%s)", sqlType.catalogString());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While that's not a constant, there are a limited number of values that sqlType.catalogString()
might return. I think it would be better to enumerate them, rather than just looking for iceberg
and mapping the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to do that, however TruncateDecimal
needs precision and scale which are runtime values.
@Override
public String canonicalName() {
return String.format("iceberg.truncate(decimal(%d,%d))", precision, scale);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can similarly enumerate them, but this isn't a big deal.
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java
Outdated
Show resolved
Hide resolved
return false; | ||
} else { | ||
return Arrays.stream(predicate.children()).skip(1).allMatch(SparkV2Filters::isLiteral); | ||
} | ||
} | ||
|
||
/** Should be called after {@link #couldConvert} passed */ | ||
private static <T> UnboundTerm<Object> toTerm(T input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be really nice if Spark passed the BoundFunction
in through the expression!
pushFilters(builder, predicate); | ||
scan = builder.build().toBatch(); | ||
|
||
Assertions.assertThat(scan.planInputPartitions().length).isEqualTo(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any tests that exercise the entire code path using SQL? I'd like to see at least one end-to-end test with SQL.
89424f7
to
3107553
Compare
import org.junit.Before; | ||
import org.junit.Test; | ||
|
||
public class TestSystemFunctionPushDownDQL extends SparkExtensionsTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue here is the SQL test. We need to add tests for update/delete
as well. Will add it later.
|
||
@Test | ||
public void testYearsFunction() { | ||
Assume.assumeTrue(!catalogName.equals("spark_catalog")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
system function can not be resolved with spark_catalog. Got the following error. Will address it in a follow-up.
scala> spark.sql("select spark_catalog.system.years(cast('2017-11-22' as timestamp))")
org.apache.spark.sql.AnalysisException: [SCHEMA_NOT_FOUND] The schema `system` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a catalog, verify the current_schema() output, or qualify the name with the correct catalog.
To tolerate the error on drop use DROP SCHEMA IF EXISTS.; line 1 pos 7
at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:42)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40)
at org.apache.spark.sql.hive.HiveExternalCatalog.requireDbExists(HiveExternalCatalog.scala:56)
at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$functionExists$1(HiveExternalCatalog.scala:1352)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than setting up the catalog just to skip all the tests, can you just run this with one catalog for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, updated to run hive only.
extensions.injectCheckRule { _ => MergeIntoIcebergTableResolutionCheck } | ||
extensions.injectCheckRule { _ => AlignedRowLevelIcebergCommandCheck } | ||
|
||
// optimizer extensions | ||
extensions.injectOptimizerRule { _ => ExtendedSimplifyConditionalsInPredicate } | ||
extensions.injectOptimizerRule { _ => ExtendedReplaceNullWithFalseInPredicate } | ||
extensions.injectOptimizerRule { spark => RewriteStaticInvoke(spark) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put in the optimizer in order to some constant expressions have been optimized. Such as system.days(ts) = system.days(date('2022-09-08'))
. The system.days(date('2022-09-08')
will be evaluated into a constant value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this in a separate PR? I don't think the basic support for functions should include this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intend to do that, however, SOL uts depends on it. I will submit a separate PR to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submit #8088 and will add the sql push-down UTs after this in.
// Controls whether to push down Iceberg system function | ||
public static final String SYSTEM_FUNC_PUSH_DOWN_ENABLED = | ||
"spark.sql.iceberg.system-function-push-down.enabled"; | ||
public static final boolean SYSTEM_FUNC_PUSH_DOWN_ENABLED_DEFAULT = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a setting for this and why does it default to false?
Iceberg always has the option of rejecting functions by returning them rather than consuming them. I don't see a reason to have a flag for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compared with StaticInvoke
, ApplyFunctionExpression
may have a little decrease in performance because it can not leverage codegen.
412a3ba
to
27401ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am excited about this functionality.
Personally, I have some questions about how the filters are working for the delete use case, and I think it might be good to have some tests for the filter pushdown on deletes since (from my first pass reading of the code) it might throw an exception.
It looks great, and I'm excited to have more filter pushdowns working with Iceberg :)
for (Predicate predicate : predicates) { | ||
Expression converted = convert(predicate); | ||
Preconditions.checkArgument( | ||
converted != null, "Cannot convert Spark predicate to Iceberg expression: %s", predicate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it's following what was done elsewhere this means if there was non-iceberg UDF predicate that got pushed down it would fail to push down the iceberg expressions? Looking at the code this only seems to be used (currently) in the deleteWhere, but I think for the deleteWhere codepath we should not throw.
public void deleteWhere(Filter[] filters) { | ||
Expression deleteExpr = SparkFilters.convert(filters); | ||
public void deleteWhere(Predicate[] predicates) { | ||
Expression deleteExpr = SparkV2Filters.convert(predicates); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is the one case where we use convert on a list of predicates instead one at a time, with canDeleteWhere using a for loop on the individual expressions we will can different results.
I think given the use case of convert + list of predicates we should change that to not throw as originally suggest by @rdblue .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the above answers. I think here is deliberate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I think I missed the this method will be invoked only if {@link #canDeleteWhere(Predicate[])} returns true.
(I was looking at the Iceberg trait this extends instead of the upstream source).
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java
Show resolved
Hide resolved
Thanks @holdenk for your interest and comments.
I mentioned a little in the descriptions.
Actually, I have submitted some of the code for the rule and SQL UTs. However, they are split out to keep the PR small for review. I have a follow-up #8088 here and will be moved forward when this PR is merged. |
Awesome :) Sorry for missing the previous discussion around |
Merged! Thanks for working on this @ConeyLiu. Fantastic work. |
Thanks @rdblue for merging this. And thanks all for your patient reviewing. |
Although I am only an iceberg user, I think this issue should be discussed further. This solution solves the problem at hand, but it inherently hurts the fairness of ICEBERG's support for different engines, since not all engines can support ICEBERG's system_functions. The ideal would be to implement transformational pushdown for complex conditions within each engine. While there may not be a conflict between these two options, I thought I'd speak up and say what I'm thinking. |
@BsoBird this is not an issue, it's a pull request. If you would like to see support for pushing down transforms into other query engine I would suggest starting a new issue for those engines. |
Will this work in case of
|
I am not sure what will be the syntax of it.
@aokolnychyi @flyrain @RussellSpitzer Looking forward. |
This PR adds support to push down system functions' filter by the Spark V2 filters. For example:
This is the first part code. It changes filters to V2 filters and adds support to convert system functions wrapped in Spark
UserDefinedScalarFunc
to Iceberg expressions. In the following PR will add a rule to convert the system function call toApplyFunctionExpression
which could be pushed down to the data source.