Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.4: Support pushing down system functions by V2 filters #7886

Merged
merged 17 commits into from
Aug 1, 2023

Conversation

ConeyLiu
Copy link
Contributor

@ConeyLiu ConeyLiu commented Jun 23, 2023

This PR adds support to push down system functions' filter by the Spark V2 filters. For example:

-- table(id long, name string) partitioned by bucket(10, id)

SELECT * FROM iceberg.db.table WHERE iceberg.system.bucket(10, id) = 2;

DELETE FROM TABLE iceberg.db.table WHERE iceberg.system.bucket(10, id) = 2;

This is the first part code. It changes filters to V2 filters and adds support to convert system functions wrapped in Spark UserDefinedScalarFunc to Iceberg expressions. In the following PR will add a rule to convert the system function call to
ApplyFunctionExpression which could be pushed down to the data source.

@github-actions github-actions bot added the spark label Jun 23, 2023
@ConeyLiu
Copy link
Contributor Author

Hi, @rdblue @aokolnychyi @szehon-ho @Fokko @nastra , could you help to review this when you are free?

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising, wondering if we can split out unrelated changes for another pr

@rdblue
Copy link
Contributor

rdblue commented Jul 4, 2023

@ConeyLiu, I'll take another look when this is rebased on #7898.

@ConeyLiu
Copy link
Contributor Author

ConeyLiu commented Jul 5, 2023

Thanks @rdblue, this has been rebased.

private static boolean isSystemFunc(org.apache.spark.sql.connector.expressions.Expression expr) {
if (expr instanceof UserDefinedScalarFunc) {
UserDefinedScalarFunc udf = (UserDefinedScalarFunc) expr;
return udf.canonicalName().startsWith("iceberg")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this string check? I feel like all of the functions should be listed in the list on 369?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be possible some other UDF functions like 'catalog.other.years'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that this is using the UDF's canonicalName but the other check uses the function's name. Those can differ: BucketFunction.name() return bucket, what the user would call, but the canonical function name identifies the exact bound function no matter how it is loaded so BucketInt.canonicalName() returns iceberg.bucket(int).

If we want to limit to just the Iceberg-defined functions, then this is necessary. It may be better to have a set of supported functions that uses just the canonicalName.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to have a set of supported functions that uses just the canonicalName.

There are some canonicalName that are not constants, for example:

    @Override
    public String canonicalName() {
      return String.format("iceberg.bucket(%s)", sqlType.catalogString());
    }

Copy link
Contributor

@rdblue rdblue Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that's not a constant, there are a limited number of values that sqlType.catalogString() might return. I think it would be better to enumerate them, rather than just looking for iceberg and mapping the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to do that, however TruncateDecimal needs precision and scale which are runtime values.

    @Override
    public String canonicalName() {
      return String.format("iceberg.truncate(decimal(%d,%d))", precision, scale);
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can similarly enumerate them, but this isn't a big deal.

return false;
} else {
return Arrays.stream(predicate.children()).skip(1).allMatch(SparkV2Filters::isLiteral);
}
}

/** Should be called after {@link #couldConvert} passed */
private static <T> UnboundTerm<Object> toTerm(T input) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really nice if Spark passed the BoundFunction in through the expression!

pushFilters(builder, predicate);
scan = builder.build().toBatch();

Assertions.assertThat(scan.planInputPartitions().length).isEqualTo(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any tests that exercise the entire code path using SQL? I'd like to see at least one end-to-end test with SQL.

import org.junit.Before;
import org.junit.Test;

public class TestSystemFunctionPushDownDQL extends SparkExtensionsTestBase {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue here is the SQL test. We need to add tests for update/delete as well. Will add it later.


@Test
public void testYearsFunction() {
Assume.assumeTrue(!catalogName.equals("spark_catalog"));
Copy link
Contributor Author

@ConeyLiu ConeyLiu Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

system function can not be resolved with spark_catalog. Got the following error. Will address it in a follow-up.

scala> spark.sql("select spark_catalog.system.years(cast('2017-11-22' as timestamp))")
org.apache.spark.sql.AnalysisException: [SCHEMA_NOT_FOUND] The schema `system` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a catalog, verify the current_schema() output, or qualify the name with the correct catalog.
To tolerate the error on drop use DROP SCHEMA IF EXISTS.; line 1 pos 7
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:42)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40)
  at org.apache.spark.sql.hive.HiveExternalCatalog.requireDbExists(HiveExternalCatalog.scala:56)
  at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$functionExists$1(HiveExternalCatalog.scala:1352)
  at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than setting up the catalog just to skip all the tests, can you just run this with one catalog for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, updated to run hive only.

extensions.injectCheckRule { _ => MergeIntoIcebergTableResolutionCheck }
extensions.injectCheckRule { _ => AlignedRowLevelIcebergCommandCheck }

// optimizer extensions
extensions.injectOptimizerRule { _ => ExtendedSimplifyConditionalsInPredicate }
extensions.injectOptimizerRule { _ => ExtendedReplaceNullWithFalseInPredicate }
extensions.injectOptimizerRule { spark => RewriteStaticInvoke(spark) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put in the optimizer in order to some constant expressions have been optimized. Such as system.days(ts) = system.days(date('2022-09-08')). The system.days(date('2022-09-08') will be evaluated into a constant value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this in a separate PR? I don't think the basic support for functions should include this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to do that, however, SOL uts depends on it. I will submit a separate PR to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submit #8088 and will add the sql push-down UTs after this in.

// Controls whether to push down Iceberg system function
public static final String SYSTEM_FUNC_PUSH_DOWN_ENABLED =
"spark.sql.iceberg.system-function-push-down.enabled";
public static final boolean SYSTEM_FUNC_PUSH_DOWN_ENABLED_DEFAULT = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a setting for this and why does it default to false?

Iceberg always has the option of rejecting functions by returning them rather than consuming them. I don't see a reason to have a flag for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared with StaticInvoke, ApplyFunctionExpression may have a little decrease in performance because it can not leverage codegen.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am excited about this functionality.

Personally, I have some questions about how the filters are working for the delete use case, and I think it might be good to have some tests for the filter pushdown on deletes since (from my first pass reading of the code) it might throw an exception.

It looks great, and I'm excited to have more filter pushdowns working with Iceberg :)

for (Predicate predicate : predicates) {
Expression converted = convert(predicate);
Preconditions.checkArgument(
converted != null, "Cannot convert Spark predicate to Iceberg expression: %s", predicate);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it's following what was done elsewhere this means if there was non-iceberg UDF predicate that got pushed down it would fail to push down the iceberg expressions? Looking at the code this only seems to be used (currently) in the deleteWhere, but I think for the deleteWhere codepath we should not throw.

public void deleteWhere(Filter[] filters) {
Expression deleteExpr = SparkFilters.convert(filters);
public void deleteWhere(Predicate[] predicates) {
Expression deleteExpr = SparkV2Filters.convert(predicates);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the one case where we use convert on a list of predicates instead one at a time, with canDeleteWhere using a for loop on the individual expressions we will can different results.

I think given the use case of convert + list of predicates we should change that to not throw as originally suggest by @rdblue .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the above answers. I think here is deliberate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I think I missed the this method will be invoked only if {@link #canDeleteWhere(Predicate[])} returns true. (I was looking at the Iceberg trait this extends instead of the upstream source).

@ConeyLiu
Copy link
Contributor Author

ConeyLiu commented Aug 1, 2023

Thanks @holdenk for your interest and comments.

Personally, I have some questions about how the filters are working for the delete use case, and I think it might be good to have some tests for the filter pushdown on deletes since (from my first pass reading of the code) it might throw an exception.

I mentioned a little in the descriptions.

This is the first part code. It changes filters to V2 filters and adds support to convert system functions wrapped in Spark UserDefinedScalarFunc to Iceberg expressions. In the following PR will add a rule to convert the system function call to
ApplyFunctionExpression which could be pushed down to the data source.

Actually, I have submitted some of the code for the rule and SQL UTs. However, they are split out to keep the PR small for review. I have a follow-up #8088 here and will be moved forward when this PR is merged.

@holdenk
Copy link
Contributor

holdenk commented Aug 1, 2023

Actually, I have submitted some of the code for the rule and SQL UTs. However, they are split out to keep the PR small for review. I have a follow-up #8088 here and will be moved forward when this PR is merged.

Awesome :) Sorry for missing the previous discussion around canDeleteWhere and happy to learn there is a follow on PR with more UTs :)

@rdblue rdblue merged commit 6875577 into apache:master Aug 1, 2023
31 checks passed
@rdblue
Copy link
Contributor

rdblue commented Aug 1, 2023

Merged! Thanks for working on this @ConeyLiu. Fantastic work.

@ConeyLiu
Copy link
Contributor Author

ConeyLiu commented Aug 2, 2023

Thanks @rdblue for merging this. And thanks all for your patient reviewing.

@BsoBird
Copy link

BsoBird commented Oct 31, 2023

Although I am only an iceberg user, I think this issue should be discussed further. This solution solves the problem at hand, but it inherently hurts the fairness of ICEBERG's support for different engines, since not all engines can support ICEBERG's system_functions. The ideal would be to implement transformational pushdown for complex conditions within each engine. While there may not be a conflict between these two options, I thought I'd speak up and say what I'm thinking.

@RussellSpitzer
Copy link
Member

@BsoBird this is not an issue, it's a pull request. If you would like to see support for pushing down transforms into other query engine I would suggest starting a new issue for those engines.

@anigos
Copy link

anigos commented Feb 29, 2024

Will this work in case of MERGE INTO

MERGE INTO TAB1 trg USING TAB2 src ON (
  trg.serial_id = src.serial_id 
)
AND (
  (
    iceberg.system.bucket(10, id) = 2 
  )
)
WHEN MATCHED THEN
UPDATE
SET
  trg.data = src.data,
  trg.category = src.category
  WHEN NOT MATCHED THEN
INSERT
  (serial_id, data, category)
VALUES
  (
    src.serial_id, 
    src.data,
    src.category
  );

@anigos
Copy link

anigos commented Feb 29, 2024

I am not sure what will be the syntax of it. iceberg.system.bucket(10, id) not working neither trg._partition.id_bucket = 1


  spark.sql(""" MERGE INTO iceberg.sample_trg trg USING iceberg.sample_src src ON (
              |  trg.id = src.id
              |)
              |AND (trg._partition.id_bucket = 1 and trg._partition.id_bucket = 7 )
              |WHEN MATCHED THEN
              |UPDATE
              |SET
              |  trg.data = src.data,
              |  trg.category = src.category
              |  WHEN NOT MATCHED THEN
              |INSERT
              |  (id, data, category, ts)
              |VALUES
              |  (
              |    src.id,
              |    src.data,
              |    src.category,
              |    src.ts
              |  );
              |  """.stripMargin)

@aokolnychyi @flyrain @RussellSpitzer Looking forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants