Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pushdown SUBSTRING filter when equivalent to STARTSWITH #8911

Closed
tombarti opened this issue Oct 23, 2023 · 6 comments
Closed

Pushdown SUBSTRING filter when equivalent to STARTSWITH #8911

tombarti opened this issue Oct 23, 2023 · 6 comments
Labels

Comments

@tombarti
Copy link

Feature Request / Improvement

Summary

When filtering an Iceberg table in Spark, would it be possible to pushdownSUBSTRING filters when the substring begins with the start of the word (position 1)?

For example, would it be possible to push down to the BatchScan this filter:

WHERE SUBSTRING(mycol, 1, 3) IN ('foo', 'bar', 'baz')

Since it is equivalent to:

WHERE STARTSWITH(mycol, 'foo') OR STARTSWITH(mycol, 'bar') OR STARTSWITH(mycol, 'baz')

Which does indeed get pushed down as I can see from the physical plan that it is included in the BatchScan:

== Physical Plan ==
*(1) ColumnarToRow
+- BatchScan catalog.db.mytable[mycol#9, name#10] catalog.db.mytable (branch=null) [filters=((mycol LIKE 'foo%' OR mycol LIKE 'bar%') OR mycol LIKE 'baz%'), groupedBy=] RuntimeFilters: []

Use Case

Suppose I have a table which contains location related data with a geohash column which is used to partition the data as follows:

CREATE TABLE IF NOT EXISTS dev.db.locations (
    geohash string,
    location_name string,
    identifier integer
)
USING ICEBERG
PARTITIONED BY (truncate(2, geohash))

Now let's insert some data:

INSERT INTO dev.db.locations VALUES
   ('fec229', 'name_1', 1),
   ('fez228', 'name_2', 2),
   ('f2c229', 'name_1', 3),
   ('c2c23c', 'name_2', 4),
   ('c2c22c', 'name_3', 5),
   ('h3c236', 'name_3', 6),

I would like for the filter to be pushed down when perform the following sort of query:

SELECT *
FROM dev.db.locations
WHERE SUBSTRING(geohash, 1, n) IN (...)

Where n could vary in size from one query to another depending on the precision (the length) of geohashes we want to filter on. For example, if we are interested in geohashes of precision 2, this would be:

SELECT *
FROM dev.db.locations
WHERE SUBSTRING(geohash, 1, 2) IN ('fe', 'c2')

This is currently not the case as can be seen by the physical plan generated by the above query:

== Physical Plan ==
*(1) Filter substring(geohash#9, 1, 2) IN (fe,c2)
+- *(1) ColumnarToRow
   +- BatchScan dev.db.locations[geohash#9, location_name#10, identifier#11] dev.db.locations (branch=null) [filters=, groupedBy=] RuntimeFilters: []

Important

Note that in this use case, the IN (...) set could contain hundreds of thousands of elements. Would this be viable?

Query engine

Spark

@RussellSpitzer
Copy link
Member

I think this could be done but the fix would have to be in Spark I believe. Spark needs to be able to convert Substring => StartsWith on their end

@tombarti
Copy link
Author

Thanks for the quick reply @RussellSpitzer, so what you are saying is that this really should be implemented in Spark and once it is, there is nothing much to do on the Iceberg side?

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Oct 23, 2023

Iceberg uses the Datasource API from Spark, so we only see filters and expressions that Spark decides to pass through to us. In this case "substring" is just not an expression it can push through. What it can push through is "StartsWith" so in Spark we would want an analysis rule that converted Substring(1, X) => StartsWith.

Another possible avenue to support this sort of thing would be to use the Iceberg truncate expression and an in clause. That may be possible in just Iceberg.

@tombarti
Copy link
Author

Thanks for taking the time to explain, that all makes sense now!

I can see that #7886 in Iceberg 1.4.0 could be handful for the other avenue you are suggesting!

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Sep 25, 2024
Copy link

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants