Pushdown SUBSTRING filter when equivalent to STARTSWITH #8911

tombarti · 2023-10-23T18:48:33Z

Feature Request / Improvement

Summary

When filtering an Iceberg table in Spark, would it be possible to pushdownSUBSTRING filters when the substring begins with the start of the word (position 1)?

For example, would it be possible to push down to the BatchScan this filter:

WHERE SUBSTRING(mycol, 1, 3) IN ('foo', 'bar', 'baz')

Since it is equivalent to:

WHERE STARTSWITH(mycol, 'foo') OR STARTSWITH(mycol, 'bar') OR STARTSWITH(mycol, 'baz')

Which does indeed get pushed down as I can see from the physical plan that it is included in the BatchScan:

== Physical Plan ==
*(1) ColumnarToRow
+- BatchScan catalog.db.mytable[mycol#9, name#10] catalog.db.mytable (branch=null) [filters=((mycol LIKE 'foo%' OR mycol LIKE 'bar%') OR mycol LIKE 'baz%'), groupedBy=] RuntimeFilters: []

Use Case

Suppose I have a table which contains location related data with a geohash column which is used to partition the data as follows:

CREATE TABLE IF NOT EXISTS dev.db.locations (
    geohash string,
    location_name string,
    identifier integer
)
USING ICEBERG
PARTITIONED BY (truncate(2, geohash))

Now let's insert some data:

INSERT INTO dev.db.locations VALUES
   ('fec229', 'name_1', 1),
   ('fez228', 'name_2', 2),
   ('f2c229', 'name_1', 3),
   ('c2c23c', 'name_2', 4),
   ('c2c22c', 'name_3', 5),
   ('h3c236', 'name_3', 6),

I would like for the filter to be pushed down when perform the following sort of query:

SELECT *
FROM dev.db.locations
WHERE SUBSTRING(geohash, 1, n) IN (...)

Where n could vary in size from one query to another depending on the precision (the length) of geohashes we want to filter on. For example, if we are interested in geohashes of precision 2, this would be:

SELECT *
FROM dev.db.locations
WHERE SUBSTRING(geohash, 1, 2) IN ('fe', 'c2')

This is currently not the case as can be seen by the physical plan generated by the above query:

== Physical Plan ==
*(1) Filter substring(geohash#9, 1, 2) IN (fe,c2)
+- *(1) ColumnarToRow
   +- BatchScan dev.db.locations[geohash#9, location_name#10, identifier#11] dev.db.locations (branch=null) [filters=, groupedBy=] RuntimeFilters: []

Important

Note that in this use case, the IN (...) set could contain hundreds of thousands of elements. Would this be viable?

Query engine

Spark

The text was updated successfully, but these errors were encountered:

RussellSpitzer · 2023-10-23T18:51:06Z

I think this could be done but the fix would have to be in Spark I believe. Spark needs to be able to convert Substring => StartsWith on their end

tombarti · 2023-10-23T19:18:15Z

Thanks for the quick reply @RussellSpitzer, so what you are saying is that this really should be implemented in Spark and once it is, there is nothing much to do on the Iceberg side?

RussellSpitzer · 2023-10-23T19:24:42Z

Iceberg uses the Datasource API from Spark, so we only see filters and expressions that Spark decides to pass through to us. In this case "substring" is just not an expression it can push through. What it can push through is "StartsWith" so in Spark we would want an analysis rule that converted Substring(1, X) => StartsWith.

Another possible avenue to support this sort of thing would be to use the Iceberg truncate expression and an in clause. That may be possible in just Iceberg.

tombarti · 2023-10-23T19:46:58Z

Thanks for taking the time to explain, that all makes sense now!

I can see that #7886 in Iceberg 1.4.0 could be handful for the other avenue you are suggesting!

github-actions · 2024-09-25T00:15:15Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions · 2024-10-10T00:14:42Z

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions bot added the stale label Sep 25, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pushdown SUBSTRING filter when equivalent to STARTSWITH #8911

Pushdown SUBSTRING filter when equivalent to STARTSWITH #8911

tombarti commented Oct 23, 2023

RussellSpitzer commented Oct 23, 2023

tombarti commented Oct 23, 2023

RussellSpitzer commented Oct 23, 2023 •

edited

Loading

tombarti commented Oct 23, 2023

github-actions bot commented Sep 25, 2024

github-actions bot commented Oct 10, 2024

Pushdown SUBSTRING filter when equivalent to STARTSWITH #8911

Pushdown SUBSTRING filter when equivalent to STARTSWITH #8911

Comments

tombarti commented Oct 23, 2023

Feature Request / Improvement

Summary

Use Case

Query engine

RussellSpitzer commented Oct 23, 2023

tombarti commented Oct 23, 2023

RussellSpitzer commented Oct 23, 2023 • edited Loading

tombarti commented Oct 23, 2023

github-actions bot commented Sep 25, 2024

github-actions bot commented Oct 10, 2024

RussellSpitzer commented Oct 23, 2023 •

edited

Loading