-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pushdown SUBSTRING filter when equivalent to STARTSWITH #8911
Comments
I think this could be done but the fix would have to be in Spark I believe. Spark needs to be able to convert Substring => StartsWith on their end |
Thanks for the quick reply @RussellSpitzer, so what you are saying is that this really should be implemented in Spark and once it is, there is nothing much to do on the Iceberg side? |
Iceberg uses the Datasource API from Spark, so we only see filters and expressions that Spark decides to pass through to us. In this case "substring" is just not an expression it can push through. What it can push through is "StartsWith" so in Spark we would want an analysis rule that converted Substring(1, X) => StartsWith. Another possible avenue to support this sort of thing would be to use the Iceberg truncate expression and an in clause. That may be possible in just Iceberg. |
Thanks for taking the time to explain, that all makes sense now! I can see that #7886 in Iceberg 1.4.0 could be handful for the other avenue you are suggesting! |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' |
Feature Request / Improvement
Summary
When filtering an Iceberg table in Spark, would it be possible to pushdown
SUBSTRING
filters when the substring begins with the start of the word (position1
)?For example, would it be possible to push down to the
BatchScan
this filter:Since it is equivalent to:
Which does indeed get pushed down as I can see from the physical plan that it is included in the
BatchScan
:Use Case
Suppose I have a table which contains location related data with a geohash column which is used to partition the data as follows:
Now let's insert some data:
I would like for the filter to be pushed down when perform the following sort of query:
Where
n
could vary in size from one query to another depending on the precision (the length) of geohashes we want to filter on. For example, if we are interested in geohashes of precision 2, this would be:This is currently not the case as can be seen by the physical plan generated by the above query:
Important
Note that in this use case, the
IN (...)
set could contain hundreds of thousands of elements. Would this be viable?Query engine
Spark
The text was updated successfully, but these errors were encountered: