-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: Support Page Skipping in Iceberg Parquet Reader #1566
Conversation
Thanks for opening this PR, @shangxinli! I'll take a look at this soon. It's great to see you have it working. |
FYI @aokolnychyi, @prodeezy: this PR adds support for page-level skipping in Parquet. |
0f330e0
to
a951e38
Compare
a951e38
to
cd70cac
Compare
@@ -582,6 +582,13 @@ public ReadBuilder withNameMapping(NameMapping newNameMapping) { | |||
optionsBuilder = ParquetReadOptions.builder(); | |||
} | |||
|
|||
if (filter != null && | |||
schema.getAliases() != null && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale behind schema.getAliases() != null
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, there will be a NPE thrown in https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/ParquetFilters.java#L120. There was some discussion in #193 with Ryan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Iceberg table schema will probably rarely have aliases. I think after my recommendation at #1566 (comment) you should no longer see the NPE and this check can be removed.
I think this is also in alignment with what @rdblue says at #193 (comment)
When converting to the Parquet FilterApi, the aliases must come from a Schema instance that was created from the MessageType of the file you're pushing filters to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense
schema.getAliases() != null && | ||
ParquetFilters.isSupportedFilter(filter, schema, caseSensitive)) { | ||
optionsBuilder.useRecordFilter(filterRecords); | ||
optionsBuilder.withRecordFilter(ParquetFilters.convert(schema, filter, caseSensitive)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first argument to ParquetFilters.convert
should be the file schema and not the table schema. Since Iceberg allows column renaming, a column in the table schema maybe named differently in the file. If you look at
String path = schema.idToAlias(ref.fieldId()); |
type = schemaReader.getFileMetaData().getSchema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you recommend to move "type = schemaReader.getFileMetaData().getSchema()" to the beginning of this method so that we can use the file schema instead of table schema? I worry it could result in more storage API calls because when the condition "if (readerFunc != null || batchedReaderFunc != null)" is true, we don't have this API call today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to move it to the absolute top, we can only fetch the file schema if the filter is supposed to be used. It will lead to an extra open call, but I am not sure I see a good way around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we call it when we need to convert.
Operation.LT, | ||
Operation.LT_EQ); | ||
|
||
private static final Set<Type.TypeID> SUPPORTED_TYPES = ImmutableSet.of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why other types like TIMESTAMP
, STRING
and other binary columns are not supported? Seems like ParquetFilters
is able to convert expressions for these types too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I overlooked at UNSUPPORTED_TYPES defined in ExpressToSearchArgument.java. It is for ORC instead of Parquet. I will remove these.
private ParquetFilters() { | ||
} | ||
|
||
public static boolean isSupportedFilter(Expression expr, Schema schema, boolean caseSensitive) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than introducing more code to check if the filter is supported, can we just eagerly try to convert filter and use the exception as a signal that the filter is unsupported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what I originally use but I changed it because throwing/catching exception is pretty expensive. In the non-error case, we should avoid that, although I understand creating some separate code to keep tracking is another pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha! Since this is not in the critical loop of reading rows, I think the cost of catching exception should not be significant in the grand scheme of things. I don't have a strong opinion for this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We throw it once per file in case of unsupported filters. I don't have strong opinions either.
long blockRowCount = blocks.get(nextRowGroup).getRowCount(); | ||
Preconditions.checkState(blockRowCount >= pages.getRowCount(), | ||
"Number of values in the block, %s, does not great or equal number of values after filtering, %s", | ||
blockRowCount, pages.getRowCount()); | ||
long rowPosition = rowGroupsStartRowPos[nextRowGroup]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if the nextRowGroup
counter and consequently rowPosition
will be accurate in case of skipped pages. Having the correct rowPosition
relative to the start of the file is critical for applying row-level deletes. Can you add a test to TestSparkParquetReadMetadataColumns
which would invoke page skipping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading more about Parquet and page skipping, I don't think my comment above holds true in the general case. Seems like nextRowGroup
counter should still accurately depict the current row group being processed. The only edge case that I see is in https://github.com/apache/parquet-mr/blob/fe9ab5d22c602776f44b027ad3cc05609eddba54/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L967 where a whole row group may be skipped if the column index allows all the pages in the row group to be skipped. In such a case, readNextFilteredRowGroup
will automatically advance to the next block without our knowledge and hence without the counter being updated. This will be an issue for all further row group skipping and row postition calculations in Iceberg's ParquetReader. However, I am not sure if there can be such a case where Iceberg's row group skipping is not able to skip the row group, but Parquet's page skipping ends put skipping all page and consequently the whole row group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This edge case is probably possible with a carefully crafted file. e.g. consider the min/max values of single column
Row Group: [1, 1000]
Pages: [[1, 10], [900, 1000]]
And expression column = 500
will match the row group but will not match any of the pages resulting in the next row group being read automatically without the nextRowGroup
counter being updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is such an important comment! That is absolutely possible and we need to handle that case. I looked at the code again and found even the existing code, we are missing a condition check 'pages == null'. It seems valid that readNextRowGroup() can return null https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L899. In that case, the caller needs to handle while it doesn't. It is more complicated for readNextFilteredRowGroup() because it can advance internally without we know. I will add some handling code there. If you have suggestions, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shardulm94, after looking into the code deeper, I think there could be two solutions here. 1) Change Parquet to let readNextFilteredGroup() return not only the filtered pages but also the record counts it skipped. 2) In the ReadConf constructor, when deciding 'shouldRead', we add ColumnIndex filter. This will ensure if a row group is called by readNextFilteredGroup(), there will always be at least 1 page returned and no implicitly advance internally without Iceberg knowing it. Here is the place for determining the 'shouldRead' boolean variable. https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L106
For option 1, we will need to wait for the new release of Parquet. I am not sure releasing 1.10.2 is an option for Parquet. If not, we have to wait for Parquet 1.12.0. For option 2, it requires a pretty significant change. And we have to make sure the implementation here is consistent with Parquet' ColumnIndex filter implementation.
Please share your thought and we can discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shardulm94, I talked to the Parquet community(PARQUET-1927) and it seems we don't need Parquet change anymore. What we can do is to use ParquetFileReader constructor's filter(page level stats, row group stats, dictionray and future bloomfilter). All the following calls like reader.getRowGroups() and reader.getFilteredRecordCount() in ReadConf are all filtered values. It also simplifies the ReadConf constructor and ParquetReader/VectorizedParquetReader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shangxinli After going through the code, I don't think this solves the problem. ParquetFileReader.getRowGroups
does not apply column index filters (only stats and dict). So there is still the case where getRowGroups
will return a block which will be skipped by readNextFilteredGroup()
. I think the current test case does not accurately capture this.
Overall I don't think page skipping plays nicely with Iceberg's requirement to accurately fetch row numbers for row level deletes. e.g Iceberg currently stores the row number of the start of each row group, but in case of page skipping, some pages in the middle of the row group may be skipped and we would have no way to figure out the accurate row numbers. So we may have to disable page skipping when we want to fetch row numbers.
cc: @rdblue
Having thought about this issue deeply in the Parquet context, I have doubts whether our ORC implementation for row numbers also suffers from the same issue. I will have to revisit this for ORC.
I guess
valuesRead + skippedValues just like hasNext()
|
int totalCount = getPageRecordCount(parquetFile, null); | ||
int filterCount = getPageRecordCount(parquetFile, | ||
Expressions.and(Expressions.notNull("intCol"), Expressions.equal("intCol", 1))); | ||
Assert.assertTrue(filterCount < totalCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this end up reading one and only one page? Can make the assertion more strict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I changed to test code to compare with one single page count.
Nice work @shangxinli! Do you plan to add any jmh benchmark tests to illustrate the filtering benefits? |
I am new to Iceberg. Can you add more details about 'jmh'? My plan is to have unit tests along with this PR. I can then open another PR if benchmarking tests and other tests are needed.
|
JMH is a toolkit for microbenchmarks in Java. There are some tests here in Iceberg that use it: |
437c1aa
to
e7b5a80
Compare
c623300
to
e004fe2
Compare
@shardulm94 Do you have time to have another look? |
@shangxinli Yes, I started reviewing this yesterday, plan to complete it today. Just relying on Parquet to do filtering (stats/dictionary/page skipping/bloom filters) will greatly simplify Iceberg code, but I am not sure why the initial implementation chose to apply filters ourselves. @rdblue do you have any insight here? |
@shardulm94 I am also curious about why Iceberg reimplements those filters. @rdblue Can you cast some light on this? I can see the pros and cons of reimplementing. The concern for reimplementing is that it creates duplication and fragments. For short-term workaround it is fine, but for the long run, I think deduping to one implementation makes more sense. Otherwise, when new filters are added in Parquet, we need to reimplement them here. Column index is one, and bloom filer is another one on the way, and so on. If some filters in Iceberg but not in Parquet, we can bring in to the Parquet community to add them. What do you think? |
There are a few reasons why Iceberg reimplemented the filters.
To fix some of the issues with the Parquet API, my hope was that eventually Parquet would use Iceberg's expression API and filters in place of its own. We'd need to refactor a bit to make this happen, but I think it would still be a good option. There are several things that I think would be great to standardize across some of the storage projects like the |
One more thing, we also wanted to use Iceberg's comparators. I think we had those done before Parquet introduced its comparators, and Iceberg also has strong support for |
@@ -623,17 +630,10 @@ public ReadBuilder withNameMapping(NameMapping newNameMapping) { | |||
if (filter != null) { | |||
// TODO: should not need to get the schema to push down before opening the file. | |||
// Parquet should allow setting a filter inside its read support | |||
MessageType type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment above sounds like another possible reason why we wanted to reimplement filters. Tailoring the filter to each file is difficult, compared to evaluating the same filter for a file.
If I remember correctly, the need to tailor the filter for the file is because we use id-based column resolution. So the file might contain 1: a int
and Iceberg has a filter for 1: x long
. Unless the filter is translated to use a
instead of x
, Parquet will skip the file because it doesn't think the column exists (and is all nulls).
if (filter != null) { | ||
statsFilter = new ParquetMetricsRowGroupFilter(expectedSchema, filter, caseSensitive); | ||
dictFilter = new ParquetDictionaryRowGroupFilter(expectedSchema, filter, caseSensitive); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR removes support for pushing down filters like startsWith
and in
?
@rdblue, make sense. Now the question is for ColumnIndex filter do we also want to reimplement? And the same question for future bloomfilter. Since we already had other types of filter reimplemented, it makes more sense to do the same thing. It is just more effort to do, but I can do it if Iceberg decides to reimplement all the filters in Parquet. If we don't reimplement, we might need Parquet to release 1.11.2(we discussed in the last Parquet meeting and that is possible if needed) with a fix to have skipped record count. But we still have the same issue you mentioned above(IN, STARTWITH not supported etc). @shardulm94, I didn't address your comments yet. If we decide to reimplement the ColumnIndex filter, then we cannot call readNextRowGroupFilter(). Instead, we will filter pages(maybe in BaseColumnIterator?, need to look into it more). So I will address later once we have agreement here. |
@shangxinli, I think we probably want to build the filtering into Iceberg. I could be convinced otherwise, but it would take a lot of work to use Parquet filters. After my comment yesterday, I was looking at the implementation here and noticed another problem: Parquet expects a filter that references columns using the name that was written into each data file. Iceberg allows renaming and so the Iceberg filters are written to use field IDs instead of names. That is needed for correctness. If I rename field Parquet also doesn't support customizing the filter in I'm remembering more of the issues with the original filter path now, and I think that it makes more sense to do the work in Iceberg for now, with the intent to get the work upstream in a 2.0 Parquet API that fixes some of these issues. What do you think? |
@rdblue I think it makes sense. |
@rdblue I looked at the implementation of ColumnIndex. Here are what we need to do to rewrite this filter and use it in Iceberg.
The work seems more than just rewriting the filter but also rewriting ColumnChunkPageReadStore, SynchronizingColumnReader, and partially ParquetFileReader. This makes me think should we continue rewriting or reusing? If we choose to reuse, we can limit the filter not to support startWith, in.., or we can add them into Parquet. Let me know you thought. |
@shangxinli Hi, any updates on it? |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Issue link: #193