-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tustvold/extract parquet statistics #16
Tustvold/extract parquet statistics #16
Conversation
* fix: wrong result of range function * fix test * add ut * add ut * nit * nit --------- Co-authored-by: zhongjingxiong <[email protected]>
* refactor: output-ordering * chore: test * chore: cr comment Co-authored-by: Alex Huang <[email protected]> --------- Co-authored-by: Alex Huang <[email protected]>
pub(crate) fn prune_row_groups_by_statistics( | ||
parquet_schema: &SchemaDescriptor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh we need the full FileMetadata here in order to inspect the ColumnOrder - I decided against this as it would result in a load of test churn
} | ||
|
||
// This could be made more efficient (#TBD) | ||
let parquet_idx = (0..parquet_schema.columns().len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix for apache#8335
return Ok(new_empty_array(self.field.data_type())); | ||
} | ||
/// Extracts the min statistics from an iterator of [`ParquetStatistics`] to an [`ArrayRef`] | ||
pub fn min_statistics<'a, I: Iterator<Item = Option<&'a ParquetStatistics>>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I had this as ColumnChunkMetadata, however, we don't actually need anything beyond the ParquetStatistics, so it seemed peculiar to require the full ColumnChunkMetadata
. Additionally one method to support the column index using the same array logic would be to coerce both kinds of statistics to the same representation
pub(crate) struct RowGroupStatisticsConverter<'a> { | ||
field: &'a Field, | ||
/// Returns the parquet column index and the corresponding arrow field | ||
pub fn parquet_column<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is basically a hack, if/when we upstream this we can use the parquet-private ParquetField which handles this properly
Which issue does this PR close?
Closes #.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?