Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune columns / pages that are all null in ParquetExec by connecting up row_counts in pruning statistics #9961

Closed
alamb opened this issue Apr 5, 2024 · 3 comments · Fixed by #10051
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@alamb
Copy link
Contributor

alamb commented Apr 5, 2024

Is your feature request related to a problem or challenge?

@appletreeisyellow added PruningStatistics::row_counts() in #9223 which allows better pruning of columns which are all null.

However, I believe we have not hooked that API up into the ParquetExec, so it won't prune row groups based on this information.

For example, if column a is all NULL, a predicate `a > 5' can never be true, but the the ParquetExec won't be able to prune row groups or pages for this case

Describe the solution you'd like

Implement RowGroupPruningStastics::row_counts

https://github.com/apache/arrow-datafusion/blob/2dad90425bacb98a3c2a4214faad53850c93104e/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L345-L347

And PagesPruningStatistics::row_counts

https://github.com/apache/arrow-datafusion/blob/2dad90425bacb98a3c2a4214faad53850c93104e/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L550-L552

Describe alternatives you've considered

I think the row counts can be found on https://docs.rs/parquet/latest/parquet/format/struct.ColumnMetaData.html

So this ticket should be a matter of copying the row counts correctly and writing some tests in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/parquet/row_group_pruning.rs / https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/parquet/page_pruning.rs

Additional context

No response

@alamb alamb added enhancement New feature or request help wanted Extra attention is needed labels Apr 5, 2024
@alamb alamb changed the title Connect up row_counts statistics in ParquetExec Prune columns / pages that are all null in ParquetExec by connecting up row_counts in pruning statistics Apr 5, 2024
@alamb
Copy link
Contributor Author

alamb commented Apr 5, 2024

cc @Ted-Jiang and @progval as you have been working in this area recently

@Ted-Jiang
Copy link
Member

@alamb Thanks for ping me, i will check this today

@Ted-Jiang Ted-Jiang self-assigned this Apr 7, 2024
@Ted-Jiang
Copy link
Member

related: #9223 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
2 participants