Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: Add option to cache file metadata #12548

Closed
wants to merge 1 commit into from

Conversation

progval
Copy link
Contributor

@progval progval commented Sep 20, 2024

Inspired by datafusion-examples/examples/advanced_parquet_index.rs

Which issue does this PR close?

This was an attempt to solve #12547, but did not achieve it, and I am not sure it is the right approach.

Rationale for this change

On every query on Parquet ables, Datafusion re-opens every file, and parses its metadata. This takes a significant time for short queries (in my use case, there is usually a single hit in the Page Index).

My goal with to make these queries near-instant. Unfortunately, I realized after writing this code that the Page Index still needs to be parsed every time, because file metadata is lost through the listing layer (as mentioned in #9964).

So this does spare some (negligible?) time parsing metadata. I'm not sure it's worth the extra complexity, especially in ParquetFormat. What do you think?

What changes are included in this PR?

  • Made ParquetFormat carry state (it probably deserves a renaming then...)
  • Added CachedParquetFileReaderFactory as an alternative to DefaultParquetFileReaderFactory, and made it usable through a config option

Are these changes tested?

no

Are there any user-facing changes?

Added datafusion.execution.parquet.cache_metadata

Inspired by datafusion-examples/examples/advanced_parquet_index.rs
@github-actions github-actions bot added core Core DataFusion crate common Related to common crate labels Sep 20, 2024
@progval
Copy link
Contributor Author

progval commented Sep 23, 2024

Closing in favor of #12593, which actually works and doesn't require extensive changes to Datafusion.

@progval progval closed this Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant