Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: add support for returning empty files in scans if requested. #8204

Merged

Conversation

rustyconover
Copy link
Contributor

@rustyconover rustyconover commented Aug 1, 2023

There are times when scanning a table that you want the empty files to be returned, this allows that to happen via an option to the scan.

@github-actions github-actions bot added the python label Aug 1, 2023
@Fokko
Copy link
Contributor

Fokko commented Aug 2, 2023

@rustyconover Thanks for raising this. Do you have more context on it? What's the use case of this? Just to make sure that we're not masking a 🐛

@rustyconover
Copy link
Contributor Author

Hi @Fokko,

I hope you're doing well! 😊

The main motivation behind it is to improve the comparison of Iceberg's inventory of files in a table to a collection of source files. We've noticed that users often check if all source files have been properly loaded by examining the parquet files of the Iceberg table. However, in some cases, the table may have source files with zero rows, and these files are currently not included in the table scan.

While excluding these empty files is a good optimization, we believe it's important to provide the option to include even empty files in the scans. This way, we can offer more flexibility and ensure thorough checks.

Let me know if you have any questions or suggestions about this. Your feedback is highly appreciated!

Best regards,
Rusty

@Fokko
Copy link
Contributor

Fokko commented Aug 6, 2023

Interesting, thanks for the explanation. How are you writing these files? Iceberg its focus is minimizing IO. Returning a file that doesn't contain any data, should be skipped for that reason. I don't recall anyone looking for such a feature on the Java side. @rdblue any thoughts on this?

@Fokko
Copy link
Contributor

Fokko commented Aug 6, 2023

@rustyconover Wouldn't it make more sense to go through the manifest instead of the actual scan planning?

image

@rustyconover
Copy link
Contributor Author

Hello @Fokko,

Sometimes, applications need to verify the presence of files based on specific filter expressions. For instance, if you're interested in checking for file membership in Iceberg, particularly for files with a date > 2023-01-01, utilizing the scan planner can be quite beneficial. This approach allows you to avoid loading unnecessary manifest files, enhancing efficiency.

Although working with manifest data provides valuable insights, it lacks the capability to filter using expressions. This is where the scan planner shines, enabling you to efficiently narrow down your results.

In my experience, I frequently encounter Iceberg tables containing an extensive number of data files, often exceeding 80,000. Therefore, any functionality that aids in filtering is immensely valuable.

Rusty

@rdblue rdblue merged commit 680241b into apache:master Aug 6, 2023
7 checks passed
@rdblue
Copy link
Contributor

rdblue commented Aug 6, 2023

Seems fine to me. Thanks, @rustyconover!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants