Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Support subDirs parameter for VACUUM #1928

Closed
wants to merge 3 commits into from

Conversation

sezruby
Copy link
Contributor

@sezruby sezruby commented Jul 22, 2023

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Support subDirs parameter for VACUUM command. Currently VACUUM command builds a list of all files under the root table location and exclude valid files from the list. This operation could cause huge storage cost when >10M of valid files under the table and need to run VACUUM frequently for few files. For now there is no way to avoid listing all files for VACUUM operation.

This PR adds subDirs parameter for VACUUM Scala/Python API to limit the candidate files. If subDirs parameter is given, only file paths under the directories will be considered for VACUUM.

It can be beneficial for the following scenario:

  • A table maintains millions of files, keeps getting new data and needs to VACUUM frequently.
  • If a table is partitioned by date type, it's clear which partitions need to be vacuumed.
  • If there are many invalid files and a user want to run vacuum partially due to of lack of resources, etc.

For WHERE clause support in #1691, we usually use the clause to "filter" files. So we have to make the whole list first for filtering. Otherwise it needs to check the predicate before stepping into each subdirectory recursively, which requires a lot of code changes. This PR could be a workaround for #1691. There was the same request #220

How was this patch tested?

Unit tests, ran in production for months

Does this PR introduce any user-facing changes?

Yes, support new parameter for VACUUM.

@sezruby sezruby changed the title [Spark] Support subDirs parameter for VACUUM [Spark] Support targetPrefixes parameter for VACUUM Jul 23, 2023
@sezruby sezruby changed the title [Spark] Support targetPrefixes parameter for VACUUM [Spark] Support subDirs parameter for VACUUM Jul 23, 2023
@tnyz
Copy link

tnyz commented Feb 22, 2024

can we release this feature?

@sezruby
Copy link
Contributor Author

sezruby commented Feb 23, 2024

@tnyz closed as better approach - #1932 was going to merge. (the latest design seems bit complex now though)

This PR could be a quick workaround for expensive dir command problem while vacuum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants