[Spark] Support subDirs parameter for VACUUM #1928
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which Delta project/connector is this regarding?
Description
Support subDirs parameter for VACUUM command. Currently VACUUM command builds a list of all files under the root table location and exclude valid files from the list. This operation could cause huge storage cost when >10M of valid files under the table and need to run VACUUM frequently for few files. For now there is no way to avoid listing all files for VACUUM operation.
This PR adds
subDirs
parameter for VACUUM Scala/Python API to limit the candidate files. If subDirs parameter is given, only file paths under the directories will be considered for VACUUM.It can be beneficial for the following scenario:
For WHERE clause support in #1691, we usually use the clause to "filter" files. So we have to make the whole list first for filtering. Otherwise it needs to check the predicate before stepping into each subdirectory recursively, which requires a lot of code changes. This PR could be a workaround for #1691. There was the same request #220
How was this patch tested?
Unit tests, ran in production for months
Does this PR introduce any user-facing changes?
Yes, support new parameter for VACUUM.