-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add handling for unmanaged files to vacuum command #1817
base: main
Are you sure you want to change the base?
fix: add handling for unmanaged files to vacuum command #1817
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
The new version of this crate properly sets a lease duration such that the locks can actually expire
# Description Introduce a `LogStore` abstraction to channel all log store reads and writes through a single place. This is supposed to allow implementations with more sophisticated locking mechanisms that do not rely on atomic rename semantics for the underlying object store. This does not change any functionality - it reorganizes read operations and commits on the delta commit log to be funneled through the respective methods of `LogStore`. ## Rationale The goal is to align the implementation of multi-cluster writes for Delta Lake on S3 with the one provided by the original `delta` library, enabling multi-cluster writes with some writers using Spark / Delta library and other writers using `delta-rs` For an overview of how it's done in delta, please see: 1. Delta [blog post](https://delta.io/blog/2022-05-18-multi-cluster-writes-to-delta-lake-storage-in-s3/) (high-level concept) 2. Associated Databricks [design doc](https://docs.google.com/document/d/1Gs4ZsTH19lMxth4BSdwlWjUNR-XhKHicDvBjd2RqNd8/edit#heading=h.mjjuxw9mcz9h) (detailed read) 3. [S3DynamoDbLogStore.java](https://github.com/delta-io/delta/blob/master/storage-s3-dynamodb/src/main/java/io/delta/storage/S3DynamoDBLogStore.java)(content warning: Java code behind this link) This approach requires readers of a delta table to "recover" unfinished commits from writers - as a result, reading and writing is combined in a single interface, which in this PR is modeled after [LogStore.java](https://github.com/delta-io/delta/blob/master/storage/src/main/java/io/delta/storage/LogStore.java). Currently in `delta-rs`, read path for commits is implemented directly in `DeltaTable`, and there's no mechanism to implement storage-specific behavior like interacting with DynamoDb. --------- Co-authored-by: Robert Pack <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for finishing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there is a test assertion that just needs to be updated, since _change_data
and _delta_index
are exceptions in is_hidden_directory()
.
We may have some additional aspects into consideration when selecting the files to be vacuumed. As it stands I think we may accidentally delete deletion vector files for files that are still part of the log. We should however be able to get the corresponding files names from the add action and add them to the managed files. Some |
None | ||
}; | ||
})) | ||
.collect::<HashSet<String>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the deletion vector file path is being constructed on the fly by the absolut_path
function, it now must be a HashSet<String>
instead of HashSet<&str>
. This also means that we now have to clone the String of the add and remove path. Is this fine or is there a better way to handle this?
Since the add/remove path potentially needs to be converted to an absolute path, I need to construct a String any way
Also, this nested handling of the deletion vector paths doesn't look nice, can you show me a better/more idiomatic way to achieve this?
Ok((path_name.starts_with('.') || path_name.starts_with('_')) | ||
&& !path_name.starts_with("_delta_index") | ||
&& !path_name.starts_with("_change_data") | ||
fn is_hidden_file(partition_columns: &[String], path: &Path) -> Result<bool, DeltaTableError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this to is_hidden_file
since we are not only ignoring directories but also single files
/// Whether a file is contained within the set of managed files. | ||
fn is_file_managed(&self, managed_files: &HashSet<String>, file: &Path) -> bool { | ||
return if is_absolute_path(file.as_ref()) { | ||
managed_files.contains(file.as_ref()) | ||
} else { | ||
let path = format!("{}{}", self.log_store.root_uri(), file.as_ref()); | ||
managed_files.contains(&path) | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the location
of an object in the log_store ever be absolute? Or is it always relative to the table root? If so, then we wouldn't need the is_absolute_path
check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jan-Schweizer the location can be an absolute URL, and that's the case with shallow cloned tables. The add action describes the path
key.
So for better or worse, we need to check for an absolute path 😄
/// Whether a file is contained within the set of managed files. | ||
fn is_file_managed(&self, managed_files: &HashSet<String>, file: &Path) -> bool { | ||
return if is_absolute_path(file.as_ref()) { | ||
managed_files.contains(file.as_ref()) | ||
} else { | ||
let path = format!("{}{}", self.log_store.root_uri(), file.as_ref()); | ||
managed_files.contains(&path) | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jan-Schweizer the location can be an absolute URL, and that's the case with shallow cloned tables. The add action describes the path
key.
So for better or worse, we need to check for an absolute path 😄
# Description I have made a bunch of improvements to fix the overall structure due to example sections not being consistent. I've also enabled some extra features. Fixed also the issue of some classes/functions not being shown properly.
# Description This PR prunes each merge bin with only 1 file, even though there are multiple merge bins. # Related Issue(s) - closes delta-io#1901 # Documentation This PR adds test_idempotent_with_multiple_bins() for testing.
# Description Prepare for next release # Related Issue(s) <!--- For example: - closes delta-io#106 ---> # Documentation <!--- Share links to useful documentation --->
# Description This is a continuation of the discussion in the delta-io#1917 Getting rid of panic in the library crate in favor of returning an error so lib users could handle it in a way they see it Test changes accordingly Co-authored-by: Robert Pack <[email protected]>
I'll pick this up next week since we getting often issues about vacuum not cleaning up all required files. |
Description
Adds handling for unmanaged files to the VACUUM command. The VACUUM command should correctly handle files that are not managed by a Delta table, e.g. files and directories that start with
_
or.
.Additionally fixes test
test_non_managed_files
.Related Issue(s)