Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add handling for unmanaged files to vacuum command #1817

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
1dc65b3
Add handling for unmanaged files to vacuum command
Jan-Schweizer Nov 7, 2023
e59bb34
collapse nested if block
Jan-Schweizer Nov 7, 2023
90b7741
chore: upgrade to the latest dynamodb-lock crate
rtyler Nov 7, 2023
809f645
feat: default logstore implementation (#1742)
dispanser Nov 9, 2023
8e64a0d
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
Jan-Schweizer Nov 9, 2023
7559c24
fix: use correct folder for auto assigned labels
roeap Nov 8, 2023
da6e438
fix: run integration tests in CI
roeap Nov 9, 2023
140f949
Update README.md
dennyglee Nov 9, 2023
a327fa8
Correctly handle hidden files in _change_data and _delta_index & dele…
Jan-Schweizer Nov 11, 2023
b40f276
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
Jan-Schweizer Nov 11, 2023
a358f06
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
Jan-Schweizer Nov 15, 2023
752773a
Fix paths for managed files
Jan-Schweizer Nov 15, 2023
dc74dcc
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
Jan-Schweizer Nov 29, 2023
96a5e0a
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
rtyler Nov 30, 2023
5ea77fd
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
ion-elgreco Nov 30, 2023
48b4e3c
docs: fix all examples and change overall structure (#1931)
ion-elgreco Dec 1, 2023
f90b48c
fix: prune each merge bin with only 1 file (#1902)
haruband Dec 2, 2023
d518f40
chore: update python version (#1934)
wjones127 Dec 2, 2023
2733f3d
Support os.PathLike for table references
bolkedebruin Nov 6, 2023
f4b9e91
add type param
wjones127 Dec 2, 2023
83f2f99
fix: get rid of panic in during table (#1928)
Dec 2, 2023
b946d07
Happify linter
Jan-Schweizer Dec 2, 2023
d4642bf
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
Jan-Schweizer Dec 2, 2023
04e3811
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
Jan-Schweizer Dec 14, 2023
27c2b53
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
ion-elgreco Dec 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 89 additions & 9 deletions crates/deltalake-core/src/operations/vacuum.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ use object_store::Error;
use object_store::{path::Path, ObjectStore};
use serde::Serialize;
use serde_json::Value;
use url::Url;

use super::transaction::commit;
use crate::crate_version;
Expand Down Expand Up @@ -185,7 +186,6 @@ impl VacuumBuilder {
};

let expired_tombstones = get_stale_files(&self.snapshot, retention_period, now_millis);
let valid_files = self.snapshot.file_paths_iter().collect::<HashSet<Path>>();

let mut files_to_delete = vec![];
let mut file_sizes = vec![];
Expand All @@ -200,12 +200,58 @@ impl VacuumBuilder {
.ok_or(DeltaTableError::NoMetadata)?
.partition_columns;

// A set containing the absolute paths to managed files
let managed_files = self
.snapshot
.files()
.iter()
.map(|a| {
if is_absolute_path(&a.path) {
a.path.clone()
} else {
format!("{}{}", self.log_store.root_uri(), a.path)
}
})
.chain(self.snapshot.all_tombstones().iter().map(|r| {
if is_absolute_path(&r.path) {
r.path.clone()
} else {
format!("{}{}", self.log_store.root_uri(), r.path)
}
}))
.chain(self.snapshot.files().iter().filter_map(|a| {
return if let Some(deletion_vector) = &a.deletion_vector {
if let Ok(parent) =
&Url::parse(&format!("file://{}", self.log_store.root_uri().as_str()))
{
if let Ok(dv_absolute_path) = deletion_vector.absolute_path(parent) {
Some(dv_absolute_path?.path().to_string())
} else {
None
}
} else {
None
}
} else {
None
};
}))
.collect::<HashSet<String>>();
Copy link
Author

@Jan-Schweizer Jan-Schweizer Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the deletion vector file path is being constructed on the fly by the absolut_path function, it now must be a HashSet<String> instead of HashSet<&str>. This also means that we now have to clone the String of the add and remove path. Is this fine or is there a better way to handle this?
Since the add/remove path potentially needs to be converted to an absolute path, I need to construct a String any way

Also, this nested handling of the deletion vector paths doesn't look nice, can you show me a better/more idiomatic way to achieve this?


while let Some(obj_meta) = all_files.next().await {
// TODO should we allow NotFound here in case we have a temporary commit file in the list
let obj_meta = obj_meta.map_err(DeltaTableError::from)?;
if valid_files.contains(&obj_meta.location) // file is still being tracked in table
|| !expired_tombstones.contains(obj_meta.location.as_ref()) // file is not an expired tombstone
|| is_hidden_directory(partition_columns, &obj_meta.location)?

if is_hidden_file(partition_columns, &obj_meta.location)? {
continue;
}

if self.is_file_managed(&managed_files, &obj_meta.location) {
if !expired_tombstones.contains(obj_meta.location.as_ref()) {
continue;
}
} else if now_millis - retention_period.num_milliseconds()
< obj_meta.last_modified.timestamp_millis()
{
continue;
}
Expand All @@ -222,6 +268,16 @@ impl VacuumBuilder {
specified_retention_millis: Some(retention_period.num_milliseconds()),
})
}

/// Whether a file is contained within the set of managed files.
fn is_file_managed(&self, managed_files: &HashSet<String>, file: &Path) -> bool {
return if is_absolute_path(file.as_ref()) {
managed_files.contains(file.as_ref())
} else {
let path = format!("{}{}", self.log_store.root_uri(), file.as_ref());
managed_files.contains(&path)
};
}
Comment on lines +272 to +280
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the location of an object in the log_store ever be absolute? Or is it always relative to the table root? If so, then we wouldn't need the is_absolute_path check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jan-Schweizer the location can be an absolute URL, and that's the case with shallow cloned tables. The add action describes the path key.

So for better or worse, we need to check for an absolute path 😄

}

impl std::future::IntoFuture for VacuumBuilder {
Expand Down Expand Up @@ -254,6 +310,11 @@ impl std::future::IntoFuture for VacuumBuilder {
}
}

fn is_absolute_path(path: &str) -> bool {
let path = std::path::Path::new(path);
path.is_absolute()
}

/// Encapsulate which files are to be deleted and the parameters used to make that decision
struct VacuumPlan {
/// What files are to be deleted
Expand Down Expand Up @@ -367,11 +428,15 @@ impl VacuumPlan {
/// Names of the form partitionCol=[value] are partition directories, and should be
/// deleted even if they'd normally be hidden. The _db_index directory contains (bloom filter)
/// indexes and these must be deleted when the data they are tied to is deleted.
fn is_hidden_directory(partition_columns: &[String], path: &Path) -> Result<bool, DeltaTableError> {
let path_name = path.to_string();
Ok((path_name.starts_with('.') || path_name.starts_with('_'))
&& !path_name.starts_with("_delta_index")
&& !path_name.starts_with("_change_data")
fn is_hidden_file(partition_columns: &[String], path: &Path) -> Result<bool, DeltaTableError> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this to is_hidden_file since we are not only ignoring directories but also single files

let path_name = path.as_ref();
let skip = path_name.starts_with("_delta_index") || path_name.starts_with("_change_data");
let is_hidden = path
.parts()
.skip(skip as usize)
.any(|p| p.as_ref().starts_with('.') || p.as_ref().starts_with('_'));

Ok(is_hidden
&& !partition_columns
.iter()
.any(|partition_column| path_name.starts_with(partition_column)))
Expand Down Expand Up @@ -451,4 +516,19 @@ mod tests {

assert_eq!(result.files_deleted, empty);
}

#[tokio::test]
async fn vacuum_table_with_dv_small() {
let table = open_table("./tests/data/table-with-dv-small")
.await
.unwrap();

let (_table, result) = VacuumBuilder::new(table.log_store, table.state)
.with_dry_run(true)
.await
.unwrap();

let empty: Vec<String> = Vec::new();
assert_eq!(result.files_deleted, empty);
}
}
3 changes: 1 addition & 2 deletions crates/deltalake-core/tests/command_vacuum.rs
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,6 @@ async fn test_partitions_included() {
);
}

#[ignore]
#[tokio::test]
// files that are not managed by the delta log and have a last_modified greater
// than the retention period should be deleted. Unmanaged files and directories
Expand Down Expand Up @@ -276,7 +275,7 @@ async fn test_non_managed_files() {

// Validate unmanaged files are deleted after the retention period
let res = {
clock.tick(Duration::hours(1));
clock.tick(Duration::days(7));
let (_, metrics) = DeltaOps(table)
.vacuum()
.with_clock(Arc::new(clock.clone()))
Expand Down
Loading