Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Support commits with DVs #11495

Merged
merged 7 commits into from
Nov 11, 2024
Merged

Conversation

aokolnychyi
Copy link
Contributor

This PR adds supports for committing operations with DVs.

This work is part of #11122.

@@ -769,6 +794,82 @@ protected void validateDataFilesExist(
}
}

// validates there are no concurrently added DVs for referenced data files
protected void validateAddedDVs(
Copy link
Contributor Author

@aokolnychyi aokolnychyi Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simply validates there is at most one DV per file. There would be a separate PR to automatically rewrite conflicting DVs during retries. The goal is to make DVs fully usable before optimizing.

@github-actions github-actions bot added the data label Nov 8, 2024
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to review tests but the code changes themselves overall look good, just some minor comments. Thanks @aokolnychyi!

}

// builds a set of data file locations referenced by new DVs
private Set<String> dvRefs() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this referencedDataFiles?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's a private method so name is not a big deal

I think that dvRefs() (or dvReferences()) makes more sense here as it checks if the deleteFile is a DV. So, having DV in the method name makes sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have dv in the name. If so, I'd prefer a shorter var name to avoid splitting statements on multiple lines.

.stopOnFailure()
.throwFailureWhenFinished()
.executeWith(workerPool())
.run(m -> validateAddedDVs(m, conflictDetectionFilter, newSnapshotIds, dvRefs));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should we use the full word manifest in the lambda?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity I agree with @amogh-jahagirdar (manifest instead of m). I'm fine to use DV instead of DeleteVector but manifest here would help 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was me trying to work-around Spotless formatting for closures. I updated.

Copy link
Member

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first quick pass. I will to another round asap. Thanks @aokolnychyi great work as usual !

@@ -139,6 +139,10 @@ protected void validate(TableMetadata base, Snapshot parent) {
if (validateNewDeleteFiles) {
validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, parent);
}

if (base.formatVersion() >= 3 && addsDeleteFiles()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the format version here ? Why not just using addsDeleteFiles flag ? (just wondering).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've restructured this part a bit to compute dv refs as deletes are being added. No longer applies.

"Must not use DVs for position deletes in V2: %s",
ContentFileUtil.dvDesc(file));
Preconditions.checkArgument(
formatVersion() == 2 || ContentFileUtil.isDV(file),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, why not just checking if it's DV: we know that if it's DV it's format version 3+ right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method validates input from the user and ensures it is not possible to add a DV to a V2 table.

.stopOnFailure()
.throwFailureWhenFinished()
.executeWith(workerPool())
.run(m -> validateAddedDVs(m, conflictDetectionFilter, newSnapshotIds, dvRefs));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity I agree with @amogh-jahagirdar (manifest instead of m). I'm fine to use DV instead of DeleteVector but manifest here would help 😃

}

// builds a set of data file locations referenced by new DVs
private Set<String> dvRefs() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's a private method so name is not a big deal

I think that dvRefs() (or dvReferences()) makes more sense here as it checks if the deleteFile is a DV. So, having DV in the method name makes sense to me.

return formatVersion >= 3 ? FILE_B_DV : FILE_B_DELETES;
}

protected DeleteFile newDeletes(DataFile dataFile) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe newDeleteFile?

Copy link
Contributor Author

@aokolnychyi aokolnychyi Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other newDeleteFile methods below return V2 position deletes, so I wanted to differentiate.
I am OK renaming too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe callers won't care as long as you get a correct DeleteFile for the correct formatVersion, so I was thinking that we most likely don't need to differentiate here with the naming

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no preference, to be honest. I'll update in the next round.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went to rename and I remembered why I did it this way. I wanted a way to construct a V2 delete file independently of the format version and to not need to update the other methods that create delete files now.

@@ -78,7 +78,7 @@ public void testFileSizeSummary() {

@TestTemplate
public void testFileSizeSummaryWithDeletes() {
assumeThat(formatVersion).isGreaterThan(1);
assumeThat(formatVersion).isEqualTo(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you planning to update these tests eventually to run against format version 3?

Copy link
Contributor Author

@aokolnychyi aokolnychyi Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. There are new tests for DVs in this class. I think this one needs V2 position deletes, we simply didn't have the validation before. V2 position deletes are not allowed in V3 tables. What do you think, @nastra?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry I forgot that you already added testFileSizeSummaryWithDVs, so this LGTM

@@ -719,7 +719,7 @@ public void testRewriteLargeManifestsEvolvedUnpartitionedV1Table() throws IOExce

@TestTemplate
public void testRewriteSmallDeleteManifestsNonPartitionedTable() throws IOException {
assumeThat(formatVersion).isGreaterThan(1);
assumeThat(formatVersion).isEqualTo(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I'm updating this test class in #11485

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, will take a look!

@@ -956,6 +956,62 @@ public void testRewriteLargeDeleteManifestsPartitionedTable() throws IOException
assertThat(deleteManifests).hasSizeGreaterThanOrEqualTo(2);
}

@TestTemplate
public void testUpgradeToV3() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a little awkward about the scope of this test. It's testing upgrade and manifest rewrite, but there are specific aspects that are included.

I feel like there are more scenarios to cover here (e.g. writing a DV to an existing file that has positional delete after the upgrade.). I would say either refine the name to be more explicit (e.g. testUpgradeToV3WithExistingV2Deletes) so that we can add more test cases for the upgrade path.

Copy link
Contributor Author

@aokolnychyi aokolnychyi Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the name is too generic, I'll update. I did cover adding a DV to an existing file with positional delete in a previous PR here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also an extra test in TestRowDelta. This one was for the action that rewrites manifests, but I renamed both.

@aokolnychyi aokolnychyi merged commit af3fbfe into apache:main Nov 11, 2024
49 checks passed
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @nastra @danielcweeks @jbonofre @amogh-jahagirdar!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants