Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Optimize MergingSnapshotProducer to use referenced manifests to determine if manifest needs to be rewritten #11131

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Sep 13, 2024

This change optimizes merging snapshot producer logic to determine if a file has been deleted in a manifest in the case deleteFile(ContentFile file) is called; optimizing this logic means that determining if a manifest needs to be rewritten is now optimized.

Previously determining a manifest file needs to be rewritten has 2 high level steps:

  1. Open up the manifest, iterate through entries until an entry matches a criteria for deletion (either by partition expression, or a path based deletion). Stop iterating through the manifest if one of these criteria is hit.
  2. If 1 ends up yielding manifests that need to be rewritten, a new manifest is rewritten with the same contents as the old manifests minus any deleted files; for delete manifests if there are delete files older than the min sequence number those are also dropped as part of the writing.

The optimization in this PR will optimize step 1 in the case deleteFile(ContentFile file) is called by keeping track of the data/delete file's manifests and the position in the manifest that is being deleted. Using this the logic for doing the first pass over the manifest is no longer necessary since the presence of a referenced manifest means that manifest must be rewritten.
Before


Benchmark                                       (numFiles)  (percentDeleteFilesReplaced)  Mode  Cnt  Score   Error  Units
ReplaceDeleteFilesBenchmark.replaceDeleteFiles       50000                             5    ss    5  0.234 ± 0.054   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles       50000                            25    ss    5  0.599 ± 1.330   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles       50000                            50    ss    5  0.814 ± 1.823   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles       50000                           100    ss    5  0.877 ± 1.837   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      100000                             5    ss    5  0.426 ± 0.050   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      100000                            25    ss    5  0.838 ± 1.920   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      100000                            50    ss    5  1.051 ± 1.847   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      100000                           100    ss    5  1.525 ± 0.108   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      500000                             5    ss    5  0.666 ± 1.511   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      500000                            25    ss    5  1.596 ± 0.289   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      500000                            50    ss    5  1.991 ± 0.438   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      500000                           100    ss    5  2.496 ± 0.566   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     1000000                             5    ss    5  1.110 ± 1.858   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     1000000                            25    ss    5  1.955 ± 0.317   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     1000000                            50    ss    5  2.499 ± 0.556   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     1000000                           100    ss    5  2.947 ± 1.181   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     2000000                             5    ss    5  1.709 ± 0.159   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     2000000                            25    ss    5  3.153 ± 1.251   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     2000000                            50    ss    5  3.093 ± 1.475   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     2000000                           100    ss    5  5.808 ± 3.145   s/op

After
ReplaceDeleteFilesBenchmark.replaceDeleteFiles       50000                             5    ss    5  0.196 ± 0.038   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles       50000                            25    ss    5  0.468 ± 1.496   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles       50000                            50    ss    5  0.652 ± 1.914   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles       50000                           100    ss    5  0.741 ± 1.925   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      100000                             5    ss    5  0.223 ± 0.019   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      100000                            25    ss    5  0.641 ± 1.916   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      100000                            50    ss    5  0.879 ± 1.883   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      100000                           100    ss    5  1.326 ± 0.122   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      500000                             5    ss    5  0.470 ± 1.631   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      500000                            25    ss    5  1.349 ± 0.140   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      500000                            50    ss    5  1.655 ± 0.244   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles      500000                           100    ss    5  2.077 ± 0.258   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     1000000                             5    ss    5  0.866 ± 1.878   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     1000000                            25    ss    5  1.652 ± 0.113   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     1000000                            50    ss    5  2.038 ± 0.123   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     1000000                           100    ss    5  1.871 ± 0.278   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     2000000                             5    ss    5  1.319 ± 0.097   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     2000000                            25    ss    5  1.993 ± 0.113   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     2000000                            50    ss    5  1.820 ± 0.317   s/op
ReplaceDeleteFilesBenchmark.replaceDeleteFiles     2000000                           100    ss    5  3.363 ± 0.336   s/op

@github-actions github-actions bot added the core label Sep 13, 2024
@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Sep 13, 2024

Publishing a draft so I can test against entire CI. I need to think more about a good way to benchmark this and if there's even more reasonable optimizations that I can do here

Well another possible optimization to think through:

If we have both the manifestLocation and the ordinal position of the content file in the manifest AND there are no delete expressions or deletions by pure paths, we can possibly just write the new manifest with entries that are not part of the referenced positions without having to evaluate file paths or predicates against manifest entries.

We could just evaluate against the positions (every entry would be compared against the "deleted" pos set) as opposed to file paths/partition values, which should be a bit more performant.

@amogh-jahagirdar amogh-jahagirdar force-pushed the optimize-merging-snapshot-producer branch 7 times, most recently from ad1dd92 to b230f1e Compare September 16, 2024 14:54
@amogh-jahagirdar amogh-jahagirdar force-pushed the optimize-merging-snapshot-producer branch 2 times, most recently from 47cda8a to 4099608 Compare September 16, 2024 15:48
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On testing, even though SparkRewriteDataFilesAction exercises this new path I think it's worth adding a separate test in TestRewriteFiles, or if https://github.com/apache/iceberg/pull/11166/files gets in first can also add a test to RowDelta. The test should explicitly read entries from manifests to delete

@github-actions github-actions bot added the build label Sep 20, 2024
@amogh-jahagirdar amogh-jahagirdar force-pushed the optimize-merging-snapshot-producer branch 3 times, most recently from be432d6 to 5eece22 Compare September 20, 2024 02:53
@amogh-jahagirdar amogh-jahagirdar marked this pull request as ready for review September 20, 2024 03:00
Comment on lines 121 to 128
if (filesToDelete > 0) {
DeleteFile deleteFile = FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
rowDelta.addDeletes(deleteFile);
generatedDeleteFiles.add(deleteFile);
DeleteFile pendingDeleteFile =
FileGenerationUtil.generatePositionDeleteFile(table, dataFile);
generatedPendingDeleteFiles.add(pendingDeleteFile);
filesToDelete--;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opting for keeping the benchmarking pretty simple for now. We just use the percentage and this will correlate linearly with the amount of manifests which need to get rewritten. We could test cases where we vary the distribution of deletes across partitions, but don't think it's strictly required now since the current benchmark does indicate this is a good improvement.

Copy link
Contributor

@aokolnychyi aokolnychyi Sep 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the new and the old logic replace all delete files. Your PR would benefit use cases when a small number of manifests has to be rewritten, so I'd probably add numFiles data and delete files but replace only a percentage of those.

Also, I would consider running the same benchmark for an unpartitioned table as it will not have the partition filters and Iceberg would be forced to scan through all of the metadata even if a single delete file is replaced (without your PR, of course).

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the benchmark to only replace a percentage of the delete files rather than before where the percentage was more a ratio of how many data files had delete files. That's probably a more useful bench for evaluating synchronous maintenance since we get a sense of how expensive the operation is at varying levels of replacement + varying levels of files.

After updating to have the set be a source of truth for delete manifests, a more noticeable difference is observed, as expected! I updated the benchmark results in the PR description.

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Sep 20, 2024

Seems like after my latest updates to not add to the delete paths if a manifest location is defined, some cherry pick test cases are failing. Taking a look.

@amogh-jahagirdar amogh-jahagirdar force-pushed the optimize-merging-snapshot-producer branch 4 times, most recently from b37854d to 07a3b1b Compare September 20, 2024 18:05
@amogh-jahagirdar amogh-jahagirdar force-pushed the optimize-merging-snapshot-producer branch 4 times, most recently from 2e67a18 to 726fcfb Compare October 17, 2024 15:44
…o determine if a given manifest needs to be rewritten or not
@amogh-jahagirdar amogh-jahagirdar force-pushed the optimize-merging-snapshot-producer branch from 726fcfb to ec6c919 Compare October 17, 2024 16:06
// The current set of referenced manifests can be trusted if it is a subset of the manifests
// being filtered. If a single referenced manifest is not in the set of manifests being filtered
// this indicates that the referenced manifests are stale and cannot be trusted.
Set<String> manifestLocations =
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These manifests are the set of manifests in the parent snapshot. I think the check that the referenced manifests is a subset of the manifests in the parent is sufficient. The reason is in the case of a concurrent update which removes 1 or more of the currently referenced manifests, the parent would reflect this removal (since every retry due to a concurrent update would naturally have an updated parent and therefore an updated view of what are the manifests currently). Added some tests in TestTransactions that verify this cc @rdblue @aokolnychyi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rerunning benchmarks to make sure after all the recent refactoring I did, that somehow we're not skipping the optimization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to state in this comment what is meant by "trusted". I think it means that we can filter the set of referenced manifests rather than the set of manifests that was passed into this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it means that we can filter the set of referenced manifests rather than the set of manifests that was passed into this method.

Actually, I think it's more specific than this. It's possible that the set can be trusted, there's a manifest which is not referenced set, and the manifest still needs to be filtered because it either has live entries OR it's a delete manifest with aged out entries (the min sequence number ends up being less than the min data sequence number during filtering).

This also ties in to trying to make this logic a bit less complicated since that last condition I mentioned is what's leading to the current implementation.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to remove the check to always rewrite a manifest just because it had aged off deletes and no other reason to rewrite. This simplified a bit since now a lot of the referenced sets checks could be done upfront a bit more explicitly. See #11131 (comment)

@@ -323,11 +345,15 @@ private ManifestFile filterManifest(Schema tableSchema, ManifestFile manifest) {
PartitionSpec spec = reader.spec();
PartitionAndMetricsEvaluator evaluator =
new PartitionAndMetricsEvaluator(tableSchema, spec, deleteExpression);
boolean hasDeletedFiles = manifestsReferencedForDeletes.contains(manifest.path());
if (hasDeletedFiles) {
Copy link
Contributor

@rdblue rdblue Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this structure makes the code harder to reason about and misses an optimization.

This currently checks whether the manifest is known to have a file to remove. If it is unknown, then the logic below is used, which scans the manifest to determine whether it needs to be rewritten. But if the set of referenced manifests is trusted, then there is no need to scan these manifests because the manifest to rewrite for every removal is known.

This should check allDeletesReferenceManifests to determine if the next step can be avoided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that this may actually be caught by canContainDeletedFiles.

I think part of the problem is that this is trying to use instance state rather than passing data through methods. I think the logic would be more clear if state were passed through args rather than relying on instance fields several calls down the chain.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, canContainDeletedFiles should skip entirely if we can trust the referenced set.

I think part of the problem is that this is trying to use instance state rather than passing data through methods. I think the logic would be more clear if state were passed through args rather than relying on instance fields several calls down the chain.

Sounds good, I think what we could do is filter manager surfaces the referenced manifests if all the removals are performed with a referenced manifest location. Then in merging snapshot producer we leverage that set to filter down the set that gets passed to the filter manager to begin with. I still think this particular case would be a good optimization (the case where a file is known to be in a manfiest referenced in a delete) so that we avoid opening up the manifest and evaluating stats etc, and just skip to writing the new manifest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the check which led to always rewriting a delete manifest in case it had aged off deletes simplified this, for context see: #11131 (comment)

I moved the referenced check a bit more explicitly up front during the filtering and also avoided some additional referenced set checks (previously we were doing that a few times) but now it should only happen once.

@amogh-jahagirdar amogh-jahagirdar force-pushed the optimize-merging-snapshot-producer branch from 661bc57 to e137da1 Compare October 24, 2024 16:39
// The current set of referenced manifests can be trusted if it is a subset of the manifests
// being filtered. If a single referenced manifest is not in the set of manifests being filtered
// this indicates that the referenced manifests are stale and cannot be trusted.
Set<String> manifestLocations =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it means that we can filter the set of referenced manifests rather than the set of manifests that was passed into this method.

Actually, I think it's more specific than this. It's possible that the set can be trusted, there's a manifest which is not referenced set, and the manifest still needs to be filtered because it either has live entries OR it's a delete manifest with aged out entries (the min sequence number ends up being less than the min data sequence number during filtering).

This also ties in to trying to make this logic a bit less complicated since that last condition I mentioned is what's leading to the current implementation.

@amogh-jahagirdar amogh-jahagirdar force-pushed the optimize-merging-snapshot-producer branch 4 times, most recently from 862bba6 to 219ed14 Compare October 24, 2024 21:31
Comment on lines -373 to -375
boolean canContainDropBySeq =
manifest.content() == ManifestContent.DELETES
&& manifest.minSequenceNumber() < minSequenceNumber;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to rewrite a given manifest anytime it had aged deletes regardless of if the manifest needed to get rewritten for other purposes. Discussed with @rdblue and the way to look at this is that we should opportunistically remove the aged deletes when the manifest needs to get rewritten for other purposes.

Otherwise we don't really need to go out of our way since it can create unnecessary failures which I think I agree with the rationale on that. I removed canContainDropSeq as an individual condition to drive a rewrite, which is technically a behavior change but arguably a better choice based on the argument that it avoids unnecessary failures that can happen by doing that work.

I also updated TestRowDelta#testDeleteDataFileWithDeleteFile so that we still exercise the case where aged deletes do get rewritten, but the conditions for triggering that had to change in the tests.

Comment on lines +604 to 609
dataSeqs(1L, 1L),
fileSeqs(1L, 1L),
ids(deltaSnapshotId, deltaSnapshotId),
files(FILE_A_DELETES, FILE_B_DELETES),
statuses(Status.ADDED, Status.ADDED));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #11131 (comment) for why I changed this test. TLDR, the behavior around rewriting manifests with aged off deletes changed from "Always rewrite in case the manifest does have aged off deletes" to "Remove the aged off deletes if the manifest actually needs to be rewritten for other purposes". This test used to just work on the assumption that this condition existed, but now we need to perform an action (deleting FILE_B_DELETES) which forces the manifest to be rewritten along with the deleted, aged off entries.

@amogh-jahagirdar amogh-jahagirdar force-pushed the optimize-merging-snapshot-producer branch 2 times, most recently from 2e583a2 to e8a87bc Compare October 24, 2024 23:30
@github-actions github-actions bot added the spark label Oct 24, 2024
Comment on lines +297 to +299
DeleteFile deleteFile = TestHelpers.deleteFiles(table).iterator().next();
Path deleteFilePath = new Path(String.valueOf(deleteFile.path()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests changed for the same reason mentioned #11131 (comment)

@amogh-jahagirdar amogh-jahagirdar force-pushed the optimize-merging-snapshot-producer branch from e8a87bc to eb3e848 Compare October 25, 2024 02:34
…o determine if a given manifest needs to be rewritten or not
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants