Deduplicate Index Metadata in BlobStore #50278

original-brownbear · 2019-12-17T15:10:05Z

This PR introduces two new fields in to RepositoryData (index-N) to track the blob name of IndexMetaData blobs and their content hashes. This is used to deduplicate the IndexMetaData blobs (meta-{uuid}.dat in the indices folders under /indices so that new metadata for an index is only written to the repository during a snapshot if that same metadata can't be found in another snapshot.
This saves one write per index in the common case of unchanged metadata thus saving cost and making snapshot finalization drastically faster if many indices are being snapshotted at the same time.

The implementation is mostly analogous to that for shard generations in #46250 and piggy backs on the BwC mechanism introduced in that PR (which means this PR needs adjustments if it doesn't go into 7.6).

Relates to #45736 as it improves the efficiency of snapshotting unchanged indices
Relates to #49800 as it has the potential of loading the index metadata for multiple snapshots of the same index concurrently much more efficient speeding up future concurrent snapshot delete

…metadata

elasticmachine · 2019-12-17T15:10:08Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

…metadata

original-brownbear · 2019-12-18T04:11:46Z

test/framework/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreTestUtil.java

+            }
+            // TODO: assertEquals(indexMetaGenerationsExpected, indexMetaGenerationsFound); requires cleanup functionality for
+            //       index meta generations blobs
+            assertTrue(indexMetaGenerationsFound.containsAll(indexMetaGenerationsExpected));


Not checking equality here makes this test somewhat redundant admittedly but I figured it was ncie to leave the logic and TODO here to make it clear why we need index metadata blob cleanup.
Note, the fact that we leak index metadata blobs is not new and can happen without this change as well. As a matter of fact, you could argue that it is more likely to leak index metadata blobs before this change since there's simply more of them and the delete timing hasn't changed.

Do we have a test that checks that index metadata is cleaned up if it is no longer referenced? (i.e. a succeeding snapshot delete that was holding the last reference to that index metadata)

Enhanced the test for metadata deduplication in c9ab2fc to cover this explicitly

original-brownbear · 2019-12-18T04:15:27Z

server/src/main/java/org/elasticsearch/repositories/Repository.java

     * @param snapshotId the snapshot id to load the index metadata from
     * @param index      the {@link IndexId} to load the metadata from
     * @return the index metadata about the given index for the given snapshot
     */
-    IndexMetaData getSnapshotIndexMetaData(SnapshotId snapshotId, IndexId index) throws IOException;
+    IndexMetaData getSnapshotIndexMetaData(RepositoryData repositoryData, SnapshotId snapshotId, IndexId index) throws IOException;


Kind of an awkward API to pass the RepositoryData here again admittedly, but as you can see from the diff we have the RepositoryData available 100% of the time in production code when calling this method so I didn't see a reason to make this call more expensive, slower and less stable by forcing another round trip to loading repositoryData just so that I can keep the simpler API.

original-brownbear · 2019-12-18T04:36:49Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+
+        final Executor executor = threadPool.executor(ThreadPool.Names.SNAPSHOT);
+
+        getRepositoryData(ActionListener.wrap(existingRepositoryData -> {


Note to reviewers: There's no big change here. Just moved out the loading of repository data to be the first step so that we have it available for the index metadata writer threads for the lookup and removed its lazy loading from allMetaListener below, all other changes are just indent changes.

Pre-requesite for elastic#50278 to be able to uniquely identify index metadata by its version fields and UUIDs when restoring into closed indices.

…metadata

Pre-requesite for #50278 to be able to uniquely identify index metadata by its version fields and UUIDs when restoring into closed indices.

…metadata

original-brownbear · 2020-05-29T11:35:46Z

@ywelsch @tlrx I made use of #56930 to better track the uniqueness of index metadata => this PR should be good for review again at last :) (not urgent though)

ywelsch

I've left some small comments, looking good o.w.

server/src/main/java/org/elasticsearch/repositories/IndexMetaDataGenerations.java

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

ywelsch · 2020-06-05T11:34:45Z

test/framework/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreTestUtil.java

+            }
+            // TODO: assertEquals(indexMetaGenerationsExpected, indexMetaGenerationsFound); requires cleanup functionality for
+            //       index meta generations blobs
+            assertTrue(indexMetaGenerationsFound.containsAll(indexMetaGenerationsExpected));


Do we have a test that checks that index metadata is cleaned up if it is no longer referenced? (i.e. a succeeding snapshot delete that was holding the last reference to that index metadata)

…metadata

original-brownbear · 2020-06-05T14:17:09Z

Thanks Yannick, sorry for the merge error, fixed now + test added:)

original-brownbear · 2020-06-05T14:42:11Z

Jenkins run elasticsearch-ci/2 (unrelated xpack)

ywelsch

LGTM

original-brownbear · 2020-06-05T17:15:47Z

Thanks Yannick!

This PR introduces two new fields in to `RepositoryData` (index-N) to track the blob name of `IndexMetaData` blobs and their content via setting generations and uuids. This is used to deduplicate the `IndexMetaData` blobs (`meta-{uuid}.dat` in the indices folders under `/indices` so that new metadata for an index is only written to the repository during a snapshot if that same metadata can't be found in another snapshot. This saves one write per index in the common case of unchanged metadata thus saving cost and making snapshot finalization drastically faster if many indices are being snapshotted at the same time. The implementation is mostly analogous to that for shard generations in elastic#46250 and piggy backs on the BwC mechanism introduced in that PR (which means this PR needs adjustments if it doesn't go into `7.6`). Relates to elastic#45736 as it improves the efficiency of snapshotting unchanged indices Relates to elastic#49800 as it has the potential of loading the index metadata for multiple snapshots of the same index concurrently much more efficient speeding up future concurrent snapshot delete

This PR introduces two new fields in to `RepositoryData` (index-N) to track the blob name of `IndexMetaData` blobs and their content via setting generations and uuids. This is used to deduplicate the `IndexMetaData` blobs (`meta-{uuid}.dat` in the indices folders under `/indices` so that new metadata for an index is only written to the repository during a snapshot if that same metadata can't be found in another snapshot. This saves one write per index in the common case of unchanged metadata thus saving cost and making snapshot finalization drastically faster if many indices are being snapshotted at the same time. The implementation is mostly analogous to that for shard generations in #46250 and piggy backs on the BwC mechanism introduced in that PR (which means this PR needs adjustments if it doesn't go into `7.6`). Relates to #45736 as it improves the efficiency of snapshotting unchanged indices Relates to #49800 as it has the potential of loading the index metadata for multiple snapshots of the same index concurrently much more efficient speeding up future concurrent snapshot delete

original-brownbear added 8 commits December 14, 2019 14:35

bck

1afb69e

Merge remote-tracking branch 'elastic/master' into deduplicate-index-…

24a9e1e

…metadata

Merge remote-tracking branch 'elastic/master' into deduplicate-index-…

4d5685c

…metadata

bck

94d4c3a

Merge remote-tracking branch 'elastic/master' into deduplicate-index-…

15aa01c

…metadata

works some more

e994881

bck

e939e55

Merge remote-tracking branch 'elastic/master' into deduplicate-index-…

083874d

…metadata

original-brownbear added >enhancement WIP :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Dec 17, 2019

original-brownbear added 6 commits December 17, 2019 17:57

bck

aa2f463

TODO

4b898c3

better

69bf0fe

fix test

dae25c4

Merge remote-tracking branch 'elastic/master' into deduplicate-index-…

96dad34

…metadata

javadoc

8fbdd9f

original-brownbear commented Dec 18, 2019

View reviewed changes

add comment

b1af8ee

original-brownbear commented Dec 18, 2019

View reviewed changes

original-brownbear added 4 commits December 18, 2019 05:39

drop obsolete TODO

fecb413

add test to show better incrementality

f747e90

new tests + equals + hashcode

2f46ec3

toString

8db076d

original-brownbear added v7.6.0 v8.0.0 and removed WIP labels Dec 18, 2019

original-brownbear marked this pull request as ready for review December 18, 2019 08:53

original-brownbear mentioned this pull request May 25, 2020

Add History UUID Index Setting (#56930) #57104

Merged

original-brownbear added 2 commits May 25, 2020 10:50

Merge remote-tracking branch 'elastic/master' into deduplicate-index-…

de37e85

…metadata

add use of history uuid

85d7d13

original-brownbear added a commit that referenced this pull request May 25, 2020

Add History UUID Index Setting (#56930) (#57104)

9fa60f7

Pre-requesite for #50278 to be able to uniquely identify index metadata by its version fields and UUIDs when restoring into closed indices.

original-brownbear added 2 commits May 25, 2020 11:33

Merge remote-tracking branch 'elastic/master' into deduplicate-index-…

d154bde

…metadata

Merge remote-tracking branch 'elastic/master' into deduplicate-index-…

179654f

…metadata

ywelsch reviewed Jun 5, 2020

View reviewed changes

original-brownbear added 2 commits June 5, 2020 16:01

Merge remote-tracking branch 'elastic/master' into deduplicate-index-…

bf095d9

…metadata

CR: comments

c9ab2fc

original-brownbear requested a review from ywelsch June 5, 2020 14:16

ywelsch approved these changes Jun 5, 2020

View reviewed changes

original-brownbear merged commit 37ab351 into elastic:master Jun 5, 2020

original-brownbear deleted the deduplicate-index-metadata branch June 5, 2020 17:16

original-brownbear added the backport pending label Jun 5, 2020

original-brownbear removed the backport pending label Jul 14, 2020

original-brownbear mentioned this pull request Jul 14, 2020

Deduplicate Index Metadata in BlobStore (#50278) #59514

Merged

original-brownbear restored the deduplicate-index-metadata branch August 6, 2020 18:25

original-brownbear deleted the deduplicate-index-metadata branch December 1, 2020 12:51

original-brownbear restored the deduplicate-index-metadata branch December 6, 2020 18:59

original-brownbear deleted the deduplicate-index-metadata branch January 20, 2021 09:04

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

original-brownbear restored the deduplicate-index-metadata branch April 18, 2023 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate Index Metadata in BlobStore #50278

Deduplicate Index Metadata in BlobStore #50278

original-brownbear commented Dec 17, 2019 •

edited

Loading

elasticmachine commented Dec 17, 2019

original-brownbear Dec 18, 2019

ywelsch Jun 5, 2020

original-brownbear Jun 5, 2020

original-brownbear Dec 18, 2019

original-brownbear Dec 18, 2019

original-brownbear commented May 29, 2020 •

edited

Loading

ywelsch left a comment

ywelsch Jun 5, 2020

original-brownbear commented Jun 5, 2020

original-brownbear commented Jun 5, 2020

ywelsch left a comment

original-brownbear commented Jun 5, 2020


		final Executor executor = threadPool.executor(ThreadPool.Names.SNAPSHOT);

		getRepositoryData(ActionListener.wrap(existingRepositoryData -> {

Deduplicate Index Metadata in BlobStore #50278

Deduplicate Index Metadata in BlobStore #50278

Conversation

original-brownbear commented Dec 17, 2019 • edited Loading

elasticmachine commented Dec 17, 2019

original-brownbear Dec 18, 2019

Choose a reason for hiding this comment

ywelsch Jun 5, 2020

Choose a reason for hiding this comment

original-brownbear Jun 5, 2020

Choose a reason for hiding this comment

original-brownbear Dec 18, 2019

Choose a reason for hiding this comment

original-brownbear Dec 18, 2019

Choose a reason for hiding this comment

original-brownbear commented May 29, 2020 • edited Loading

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Jun 5, 2020

Choose a reason for hiding this comment

original-brownbear commented Jun 5, 2020

original-brownbear commented Jun 5, 2020

ywelsch left a comment

Choose a reason for hiding this comment

original-brownbear commented Jun 5, 2020

original-brownbear commented Dec 17, 2019 •

edited

Loading

original-brownbear commented May 29, 2020 •

edited

Loading