Paginate persisted cluster state #78875

DaveCTurner · 2021-10-08T14:08:22Z

Today we allocate a contiguous chunk of memory for the global metadata
each time we write it to disk. The size of this chunk is unbounded and
in practice it can be pretty large. This commit splits the metadata
document up into pages (1MB by default) that are streamed to disk at
write time, bounding the memory usage of cluster state persistence.
Since the memory usage is now bounded we can allocate a single buffer up
front and re-use it for every write.

Today we allocate a contiguous chunk of memory for the global metadata each time we write it to disk. The size of this chunk is unbounded and in practice it can be pretty large. This commit splits the metadata document up into pages (1MB by default) that are streamed to disk at write time, bounding the memory usage of cluster state persistence. Since the memory usage is now bounded we can allocate a single buffer up front and re-use it for every write.

elasticmachine · 2021-10-20T20:35:40Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2021-10-21T07:24:29Z

Now that #78525 is resolved this is ready to go I think.

…08-paginate-persisted-cluster-state

original-brownbear · 2021-10-28T12:24:42Z

Sorry for the delay here @DaveCTurner this is on my list to benchmark today.

DaveCTurner · 2021-10-28T12:25:32Z

npnp I think I've successfully merged the reformatted master in now 🤞

henningandersen

LGTM.

One question: I think we should never be able to see a downgraded node exercise this code since reading this happens after the NodeEnvironment reads node metadata. Just wanted to confirm that with you.

henningandersen · 2021-12-29T19:54:53Z

server/src/main/java/org/elasticsearch/gateway/PersistedClusterStateService.java


        final Query query = new TermQuery(new Term(TYPE_FIELD_NAME, type));
        final Weight weight = indexSearcher.createWeight(query, ScoreMode.COMPLETE_NO_SCORES, 0.0f);
        logger.trace("running query [{}]", query);

+        final Map<String, PaginatedDocumentReader> documentReaders = new HashMap<>();


I wonder if we risk increased memory use here in case we are very unfortunate and get one page of every (or at least many) index metadata before seeing the last page of them? Seems unlikely for a few reasons: index metadata often fits just one page, doc-id order is likely to result in index by index reading and total cluster state need to fit in memory anyway. I wonder if we could or should extract documents in order? Or just add a comment to explain this is not important.

This runs fairly early within Node#start so there won't be much else happening in the node yet, and later on we will need to completely serialize the cluster state in memory anyway, so I think it should be fine. Also I think we'll be in trouble for other reasons sooner if index metadata exceeds 1MB when compressed and serialized. I added a comment in 16d6471.

henningandersen · 2021-12-29T20:47:32Z

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

+        if (random.nextInt(10) == 0) {
+            builder.put(
+                PersistedClusterStateService.DOCUMENT_PAGE_SIZE.getKey(),
+                new ByteSizeValue(RandomNumbers.randomIntBetween(random, rarely() ? 10 : 100, 1000))
+            );
+        }


I wonder if we should apply this more often but use between(10, 1000000) to allow for more variety of cases?

Suggested change

if (random.nextInt(10) == 0) {

builder.put(

PersistedClusterStateService.DOCUMENT_PAGE_SIZE.getKey(),

new ByteSizeValue(RandomNumbers.randomIntBetween(random, rarely() ? 10 : 100, 1000))

);

}

if (randomBoolean()) {

builder.put(

PersistedClusterStateService.DOCUMENT_PAGE_SIZE.getKey(),

new ByteSizeValue(RandomNumbers.randomIntBetween(random, rarely() ? 10 : 100, 1000000))

);

}

I think the more interesting corner cases are mostly when paginating harder; simply increasing the limit to 1000000 means that these interesting cases get much much less likely, but I introduced a variety of limits in ade4343.

Today we allocate a contiguous chunk of memory for the global metadata each time we write it to disk. The size of this chunk is unbounded and in practice it can be pretty large. This commit splits the metadata document up into pages (1MB by default) that are streamed to disk at write time, bounding the memory usage of cluster state persistence. Since the memory usage is now bounded we can allocate a single buffer up front and re-use it for every write.

DaveCTurner added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.16.0 labels Oct 8, 2021

DaveCTurner force-pushed the 2021-10-08-paginate-persisted-cluster-state branch 3 times, most recently from d26e53f to bb7b5c1 Compare October 10, 2021 20:42

This comment has been minimized.

Sign in to view

DaveCTurner force-pushed the 2021-10-08-paginate-persisted-cluster-state branch from b7eb459 to 572fffd Compare October 20, 2021 19:18

DaveCTurner removed the v7.16.0 label Oct 20, 2021

DaveCTurner force-pushed the 2021-10-08-paginate-persisted-cluster-state branch from 572fffd to 6dc6a56 Compare October 20, 2021 19:25

DaveCTurner marked this pull request as ready for review October 20, 2021 20:35

elasticmachine added the Team:Distributed Meta label for distributed team label Oct 20, 2021

DaveCTurner requested a review from original-brownbear October 20, 2021 20:36

DaveCTurner added v8.1.0 and removed v8.0.0 labels Oct 28, 2021

DaveCTurner added 4 commits October 28, 2021 11:59

Merge commit '4e09186b3d3a20e97e44713d4200b22bc3cec332' into 2021-10-…

e082c44

…08-paginate-persisted-cluster-state

Reformat the world

cc16f8c

Merge commit '12ad399c488f0cc60e19b5e1b29c6d569cb4351a' into 2021-10-…

d102660

…08-paginate-persisted-cluster-state

Merge branch 'master' into 2021-10-08-paginate-persisted-cluster-state

badf14e

DaveCTurner added 3 commits November 11, 2021 08:01

Merge branch 'master' into 2021-10-08-paginate-persisted-cluster-state

679f9c1

Remove dead code

5d529de

Compress cluster state before pagination

4888952

henningandersen self-requested a review November 11, 2021 12:52

Merge branch 'master' into 2021-10-08-paginate-persisted-cluster-state

bb15eea

DaveCTurner added 2 commits December 9, 2021 10:57

Merge branch 'master' into 2021-10-08-paginate-persisted-cluster-state

7a614a7

Merge branch 'master' into 2021-10-08-paginate-persisted-cluster-state

5d595eb

henningandersen approved these changes Dec 29, 2021

View reviewed changes

DaveCTurner added 3 commits January 4, 2022 09:45

Merge branch 'master' into 2021-10-08-paginate-persisted-cluster-state

50bde1c

Add comment

16d6471

More variety

ade4343

DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 4, 2022

elasticsearchmachine merged commit 641a01b into elastic:master Jan 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paginate persisted cluster state #78875

Paginate persisted cluster state #78875

DaveCTurner commented Oct 8, 2021

This comment has been minimized.

elasticmachine commented Oct 20, 2021

DaveCTurner commented Oct 21, 2021

original-brownbear commented Oct 28, 2021

DaveCTurner commented Oct 28, 2021

henningandersen left a comment

henningandersen Dec 29, 2021

DaveCTurner Jan 4, 2022

henningandersen Dec 29, 2021

DaveCTurner Jan 4, 2022

Paginate persisted cluster state #78875

Paginate persisted cluster state #78875

Conversation

DaveCTurner commented Oct 8, 2021

This comment has been minimized.

elasticmachine commented Oct 20, 2021

DaveCTurner commented Oct 21, 2021

original-brownbear commented Oct 28, 2021

DaveCTurner commented Oct 28, 2021

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Dec 29, 2021

Choose a reason for hiding this comment

DaveCTurner Jan 4, 2022

Choose a reason for hiding this comment

henningandersen Dec 29, 2021

Choose a reason for hiding this comment

DaveCTurner Jan 4, 2022

Choose a reason for hiding this comment