Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paginate persisted cluster state #78875

Conversation

DaveCTurner
Copy link
Contributor

Today we allocate a contiguous chunk of memory for the global metadata
each time we write it to disk. The size of this chunk is unbounded and
in practice it can be pretty large. This commit splits the metadata
document up into pages (1MB by default) that are streamed to disk at
write time, bounding the memory usage of cluster state persistence.
Since the memory usage is now bounded we can allocate a single buffer up
front and re-use it for every write.

@DaveCTurner DaveCTurner added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.16.0 labels Oct 8, 2021
@DaveCTurner DaveCTurner force-pushed the 2021-10-08-paginate-persisted-cluster-state branch 3 times, most recently from d26e53f to bb7b5c1 Compare October 10, 2021 20:42
@DaveCTurner

This comment has been minimized.

@DaveCTurner DaveCTurner force-pushed the 2021-10-08-paginate-persisted-cluster-state branch from b7eb459 to 572fffd Compare October 20, 2021 19:18
Today we allocate a contiguous chunk of memory for the global metadata
each time we write it to disk. The size of this chunk is unbounded and
in practice it can be pretty large. This commit splits the metadata
document up into pages (1MB by default) that are streamed to disk at
write time, bounding the memory usage of cluster state persistence.
Since the memory usage is now bounded we can allocate a single buffer up
front and re-use it for every write.
@DaveCTurner DaveCTurner force-pushed the 2021-10-08-paginate-persisted-cluster-state branch from 572fffd to 6dc6a56 Compare October 20, 2021 19:25
@DaveCTurner DaveCTurner marked this pull request as ready for review October 20, 2021 20:35
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Oct 20, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor Author

Now that #78525 is resolved this is ready to go I think.

@DaveCTurner DaveCTurner added v8.1.0 and removed v8.0.0 labels Oct 28, 2021
@original-brownbear
Copy link
Member

Sorry for the delay here @DaveCTurner this is on my list to benchmark today.

@DaveCTurner
Copy link
Contributor Author

npnp I think I've successfully merged the reformatted master in now 🤞

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

One question: I think we should never be able to see a downgraded node exercise this code since reading this happens after the NodeEnvironment reads node metadata. Just wanted to confirm that with you.


final Query query = new TermQuery(new Term(TYPE_FIELD_NAME, type));
final Weight weight = indexSearcher.createWeight(query, ScoreMode.COMPLETE_NO_SCORES, 0.0f);
logger.trace("running query [{}]", query);

final Map<String, PaginatedDocumentReader> documentReaders = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we risk increased memory use here in case we are very unfortunate and get one page of every (or at least many) index metadata before seeing the last page of them? Seems unlikely for a few reasons: index metadata often fits just one page, doc-id order is likely to result in index by index reading and total cluster state need to fit in memory anyway. I wonder if we could or should extract documents in order? Or just add a comment to explain this is not important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs fairly early within Node#start so there won't be much else happening in the node yet, and later on we will need to completely serialize the cluster state in memory anyway, so I think it should be fine. Also I think we'll be in trouble for other reasons sooner if index metadata exceeds 1MB when compressed and serialized. I added a comment in 16d6471.

Comment on lines 584 to 589
if (random.nextInt(10) == 0) {
builder.put(
PersistedClusterStateService.DOCUMENT_PAGE_SIZE.getKey(),
new ByteSizeValue(RandomNumbers.randomIntBetween(random, rarely() ? 10 : 100, 1000))
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should apply this more often but use between(10, 1000000) to allow for more variety of cases?

Suggested change
if (random.nextInt(10) == 0) {
builder.put(
PersistedClusterStateService.DOCUMENT_PAGE_SIZE.getKey(),
new ByteSizeValue(RandomNumbers.randomIntBetween(random, rarely() ? 10 : 100, 1000))
);
}
if (randomBoolean()) {
builder.put(
PersistedClusterStateService.DOCUMENT_PAGE_SIZE.getKey(),
new ByteSizeValue(RandomNumbers.randomIntBetween(random, rarely() ? 10 : 100, 1000000))
);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more interesting corner cases are mostly when paginating harder; simply increasing the limit to 1000000 means that these interesting cases get much much less likely, but I introduced a variety of limits in ade4343.

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 4, 2022
@elasticsearchmachine elasticsearchmachine merged commit 641a01b into elastic:master Jan 4, 2022
astefan pushed a commit to astefan/elasticsearch that referenced this pull request Jan 7, 2022
Today we allocate a contiguous chunk of memory for the global metadata
each time we write it to disk. The size of this chunk is unbounded and
in practice it can be pretty large. This commit splits the metadata
document up into pages (1MB by default) that are streamed to disk at
write time, bounding the memory usage of cluster state persistence.
Since the memory usage is now bounded we can allocate a single buffer up
front and re-use it for every write.
astefan pushed a commit to astefan/elasticsearch that referenced this pull request Jan 7, 2022
Today we allocate a contiguous chunk of memory for the global metadata
each time we write it to disk. The size of this chunk is unbounded and
in practice it can be pretty large. This commit splits the metadata
document up into pages (1MB by default) that are streamed to disk at
write time, bounding the memory usage of cluster state persistence.
Since the memory usage is now bounded we can allocate a single buffer up
front and re-use it for every write.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed Meta label for distributed team v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants