-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paginate persisted cluster state #78875
Paginate persisted cluster state #78875
Conversation
d26e53f
to
bb7b5c1
Compare
This comment has been minimized.
This comment has been minimized.
b7eb459
to
572fffd
Compare
Today we allocate a contiguous chunk of memory for the global metadata each time we write it to disk. The size of this chunk is unbounded and in practice it can be pretty large. This commit splits the metadata document up into pages (1MB by default) that are streamed to disk at write time, bounding the memory usage of cluster state persistence. Since the memory usage is now bounded we can allocate a single buffer up front and re-use it for every write.
572fffd
to
6dc6a56
Compare
Pinging @elastic/es-distributed (Team:Distributed) |
Now that #78525 is resolved this is ready to go I think. |
…08-paginate-persisted-cluster-state
…08-paginate-persisted-cluster-state
Sorry for the delay here @DaveCTurner this is on my list to benchmark today. |
npnp I think I've successfully merged the reformatted master in now 🤞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
One question: I think we should never be able to see a downgraded node exercise this code since reading this happens after the NodeEnvironment
reads node metadata. Just wanted to confirm that with you.
|
||
final Query query = new TermQuery(new Term(TYPE_FIELD_NAME, type)); | ||
final Weight weight = indexSearcher.createWeight(query, ScoreMode.COMPLETE_NO_SCORES, 0.0f); | ||
logger.trace("running query [{}]", query); | ||
|
||
final Map<String, PaginatedDocumentReader> documentReaders = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we risk increased memory use here in case we are very unfortunate and get one page of every (or at least many) index metadata before seeing the last page of them? Seems unlikely for a few reasons: index metadata often fits just one page, doc-id order is likely to result in index by index reading and total cluster state need to fit in memory anyway. I wonder if we could or should extract documents in order? Or just add a comment to explain this is not important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This runs fairly early within Node#start
so there won't be much else happening in the node yet, and later on we will need to completely serialize the cluster state in memory anyway, so I think it should be fine. Also I think we'll be in trouble for other reasons sooner if index metadata exceeds 1MB when compressed and serialized. I added a comment in 16d6471.
if (random.nextInt(10) == 0) { | ||
builder.put( | ||
PersistedClusterStateService.DOCUMENT_PAGE_SIZE.getKey(), | ||
new ByteSizeValue(RandomNumbers.randomIntBetween(random, rarely() ? 10 : 100, 1000)) | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should apply this more often but use between(10, 1000000) to allow for more variety of cases?
if (random.nextInt(10) == 0) { | |
builder.put( | |
PersistedClusterStateService.DOCUMENT_PAGE_SIZE.getKey(), | |
new ByteSizeValue(RandomNumbers.randomIntBetween(random, rarely() ? 10 : 100, 1000)) | |
); | |
} | |
if (randomBoolean()) { | |
builder.put( | |
PersistedClusterStateService.DOCUMENT_PAGE_SIZE.getKey(), | |
new ByteSizeValue(RandomNumbers.randomIntBetween(random, rarely() ? 10 : 100, 1000000)) | |
); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the more interesting corner cases are mostly when paginating harder; simply increasing the limit to 1000000 means that these interesting cases get much much less likely, but I introduced a variety of limits in ade4343.
Today we allocate a contiguous chunk of memory for the global metadata each time we write it to disk. The size of this chunk is unbounded and in practice it can be pretty large. This commit splits the metadata document up into pages (1MB by default) that are streamed to disk at write time, bounding the memory usage of cluster state persistence. Since the memory usage is now bounded we can allocate a single buffer up front and re-use it for every write.
Today we allocate a contiguous chunk of memory for the global metadata each time we write it to disk. The size of this chunk is unbounded and in practice it can be pretty large. This commit splits the metadata document up into pages (1MB by default) that are streamed to disk at write time, bounding the memory usage of cluster state persistence. Since the memory usage is now bounded we can allocate a single buffer up front and re-use it for every write.
Today we allocate a contiguous chunk of memory for the global metadata
each time we write it to disk. The size of this chunk is unbounded and
in practice it can be pretty large. This commit splits the metadata
document up into pages (1MB by default) that are streamed to disk at
write time, bounding the memory usage of cluster state persistence.
Since the memory usage is now bounded we can allocate a single buffer up
front and re-use it for every write.