Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TrieLogPruner preload with 30 second timeout #7365

Merged
merged 12 commits into from
Jul 26, 2024

Conversation

siladu
Copy link
Contributor

@siladu siladu commented Jul 23, 2024

PR description

The TrieLogPruner preload operation executes during startup. It streams a batch of trie logs from rocksdb, looks up their block headers to get a block number and adds them to the prune queue. Finally it triggers a prune operation on the prune queue, which may prune some of these trie logs if they are eligible - note pruning from the queue is fast.

The preload is executed synchronously on the main thread. Under certain conditions, this can take a long time, particularly when the trie log column family contains a large number of keys.

This PR wraps the whole operation in a 30 second timeout to avoid startup hangs.

It is still synchronous and blocks the main startup thread because doing this async (as explored in #7337) affects the performance (especially disk I/O) of the normal operation of the node: block import and potentially block proposal.

In addition, the default pruning window size is reduced from 30_000 to 5_000 which should be sufficient to cover any pruning gaps due to maintenance downtime, while reducing the impact of the performance issue.

The expectation is that if the user has a backlog of trie logs, they will follow the guide to use the prune subcommand. As such, a log has been added which will be the last log line printed before the 30 second timeout window expires.

Here's an example from before this PR of a 2.5minute load...

{"@timestamp":"2024-07-23T05:56:27,308","level":"INFO","thread":"main","class":"TrieLogPruner","message":"Loading first 5000 trie logs from database...","throwable":""}
{"@timestamp":"2024-07-23T05:59:01,297","level":"INFO","thread":"main","class":"TrieLogPruner","message":"Loaded 5000 trie logs from database","throwable":""}

And now with this PR...

{"@timestamp":"2024-07-23T05:56:29,465","level":"INFO","thread":"main","class":"TrieLogPruner","message":"Attempting to load first 5000 trie logs from database...","throwable":""}
{"@timestamp":"2024-07-23T05:56:29,466","level":"INFO","thread":"main","class":"TrieLogPruner","message":"Trie log pruning will timeout after 30 seconds. If this is timing out, consider using `besu storage trie-log prune` subcommand, see https://besu.hyperledger.org/public-networks/how-to/bonsai-limit-trie-logs","throwable":""}
{"@timestamp":"2024-07-23T05:56:59,466","level":"WARN","thread":"pool-8-thread-1","class":"TrieLogPruner","message":"Timeout occurred while loading and processing 5000 trie logs from database","throwable":""}
{"@timestamp":"2024-07-23T05:56:59,536","level":"INFO","thread":"main","class":"TrieLogPruner","message":"Operation timed out, but still pruned 513 trie logs.","throwable":""}

Fixed Issue(s)

#7322

Thanks for sending a pull request! Have you done the following?

  • Checked out our contribution guidelines?
  • Considered documentation and added the doc-change-required label to this PR if updates are required.
  • Considered the changelog and included an update if required.
  • For database changes (e.g. KeyValueSegmentIdentifier) considered compatibility and performed forwards and backwards compatibility tests

Locally, you can run these tests to catch failures early:

  • unit tests: ./gradlew build
  • acceptance tests: ./gradlew acceptanceTest
  • integration tests: ./gradlew integrationTest
  • reference tests: ./gradlew ethereum:referenceTests:referenceTests

@siladu siladu self-assigned this Jul 23, 2024
Copy link
Contributor

@pinges pinges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a question for you, but I think it looks good and adding the timeout is a good thing!


private void preloadQueue(
final AtomicBoolean timeoutOccurred, final ScheduledFuture<?> timeoutFuture) {

try (final Stream<byte[]> trieLogKeys = rootWorldStateStorage.streamTrieLogKeys(loadingLimit)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to start streaming at a random key? There is a method streamFromKey() in class KeyValueStorage that allows you to pass in the starting key.

@siladu siladu marked this pull request as ready for review July 25, 2024 22:31
@siladu siladu enabled auto-merge (squash) July 26, 2024 01:54
@siladu siladu merged commit 51bb6c7 into hyperledger:main Jul 26, 2024
40 checks passed
gconnect pushed a commit to gconnect/besu that referenced this pull request Aug 26, 2024
Also reduce pruning window from 30_000 to 5_000

---------

Signed-off-by: Simon Dudley <[email protected]>
Co-authored-by: Sally MacFarlane <[email protected]>
Signed-off-by: gconnect <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants