Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TrieLogPruner.preloadQueue Performance Issue #7322

Closed
siladu opened this issue Jul 15, 2024 · 5 comments
Closed

TrieLogPruner.preloadQueue Performance Issue #7322

siladu opened this issue Jul 15, 2024 · 5 comments
Assignees
Labels
bug Something isn't working performance

Comments

@siladu
Copy link
Contributor

siladu commented Jul 15, 2024

First alerted when Discord user ktmkancha mentioned they upgraded to Besu 24.6.0 (which enabled bonsai-limit-trie-log-enabled by default) and TrieLogPruner was taking 45mins+ to load. It is suspected this user hadn't pruned their trie log backlog and is it not known how much trie log data they have.

A second instance of this was reported to me by @gfukushima when upgrading his node, dev-clc-bu-tk-mainnet-mainnet-archive-bkp which contained 112 GiB (47 GiB of which is SST files, 64 GiB is BlobDB files):

| Column Family                  | Keys            | Total Size  | SST Files Size  | Blob Files Size  |
|--------------------------------|-----------------|-------------|-----------------|------------------|
| TRIE_LOG_STORAGE               | 1564331         | 112 GiB     | 47 GiB          | 64 GiB           |

The time taken was ~45 mins...

{"@timestamp":"2024-07-11T23:45:46,224","level":"INFO","thread":"main","class":"TrieLogPruner","message":"Loading first 30000 trie logs from database...","throwable":""}
{"@timestamp":"2024-07-12T00:32:30,543","level":"INFO","thread":"main","class":"TrieLogPruner","message":"Loaded 30000 trie logs from database","throwable":""}

Subsequent restarts of this node took ~2 mins to load...

{"@timestamp":"2024-07-13T03:16:19,846","level":"INFO","thread":"main","class":"TrieLogPruner","message":"Loading first 30000 trie logs from database...","throwable":""}
{"@timestamp":"2024-07-13T03:18:20,768","level":"INFO","thread":"main","class":"TrieLogPruner","message":"Loaded 30000 trie logs from database","throwable":""}

Note, that the first restart pruned ~30_000 trie logs so the total to load reduced by ~2% (from 1,564,331 keys to 1,534,331).

I did some further testing of another canary, prd-elc-besu-lighthouse-mainnet-nightly-bonsai-snap which contained 70 GiB of trie logs (majority of which is BlobDB files)...

|--------------------------------|-----------------|-------------|-----------------|------------------|
| Column Family                  | Keys            | Total Size  | SST Files Size  | Blob Files Size  |
|--------------------------------|-----------------|-------------|-----------------|------------------|
| TRIE_LOG_STORAGE               | 954612          | 70 GiB      | 42 MiB          | 70 GiB           |

This yielded loading times of only 20 seconds...

"@timestamp":"2024-07-12T03:36:29,189","level":"INFO","thread":"main","class":"TrieLogPruner","message":"Loading first 30000 trie logs from database...","throwable":""}
{"@timestamp":"2024-07-12T03:36:47,824","level":"INFO","thread":"main","class":"TrieLogPruner","message":"Loaded 30000 trie logs from database","throwable":""}

The default loading window size is 30_000. This was based on some testing which showed this to take between 4-7 seconds for a node with 37 GiB of TRIE_LOG_STORAGE (which predates the storage of trie logs as BlobDB so it would have been all SST files): #6026 (comment)


Since the tests were only done on 37 GiB of data, it seems plausible that the performance degrades in a non-linear fashion, potentially quite rapidly if the size of the database is large.

The difference between the canaries could be explained by the amount of SST vs BlobDB files. A large number of SST files would be present on older nodes since the BlobDB was only enabled in 24.3.0 and applies from time of upgrade onwards, it is not retrospectively applied to old SST files.

An interesting observation is that for gfukushima's node, dev-clc-bu-tk-mainnet-mainnet-archive-bkp, loading time dropped from 45 mins to 2 mins, with only a difference of 30_000 trie logs. This perhaps suggests there is something else going on beyond simply the size of the TRIE_LOG_STORAGE column family.

Another potentially significant factor is that both the user and gfukushima's nodes' besu version was being upgraded when the 45mins loading event occurred. In gfukushima's case, from 24.3.0 -> 24.6.0.

@siladu siladu added bug Something isn't working performance labels Jul 15, 2024
@siladu
Copy link
Contributor Author

siladu commented Jul 15, 2024

I took some JFR recordings of both canary nodes, running JFR only during the loading window (for < 10 seconds)...
7322-jfr.tar.gz

dev-clc-bu-tk-mainnet-mainnet-archive-bkp trie log pruner load time was ~2 mins
prd-elc-besu-lighthouse-mainnet-nightly-bonsai-snap trie log pruner was ~20 seconds

@siladu
Copy link
Contributor Author

siladu commented Jul 15, 2024

I also did some wall clock profiling with e.g.:
asprof -e wall -t -i 5ms -f dev-clc-bu-tk-mainnet-mainnet-archive-bkp_10secs.html -d 10 237462

A 10-second profile during the loading window of both nodes, and a 60 second profile during the 2 min window of dev-clc-bu-tk-mainnet-mainnet-archive-bkp...
7322-async-wallclock.tar.gz


I suspect the problem lies in the forEach that processes the results, rather than the database loading itself: https://github.com/hyperledger/besu/blob/main/ethereum/core/src/main/java/org/hyperledger/besu/ethereum/trie/diffbased/common/trielog/TrieLogPruner.java#L77-L92
Perhaps parallelising this processing, or utilizing the RocksDB stream somehow may help here.

@siladu
Copy link
Contributor Author

siladu commented Jul 15, 2024

I have an untested stop-gap solution here, that reduces the amount we attempt to prune from 30_000 to 5000 and warns the user to prune.

#7321

@siladu
Copy link
Contributor Author

siladu commented Jul 16, 2024

Possible improvements after discussing the wall clock profiling with @ahamlat :

  • Disable RocksDB checksum verification as not needed for reading
  • Try parallel stream on the forEach, or use executor

@siladu
Copy link
Contributor Author

siladu commented Jul 16, 2024

Another possibility is if I can find a way to preload the queue asynchronously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant