-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TrieLogPruner.preloadQueue Performance Issue #7322
Comments
I took some JFR recordings of both canary nodes, running JFR only during the loading window (for < 10 seconds)... dev-clc-bu-tk-mainnet-mainnet-archive-bkp trie log pruner load time was ~2 mins |
I also did some wall clock profiling with e.g.: A 10-second profile during the loading window of both nodes, and a 60 second profile during the 2 min window of dev-clc-bu-tk-mainnet-mainnet-archive-bkp... I suspect the problem lies in the forEach that processes the results, rather than the database loading itself: https://github.com/hyperledger/besu/blob/main/ethereum/core/src/main/java/org/hyperledger/besu/ethereum/trie/diffbased/common/trielog/TrieLogPruner.java#L77-L92 |
I have an untested stop-gap solution here, that reduces the amount we attempt to prune from 30_000 to 5000 and warns the user to prune. |
Possible improvements after discussing the wall clock profiling with @ahamlat :
|
Another possibility is if I can find a way to preload the queue asynchronously. |
First alerted when Discord user ktmkancha mentioned they upgraded to Besu 24.6.0 (which enabled bonsai-limit-trie-log-enabled by default) and TrieLogPruner was taking 45mins+ to load. It is suspected this user hadn't pruned their trie log backlog and is it not known how much trie log data they have.
A second instance of this was reported to me by @gfukushima when upgrading his node,
dev-clc-bu-tk-mainnet-mainnet-archive-bkp
which contained 112 GiB (47 GiB of which is SST files, 64 GiB is BlobDB files):The time taken was ~45 mins...
Subsequent restarts of this node took ~2 mins to load...
Note, that the first restart pruned ~30_000 trie logs so the total to load reduced by ~2% (from 1,564,331 keys to 1,534,331).
I did some further testing of another canary,
prd-elc-besu-lighthouse-mainnet-nightly-bonsai-snap
which contained 70 GiB of trie logs (majority of which is BlobDB files)...This yielded loading times of only 20 seconds...
The default loading window size is 30_000. This was based on some testing which showed this to take between 4-7 seconds for a node with 37 GiB of TRIE_LOG_STORAGE (which predates the storage of trie logs as BlobDB so it would have been all SST files): #6026 (comment)
Since the tests were only done on 37 GiB of data, it seems plausible that the performance degrades in a non-linear fashion, potentially quite rapidly if the size of the database is large.
The difference between the canaries could be explained by the amount of SST vs BlobDB files. A large number of SST files would be present on older nodes since the BlobDB was only enabled in 24.3.0 and applies from time of upgrade onwards, it is not retrospectively applied to old SST files.
An interesting observation is that for gfukushima's node,
dev-clc-bu-tk-mainnet-mainnet-archive-bkp
, loading time dropped from 45 mins to 2 mins, with only a difference of 30_000 trie logs. This perhaps suggests there is something else going on beyond simply the size of the TRIE_LOG_STORAGE column family.Another potentially significant factor is that both the user and gfukushima's nodes' besu version was being upgraded when the 45mins loading event occurred. In gfukushima's case, from 24.3.0 -> 24.6.0.
The text was updated successfully, but these errors were encountered: