-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State tree / app state garbage collection #154
Comments
Without "stopping the world", not sure if this will introduce errors. Say if a key is not used when the sweep happens, so it's not recorded in the final bloom filter. Then when deleting unused keys, but there are new request and written to one of the unused key. But the bloom filter does not have that key recorded (unfortunate), then the key will be deleted. |
In my mind there was no "final" Bloom filter: The database wrapper would enter into a state where it uses a Bloom filter to record writes, which are not to be deleted; at some point it will receive a historical Bloom filter which is the result of traversing all the reachable data from the state roots we want to keep; it merges the two but keep adding new writes to the now combined Bloom filter; meanwhile it will also receive batches of old keys to delete iff they are not in the current Bloom filter. At some point the process ends and we can remove the Bloom filter and go back to normal mode. I'm not sure what you mean exactly by a "key is not used when the sweep happens"; I understand it as it's not being written to, and it's also not reachable in the history we want to keep. Then we start the deletion and this key is sent to the wrapper to be deleted. Two things can happen: 1) it gets deleted, but then the write puts it back, or 2) it's written and recorded in the current Bloom filter, and subsequently it's not deleted. Either way we should end up with the same state. |
Indeed, stop the world is not necessary. We can implement concurrent mark and sweep algos, although there may be some critical sections where we do need to stop writes. We can always serve reads, as long as APIs are aware of the garbage collection runs, e.g. when we start pruning height There is some precedent of this in Lotus' splitstore (hot/cold store), but here's your gentle warning that it's complex, as its design goal was to keep an eden and a tenured generation; eden/hot for the critical blockchain validation path, and tenured/cold for user queries. I like the idea of using bloom filters, tracking live writes into mergeable bloom filters, and deleting in batches where we retest against live bloom filters. We can tune bloom filter size and number of hashes depending on the entry count at the start of the round (which RocksDB can estimate for us).
|
I was thinking about implementing garbage collection for the ledger.
See the Forest and Lotus implementations.
Forest uses a "semi-space" strategy of copying the reachable blocks in the store to a new database and deleting the old one, always having the last two databases, writing into the 'current' and reading from both 'current' and 'old'. They use ParityDB with hashed keys, so they cannot enumerate keys in their original format, which would be required for filtering if they wanted to implement "mark and sweep" GC. Their strategy works because everything in their store is IPLD data, so they can analyse reachability from the tipset roots, and everything relevant gets copied.
That's not how we use RocksDB: we maintain separate column families for different types of data:
The latter separation was introduced so that data arriving over bitswap has no chance of affecting actor calculations while reachability analysis is not completed by the FVM. This shouldn't be an issue with EVM actors, but if we had builtin actors, it could be. NB now that we have snapshots and potentially garbage collection itself can also make the actor storage itself different, so random CID lookups cannot be considered determinstic.
I was thinking about implementing a sort of mark-and-sweep with Bloom filters on the actor state:
This way we can collect e.g. 95% of the unused keys (depends on how big the Bloom filter is) using limited memory and no blocking.
The downside is that the database has to be wrapped into something that receives read and write requests through a channel (reads should go in the same channel if we want them to be consistent with writes), and processes them in a thread. It can potentially preform deletions in batches.
RocksDB should eventually compact the data to reclaim free space and heal the fragmentation. This is a problem that the "semi-space" strategy doesn't have, but in return it doesn't require 100% extra disk space.
A different question is how to deal with the Bitswap store, where there are no obvious reachability guidelines. For example we wanted to use Bitswap for resolving cross-messages, and there is no obvious way to tie these to any root, because unlike Forest which has the tipset, in our case the only custodian for blocks and transactions is CometBFT. If we want roots for cross messages (checkpoints) we have to add them to the application state explicitly. Then we could run the same garbage collector on the bitswap store as well. For now the codepaths using Bitswap hasn't been enabled yet.
Maybe worth noting that it is also possible to drop an entire column family, to implement "semi-space" on a per-namespace level, but it's probably not a good idea because we don't know when RocksDB will reclaim the space, so it would potentially duplicate the storage requirement for an unknown amount of time, whereas the multi-database approach simply deletes the files no longer in use.
We could potentially also change the way the application state history is maintained to be an AMT, to have a single blockstore. The KVStore abstraction is different in that it offers transactionality.
The text was updated successfully, but these errors were encountered: