Splitstore: Online Garbage Collection for the Coldstore #6577

vyzo · 2021-06-23T18:15:11Z

Need for Space

Once the splitstore has been deployed (see also #6474) we have the ability to perform online garbage collection for the coldstore, as we control writes.
Specifically, we only write to the coldstore during compaction, when we move newly cold objects, protected by the compaction lock.
That means we can perform gc on the coldstore without disrupting regular node operations or requiring downtime.

Garbage collecting the coldstore is an essential operation for keeping space usage bounded in non-archival nodes -- see also #4701.

Design Considerations

Garbage collection must effectively reclaim space; hence we can't use native badger gc which is horrible at reclaiming space and requires hacks in order to convince the thing to reclaim as much space as possible.
Furthermore, even if we do manage to reclaim all space possible, the gc'ed blockstore has the tendency to quickly balloon back up in size.

Instead, we propose a moving garbage collector for the coldstore, which also allows us to tune the gc process to the user's needs.

Fundamentally, the gc operation will instantiate a new (empty) coldstore, walk the chain for live objects in the coldstore according to user retention policies, and then move live objects to the new coldstore.

Once the move is complete, the new coldstore becomes the actual coldstore and the old coldstore is deleted.

Retention Policies:

At a minimum, we must ensure that we retain chain headers (all the way to genesis), as it is not currently safe to discard them due to unlimited randomness lookback (this may change in the future, but we still want to keep them in order to be able to navigate the chain).

Apart from that it is up to the user:

whether to retain messages, and for how many finalities (or all of them)
whether to retain message receipts
how deep to retain state roots and associated objects.

So the garbage collection interface must allow the user to specify preferences/policies that match their own demand.

A possible sane default:

retain all messages and receipts to allow other nodes to sync using our node.
retain state roots for up to a few finalities (this could perhaps be none) to allow resets.

Additional considerations:

the coldstore is by design likely to be housed in a separate disk and be symlinked in the .lotus/datastore/chain path.
we must not run out of space during the move; the user must ensure enough available space. Generally this should be less than the current coldstore, but more precise estimtes can be made by the size of the hotstore and the number of finalities we want to retain for space objects.

In order to address these issues, we propose that the user supplies the new coldstore path at the time of move, which is then symlinked by the system itself into ~/.lotus/datastore/chain.

Interface

We propose to introduce a new (v1) API which can be invoked to trigger gc on demand, perhaps through a cron job.
The API handler will try to cast the blockstore to the splitstore, and if successful invoke the relevant interface with the options supplied by the user.
The cli frontend can be either a lotus command or a lotus-shed command; it doesn't really matter.
We might want to use a lotus-shed command while the splitstore remains experimental and later migrate to the lotus binary when it becomes the default.

What About the Hotstore?

The hotstore is gc'ed online with badger's gc after every compaction.
This doesn't reclaim all the space that it can, but over time it does reclaim enough space to not balloon out of control.
If uncontrolled growth of the hotstore is observed, we can add a lotus-shed command to implement moving gc for the hotstore itself.
This will require node downtime however.

The text was updated successfully, but these errors were encountered:

Stebalien · 2021-06-28T17:21:32Z

Problem: manually copying blocks from one badger datastore to another is slow (need to re-build indexes, etc).
Solution: Use badgers streaming read/write interfaces.

However, to do this "generically", we'll likely need to extend the blockstore interface (or implement optional extensions).

Solution 1: Filter(callback)

Remove everything not matching the given filter function:

type BlockstoreFilter interface {
    Filter(ctx context.Context, cb func(context.Context, func(multihash.Multihash, []byte)) (bool, error))
}

In practice, the underlying blockstore would either:

Delete all non-matching entries.
Create a new blockstore, copy relevant entries, delete the old blockstore.

For badger-backed blockstores, we'd likely do the latter. If we want to get fancy, we could do a bit of random sampling and pick copy/in-place depending on the amount of data we expect to delete.

Solution 2: CopyTo(target, callback)

More generally, we could implement a CopyTo function to copy all matching blocks from one datastore to another.

type BlockstoreCopy interface {
    CopyTo(ctx context.Context, target Blockstore, cb func(context.Context, func(multihash.Multihash, []byte)) (bool, error))
}

This is significantly more general purpose (would work for estuary as well) but:

Requires some brittle type-assertions to optimize correctly. That is, we'd need to type assert the target blockstore to, e.g., a badger-backed blockstore. Unfortunately, if it's wrapped, we'd fail and silently fallback to a slow copy operation.
Doesn't allow the random sampling trick suggested in solution 1.

These solutions aren't exclusive so we could just implement the solution best for this case (likely solution 1) and implement CopyTo at some later date.

iand · 2021-06-30T12:27:10Z

Has any consideration been given to time bounds of the GC? What is the target timeframe for a GC to complete? A moving collector is essentially unbounded in time if the collection rate is lower than the rate of growth of the chain and state tree. This may not be a problem at present but exponential growth in deals and capacity is going to rapidly increase the amount of garbage.

An alternate approach is to keep two cold stores, green and blue. Initially all compaction writes go to the green store. Reads go to both. When its size exceeds a threshold all writes are switched to the blue store. Records are kept of which blocks are dead in the green store as they become unreachable due to new blocks being written to blue. When the number of live blocks in the green store falls below a fixed threshold they are all copied to the blue store, the green store is replaced with an empty one and the roles are switched. The fixed threshold at which the copy takes place gives a time bound to the operation.

Stebalien · 2021-06-30T14:52:51Z

Has any consideration been given to time bounds of the GC? What is the target timeframe for a GC to complete? A moving collector is essentially unbounded in time if the collection rate is lower than the rate of growth of the chain and state tree. This may not be a problem at present but exponential growth in deals and capacity is going to rapidly increase the amount of garbage.

The state is growing, but the _churn_ (the state that would need to be archived every day) isn't a pretty consistent 21GiB/day.

An alternate approach is to keep two cold stores, green and blue. Initially all compaction writes go to the green store. Reads go to both. When its size exceeds a threshold all writes are switched to the blue store. Records are kept of which blocks are dead in the green store as they become unreachable due to new blocks being written to blue. When the number of live blocks in the green store falls below a fixed threshold they are all copied to the blue store, the green store is replaced with an empty one and the roles are switched. The fixed threshold at which the copy takes place gives a time bound to the operation.

I have some concerns, but this is probably worth exploring more. Concerns: 1. In the short-term, we can set a threshold of something like 20% live and be fine. But over time, "perminent state" will grow and "20%" could be hundreds of gigabytes (meaning we wouldn't GC until we hit (statesize/0.2)). 2. This could increase latency/memory usage. Ideally, the coldstore is pretty much _never_ accessed but this would effectively double everything.

iand · 2021-06-30T14:58:03Z

In the short-term, we can set a threshold of something like 20% live and be fine. But over time, "perminent state" will grow and "20%" could be hundreds of gigabytes (meaning we wouldn't GC until we hit (statesize/0.2)).

My intention was that the threshold would be a fixed amount of work not a fraction of total.

Stebalien · 2021-06-30T15:06:52Z

My intention was that the threshold would be a fixed amount of work not a fraction of total.

Unfortunately, that would end up with lots of copying once the base state-tree starts approaching the threshold.

iand · 2021-06-30T15:11:50Z

My intention was that the threshold would be a fixed amount of work not a fraction of total.

Unfortunately, that would end up with lots of copying once the base state-tree starts approaching the threshold.

I don't follow. It's a fixed amount of work (target number of bytes as a function of io capacity) so would be set to a level that is an acceptable level of copying.

Stebalien · 2021-06-30T17:32:00Z

At some point, we'll stop hitting that threshold because the state-tree will grow to 100s of GiB.

Stebalien · 2021-06-30T17:32:50Z

I.e., the "green" store will have all the sector infos, and those sector infos will remain live for 6-18 months.

vyzo · 2021-07-14T13:01:25Z

Implementation in #6728

Kubuxu · 2021-07-14T18:05:46Z

For badger-backed blockstores, we'd likely do the latter. If we want to get fancy, we could do a bit of random sampling and pick copy/in-place depending on the amount of data we expect to delete.

Another way would be to force badger to completely rewrite the value log by setting a very low value for discardRatio in RunValueLogGC. This way badger will be forced to rewrite for example all value logs that contain 1% or 5% garbage.

vyzo added the team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs label Jun 23, 2021

vyzo self-assigned this Jun 23, 2021

jennijuju added the area/chain Area: Chain label Jun 23, 2021

vyzo mentioned this issue Jul 5, 2021

Splitstore Enhanchements #6474

Merged

jennijuju added this to the █Blockstore Improvements milestone Jul 12, 2021

vyzo mentioned this issue Jul 14, 2021

Splitstore Garbage Collection #6728

Closed

jacobheun added the epic/splitstore label Jul 14, 2021

jennijuju removed the epic/splitstore label Dec 13, 2021

vyzo mentioned this issue Feb 7, 2022

Splitstore: the road to production readiness #8037

Open

10 tasks

ZenGround0 mentioned this issue Jul 19, 2022

feat:chain:splitstore chain prune #9056

Merged

12 tasks

LesnyRumcajs mentioned this issue Jul 22, 2022

Spike: garbage collection in Forest ChainSafe/forest#1708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitstore: Online Garbage Collection for the Coldstore #6577

Splitstore: Online Garbage Collection for the Coldstore #6577

vyzo commented Jun 23, 2021

Stebalien commented Jun 28, 2021

iand commented Jun 30, 2021

Stebalien commented Jun 30, 2021 via email

iand commented Jun 30, 2021

Stebalien commented Jun 30, 2021

iand commented Jun 30, 2021

Stebalien commented Jun 30, 2021

Stebalien commented Jun 30, 2021

vyzo commented Jul 14, 2021

Kubuxu commented Jul 14, 2021

Splitstore: Online Garbage Collection for the Coldstore #6577

Splitstore: Online Garbage Collection for the Coldstore #6577

Comments

vyzo commented Jun 23, 2021

Need for Space

Design Considerations

Retention Policies:

Additional considerations:

Interface

What About the Hotstore?

Stebalien commented Jun 28, 2021

iand commented Jun 30, 2021

Stebalien commented Jun 30, 2021 via email

iand commented Jun 30, 2021

Stebalien commented Jun 30, 2021

iand commented Jun 30, 2021

Stebalien commented Jun 30, 2021

Stebalien commented Jun 30, 2021

vyzo commented Jul 14, 2021

Kubuxu commented Jul 14, 2021