Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitstore: Online Garbage Collection for the Coldstore #6577

Open
vyzo opened this issue Jun 23, 2021 · 10 comments
Open

Splitstore: Online Garbage Collection for the Coldstore #6577

vyzo opened this issue Jun 23, 2021 · 10 comments
Assignees
Labels
area/chain Area: Chain team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs

Comments

@vyzo
Copy link
Contributor

vyzo commented Jun 23, 2021

Need for Space

Once the splitstore has been deployed (see also #6474) we have the ability to perform online garbage collection for the coldstore, as we control writes.
Specifically, we only write to the coldstore during compaction, when we move newly cold objects, protected by the compaction lock.
That means we can perform gc on the coldstore without disrupting regular node operations or requiring downtime.

Garbage collecting the coldstore is an essential operation for keeping space usage bounded in non-archival nodes -- see also #4701.

Design Considerations

Garbage collection must effectively reclaim space; hence we can't use native badger gc which is horrible at reclaiming space and requires hacks in order to convince the thing to reclaim as much space as possible.
Furthermore, even if we do manage to reclaim all space possible, the gc'ed blockstore has the tendency to quickly balloon back up in size.

Instead, we propose a moving garbage collector for the coldstore, which also allows us to tune the gc process to the user's needs.

Fundamentally, the gc operation will instantiate a new (empty) coldstore, walk the chain for live objects in the coldstore according to user retention policies, and then move live objects to the new coldstore.

Once the move is complete, the new coldstore becomes the actual coldstore and the old coldstore is deleted.

Retention Policies:

At a minimum, we must ensure that we retain chain headers (all the way to genesis), as it is not currently safe to discard them due to unlimited randomness lookback (this may change in the future, but we still want to keep them in order to be able to navigate the chain).

Apart from that it is up to the user:

  • whether to retain messages, and for how many finalities (or all of them)
  • whether to retain message receipts
  • how deep to retain state roots and associated objects.

So the garbage collection interface must allow the user to specify preferences/policies that match their own demand.

A possible sane default:

  • retain all messages and receipts to allow other nodes to sync using our node.
  • retain state roots for up to a few finalities (this could perhaps be none) to allow resets.

Additional considerations:

  • the coldstore is by design likely to be housed in a separate disk and be symlinked in the .lotus/datastore/chain path.
  • we must not run out of space during the move; the user must ensure enough available space. Generally this should be less than the current coldstore, but more precise estimtes can be made by the size of the hotstore and the number of finalities we want to retain for space objects.

In order to address these issues, we propose that the user supplies the new coldstore path at the time of move, which is then symlinked by the system itself into ~/.lotus/datastore/chain.

Interface

We propose to introduce a new (v1) API which can be invoked to trigger gc on demand, perhaps through a cron job.
The API handler will try to cast the blockstore to the splitstore, and if successful invoke the relevant interface with the options supplied by the user.
The cli frontend can be either a lotus command or a lotus-shed command; it doesn't really matter.
We might want to use a lotus-shed command while the splitstore remains experimental and later migrate to the lotus binary when it becomes the default.

What About the Hotstore?

The hotstore is gc'ed online with badger's gc after every compaction.
This doesn't reclaim all the space that it can, but over time it does reclaim enough space to not balloon out of control.
If uncontrolled growth of the hotstore is observed, we can add a lotus-shed command to implement moving gc for the hotstore itself.
This will require node downtime however.

@vyzo vyzo added the team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs label Jun 23, 2021
@vyzo vyzo self-assigned this Jun 23, 2021
@jennijuju jennijuju added the area/chain Area: Chain label Jun 23, 2021
@Stebalien
Copy link
Member

Problem: manually copying blocks from one badger datastore to another is slow (need to re-build indexes, etc).
Solution: Use badgers streaming read/write interfaces.

However, to do this "generically", we'll likely need to extend the blockstore interface (or implement optional extensions).

Solution 1: Filter(callback)

Remove everything not matching the given filter function:

type BlockstoreFilter interface {
    Filter(ctx context.Context, cb func(context.Context, func(multihash.Multihash, []byte)) (bool, error))
}

In practice, the underlying blockstore would either:

  1. Delete all non-matching entries.
  2. Create a new blockstore, copy relevant entries, delete the old blockstore.

For badger-backed blockstores, we'd likely do the latter. If we want to get fancy, we could do a bit of random sampling and pick copy/in-place depending on the amount of data we expect to delete.

Solution 2: CopyTo(target, callback)

More generally, we could implement a CopyTo function to copy all matching blocks from one datastore to another.

type BlockstoreCopy interface {
    CopyTo(ctx context.Context, target Blockstore, cb func(context.Context, func(multihash.Multihash, []byte)) (bool, error))
}

This is significantly more general purpose (would work for estuary as well) but:

  1. Requires some brittle type-assertions to optimize correctly. That is, we'd need to type assert the target blockstore to, e.g., a badger-backed blockstore. Unfortunately, if it's wrapped, we'd fail and silently fallback to a slow copy operation.
  2. Doesn't allow the random sampling trick suggested in solution 1.

These solutions aren't exclusive so we could just implement the solution best for this case (likely solution 1) and implement CopyTo at some later date.

@iand
Copy link
Contributor

iand commented Jun 30, 2021

Has any consideration been given to time bounds of the GC? What is the target timeframe for a GC to complete? A moving collector is essentially unbounded in time if the collection rate is lower than the rate of growth of the chain and state tree. This may not be a problem at present but exponential growth in deals and capacity is going to rapidly increase the amount of garbage.

An alternate approach is to keep two cold stores, green and blue. Initially all compaction writes go to the green store. Reads go to both. When its size exceeds a threshold all writes are switched to the blue store. Records are kept of which blocks are dead in the green store as they become unreachable due to new blocks being written to blue. When the number of live blocks in the green store falls below a fixed threshold they are all copied to the blue store, the green store is replaced with an empty one and the roles are switched. The fixed threshold at which the copy takes place gives a time bound to the operation.

@Stebalien
Copy link
Member

Stebalien commented Jun 30, 2021 via email

@iand
Copy link
Contributor

iand commented Jun 30, 2021

  1. In the short-term, we can set a threshold of something like 20% live and be fine. But over time, "perminent state" will grow and "20%" could be hundreds of gigabytes (meaning we wouldn't GC until we hit (statesize/0.2)).

My intention was that the threshold would be a fixed amount of work not a fraction of total.

@Stebalien
Copy link
Member

My intention was that the threshold would be a fixed amount of work not a fraction of total.

Unfortunately, that would end up with lots of copying once the base state-tree starts approaching the threshold.

@iand
Copy link
Contributor

iand commented Jun 30, 2021

My intention was that the threshold would be a fixed amount of work not a fraction of total.

Unfortunately, that would end up with lots of copying once the base state-tree starts approaching the threshold.

I don't follow. It's a fixed amount of work (target number of bytes as a function of io capacity) so would be set to a level that is an acceptable level of copying.

@Stebalien
Copy link
Member

At some point, we'll stop hitting that threshold because the state-tree will grow to 100s of GiB.

@Stebalien
Copy link
Member

I.e., the "green" store will have all the sector infos, and those sector infos will remain live for 6-18 months.

@vyzo
Copy link
Contributor Author

vyzo commented Jul 14, 2021

Implementation in #6728

@Kubuxu
Copy link
Contributor

Kubuxu commented Jul 14, 2021

For badger-backed blockstores, we'd likely do the latter. If we want to get fancy, we could do a bit of random sampling and pick copy/in-place depending on the amount of data we expect to delete.

Another way would be to force badger to completely rewrite the value log by setting a very low value for discardRatio in RunValueLogGC. This way badger will be forced to rewrite for example all value logs that contain 1% or 5% garbage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/chain Area: Chain team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs
Projects
None yet
Development

No branches or pull requests

6 participants