Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with syncing from scratch and long resyncs in the splitstore #6769

Open
vyzo opened this issue Jul 16, 2021 · 5 comments
Open

Issues with syncing from scratch and long resyncs in the splitstore #6769

vyzo opened this issue Jul 16, 2021 · 5 comments
Assignees
Labels
kind/bug Kind: Bug

Comments

@vyzo
Copy link
Contributor

vyzo commented Jul 16, 2021

From discussion in #5788

It has become apparent that there are two difficulties for the splitstore syncing and compacting, both scenarios that blow the hotstore:

  • In a sync from scratch (@whyrusleeping wants to do that) everything goes into the hotstore, which will then try to compact once synced and very likely fail because of memory requirements.
  • In a resync after long downtime, the same problem arises: the hotstore might be blown and compaction might have difficulties running.

To fix this, the splitstore will need to detect when it is way out of sync (in Start), say for more than CompactionThreshold.
When that's the case, it should switch into a mode where writes are redirected to the coldstore until fully synced, at which point a warmup is run to fetch state object references; the splitstore can then run as normal (and compact etc).


Original Post:
I've run into an issue. I have limited disk space on my hot store, but lots of storage on my cold store. But due to a different error my lotus lost sync for a while. Now that lotus is trying to catch up, it never compacts and fills my hot store disk, then stops syncing again.

I tried starting lotus with --no-bootstrap in the hope that I could trigger a compaction manually, but there is no way to do that either.

Originally posted by @clinta in #5788 (comment)

@vyzo vyzo changed the title I've run into an issue. I have limited disk space on my hot store, but lots of storage on my cold store. But due to a different error my lotus lost sync for a while. Now that lotus is trying to catch up, it never compacts and fills my hot store disk, then stops syncing again. Support syncing from scratch and long resyncs in the splitstore Jul 16, 2021
@vyzo vyzo changed the title Support syncing from scratch and long resyncs in the splitstore Issues with syncing from scratch and long resyncs in the splitstore Jul 16, 2021
@vyzo vyzo added the kind/bug Kind: Bug label Jul 16, 2021
@vyzo vyzo self-assigned this Jul 16, 2021
@clinta
Copy link
Contributor

clinta commented Jul 16, 2021

I'm concerned about directing writes directly into the cold store. My cold store is not nearly as performant as my hot store. My original motivation switching to split store was because the performance of my cold store sata SSDs was not good enough to stay in sync. Switching to a split store resolved that, and lotus ran fine for months, until I had another unrelated issue that caused it to stop syncing, and since then I've been stuck.

I think forcing a compaction when either A: The compaction is so large it will likely fail due to memory, or B: The hot store is out of disk space, would be a preferable solution.

@vyzo
Copy link
Contributor Author

vyzo commented Jul 28, 2021

So we have merged in master support for on-disk marksets; this alleviates memory pressure during compaction and might as well work to bring your node back to life.

Can you give it a try?
You can enable by adding this to the config

[Chainstore.Splitstore]
 MarkSetType = "badger"

@vyzo
Copy link
Contributor Author

vyzo commented Jul 28, 2021

I would also recommend doing a full (moving) gc in your hotstore to bring it back down to about 55G.
You can force a full gc in the next compaction with this setting:

[Chainstore.Splitstore]
  HotStoreFullGCFrequency = 1

The default is 20, which will do a full gc every 20 compactions (about once a week); you can restore the default after compacting or leave it at 1 if you wish; moving gc is not all that slow (takes about 9min in my nodes).

@vyzo
Copy link
Contributor Author

vyzo commented Jul 30, 2021

Further work on reducing memory usage in #6949

@vyzo
Copy link
Contributor Author

vyzo commented Feb 7, 2022

Long resyncs should be much better with #8008 which uses on-disk coldset and eliminates sorting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Kind: Bug
Projects
None yet
Development

No branches or pull requests

3 participants