Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATAL: Caught exception: It appears that Fulcrum was forcefully killed in the middle of committing a block... #41

Open
ghost opened this issue Jul 25, 2020 · 61 comments
Labels
question Further information is requested

Comments

@ghost
Copy link

ghost commented Jul 25, 2020

Great SPV server! It is fast and just what I needed.

I've encountered a problem when restarting Fulcrum:

FATAL: Caught exception: It appears that Fulcrum was forcefully killed in the middle of committng a block to the db. We cannot figure out where exactly in the update process Fulcrum was killed, so we cannot undo the inconsistent state caused by the unexpected shutdown. Sorry!

I restarted the server (via systemd) and also created an image snapshot for the server and then this error appears.

Having used Electrum Cash for a while, I never had any problems with this before (for years) even thought I would restart/terminate instances abruptly.

Is there anyway to prevent this problem from happening or gracefully recover? Thank you!

@cculianu
Copy link
Owner

cculianu commented Jul 25, 2020

Yeah it's a known issue with the way I did the data layout. I will have to redesign the data layout to avoid this in a future version. The recommended way to stop Fulcrum is to send it SIGINT and wait a good 60 seconds. (Usually it's done in 5-10s). See if you can configure systemd to send SIGINT or SIGTERM and have it wait for completion and not kill the process right away. I believe on most systems by default it does wait 30s or more...

You will have to resynch, unfortunately. :/ Sorry about that.

A future version will try to be ACID -- but for now I took speed shortcuts -- so hard shutdown runs the risk of this issue happening if you shut down in the middle of when a block arrived and the DB was being updated.

I understand that ElectrumX did not suffer from this. It was also slower. :)

I will see if I can do ACID without too much of a perf. hit in a future version. For now you will have to resynch from scratch though. Sorry...

If this makes you worried you can always also backup the synched DB (with Fulcrum stopped). That way you can always restore from backup. FWIW I have been running my server for months now and never had to restore from backup.

Sorry about that.

@cculianu cculianu added the question Further information is requested label Jul 25, 2020
@ghost
Copy link
Author

ghost commented Jul 25, 2020

Awesome! Thanks for the quick reply.

Yeah it's a known issue with the way I did the data layout. I will have to redesign the data layout to avoid this in a future version.

Would you be able to give me a hint on where to look so that I can implement it?

The problem is not that I force-kill a process, the problem is the server could be under highroad and the OS terminates the process abruptly.

I'm curious about the locking here:

https://github.com/cculianu/Fulcrum/blob/master/src/Storage.cpp#L1232

Does this mean that while we are processing a block, that no one can query mempool/utxo's (ie: blocked threads) until the block is committed?

Thanks so much!

@georgengelmann
Copy link

I restarted the server (via systemd) and also created an image snapshot for the server and then this error appears.

What's in your systemd file? I don't have RestartSec=60s in it, but it never crashed.

@cculianu
Copy link
Owner

cculianu commented Jul 26, 2020

@atteeela --

Would you be able to give me a hint on where to look so that I can implement it?

It's not a trivial fix. It would require redesigning the database to use a single table with different "column families" for each of the logical pieces: utxo_set, scripthash_history, headers, etc. If all of the data lives in a single table it's possible to do "begin transaction", "end transaction" pairs when updating the data when new blocks arrive, and it would be ACID -- in that case even yanking the power chord or whatever will not lead to any corruption (just a rollback). So.. given that, it's more than a quick hackjob -- a person with significant experience in rocksdb and C++ would be needed to do this. I can do this myself -- and the reason why I didn't do it initially was because it was slower than the data layout I have now. I wanted to design this server to be as fast as possible.

Potentially the new data layout would be optional and only for users that prefer ACID over speed. So -- the database layer would need to be abstracted a bit to handle both data layouts.

It's not a small job; I can do it myself -- it just would take me a lot of time.

Does this mean that while we are processing a block, that no one can query mempool/utxo's (ie: blocked threads) until the block is committed?

Yes, that's correct. Typically the locks are held for less than 5ms (sometimes less than 1ms). And blocks arrive once every 10 minutes. It's not exactly a huge deal. This amount to: 0.00083% of the time that locks are being held exclusively, on average. You may get more of a burp and slowdown from your OS kernel than from this. The rest of the time everything is very much parallel.

@vul-ture
Copy link

vul-ture commented Oct 3, 2020

I'm running into this issue. It's annoying that I have to start sync over from the beginning. I tried tweaking the db_max_open_files, max_pending_connections, and bitcoind_throttle values down but it's still happening around block 450,000. Can anyone recommend a workaround? Thanks

@cculianu
Copy link
Owner

cculianu commented Oct 3, 2020

@vul-ture Is it happening on initial synch? Or later?

Don't kill Fulcrum with kill -9 -- make sure you wait for it to gracefully shut down...

@vul-ture
Copy link

vul-ture commented Oct 3, 2020

it happens on initial sync, I'm not even sending a signal to fulcrum. It could be that my RPC connections are getting saturated (?
I'm enabling debugging and stats and will update.

@cculianu
Copy link
Owner

cculianu commented Oct 3, 2020

Huh? If you haven't synched yet RPC is not even up yet.

There are two possibilities here:

  1. You are out of disk space
  2. You are out of disk space

Please ensure that the directory you use is on a filesystem that has ~40GB free for mainnet.

@cculianu
Copy link
Owner

cculianu commented Oct 3, 2020

Also please ensure you are using a filesystem that supports >2GB files... (e.g. no FAT32 or other ancient filesystem).

@vul-ture
Copy link

vul-ture commented Oct 3, 2020

I meant the RPC connections to bitcoin daemon
using ext4 fs with 120GB free

@cculianu
Copy link
Owner

cculianu commented Oct 3, 2020

RPC to bitcoind being too slow shouldn't lead to this error message -- you would instead see some warnings about connections dropped / reconnect to bitcoind -- but it would recover from that.

@vul-ture
Copy link

vul-ture commented Oct 3, 2020

Right, the daemon connection is fast as well. It must be DB write issue. Performance of this app is excellent, ~10k transactions/sec. If I can work around this error and sync it should work fine. My specs are pretty good so I don't think it's a hardware issue.

@cculianu
Copy link
Owner

cculianu commented Oct 3, 2020

Ok well without more info I can't help. I still think somehow your data dir is not on a filesystem with enough space ...

Maybe some verbose logging will elucidate things, one might hope.

@vul-ture
Copy link

vul-ture commented Oct 4, 2020

OK I'm able to reproduce the problem, this might be a different issue than the original bug.

[2020-10-04 12:19:27.431] Verifying headers ...
[2020-10-04 12:19:27.431] (Debug) Verifying 481460 headers ...
[2020-10-04 12:19:28.518] (Debug) Read & verified 481460 headers from db in 1086.771 msec
[2020-10-04 12:19:28.518] Initializing header merkle cache ...
[2020-10-04 12:19:29.066] (Debug) Merkle cache initialized to length 481460
[2020-10-04 12:19:29.079] (Debug) Read TxNumNext from file: 248286376
[2020-10-04 12:19:29.079] Checking tx counts ...
[2020-10-04 12:19:30.933] 248286376 total transactions
[2020-10-04 12:19:30.933] UTXO set: 50838980 utxos, 4270.474 MB
[2020-10-04 12:19:30.990] (Debug) Storage starting thread
[2020-10-04 12:19:30.991] BitcoinDMgr: starting 3 bitcoin rpc clients ...
[2020-10-04 12:19:30.991] (Debug) Changed pingtime_ms: 10000
[2020-10-04 12:19:30.991] (Debug) BitcoinD.1 starting thread
[2020-10-04 12:19:30.991] (Debug) Changed pingtime_ms: 10000
[2020-10-04 12:19:30.991] (Debug) BitcoinD.2 starting thread
[2020-10-04 12:19:30.991] (Debug) Changed pingtime_ms: 10000
[2020-10-04 12:19:30.991] (Debug) BitcoinD.3 starting thread
[2020-10-04 12:19:30.991] (Debug) BitcoinDMgr starting thread
[2020-10-04 12:19:30.992] BitcoinDMgr: started ok
[2020-10-04 12:19:30.992] (Debug) Controller starting thread
[2020-10-04 12:19:30.991] <BitcoinD.1> (Debug) TCP BitcoinD.1 (id: 2) socket state: 1
[2020-10-04 12:19:30.991] <BitcoinD.2> (Debug) TCP BitcoinD.2 (id: 3) socket state: 1
[2020-10-04 12:19:30.991] <BitcoinD.2> (Debug) TCP BitcoinD.2 (id: 3) socket state: 2
[2020-10-04 12:19:30.991] <BitcoinD.1> (Debug) TCP BitcoinD.1 (id: 2) socket state: 2
[2020-10-04 12:19:30.992] <BitcoinD.3> (Debug) TCP BitcoinD.3 (id: 4) socket state: 1
[2020-10-04 12:19:30.992] <BitcoinD.3> (Debug) TCP BitcoinD.3 (id: 4) socket state: 2
[2020-10-04 12:19:31.002] <BitcoinD.2> (Debug) TCP BitcoinD.2 (id: 3) 10.10.1.2:8332 socket state: 3
[2020-10-04 12:19:31.002] <BitcoinD.2> (Debug) on_connected 3
[2020-10-04 12:19:31.002] <BitcoinD.1> (Debug) TCP BitcoinD.1 (id: 2) 10.10.1.2:8332 socket state: 3
[2020-10-04 12:19:31.002] <BitcoinD.3> (Debug) TCP BitcoinD.3 (id: 4) 10.10.1.2:8332 socket state: 3
[2020-10-04 12:19:31.002] <BitcoinD.1> (Debug) on_connected 2
[2020-10-04 12:19:31.002] <BitcoinD.3> (Debug) on_connected 4
[2020-10-04 12:19:31.004] (Debug) Auth recvd from bicoind with id: 3, proceeding with processing ...
[2020-10-04 12:19:31.006] (Debug) Refreshed version info from bitcoind, version: 0.16.3, subversion: /Satoshi:0.16.3/
[2020-10-04 12:19:31.007] (Debug) Refreshed genesis hash from bitcoind: 000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
[2020-10-04 12:19:31.101] Block height 651269, downloading new blocks ...
[2020-10-04 12:19:31.101] (Debug) Task.DL 481460 -> 651269 starting thread
[2020-10-04 12:19:31.101] (Debug) Task.DL 481461 -> 651269 starting thread
[2020-10-04 12:19:31.101] (Debug) Task.DL 481462 -> 651269 starting thread
[2020-10-04 12:19:31.101] (Debug) Task.DL 481463 -> 651269 starting thread
[2020-10-04 12:19:31.102] (Debug) Task.DL 481464 -> 651269 starting thread
[2020-10-04 12:19:31.102] (Debug) Task.DL 481465 -> 651269 starting thread
[2020-10-04 12:19:31.102] (Debug) Task.DL 481466 -> 651269 starting thread
[2020-10-04 12:20:32.059] <Task.DL 481460 -> 651269> [Qt Warning] Qt has caught an exception thrown from an event handler. Throwing
exceptions from an event handler is not supported in Qt.
You must not let any exception whatsoever propagate through Qt code.
If that is not possible, in Qt 5 you must at least reimplement
QCoreApplication::notify() and catch all exceptions there.
(:0, )

console reports
(:0, )
what(): ReadCompactSize(): size too large: iostream error

Looks like my bitcoind might have a corrupted block? investigating

@cculianu
Copy link
Owner

cculianu commented Oct 4, 2020

Oh man! You’re on bitcoin core (btc).

Fulcrum is for Bitcoin Cash.

However you piqued my curiosity. I will try and get it working with bitcoin btc (maybe) in the coming week or two.

Yeah fulcrum is for bch...

That explains it!

@vul-ture
Copy link

vul-ture commented Oct 4, 2020

Lol oops, sorry about that! I will use it for BCH then
and thanks for taking a look at Core.
Looks like I synched a little past the BCH fork block (478559) and then died

@cculianu
Copy link
Owner

cculianu commented Nov 4, 2020

Hey @vul-ture if you want you can try using Fulcrum with BTC if you start bitcoind on BTC with -rpcserialversion=0. I think in that case the bitcoind will "speak the same language" as Fulcrum and it may succeed in syncing.

I am going to test this myself but from my reading of the bitcoin core sourcecode it should work.

@vul-ture
Copy link

testing now with the rpcserialversion flag

@cculianu
Copy link
Owner

cculianu commented Nov 14, 2020

Hey! @vul-ture So actually latest Fulcrum release fully supports BTC. Don't use that flag -- it will give you problems later when you connect Electrum to it.

Just get rid of that flag, blow away your existing Fulcrum db, and resynch again from scratch. When it finishes synching you can serve up BTC. I have been doing so for the past week without problems. I have 222 users connected right now.

@vul-ture
Copy link

Working, thanks! I think this is the fastest electrum server I've seen.

@cculianu
Copy link
Owner

Wow man thanks for the compliment. :) Yeah I tried to make it fast FAST. That was my #1 design goal. (And of course also correctness was too).

Thanks!

@cculianu
Copy link
Owner

@vul-ture PS: Be sure to be on latest Fulcrum 1.4.0 since in the 1.3.x series BTC support was still beta and the mempool code wasn't as fast as it is in 1.4.x series... If you aren't already on Fulcrum 1.4.x, upgrading is simple and doesn't require a db resynch.

@apemithrandir
Copy link

Just tagging a comment to say I ran into this issue too. I had to forcibly kill my VM where Fulcrum was running on. Then when I brought the VM back up I got this error.
Will have to re-sync now.
Aside from this very happy with the server.

@cculianu
Copy link
Owner

@apemithrandir I see. Sorry to hear that. I do plan on making Fulcrum more resilient to unfortunately-timed crashes in the future. It is a partially solvable problem, meaning that we can get it to a state where for 99% of crashes it should be able to auto-recover. It just requires more logic in the code to diagnose what went wrong and backtrack a bit. Thanks for the feedback. I have never had to re-sync fulcrum after hard system reset and I've had my server randomly lose power or be forced to randomly reset about 12 times in the last 3 years (I was running my server out of my apartment at one point, without battery backup, ha ha).

I sort of optimistically thought that such corruption issues would be rare. But this just proves that anything that can go wrong will, eventually, given a large enough install base.

@apemithrandir
Copy link

I think it was a combination of my VM being off for a few days and I was bringing it back online. Bitcoind was still grabbing blocks and fulcrum was also still updating. My VM was acting laggy/unresponsive and then I went to restart. When restarting the VM, it just hung on a black screen and I had to do a force turn off.
It is the first time that my fulcrum has required a re-sync like this. My machine has crashed before, but normally when the chain was up to date. With an up to date chain, the chance that a block was being processed at the time of the crash is very low.

@craigraw
Copy link

Although it pains me to say it, I would give up a little of Fulcrum's stunning performance (if necessary) to have this implemented. Although it tends to happen more during initial indexing (often due to an overly optimistic fast-sync configuration), as you note it's inevitable that with a large enough user base there will be abnormal shutdowns during indexing, particularly with many RPi nodes running without UPS backup. This has been the main technical consideration I have come across from implementors looking to switch from Electrs.

@cculianu
Copy link
Owner

Yeah, it only happens if Fulcrum was in the process of writing out data for a new block -- so during a catch-up phase or a sync it can happen after a non-graceful exit. Under normal operation it's unlikely since a block arrives once every 10 mins and only takes 20-100msec to process, depending on CPU and HD speed..

But yes, this is solvable and I will focus on that in the future. 100% agreed.

@apemithrandir
Copy link

apemithrandir commented Mar 19, 2022

I am struggling to get through the re-sync without hitting this error. I'm on my 3rd attempt now. This is the most recent forceful kill log:

  1. Mar 19 XX:23:47 XXXX-ubuntu Fulcrum[3429]: [2022-03-19 XX:23:47.832] Processed height: 442000, 60.7%, 4.09 blocks/sec, 7561.0 txs/sec, 27767.0 addrs/sec
  2. Mar 19 XX:24:25 XXXX-ubuntu Fulcrum[3429]: [2022-03-19 XX:24:25.277] Storage UTXO Cache: Flushing to DB ...
  3. Mar 19 XX:25:23 XXXX-ubuntu kernel: [14761.013748] [ 3429] 1000 3429 3940306 2735552 28819456 63613 0 Fulcrum
  4. Mar 19 XX:25:23 XXXX-ubuntu kernel: [14761.013981] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/fulcrum.service,task=Fulcrum,pid=3429,uid=1000
  5. Mar 19 XX:25:23 XXXX-ubuntu kernel: [14761.014012] Out of memory: Killed process 3429 (Fulcrum) total-vm:15761224kB, anon-rss:10942208kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:28144kB oom_score_adj:0
  6. Mar 19 XX:25:24 XXXX-ubuntu kernel: [14762.247928] oom_reaper: reaped process 3429 (Fulcrum), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  7. Mar 19 XX:25:24 XXXX-ubuntu systemd[1]: Stopped Fulcrum.
  8. Mar 19 XX:25:24 XXXX-ubuntu systemd[1]: Started Fulcrum.
  9. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.198] Loaded SSL certificate: Internet Widgits Pty Ltd expires: Sun February 8 2032 XX:29:08
  10. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.200] Loaded key type: private algorithm: RSA
  11. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.202] Enabled JSON parser: simdjson
  12. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.202] simdjson implementations:
  13. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.202] haswell: Intel/AMD AVX2 [supported]
  14. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.202] westmere: Intel/AMD SSE4.2 [supported]
  15. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.202] fallback: Generic fallback implementation [supported]
  16. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.202] active implementation: haswell
  17. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] jemalloc: version 5.2.1-0-gea6b3e9
  18. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] Qt: version 5.15.2
  19. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] rocksdb: version 6.14.6-ed43161
  20. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] simdjson: version 0.6.0
  21. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] ssl: OpenSSL 1.1.1f 31 Mar 2020
  22. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] zmq: libzmq version: 4.3.3, cppzmq version: 4.7.1
  23. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] Fulcrum 1.6.0 (Release 5e95c0f) - Sat Mar 19, 2022 XX:25:25.205 XXXX - starting up ...
  24. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] Max open files: 8192
  25. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.207] Loading database ...
  26. Mar 19 XX:26:09 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:26:09.186] DB memory: 1024.00 MiB
  27. Mar 19 XX:26:09 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:26:09.187] Coin: BTC
  28. Mar 19 XX:26:09 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:26:09.188] Chain: main
  29. Mar 19 XX:26:09 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:26:09.195] FATAL: Caught exception: It appears that Fulcrum was forcefully killed in the middle of committing a block to the db. We cannot figure out where exactly in the update process Fulcrum was killed, so we cannot undo the inconsistent state caused by the unexpected shutdown. Sorry!
  30. Mar 19 XX:26:09 XXXX-ubuntu Fulcrum[3940]: The database has been corrupted. Please delete the datadir and resynch to bitcoind.

I was using this in fulcrum.conf:

bitcoind_timeout = 300
bitcoind_clients = 1
worker_threads = 1
db_mem=1024
db_max_open_files=200
fast-sync = 1024

Any suggestions?

@apemithrandir
Copy link

Maybe it is a problem with my VM, but twice now when I rebooted my machine after Fulcrum failed I have been booted into a initramfs prompts/busybox screen and had to manually perform fsck command to recover my file system.

@cculianu
Copy link
Owner

I am really sorry this is happening. So there must be a bug in either the jemalloc allocator Fulcrum uses or, alternatively, in the robin_hood::unordered_map (internal data structure used to store the UTXOs while synching). I suspect actually robin_hood may have issues since it has had some issues in the past. It's worrying that it takes memory and never gives it up... in some situations.

The strange thing is that I have synched recently on BTC just to test things out and I never observed this behavior. So it may be something specific that triggers it.

Yes, the more I think about it.. it could be that robin_hood is to blame. I will investigate this further. Thank you for the info.

@cculianu
Copy link
Owner

Follow-up: I predict without fast-sync it won't fail. This would be evidence that robin_hood is to blame since it's only used for synching.

@apemithrandir
Copy link

apemithrandir commented Mar 20, 2022

Ok. I assume commenting out the fast-sync line in my fulcrum.conf is how I run without fast-sync. I did run into my CPU maxing out during the last run I did, so I've set the worker_threads back to 1 from 2.
I'll give it one more go.
Also let me know if you want me to DM you more logs or anything else that you might need to bug hunt.
##########################################
Edit: I got this when I started it this time:
"<Controller> fast-sync: Not enabled"
So I will see how I get on.

@apemithrandir
Copy link

After over 2 days on the Sync (with fast sync disabled) at block height 653,000 my CPU locked up again. Since I had the worker_threads=1, the CPU locked at < 100%. The VM was still unresponsive though.

@apemithrandir
Copy link

Sorry to say, I was unable to get Fulcrum up and fully sync'd after a week of trying. My VM kept crashing or freezing during the re-sync, forcing me to do another re-sync.
I'm not willing to re-build my VM from scratch at the moment so I will have to settle for the less performant ElectrumX server for now.

@cculianu
Copy link
Owner

I'm sorry to hear that, @apemithrandir . I have had little trouble synching it even on old Windows 7 boxes (yes, there is a windows .exe available) with like 4GB of RAM and HDD. It's perplexing that it would fail on what sounds like more generous hardware. I'm just curious -- can you provide more details about your setup? Like host os, guest os, VM software, VM configuration, host machine specs, and relevant parts of config file, etc. Anything helps. I want to see if I can reproduce the issues you experienced.

Sorry to hear you are going :(.

What about running on bare metal outside a VM? Or using Docker?

@apemithrandir
Copy link

apemithrandir commented Mar 24, 2022

I'm sorry to hear that, @apemithrandir . I have had little trouble synching it even on old Windows 7 boxes (yes, there is a windows .exe available) with like 4GB of RAM and HDD. It's perplexing that it would fail on what sounds like more generous hardware. I'm just curious -- can you provide more details about your setup? Like host os, guest os, VM software, VM configuration, host machine specs, and relevant parts of config file, etc. Anything helps. I want to see if I can reproduce the issues you experienced.

Sorry to hear you are going :(.

What about running on bare metal outside a VM? Or using Docker?

Happy to share any and all details with you one on one over private message/email, if it might help you with development.

@caheredia
Copy link

Yeah it's a known issue with the way I did the data layout. I will have to redesign the data layout to avoid this in a future version. The recommended way to stop Fulcrum is to send it SIGINT and wait a good 60 seconds. (Usually it's done in 5-10s). See if you can configure systemd to send SIGINT or SIGTERM and have it wait for completion and not kill the process right away. I believe on most systems by default it does wait 30s or more...

You will have to resynch, unfortunately. :/ Sorry about that.

A future version will try to be ACID -- but for now I took speed shortcuts -- so hard shutdown runs the risk of this issue happening if you shut down in the middle of when a block arrived and the DB was being updated.

I understand that ElectrumX did not suffer from this. It was also slower. :)

I will see if I can do ACID without too much of a perf. hit in a future version. For now you will have to resynch from scratch though. Sorry...

If this makes you worried you can always also backup the synched DB (with Fulcrum stopped). That way you can always restore from backup. FWIW I have been running my server for months now and never had to restore from backup.

Sorry about that.

I just experienced the same thing. My VM rebooted for updates.

[2022-05-21 20:06:24.266] Coin: BTC

[2022-05-21 20:06:24.266] Chain: main

[2022-05-21 20:06:24.267] FATAL: Caught exception: It appears that Fulcrum was forcefully killed in the middle of committing a block to the db. We cannot figure out where exactly in the update process Fulcrum was killed, so we cannot undo the inconsistent state caused by the unexpected shutdown. Sorry!

The database has been corrupted. Please delete the datadir and resynch to bitcoind.

[2022-05-21 20:06:24.268] Stopping Controller ... 

[2022-05-21 20:06:24.268] Closing storage ...

[2022-05-21 20:06:24.341] Shutdown complete

@cculianu
Copy link
Owner

Yes, I'm sorry. This happens if it's killed while it's busy processing a block. Perhaps you should setup fulcrum to be a systemd service this way any reboot of the node will send it a SIGTERM or whatever to shutdown gracefully?

I'm sorry. You must resynch now...

@caheredia
Copy link

caheredia commented May 21, 2022

Yes, I'm sorry. This happens if it's killed while it's busy processing a block. Perhaps you should setup fulcrum to be a systemd service this way any reboot of the node will send it a SIGTERM or whatever to shutdown gracefully?

I'm sorry. You must resynch now...

I'm running it inside of a docker container, so I'll have to figure out a graceful exit strategy. I appreciate the reply.

@cculianu
Copy link
Owner

Yeah, I'm sorry. I will have to redo the data model to be fully ACID and then this can never happen. This is on the to-do list for Fulcrum 2.0. Sorry about that.

@caheredia
Copy link

Yeah, I'm sorry. I will have to redo the data model to be fully ACID and then this can never happen. This is on the to-do list for Fulcrum 2.0. Sorry about that.

Looking forward to it. Thanks for prioritizing it in the project. I'd offer to help, but I mostly code in python.

@jonscoresby
Copy link

Just wanted to say that I downloaded and synced fulcrum about a month ago and had syncing problems. I tried syncing fulcrum on bitcoin 3 times and it would crash somwhere after block 400,000 after running out of memory. I tried disabling autosuspend as suggested here, but the sync still failed. After disabling fast sync however, the sync was successful.

@cculianu
Copy link
Owner

Thanks for the feedback @jonscoresby ... Perhaps I need to go back and see if I can make --fast-sync more resilient to such conditions. Just curious: were you using swap at all or did you have swap disabled?

@RequestPrivacy
Copy link

Just wanted to swing by and report that I seem to have the same problem: I tried to index on a Raspberry Pi with 4GBs and it flooded my RAM up till the point that it exited with the above error message (first try with fast-sync = 1024, second try with fast-sync = 512 and as I noticed that it filled my RAM again set it to 200MBs. But it crashed once more.

So I disabled fast-sync and now its humming away slowly but steady at 2.7 GBs (baseload was something like 1.3GBs of RAM).

@cculianu
Copy link
Owner

Yeah I really need to set aside some time and have the fast-sync option auto-detect this situation on OS's such as linux that overcommit and prune it down if that happens. I definitely will work on this soon!

@jonscoresby
Copy link

Thanks for the feedback @jonscoresby ... Perhaps I need to go back and see if I can make --fast-sync more resilient to such conditions. Just curious: were you using swap at all or did you have swap disabled?

Sorry I didn't see this. I do not have a swap enabled.

@cculianu
Copy link
Owner

cculianu commented Aug 20, 2022

Ah I see. Thanks for getting back to me.

Yeah I have a hypothesis that this is more likely to happen in the "no swapfile" situation. I am not sure why it became fashionable to ship Linux installs these days with no swap. I remember a time when every Linux install had a default swapfile setup. At some point that changed. Anyway -- I think that in the no-swapfile case, memory usage can get out-of-hand temporarily with --fast-sync and rocksdb both gobbling up RAM. And, of course, if there's no swap.. when you are out of RAM .. something must die. And that thing is Fulcrum.

I can't fully control memory usage (because rocksdb lib does its own thing and sometimes overallocates memory temporarily even when you tell it not to). I can, however, mitigate this by detecting the situation and controlling the --fast-sync memory usage .. if it looks like we are reaching the system limit, I can just prune the cache temporarily to be smaller than what the user specified.. or something like that.

@RequestPrivacy
Copy link

Also no swapfile on my linux.

Let me know if I should test something once you might have figured out a solution.

@chrisguida
Copy link

Please, please, please fix this. We are trying to package Fulcrum for embassyOS and this makes the otherwise amazing experience very painful. It can take several days to build the index on a low-resource device in docker, and to be told that you have to do it all over again is enough to make the user want to simply delete it and switch back to electrs.

@cculianu
Copy link
Owner

cculianu commented Jan 18, 2023

I will fix in in a future release, that's the plan.

Please don't use --fast-sync that eats memory and is experimental. It's not really suited for systems with low memory and no swap. It shouldn't ever crash on initial synch as often as it does -- and I noticed everybody is using that option -- which probably is leading to OOM? I should have named it differently...

@craigraw
Copy link

I will fix in in a future release, that's the plan.

That's great to hear. I've also noticed that --fast-sync is often configured with values that are far too high for the system. Perhaps Fulcrum should warn if it's set to say > 20% of system memory?

That said, I do see this issue mentioned more frequently not for the initial sync, but for accidental power loss or other ungraceful shutdown conditions.

@chrisguida
Copy link

I will fix in in a future release, that's the plan.

Excellent, great to hear!

Please don't use --fast-sync that eats memory and is experimental.

This problem does not only present during initial sync. We have already experienced corrupted databases on a couple of devices that were already synced.

@MattDHill
Copy link

Any update on this issue and #155. Start9 is still very excited to get Fulcrum onto StartOS, but not as long as ungraceful shutdowns necessitate resyncs.

Is there any update on that issue as well as the issues related to "fast-sync" discussed above?

@greenm01
Copy link

greenm01 commented Sep 9, 2023

I lost power this morning and my Fulcrum database is now corrupted. It took several days to sync on my SSD. For the time being I will switch back to electrs until this issue is resolved.

@fabiolameira
Copy link

Hello 👋

I ran into this problem when trying to synchronize my Fulcrum Server. The process was consuming too much RAM until it was killed by the OOM Killer (Out of Memory killer), causing the program to be closed forcefully, and corrupting my fulcrum_db.

I tried with different settings in fulcum.conf:

fast-sync = 8192 | 4096 | 2048 | 1024 | 512
db_max_open_files = 400 | 200 | 100 | 50 | 40

And it always ended up failing and corrupting the db.

For context, this is my setup:
OS: Ubuntu Server 22.04.3 LTS
Processor: i5-6500
RAM: 16GB
Disk: 2TB SSD

I compiled Fulcrum myself following the instructions detailed in the project's README.md and I didn't understand why this was happening, as it's not the first time I've synchronized a Fulcrum Server and it's never happened to me before.

As on other occasions I used images already compiled from the project and this had never happened, I thought it must be related to the way I compiled the project.

It was then that I noticed this:

$ Fulcrum -v
Fulcrum 1.9.8 (Release d4b3fa1)
Protocol: version min: 1.4, version max: 1.5.2
compiled: gcc 11.4.0
jemalloc: unavailable
Qt: version 5.15.3
rocksdb: version 6.14.6-ed43161
simdjson: version 0.6.0
ssl: OpenSSL 3.0.2 Mar 15, 2021
zmq: libzmq version: 4.3.4, cppzmq version: 4.7.1

jemalloc is unavailable when I run the $ Fulcrum -v command.
Since there was no jemalloc installed on the system, the project was using the system memory allocator and not jemalloc. I immediately thought that the problem might be related to this, as the system allocator might not be able to manage RAM usage as expected.

To solve the problem, I installed jemalloc with the following command:

$ sudo apt update
$ sudo apt install libjemalloc-dev

I verified the installation by running:

$ pkg-config --modversion jemalloc

Then i verified if the flag for jemaloc exists by running:

$ pkg-config --cflags --libs jemalloc

This should return -ljemalloc

Then I recompiled the project. To do this, I ran the following commands:

# This will generate the Makefile linking our jemalloc
$ qmake LIBS+=-ljemalloc

This should return somethis like this:

Project MESSAGE: CLI overrides: LIBS=-ljemalloc
Project MESSAGE: ZMQ version: 4.3.4
Project MESSAGE: rocksdb: using static lib
Project MESSAGE: jemalloc: using CLI override
Project MESSAGE: Including embedded secp256k1
Project MESSAGE: Installation dir prefix is /usr/local

Then i run the following command to execute the Makefile:

# This will execute the Makefile with the number of cores available on your machine
$ make -j $(nproc)

Then just run:

# This will install the Fulcrum in you /usr/local/bin
$ make install

Finally, to check if jemalloc is being used by Fulcrum, run this command again:

$ Fulcrum -v

And you should see something like:

Fulcrum 1.9.8 (Release d4b3fa1)
Protocol: version min: 1.4, version max: 1.5.2
compiled: gcc 11.4.0
jemalloc: version 5.2.1-0-gea6b3e9
Qt: version 5.15.3
rocksdb: version 6.14.6-ed43161
simdjson: version 0.6.0
ssl: OpenSSL 3.0.2 Mar 15, 2021
zmq: libzmq version: 4.3.4, cppzmq version: 4.7.1

Since my Fulcrum installation is using jemalloc as a memory allocator, I never had any more problems with OOM Killer again, neither during synchronization nor during normal use after it was synchronized.

Hope this helps 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests