wallet/db_sqlite3.c: Support direct replication of SQLITE3 backends. #4890

ZmnSCPxj · 2021-10-27T10:11:03Z

ChangeLog-Added: With the sqlite3:// scheme for --wallet option, you can now specify a second file path for real-time database backup by separating it from the main file path with a : character.

Created as draft since not yet sufficiently tested, lemme check a bit. Milestoned next version

laanwj · 2021-10-27T13:48:21Z

Having this built in would be really nice.

cdecker · 2021-10-27T14:36:31Z

This brings up an issue that we had with the backup plugin itself: how to enforce that lightningd is either always started with the backup once it is initialized. The problem is that if you start with --wallet=sqlite3:///dba then later switch to --wallet=sqlite3:///dba:/dbb then the two will be out of sync, in the best case it'll crash, but it can also result in a hidden corruption, with some entries being misnumbered (autoincrement fields) or missing altogether.

The backup plugin verifies that the data_version matches and aborts if that's not the case. Is there something similar in this PR, and how could we address this better?

ZmnSCPxj · 2021-10-27T22:38:09Z

The Backup API of SQLITE3 will always overwrite the backup copy --- in the case of this PR, we always start a full backup at db_sqlite3_setup, completely overwriting the backup copy: https://github.com/ZmnSCPxj/lightning/blob/b4198cc538c371a0cb13867f2b55e013cd8dca46/wallet/db_sqlite3.c#L102-L107

The upshot is that in your example, /dbb will always be forced in sync with /dba, i.e. it will get overwritten. See also: https://sqlite.org/backup.html And I quote:

The online backup API allows the contents of one database to be copied into another database, overwriting the original contents of the target database.

The API allows incremental backups, but the implementation here does not use incremental backups (which would be dangerous in our application as some updates may not be picked up in an increment). Incremental backups are intended where multiple updaters may exist on the source db, but that is not our use-case --- only the lightningd process is the updater, so we always ask for a full backup each time. However, the API also has an optimization, where if the same source connection is used to update, then the update is automatically done on both the source and the destination backup and the backup process is marked as completed, thus reducing overhead as well (or at least should do so, I have not actually read the SQLITE3 source code, just the docs).

So in terms of correctness the code is correct, the issue is if whether it is efficient enough not to excessively delay operation (it will probably delay startup, due to the initial copy).

m-schmoock · 2021-10-28T10:37:27Z

I really like the idea!

jb55 · 2021-10-28T16:53:23Z

Interesting, Concept ACK if it works!

cdecker · 2021-10-29T11:56:18Z

I was totally unaware that sqlite3 had a backup facility built in, hence my ignorance about it's inner working. I'll test this asap and if it works this is an excellent way of creating same-machine backups.

ZmnSCPxj · 2021-10-30T05:47:25Z

The current code seems not to work (^^;;;). A backup file is generated but it seems a snapshot of the file at the start, and is not updated regularly. Seems a misunderstanding of the SQLITE3 Backup API docs.

On the other hand this seems a fairly popular pull request. Maybe at the start we can check if both databases have a data_version and both have the same data_version, and if not, replicate using the Backup API. Then in normal operation we just do the same writes to both the main and backup databases.

ZmnSCPxj · 2021-10-30T11:09:36Z

Current code does the replication "manually", i.e. just forwards writes to both databases. We only use the Backup API at the start to force-sync the two databases. As an optimization we can check db_version and if it is equal, skip the initial force-sync (not yet implemented here). Running this now on a throwaway moneyless node, let me stew it for a few days and see.

m-schmoock · 2021-10-30T17:11:04Z

@ZmnSCPxj
Is it exception safe for scenarios like i.e. when writing to the first (main) database produced an IO error or something bad, that the write to the second (backup) database is still done?

ZmnSCPxj · 2021-10-30T22:15:01Z

An error that causes a database constraint violation on the main db should also cause a database constraint violation on the backup db (and would be a bug anyway). Transactions are committed on main before committing on backup, and if a disk-level error occurs at the main then backup is not updated.

If a disk-level error occurs on backup but not on main, then main is updated, but we also abort after the failure on the backup, which is equivalent to an uncontrolled shutdown after updating the database but before sending anything to the counterparty, a scenario the protocol is prepared for. Either the non-updated backup (if .e.g. the disk-level failure was due to say NFS mount disappearing because some clumsy flesh foot tripped on the network cable, but the remote disk is still alive) or the updated main would be safe to resume with, again due to the protocol being prepared for either scenario --- as long as a message is not yet sent to the counterparty (and we only send that if both main and backup db were updated OK) then we would be safe with either the pre- or post- transaction state.

Perhaps I should note though that sqlite3 will by nature include either a journal or a wal file on both the main and backup db, and when recovering you should include any journal or wal too if present. Otherwise you could have a backup that is in a halfway-through-transaction state, which we are not safe for..

Of course, if your disk bitrots your data under you, then the main and backup can indeed go out-of-sync. But we do not have any bitrot protection in lightningd, even with a single database, so this is not a worsening at least. Use ZFS, or if you cannot, at least use BTRFS. And use ECC memory as well, because buffers are held in memory, after all.

cdecker · 2021-10-31T16:42:56Z

Transactions are committed on main before committing on backup, and if a disk-level error occurs at the main then backup is not updated.

Is that not the wrong way around? Committing to the backup only after the main DB would allow commits to get lost if there is a crash in-between. Admittedly the risk of a crash hitting between commits, followed by a corruption to the main DB is extremely unlikely, but i think the backup should get the commit first. But i also see your point about the io_loop not proceeding and acting like a sync barrier for the DB 🤔

ZmnSCPxj · 2021-11-08T05:03:25Z

@cdecker the LN safety model allows us to recover from either pre-db-transaction or post-db-transaction.

Consider that your imagined case can be reversed:

Suppose we write to backup first before main.
The following happens:
- We write to backup.
- Backup media suffers existential failure.
- Linux OOM killer kills lightningd before we write to main.
- SysAd stops operation, acquires new backup media, and copies the current main DB to the new backup media

Thus, whether we write backup first or main first, the same problem can occur regardless. Different media can write at different speeds, wire delays exist, etc., in a relativistic world there is no "simultaneous", so there will always exist a race condition.

As long as we get entire db-transactions, we should be fine, that is what channel_reestablish is for --- after all, a crash can happen before our local OS network stack has transferred the message to the NIC, so we should handle both the case where we crashed before the db-transaction commits, or after the db-transaction commits. The only requirement is db atomicity, not synchronicity; we only need to ensure that all copies are up-to-date before sending any revocation keys.

Basically: whether we write to backup first, or main first, is arbitrary, because race condition will always exist in either case. Fortunately, we make sure not to proceed unless we confirm both have synched. That is the only important case: either everything has synched and we can send revocation keys, or some of them are not up-to-date but we have not sent revocation keys (and channel_reestablish will handle either up-to-date or not-up-to-date, the only important thing is the counterparty never got the revocation keys).

ZmnSCPxj · 2021-11-08T05:19:52Z

Checked and at least on my throwaway node it looks like we can recover the data safely. The backup is not bit-for-bit same copy (maybe I should add that to BACKUP.md to assuage users), but quick checking with manual queries suggests they have the same data regardless. Also checked that copying is not done if the backup DB has the same data_version as the main DB. Copying takes about 4 seconds for my throwaway node which is tiny compared to the >40seconds gossipd copies gossip_store, but my real node has about 4.5x larger DB file, so it still is probably a good idea to avoid copying the entire db if the backup has the same data_version anyway.

For more extreme optimization, we should really launch a new thread or process and open the backup there, and just use a pipe or socket to send queries (including begin transaction and commit) to the backup process/thread. We only need a response from the backup process/thread at commit time, and that is when it can send any error messages that happened in the backup. That ensures that slow backup media writes is at least done in parallel with main media writes. But that is optimization and correctness comes first and the feature is usable as-is.

ZmnSCPxj · 2021-11-08T05:42:55Z

Added note that the backup and main are not byte-for-byte identical, rebased on master.

doc/BACKUP.md

vincenzopalazzo

ack ef7a712

With some small comment on the docs

doc/BACKUP.md

wallet/db_sqlite3.c

ChangeLog-Added: With the `sqlite3://` scheme for `--wallet` option, you can now specify a second file path for real-time database backup by separating it from the main file path with a `:` character.

ZmnSCPxj · 2021-11-14T03:58:40Z

Updated docs as per review.

vincenzopalazzo

ack 3211983

rustyrussell · 2021-11-17T01:30:29Z

Nice work!

Sjors · 2022-06-03T12:50:59Z

First of all this is great.

Second: it's a terrifying first time experience. My lightningd.sqlite3 has grown to 200 MB and although it very quickly creates the second file, at identical size, it then takes ten minutes or so to finish… whatever it's doing. During that time there is no info, other than:

2022-06-03T12:45:05.450Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_channeld
2022-06-03T12:45:05.451Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_closingd
2022-06-03T12:45:05.452Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_connectd
2022-06-03T12:45:05.453Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_gossipd
2022-06-03T12:45:05.453Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_hsmd
2022-06-03T12:45:05.454Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_onchaind
2022-06-03T12:45:05.455Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_openingd
2022-06-03T12:45:05.456Z DEBUG   hsmd: pid 966079, msgfd 41

Some sort of log status about the migration would help calm the nerves.

On a related note, what is actually stored in lightningd.sqlite3? Is it the bare essentials to safely use channels, or does it contain other stuff like gossip? Since the backup file is probably on a slower, perhaps even remote, disk, I would think that the less stuff in there, the better. Non-radioactive, yet important, things like invoice metadata are probably something I could backup separately.

jb55 · 2022-06-07T18:49:34Z

wow thanks for reminding me about this PR @Sjors, I just tried this as well and it seems to work. I have a bit more peace of mind now.

ZmnSCPxj added this to the next milestone Oct 27, 2021

ZmnSCPxj force-pushed the sqlite3-backup branch from 1267d3e to b4198cc Compare October 27, 2021 10:18

cdecker self-requested a review October 27, 2021 14:36

cdecker self-assigned this Oct 27, 2021

ZmnSCPxj force-pushed the sqlite3-backup branch from b4198cc to a438369 Compare October 30, 2021 09:26

ZmnSCPxj force-pushed the sqlite3-backup branch 2 times, most recently from 05abab9 to 3dcc776 Compare October 30, 2021 23:36

ZmnSCPxj marked this pull request as ready for review November 8, 2021 05:06

ZmnSCPxj mentioned this pull request Nov 8, 2021

crashes with "database is locked" #4857

Closed

ZmnSCPxj force-pushed the sqlite3-backup branch from 3dcc776 to ef7a712 Compare November 8, 2021 05:42

vincenzopalazzo reviewed Nov 10, 2021

View reviewed changes

doc/BACKUP.md Outdated Show resolved Hide resolved

vincenzopalazzo approved these changes Nov 10, 2021

View reviewed changes

doc/BACKUP.md Outdated Show resolved Hide resolved

wallet/db_sqlite3.c Show resolved Hide resolved

wallet/db_sqlite3.c: Support direct replication of SQLITE3 backends.

3211983

ChangeLog-Added: With the `sqlite3://` scheme for `--wallet` option, you can now specify a second file path for real-time database backup by separating it from the main file path with a `:` character.

ZmnSCPxj force-pushed the sqlite3-backup branch from ef7a712 to 3211983 Compare November 14, 2021 03:58

vincenzopalazzo approved these changes Nov 14, 2021

View reviewed changes

rustyrussell merged commit a294683 into ElementsProject:master Nov 17, 2021

ZmnSCPxj deleted the sqlite3-backup branch November 17, 2021 03:02

ZmnSCPxj mentioned this pull request Nov 17, 2021

doc/BACKUP.md: Discourage litestream use. #4916

Merged

vincenzopalazzo mentioned this pull request Feb 3, 2022

New v0.10.3 backup method? #5026

Closed

blckbx mentioned this pull request Jul 12, 2022

sqlite db backend lightningnetwork/lnd#6570

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wallet/db_sqlite3.c: Support direct replication of SQLITE3 backends. #4890

wallet/db_sqlite3.c: Support direct replication of SQLITE3 backends. #4890

ZmnSCPxj commented Oct 27, 2021

laanwj commented Oct 27, 2021

cdecker commented Oct 27, 2021

ZmnSCPxj commented Oct 27, 2021

m-schmoock commented Oct 28, 2021

jb55 commented Oct 28, 2021 via email

cdecker commented Oct 29, 2021

ZmnSCPxj commented Oct 30, 2021

ZmnSCPxj commented Oct 30, 2021

m-schmoock commented Oct 30, 2021

ZmnSCPxj commented Oct 30, 2021

cdecker commented Oct 31, 2021

ZmnSCPxj commented Nov 8, 2021 •

edited

Loading

ZmnSCPxj commented Nov 8, 2021

ZmnSCPxj commented Nov 8, 2021

vincenzopalazzo left a comment

ZmnSCPxj commented Nov 14, 2021

vincenzopalazzo left a comment

rustyrussell commented Nov 17, 2021

Sjors commented Jun 3, 2022 •

edited

Loading

jb55 commented Jun 7, 2022

wallet/db_sqlite3.c: Support direct replication of SQLITE3 backends. #4890

wallet/db_sqlite3.c: Support direct replication of SQLITE3 backends. #4890

Conversation

ZmnSCPxj commented Oct 27, 2021

laanwj commented Oct 27, 2021

cdecker commented Oct 27, 2021

ZmnSCPxj commented Oct 27, 2021

m-schmoock commented Oct 28, 2021

jb55 commented Oct 28, 2021 via email

cdecker commented Oct 29, 2021

ZmnSCPxj commented Oct 30, 2021

ZmnSCPxj commented Oct 30, 2021

m-schmoock commented Oct 30, 2021

ZmnSCPxj commented Oct 30, 2021

cdecker commented Oct 31, 2021

ZmnSCPxj commented Nov 8, 2021 • edited Loading

ZmnSCPxj commented Nov 8, 2021

ZmnSCPxj commented Nov 8, 2021

vincenzopalazzo left a comment

Choose a reason for hiding this comment

ZmnSCPxj commented Nov 14, 2021

vincenzopalazzo left a comment

Choose a reason for hiding this comment

rustyrussell commented Nov 17, 2021

Sjors commented Jun 3, 2022 • edited Loading

jb55 commented Jun 7, 2022

ZmnSCPxj commented Nov 8, 2021 •

edited

Loading

Sjors commented Jun 3, 2022 •

edited

Loading