Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wallet/db_sqlite3.c: Support direct replication of SQLITE3 backends. #4890

Merged
merged 1 commit into from
Nov 17, 2021

Conversation

ZmnSCPxj
Copy link
Collaborator

ChangeLog-Added: With the sqlite3:// scheme for --wallet option, you can now specify a second file path for real-time database backup by separating it from the main file path with a : character.

Created as draft since not yet sufficiently tested, lemme check a bit. Milestoned next version

@ZmnSCPxj ZmnSCPxj added this to the next milestone Oct 27, 2021
@laanwj
Copy link
Contributor

laanwj commented Oct 27, 2021

Having this built in would be really nice.

@cdecker
Copy link
Member

cdecker commented Oct 27, 2021

This brings up an issue that we had with the backup plugin itself: how to enforce that lightningd is either always started with the backup once it is initialized. The problem is that if you start with --wallet=sqlite3:///dba then later switch to --wallet=sqlite3:///dba:/dbb then the two will be out of sync, in the best case it'll crash, but it can also result in a hidden corruption, with some entries being misnumbered (autoincrement fields) or missing altogether.

The backup plugin verifies that the data_version matches and aborts if that's not the case. Is there something similar in this PR, and how could we address this better?

@cdecker cdecker self-requested a review October 27, 2021 14:36
@cdecker cdecker self-assigned this Oct 27, 2021
@ZmnSCPxj
Copy link
Collaborator Author

The Backup API of SQLITE3 will always overwrite the backup copy --- in the case of this PR, we always start a full backup at db_sqlite3_setup, completely overwriting the backup copy: https://github.com/ZmnSCPxj/lightning/blob/b4198cc538c371a0cb13867f2b55e013cd8dca46/wallet/db_sqlite3.c#L102-L107

The upshot is that in your example, /dbb will always be forced in sync with /dba, i.e. it will get overwritten. See also: https://sqlite.org/backup.html And I quote:

The online backup API allows the contents of one database to be copied into another database, overwriting the original contents of the target database.

The API allows incremental backups, but the implementation here does not use incremental backups (which would be dangerous in our application as some updates may not be picked up in an increment). Incremental backups are intended where multiple updaters may exist on the source db, but that is not our use-case --- only the lightningd process is the updater, so we always ask for a full backup each time. However, the API also has an optimization, where if the same source connection is used to update, then the update is automatically done on both the source and the destination backup and the backup process is marked as completed, thus reducing overhead as well (or at least should do so, I have not actually read the SQLITE3 source code, just the docs).

So in terms of correctness the code is correct, the issue is if whether it is efficient enough not to excessively delay operation (it will probably delay startup, due to the initial copy).

@m-schmoock
Copy link
Collaborator

I really like the idea!

@jb55
Copy link
Collaborator

jb55 commented Oct 28, 2021 via email

@cdecker
Copy link
Member

cdecker commented Oct 29, 2021

I was totally unaware that sqlite3 had a backup facility built in, hence my ignorance about it's inner working. I'll test this asap and if it works this is an excellent way of creating same-machine backups.

@ZmnSCPxj
Copy link
Collaborator Author

The current code seems not to work (^^;;;). A backup file is generated but it seems a snapshot of the file at the start, and is not updated regularly. Seems a misunderstanding of the SQLITE3 Backup API docs.

On the other hand this seems a fairly popular pull request. Maybe at the start we can check if both databases have a data_version and both have the same data_version, and if not, replicate using the Backup API. Then in normal operation we just do the same writes to both the main and backup databases.

@ZmnSCPxj
Copy link
Collaborator Author

Current code does the replication "manually", i.e. just forwards writes to both databases. We only use the Backup API at the start to force-sync the two databases. As an optimization we can check db_version and if it is equal, skip the initial force-sync (not yet implemented here). Running this now on a throwaway moneyless node, let me stew it for a few days and see.

@m-schmoock
Copy link
Collaborator

@ZmnSCPxj
Is it exception safe for scenarios like i.e. when writing to the first (main) database produced an IO error or something bad, that the write to the second (backup) database is still done?

@ZmnSCPxj
Copy link
Collaborator Author

An error that causes a database constraint violation on the main db should also cause a database constraint violation on the backup db (and would be a bug anyway). Transactions are committed on main before committing on backup, and if a disk-level error occurs at the main then backup is not updated.

If a disk-level error occurs on backup but not on main, then main is updated, but we also abort after the failure on the backup, which is equivalent to an uncontrolled shutdown after updating the database but before sending anything to the counterparty, a scenario the protocol is prepared for. Either the non-updated backup (if .e.g. the disk-level failure was due to say NFS mount disappearing because some clumsy flesh foot tripped on the network cable, but the remote disk is still alive) or the updated main would be safe to resume with, again due to the protocol being prepared for either scenario --- as long as a message is not yet sent to the counterparty (and we only send that if both main and backup db were updated OK) then we would be safe with either the pre- or post- transaction state.

Perhaps I should note though that sqlite3 will by nature include either a journal or a wal file on both the main and backup db, and when recovering you should include any journal or wal too if present. Otherwise you could have a backup that is in a halfway-through-transaction state, which we are not safe for..

Of course, if your disk bitrots your data under you, then the main and backup can indeed go out-of-sync. But we do not have any bitrot protection in lightningd, even with a single database, so this is not a worsening at least. Use ZFS, or if you cannot, at least use BTRFS. And use ECC memory as well, because buffers are held in memory, after all.

@ZmnSCPxj ZmnSCPxj force-pushed the sqlite3-backup branch 2 times, most recently from 05abab9 to 3dcc776 Compare October 30, 2021 23:36
@cdecker
Copy link
Member

cdecker commented Oct 31, 2021

Transactions are committed on main before committing on backup, and if a disk-level error occurs at the main then backup is not updated.

Is that not the wrong way around? Committing to the backup only after the main DB would allow commits to get lost if there is a crash in-between. Admittedly the risk of a crash hitting between commits, followed by a corruption to the main DB is extremely unlikely, but i think the backup should get the commit first. But i also see your point about the io_loop not proceeding and acting like a sync barrier for the DB 🤔

@ZmnSCPxj
Copy link
Collaborator Author

ZmnSCPxj commented Nov 8, 2021

@cdecker the LN safety model allows us to recover from either pre-db-transaction or post-db-transaction.

Consider that your imagined case can be reversed:

  • Suppose we write to backup first before main.
  • The following happens:
    • We write to backup.
    • Backup media suffers existential failure.
    • Linux OOM killer kills lightningd before we write to main.
    • SysAd stops operation, acquires new backup media, and copies the current main DB to the new backup media

Thus, whether we write backup first or main first, the same problem can occur regardless. Different media can write at different speeds, wire delays exist, etc., in a relativistic world there is no "simultaneous", so there will always exist a race condition.

As long as we get entire db-transactions, we should be fine, that is what channel_reestablish is for --- after all, a crash can happen before our local OS network stack has transferred the message to the NIC, so we should handle both the case where we crashed before the db-transaction commits, or after the db-transaction commits. The only requirement is db atomicity, not synchronicity; we only need to ensure that all copies are up-to-date before sending any revocation keys.

Basically: whether we write to backup first, or main first, is arbitrary, because race condition will always exist in either case. Fortunately, we make sure not to proceed unless we confirm both have synched. That is the only important case: either everything has synched and we can send revocation keys, or some of them are not up-to-date but we have not sent revocation keys (and channel_reestablish will handle either up-to-date or not-up-to-date, the only important thing is the counterparty never got the revocation keys).

@ZmnSCPxj ZmnSCPxj marked this pull request as ready for review November 8, 2021 05:06
@ZmnSCPxj
Copy link
Collaborator Author

ZmnSCPxj commented Nov 8, 2021

Checked and at least on my throwaway node it looks like we can recover the data safely. The backup is not bit-for-bit same copy (maybe I should add that to BACKUP.md to assuage users), but quick checking with manual queries suggests they have the same data regardless. Also checked that copying is not done if the backup DB has the same data_version as the main DB. Copying takes about 4 seconds for my throwaway node which is tiny compared to the >40seconds gossipd copies gossip_store, but my real node has about 4.5x larger DB file, so it still is probably a good idea to avoid copying the entire db if the backup has the same data_version anyway.

For more extreme optimization, we should really launch a new thread or process and open the backup there, and just use a pipe or socket to send queries (including begin transaction and commit) to the backup process/thread. We only need a response from the backup process/thread at commit time, and that is when it can send any error messages that happened in the backup. That ensures that slow backup media writes is at least done in parallel with main media writes. But that is optimization and correctness comes first and the feature is usable as-is.

@ZmnSCPxj
Copy link
Collaborator Author

ZmnSCPxj commented Nov 8, 2021

Added note that the backup and main are not byte-for-byte identical, rebased on master.

doc/BACKUP.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack ef7a712

With some small comment on the docs

doc/BACKUP.md Outdated Show resolved Hide resolved
wallet/db_sqlite3.c Show resolved Hide resolved
ChangeLog-Added: With the `sqlite3://` scheme for `--wallet` option, you can now specify a second file path for real-time database backup by separating it from the main file path with a `:` character.
@ZmnSCPxj
Copy link
Collaborator Author

Updated docs as per review.

Copy link
Collaborator

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack 3211983

@rustyrussell
Copy link
Contributor

Nice work!

@rustyrussell rustyrussell merged commit a294683 into ElementsProject:master Nov 17, 2021
@ZmnSCPxj ZmnSCPxj deleted the sqlite3-backup branch November 17, 2021 03:02
@Sjors
Copy link
Contributor

Sjors commented Jun 3, 2022

First of all this is great.

Second: it's a terrifying first time experience. My lightningd.sqlite3 has grown to 200 MB and although it very quickly creates the second file, at identical size, it then takes ten minutes or so to finish… whatever it's doing. During that time there is no info, other than:

2022-06-03T12:45:05.450Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_channeld
2022-06-03T12:45:05.451Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_closingd
2022-06-03T12:45:05.452Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_connectd
2022-06-03T12:45:05.453Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_gossipd
2022-06-03T12:45:05.453Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_hsmd
2022-06-03T12:45:05.454Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_onchaind
2022-06-03T12:45:05.455Z DEBUG   lightningd: testing /home/lightning/libexec/c-lightning/lightning_openingd
2022-06-03T12:45:05.456Z DEBUG   hsmd: pid 966079, msgfd 41

Some sort of log status about the migration would help calm the nerves.

On a related note, what is actually stored in lightningd.sqlite3? Is it the bare essentials to safely use channels, or does it contain other stuff like gossip? Since the backup file is probably on a slower, perhaps even remote, disk, I would think that the less stuff in there, the better. Non-radioactive, yet important, things like invoice metadata are probably something I could backup separately.

@jb55
Copy link
Collaborator

jb55 commented Jun 7, 2022

wow thanks for reminding me about this PR @Sjors, I just tried this as well and it seems to work. I have a bit more peace of mind now.

@blckbx blckbx mentioned this pull request Jul 12, 2022
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants