-
Notifications
You must be signed in to change notification settings - Fork 36.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: Add sqlite format option for dumptxoutset #24952
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
eaa4e63
to
21ee90d
Compare
src/rpc/blockchain.cpp
Outdated
throw JSONRPCError(RPC_INTERNAL_ERROR, strprintf("Unable to open database file %s", path.u8string())); | ||
} | ||
|
||
ret = sqlite3_exec(db, "PRAGMA journal_mode=wal2;", nullptr, nullptr, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment explaining why we want journal_mode
set to wal2
and ideally add a link to SQLite documentation: https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, isn't journal_mode = off
even better match for what we want? I think that not using journal might significantly improve write performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we don't need/use rollbacks. It's just an export so that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is now addressed in 84a6e28, thanks!
EDIT: Never mind. I can't read... |
fad59d2
to
889f6c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK 889f6c7. Manually inspected a few random rows from the first 100 in the outputted db, which looked good. I left a few additional comments inline.
I also see twice the time and speed on my side:
fs::rename(temppath, path); | ||
break; | ||
case UTXOSnapshotFormat::SQLITE: | ||
fs::remove(temppath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why we don't use the temppath for sqlite like we do on compact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temppath
adds ".incomplete" to the end of the filename.
path
is the path entered by the user, so the database is created using this parameter.
temppath
can eventually be removed from CreateUTXOSnapshot()
as this parameter is used only once and for logging purposes. I think the same information can be obtained from afile
as it is created from temppath
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not used in SQLite branch (nothing actually gets written to it), but it is somewhat tightly coupled, and since we need to do the USE_SQLITE check, it makes it a little difficult to completely decouple. So I just thought to remove it at the end. Acts as a decent indicator that nothing bad happened too (although redundant).
Maybe this can be refactored in the future if we don't want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK 889f6c7
889f6c7
to
2d95dbb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach ACK
3a761ef
to
2d95f24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
I first thought creating an external tool to convert from compact-serialized format to SQLite was a viable alternative to implement this directly in bitcoind, but it turned out to be bad idea; one of the reasons is that the decompression logic has to be implemented again, which in one part even needs secp256k1 (#24628 (comment)).
Some initial thoughts:
- Is there any reason why we store prevoutHash and scriptPubKey with type TEXT rather than a BLOB? Using the latter should reduce the size of the resulting database by almost 2x, and SQLite supports
select hex(column)
in case the hex representation is needed by the user - Not sure if it helps in this scenario or is worthy to investigate, but in some cases I've made good experiences by tuning the
cache_size
parameter (https://www.sqlite.org/pragma.html#pragma_cache_size) in order to have less frequent writes (the default is 2 MB)
One of the goals of this change is to enable trivial creation of JSON/CSV files with the data. This can be achieved simply by using one of the following:
If BLOB was used instead of hex stored in TEXT, this conversion step could not have been done trivially as one would have to specify a more complex SQL query such as:
However, it might still be worth a shot to require more complicated SQL query if we really gain 2x the performance and decrease the disk space to 50%. |
Thanks! I'm happy to investigate tuning this. We're currently roughly getting it to be double the time of a compact dump. |
I think you can't get much better than this. You are writing approx 2x the data (because of hex representation of IMO the only way how to make the process fast is to use BLOB type as suggested by @theStack above at the expense of requiring more complex query when working with the data. |
I think that's a fair trade-off. Also should not be all that more complicated of a query. |
@prusnak: Thanks for elaborating, easy conversion to CSV/JSON etc. makes sense! One thing that has to be taken care of is the endianness of the txid hash. Unfortunately Bitcoin displays (and accepts, in case of user input for RPCs) txids and block hashes as little endian, but they are stored internally as big endian: https://bitcoin.stackexchange.com/a/2069 (For the scriptPubKey, there is no such endianness problem as it is simply a sequence of bytes) So as far as I can see, we have to decide between either easy-conversion or ~2x performance / 50% size (though the latter is still just an assumption). One possible idea would be to store the BLOB already reversed, but this is kind of a something in-between solution that possibly is even more confusing for users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
Is it worth to consider to just write it to stdout while iterating all utxos? That would make it easy for a consumer to bring it into any required format for further processing or analyzing without having Core to make that decision. sqlite3 is able to capture from stdin as an import source and most other formats like csv or json can already be processed via the command line. You could even already apply filtering while dumping all records. |
It really sounds like we should be improving the current format (I don't think we need to worry about backwards compatibility too much), given the issues you've outlined, rather than adding a new format, more code, and a dependency we don't want. Otherwise we've still got to maintain the old format, which is compressed (convenient), but not really useful unless you write your own parser (clearly doable), and also maintain the new format, which is seemingly only being introduced because the current format isn't really usable for what everyone wants. |
8fe0e7c
to
01720ab
Compare
The compact-serialized format was introduced as part of the AssumeUTXO project (see #16899) for creating and loading UTXO snapshots. Not sure if at this point changing the specification is desired/needed, I guess for this specific use-case the format is just fine (maybe @jamesob can comment on this?). So it seems that we really have two different goals here: on the one hand a snapshot format to be saved/loaded by bitcoind or even other clients (and that in the most space-efficient way possible), on the other hand providing a neutral easy-to-process dump-format that can be inspected by users. I agree that it would be the best solution if both of these goals could be met by a single format. |
There's an issue in the latest push with configuring with |
This allows RPC users of the `dumptxoutset` command to output to a SQLite DB format with human-readable fields. This approach allows for trivial export to JSON and CSV by using `sqlite3` (detailed in bitcoin#24628). There is a new optional 'format' parameter for `dumptxoutset` that can take on the values of either 'compact' or 'sqlite' with 'compact' as the default. This allows for backward compatibility. Closes bitcoin#24628
01720ab
to
c5506ac
Compare
🐙 This pull request conflicts with the target branch and needs rebase. Want to unsubscribe from rebase notifications on this pull request? Just convert this pull request to a "draft". |
This could be argued the other way around too: maybe it's good to have an internal and an external format. The internal format could be changed more easily when necessary for our internal purposes. It's like the block index, the utxo index, etc. Someone can write a perser but there are no guarantees of compatibility. Also, it only needs the data we need to reconstruct from a snapshot, nothing more, nothing less. However, changes in the external format, a documented API, have to take external clients into account. |
@fanquake @theStack yeah, the existing format is intended to be as space efficient as possible since, ideally, that output will be commonly shared across network to help get nodes going. That consideration, in conjunction with
makes me sympathetic to this changeset. I think as long as this feature gracefully degrades when sqlite support isn't compiled in, I'm concept ACK and will be happy to review this soon. |
Needs a trivial rebase for #24410:
Edit: i get the following error locally while running the
Looks like it expects error code |
Concept NACK - I hope we don't go down this route. We've spent years trying to reduce the number of external dependencies in Bitcoin Core; especially in the option-less build. At this point in the project, I think adding new, or expanding the scope of, dependencies in this code base, should be reserved for things that are actually important/mission critical, i.e libsecp256k1, not for nice-to-have features, that are going to be used by a (relatively) small number of developers and enthusiasts, and not actually by the majority of nodes running this peice of software. If anything, Bitcoin Core should be removing more code, and creating external tooling for new features, rather than continuing to accumulate functionality in this codebase because it's convenient. In this case, an external tool even exists already (I don't really understand why it needing libsecp as a dependency is a problem). My longer term conern is that if something like this is merged, in 3 months time, we'll have a new pull-request, to add support for using sqlite to dump out logs in json format, or some other thing, that clearly shouldn't be a concern/maintenance burnden of Bitcoin Core. This PR also gives us somewhat awkward build logic, where an RPC feature is enabled if a user compiles with a certain dependency, that previously, only controlled whether or not you got a certain type of wallet. How is this going to work with #24202? Are we abandoning that PR in favour of this, or will we end up with a trifecta of output formats? In any case, this PR (rebased) is currently broken on arm64 macOS, when building with CXXLD bitcoin-cli
CXXLD minisketch/test
CXXLD bitcoin-tx
CXXLD bitcoin-util
ar: -L/opt/homebrew/Cellar/sqlite/3.38.5/lib: No such file or directory
ar: -lsqlite3: No such file or directory
gmake[2]: *** [Makefile:5897: libbitcoin_node.a] Error 1 |
@fanquake: Great points raised that should be considered, I agree that an additional external dependency definitely has its drawbacks. Still, the somewhat uncomfortable fact is that right now we don't have a proper "dumb" (sic!) dump format for the UTXO set that only contains pure information and thus can easily be searched/processed. I think there should be one, as the current compact-serialized format has a different purpose and won't change, and considering that the UTXO set is one of the (if not the) most important data-structures on all nodes (be it pruned or non-pruned), as it reflects the current state / distribution of funds with all (visible) spending conditions, I would argue it's worth to make an exception here to the "we don't add convenience features if the same result can also be achieved somehow with external tools" rule. This enables the user to easily inspect that state without the need for further fiddling around between different formats and investing extra time and disk space.
I think it's pretty obvious that this PR would obsolete #24202, at least I can't come up with a good reason why anyone would still be interested in a text format if you can already have a superset in the form of a simple but still powerful database (converting from SQLite to CSV format is not a big deal)? Though, if this PR is too controversial, it's probably still better to have #24202 instead of no proper UTXO set dump format at all. A compromise could be #24202 plus as follow-up a trivial tool (e.g. in contrib/ or external) that converts the text output to a SQLite database? (Would maybe still be faster than the current approach of emiting a compact-serialized dump + converting that to SQLite format via utxo_to_sqlite). Very curious to hear others opinions. |
I think we should settle on two formats. The format for internal use and a format for external use. From there on, it's for people with custom scripts to analyze it or convert it into anything they want (it's not like end users have a use for an UTXO dump anyway). As external format I think sqlite is a surprisingly good choice for these kind of huge data dumps (better than say, JSON, or formats that require writing custom parsers). It should still be possible to compile the node without sqlite though for people (most) who don't need external UTXO exports.
I agree. |
I think answers to these questions would be illuminating for comparison to #24202:
If space is comparable between the two options, and a script that adapts CSV -> sqlite is easy (I think it would be) and relatively quick (not sure here), I would say that #24202 is preferable, since it would ship with all builds. We could even bundle such a script in That said, I think dependency considerations aside, sqlite is probably a much more practical output format for actual consumption. Reading sqlite into, say, |
Did a quick comparison between a SQLite output that I created with this PR ~3 weeks ago (82401330 UTXO entries) and a text output with #24202 that I created just a few minutes ago (82801821 UTXO entries):
Both seem to take roughly ~2x of the size of the compact-serialized format, with the text output even being a bit smaller (~10.25 GB vs ~11.59 GB). The vast majority of the space per entry is taken by txid and scriptPubKey, and since we store them as hex-encoded strings (both in this PR and in #24202), it makes sense that the sizes are in a similar area.
Pretty sure that the code complexity is low (seems like sqlite3 can even directly import CSV files and it could be a very simple shell-script? https://stackoverflow.com/a/1045961), while the runtime could be a bit of a show-stopper here. Will try to write such a script within the next days and report the results then. |
Actually, you can use the SUBSTR and LENGTH https://www.sqlite.org/lang_corefunc.html#substr
so in theory: SELECT hex(SUBSTR(txid, -1, -LENGTH(txid))), vout, value, coinbase, height, hex(scriptpubkey) FROM utxos works. In general I think dumping to sqlite sounds like a fine idea. One point of contention though is it would be nice if we could return the SQLITE file over the network somehow, since not all RPC users have local filesystem access. |
Unfortunately that doesn't work. The substring is still extracted from left to right without any reversing operation involved, the only thing possible is to specify the start and end offsets by counting from right to left:
If reversing in SQLite was possible, it would be very tempting to store the txid and scriptPubkey as BLOBs to have more compact dumps that are created faster. |
ah interesting... since we know TXIDs are 32 bytes, one option would be to do a 32 byte expansion of something like: substr('abc',3, 1) || substr('abc', 2, 1) || substr('abc', 1,1) as a generated virtual table. this would have no storage overhead, but would have some overhead on retrieval. |
Are you still working on this? cc @jamesob |
Will get back to it this week, been busy with other work :) |
It's been several weeks since this comment with no additional updates, so I'm going to close this for now. If you do get around to working on this, please leave a comment so that it can be reopened. |
This allows RPC users of the
dumptxoutset
command to output to a SQLite DB format with human-readable fields.This approach allows for trivial export to JSON and CSV by using
sqlite3
(detailed in #24628).There is a new optional 'format' parameter for
dumptxoutset
that can take on the values of either 'compact' or 'sqlite'with 'compact' as the default. This allows for backward compatibility.
Closes #24628