-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rework database construction and release process to use manifests #1652
Comments
I've been working on this on and off, as part of #1654 and #1619 and the scripts in #1641 (comment). I think the I attach an early draft script that does the file finding and the manifest reloading/updating, but it doesn't actually update the database. update-sqlite3-mom-dirs.py.txt Other than fleshing out the |
(come to think of it, this is an excellent situation where plugins might come in handy. Most people using sourmash will probably not be constructing databases with hundreds of thousands of files!) |
In chatting with @bluegenes, we broke it down into two different issues:
The latter is pretty straightforward, but the former is going to be pretty slow. Tessa suggested that we chunk the signatures into (let's say) 20k signatures, and store them in zipfiles or directories. That seems pretty workable - 50 such files would be a million sigs! - but we'd need some infrastructure around that, too... |
More conversations with @bluegenes - I think our first attempt to improve database construction will end with:
|
🎉 well, that was easy! https://github.com/ctb/2021-sourmash-mom |
OK, got it all working, it seems? Some of the output numbers are incorrect so I'll fix that :) tl;dr ~1 minute and ~1 GB to get my grubby little paws on all the GTDB genome signatures for RS202. I loaded all of the signatures from Using the Manifest Of Manifests codebase I then created manifests-of-manifests (MoMs or moms) containing the combined manifests of all the zip files, as well as a (much) smaller collection of signatures that @bluegenes created to round out things that wort didn't have.
This produced two sqlite databases that are not terribly large:
and then I grabbed the latest set of GTDB accessions:
(and then had to unmangle the column header, but whatever). Finally, I asked for all matching signatures across all mom databases (in this case, I didn't actually extract them, as that would have taken an hour or two :).
|
One mildly neat realization coming out of #1664 is that for this kind of manifest stuff, the size of the underlying data doesn't matter - we have about the same number of signatures for the SRA as we do for genbank genomes, so all of the manifest stuff will work just fine. It's only the actual search that will be slower for the SRA data because it's so much bigger than the genbank genomes. |
trying out using the NCBI assembly_summary.txt files, ref sourmash-bio/databases#7, it all seems pretty straightforward --
which gave
note the added feature, Wrote 35 unmatched values from picklist to '../genbank_build/xxx.csv' which will be important for automation :) |
all of genbank => 88k missing signatures from wort, it seems.
|
Our latest database release is pretty nice, but life is also getting much more complicated ;). The process @bluegenes (mostly) and I are using to build/release GTDB looks something like this:
With sourmash 4.2.0, we can now start using picklists with
sourmash sig cat
to construct the zipfile collections, and manifests are automatically produced from that point on. Future improvements such as lazy signature loading using manifests/manifests-of-manifests can also make the actual disk I/O etc much simpler when selecting from large collections.Separately, @luizirber has a different database building process that builds the "genbank microbial" databases, based (I think) mostly on the wort output as well as the assembly_report file.
This is all getting to be a lot to manage, and partly as a result we haven't produced a new genbank microbial database in a while.
I chatted briefly with tessa about the idea of starting to use manifests as a starting point for building databases.
The basic idea goes something like this -
The text was updated successfully, but these errors were encountered: