Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

other makeshift strategies for large scale database search - the "greyhound" issue #1226

Closed
ctb opened this issue Oct 30, 2020 · 17 comments
Closed

Comments

@ctb
Copy link
Contributor

ctb commented Oct 30, 2020

I'm on day 3 of a gather of the Hu S1 dataset against all genbank (500k+ genomes), and chatting with @taylorreiter about the find-the-unassigned script that @luizirber wrote,

https://github.com/taylorreiter/cosmo-kmers/blob/master/scripts/unassigned.py

it occurs to me that in some situations (such as environmental metagenomes), it might be more efficient to do a single linear pass across the 500k genomes looking for genomes with substantial containment, and then build an in-memory (or even on disk!) database of just those genomes and do the gather on that.

stopgap? yes. takes less than 4 days? probably 😄

@ctb
Copy link
Contributor Author

ctb commented Nov 2, 2020

ok, I implemented a simple script, prefetch-gather.py gist here, that does a linear pass across all databases.

For hu-s1, it completed in ~90 minutes and ~30 GB of RAM for the ~700k genomes from the latest genbank. The actual sourmash gather is still running, 5 days in, and it's about 2/3 done only 😆

Usage:

./prefetch-gather.py --query outputs/sigs/SRR1976948.abundtrim.sig --db /home/irber/sourmash_databases/outputs/sbt/genbank-*x1e5*k31* --save-matches xxx.sig --output-unassigned xxx-unassigned.sig

@taylorreiter
Copy link
Contributor

huzzah!

@ctb
Copy link
Contributor Author

ctb commented Nov 2, 2020

@luizirber said on slack:

a couple of dishwashing1 hot takes:

  1. there is no reason why we couldn't build a reduced LcaDB-like index (hash -> [dataset IDs]) while iterating the first time during all sigs. Implementing gather on top of that is easy, and it is going to be way smaller than a regular LcaDB containing all the full signatures
  2. turns out that 1) is VERY similar to what mash screen does, but using the scaled MH of the queries let us reuse data without having to redownload the full datasets!
  3. The prefiltering from the gist should be default in the gather command (before doing DB searches)
  4. actually, 1 should also be the default impl of gather for LinearIndex. Since each sig calculation only depend on the sig, this can be parallelized...
  5. ... and since all that is needed is already implemented in Rust (sig loading, count common, or membership query for a hash), we can also unleash parallelism (a la MAG search), and finish this WAAAAAY faster (assigning hash to dataset ID is probably going to be way faster too)
  6. these approaches are doing membership queries, in a pinch we could use Bloom Filters to represent large scaled minhash or, even better, we could use MQF... =]

1: dishwashing is great for deep thinking

@ctb
Copy link
Contributor Author

ctb commented Nov 2, 2020

LcaDB-like index (hash -> [dataset IDs]) ... is going to be way smaller than a regular LcaDB containing all the full signatures

is it? I'm not sure it is - this is the main content of an LCA DB, I think?

@luizirber
Copy link
Member

I implemented 1 + 4 + 5 in greyhound1, and tried with some queries. For SRR606249 (-k 51 --scaled 1000), it took 15 minutes and 1.3 GB of memory (using 24 threads) for the newer genbank bacteria (673k genomes).

On a high level, greyhound is implemented in two steps: filtering and indexing (step 1) and gather (step 2).

For step 1, the query is loaded in memory and then all the reference sigs are loaded in parallel and checked for intersection size with the query (map). If they are above the threshold_bp, the hashes are added to a LCA-like DB (reduce). A Counter-like (in the Python sense) structure is built from the DB, for ordering references by intersection size with the query.

For step 2, while there are matches in the counter the top match is selected, and each reference in the counter has their count decreased if they have hashes in the top match (using the revindex to find which references to decrease).

The 3rd step is doing stats/summaries properly, but for now I'm just printing the matches.

While this is already way faster than the other gather approaches (in LcaDB and SBTs), it could be even faster if some optimization is done in the mh.intersection method (which calls merge all the time, and that method can be simplified/optimized A LOT).

Oh, and another hot take: I think there is space for a load_with_query(path, query) method for LcaDB, where it only keeps in memory the subset that has hashes in query. it is more practical to share one LcaDB instead of rebuilding it from reference sigs all the time.

1: if smol is the tiniest and cutest gather, then the fastest gather should be greyhound, right? =P

@ctb
Copy link
Contributor Author

ctb commented Nov 3, 2020

as a side note, I really like the simplicity of all our underlying algorithms. nothin' special, let's just reimplement everything 5 different ways to experiment...

@luizirber
Copy link
Member

as a side note, I really like the simplicity of all our underlying algorithms. nothin' special, let's just reimplement everything 5 different ways to experiment...

... and five years building tooling, tearing the software apart and rebuilding it again, shaving yaks, sharpening, automating, thinking a lot to find the simple algorithm, and finally be able to reorganize and move the whole ship quickly when needed =]

@ctb
Copy link
Contributor Author

ctb commented Nov 3, 2020

...I wonder what deeply laid bug we'll find next? 🤪

@luizirber
Copy link
Member

...I wonder what deeply laid bug we'll find next? zany_face

The seeds of the yin are in the yang, every advance contains its own destruction =]

@ctb
Copy link
Contributor Author

ctb commented Nov 5, 2020

I think this new strategy offers some nice output options not available with sourmash gather - in particular, I can see people being interested in all of the genomes that have containment in the metagenome, and not just the ones reported by the gather algorithm.

@luizirber
Copy link
Member

I think this new strategy offers some nice output options not available with sourmash gather - in particular, I can see people being interested in all of the genomes that have containment in the metagenome, and not just the ones reported by the gather algorithm.

Yup, that's easy to output too (report after step 1, where the counter is already available). I was focusing on reporting after step 2 (gather).

@ctb
Copy link
Contributor Author

ctb commented Nov 6, 2020

related to previous comment, I expect "which genomes are relevant (by containment)" to be much less dependent on k-mer size than gather - this is relevant to stuff going on in genome-grist, in which the accuracy of the hash matching rates seem to be sometimes quite sensitive to k-mer size (using mapping as a baseline with which to measure "accuracy").

so, we can do something like, 'large scale containment search at k=whatever', and then use gather on just the resulting genomes across a variety of different k-sizes to explore the dependence of gather results and mapping rates to k-mer size.

@ctb
Copy link
Contributor Author

ctb commented Nov 10, 2020

do you think the sourmash gather CLI should take --pre-fetch or --greyhound option, by default on?

@ctb
Copy link
Contributor Author

ctb commented Dec 18, 2020

idle musings: it seems like this approach would support signature search at higher resolutions than currently.

my understanding/belief is that we go with a scaled of 2000 for genomes in part because SBTs and LCAs get too big and unwieldy when you use smaller scaled values. BUT, with the prefetch/greyhound approach, the primary costs would be in loading the larger signatures, and nothing else, right? So we could potentially get to higher resolutions.

@luizirber
Copy link
Member

idle musings: it seems like this approach would support signature search at higher resolutions than currently.

my understanding/belief is that we go with a scaled of 2000 for genomes in part because SBTs and LCAs get too big and unwieldy when you use smaller scaled values. BUT, with the prefetch/greyhound approach, the primary costs would be in loading the larger signatures, and nothing else, right? So we could potentially get to higher resolutions.

Makes sense. I've been using scaled=2000 with the GTDB sigs (31k), and it uses 4.4GB of memory for the full DB. Anecdotally, tests with p8808mo9.abundtrim.sig used as low as 200 MB.

The challenge then becomes recalculating the reference sigs with smaller scaled values (100?), and efficiently storing it. JSON + gzip for sigs is at the limit for sizes, but not sure what would be a good format that maintains good archival/self-describing/easy to parse/small trade-offs.

@ctb
Copy link
Contributor Author

ctb commented Jan 10, 2021

greyhound is being merged into sourmash, soon-ish - #1238

@ctb
Copy link
Contributor Author

ctb commented Sep 23, 2023

All or most of these ideas have been canonized in sourmash, sourmash-rs, or pyo3_branchwater. I do feel like nominating this issue as a "posterity" issue because it really unleashed a whole new wave of optimizations (massively parallel stuff a la greyhound, AND ridiculously scalable on disk storage a la mastiff)... but in any case I think we can close this issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants