other makeshift strategies for large scale database search - the "greyhound" issue #1226

ctb · 2020-10-30T22:15:28Z

I'm on day 3 of a gather of the Hu S1 dataset against all genbank (500k+ genomes), and chatting with @taylorreiter about the find-the-unassigned script that @luizirber wrote,

https://github.com/taylorreiter/cosmo-kmers/blob/master/scripts/unassigned.py

it occurs to me that in some situations (such as environmental metagenomes), it might be more efficient to do a single linear pass across the 500k genomes looking for genomes with substantial containment, and then build an in-memory (or even on disk!) database of just those genomes and do the gather on that.

stopgap? yes. takes less than 4 days? probably 😄

ctb · 2020-11-02T03:07:29Z

ok, I implemented a simple script, prefetch-gather.py gist here, that does a linear pass across all databases.

For hu-s1, it completed in ~90 minutes and ~30 GB of RAM for the ~700k genomes from the latest genbank. The actual sourmash gather is still running, 5 days in, and it's about 2/3 done only 😆

Usage:

./prefetch-gather.py --query outputs/sigs/SRR1976948.abundtrim.sig --db /home/irber/sourmash_databases/outputs/sbt/genbank-*x1e5*k31* --save-matches xxx.sig --output-unassigned xxx-unassigned.sig

taylorreiter · 2020-11-02T03:08:44Z

huzzah!

ctb · 2020-11-02T13:40:37Z

@luizirber said on slack:

a couple of dishwashing¹ hot takes:

there is no reason why we couldn't build a reduced LcaDB-like index (hash -> [dataset IDs]) while iterating the first time during all sigs. Implementing gather on top of that is easy, and it is going to be way smaller than a regular LcaDB containing all the full signatures
turns out that 1) is VERY similar to what mash screen does, but using the scaled MH of the queries let us reuse data without having to redownload the full datasets!
The prefiltering from the gist should be default in the gather command (before doing DB searches)
actually, 1 should also be the default impl of gather for LinearIndex. Since each sig calculation only depend on the sig, this can be parallelized...
... and since all that is needed is already implemented in Rust (sig loading, count common, or membership query for a hash), we can also unleash parallelism (a la MAG search), and finish this WAAAAAY faster (assigning hash to dataset ID is probably going to be way faster too)
these approaches are doing membership queries, in a pinch we could use Bloom Filters to represent large scaled minhash or, even better, we could use MQF... =]

1: dishwashing is great for deep thinking

ctb · 2020-11-02T13:41:51Z

LcaDB-like index (hash -> [dataset IDs]) ... is going to be way smaller than a regular LcaDB containing all the full signatures

is it? I'm not sure it is - this is the main content of an LCA DB, I think?

luizirber · 2020-11-03T01:07:37Z

I implemented 1 + 4 + 5 in greyhound¹, and tried with some queries. For SRR606249 (-k 51 --scaled 1000), it took 15 minutes and 1.3 GB of memory (using 24 threads) for the newer genbank bacteria (673k genomes).

On a high level, greyhound is implemented in two steps: filtering and indexing (step 1) and gather (step 2).

For step 1, the query is loaded in memory and then all the reference sigs are loaded in parallel and checked for intersection size with the query (map). If they are above the threshold_bp, the hashes are added to a LCA-like DB (reduce). A Counter-like (in the Python sense) structure is built from the DB, for ordering references by intersection size with the query.

For step 2, while there are matches in the counter the top match is selected, and each reference in the counter has their count decreased if they have hashes in the top match (using the revindex to find which references to decrease).

The 3rd step is doing stats/summaries properly, but for now I'm just printing the matches.

While this is already way faster than the other gather approaches (in LcaDB and SBTs), it could be even faster if some optimization is done in the mh.intersection method (which calls merge all the time, and that method can be simplified/optimized A LOT).

Oh, and another hot take: I think there is space for a load_with_query(path, query) method for LcaDB, where it only keeps in memory the subset that has hashes in query. it is more practical to share one LcaDB instead of rebuilding it from reference sigs all the time.

1: if smol is the tiniest and cutest gather, then the fastest gather should be greyhound, right? =P

ctb · 2020-11-03T03:19:26Z

as a side note, I really like the simplicity of all our underlying algorithms. nothin' special, let's just reimplement everything 5 different ways to experiment...

luizirber · 2020-11-03T04:59:03Z

as a side note, I really like the simplicity of all our underlying algorithms. nothin' special, let's just reimplement everything 5 different ways to experiment...

... and five years building tooling, tearing the software apart and rebuilding it again, shaving yaks, sharpening, automating, thinking a lot to find the simple algorithm, and finally be able to reorganize and move the whole ship quickly when needed =]

ctb · 2020-11-03T13:36:42Z

...I wonder what deeply laid bug we'll find next? 🤪

luizirber · 2020-11-03T15:53:38Z

...I wonder what deeply laid bug we'll find next? zany_face

The seeds of the yin are in the yang, every advance contains its own destruction =]

ctb · 2020-11-05T16:39:44Z

I think this new strategy offers some nice output options not available with sourmash gather - in particular, I can see people being interested in all of the genomes that have containment in the metagenome, and not just the ones reported by the gather algorithm.

luizirber · 2020-11-05T16:55:28Z

I think this new strategy offers some nice output options not available with sourmash gather - in particular, I can see people being interested in all of the genomes that have containment in the metagenome, and not just the ones reported by the gather algorithm.

Yup, that's easy to output too (report after step 1, where the counter is already available). I was focusing on reporting after step 2 (gather).

ctb · 2020-11-06T17:51:54Z

related to previous comment, I expect "which genomes are relevant (by containment)" to be much less dependent on k-mer size than gather - this is relevant to stuff going on in genome-grist, in which the accuracy of the hash matching rates seem to be sometimes quite sensitive to k-mer size (using mapping as a baseline with which to measure "accuracy").

so, we can do something like, 'large scale containment search at k=whatever', and then use gather on just the resulting genomes across a variety of different k-sizes to explore the dependence of gather results and mapping rates to k-mer size.

ctb · 2020-11-10T20:17:12Z

do you think the sourmash gather CLI should take --pre-fetch or --greyhound option, by default on?

ctb · 2020-12-18T12:51:42Z

idle musings: it seems like this approach would support signature search at higher resolutions than currently.

my understanding/belief is that we go with a scaled of 2000 for genomes in part because SBTs and LCAs get too big and unwieldy when you use smaller scaled values. BUT, with the prefetch/greyhound approach, the primary costs would be in loading the larger signatures, and nothing else, right? So we could potentially get to higher resolutions.

luizirber · 2020-12-19T00:05:24Z

idle musings: it seems like this approach would support signature search at higher resolutions than currently.

my understanding/belief is that we go with a scaled of 2000 for genomes in part because SBTs and LCAs get too big and unwieldy when you use smaller scaled values. BUT, with the prefetch/greyhound approach, the primary costs would be in loading the larger signatures, and nothing else, right? So we could potentially get to higher resolutions.

Makes sense. I've been using scaled=2000 with the GTDB sigs (31k), and it uses 4.4GB of memory for the full DB. Anecdotally, tests with p8808mo9.abundtrim.sig used as low as 200 MB.

The challenge then becomes recalculating the reference sigs with smaller scaled values (100?), and efficiently storing it. JSON + gzip for sigs is at the limit for sizes, but not sure what would be a good format that maintains good archival/self-describing/easy to parse/small trade-offs.

ctb · 2021-01-10T15:04:57Z

greyhound is being merged into sourmash, soon-ish - #1238

ctb · 2023-09-23T16:32:26Z

All or most of these ideas have been canonized in sourmash, sourmash-rs, or pyo3_branchwater. I do feel like nominating this issue as a "posterity" issue because it really unleashed a whole new wave of optimizations (massively parallel stuff a la greyhound, AND ridiculously scalable on disk storage a la mastiff)... but in any case I think we can close this issue :)

ctb mentioned this issue Nov 2, 2020

how do we showcase the accuracy of gather results, without using taxonomy? dib-lab/2020-paper-sourmash-gather#3

Closed

ctb mentioned this issue Nov 3, 2020

"reverse" containment #1198

Closed

luizirber mentioned this issue Nov 10, 2020

SBT loading in memory #475

Closed

This was referenced Nov 12, 2020

Add prefetching of containment matches to speed up gather dib-lab/genome-grist#16

Merged

can we fit david's analytical work on error bounds in this paper? dib-lab/2020-paper-sourmash-gather#10

Closed

ctb mentioned this issue Dec 21, 2020

brainstorming: alternative signature storage/loading/query formats #1262

Open

ctb mentioned this issue Jan 10, 2021

Supporting on-demand loading of signatures from file lists in search, gather, and prefetch #1096

Closed

ctb mentioned this issue Jan 10, 2021

Thinking about further gather performance improvements. #1110

Closed

4 tasks

ctb mentioned this issue Jan 18, 2021

sourmash index running out of memory #1284

Closed

This was referenced Feb 7, 2021

maybe add new command, sourmash prefetch #1310

Closed

distributing zip files of signatures, alone. #1320

Closed

ctb changed the title ~~other makeshift strategies for large scale database search~~ other makeshift strategies for large scale database search - the "greyhound" issue Feb 28, 2021

This was referenced Mar 4, 2021

Inconsistency between sourmash lca classify and sourmash lca gather #707

Closed

gather should report all matches with equal containment #278

Closed

[MRG] refactor gather functionality for speed & modularity; provide prefetch functionality. #1370

Merged

This was referenced Apr 16, 2021

[MRG] Rework the find functionality for Index classes #1392

Merged

improve UX for slow-running/long-latency queries #271

Closed

ctb mentioned this issue Jun 9, 2021

some more hot take ideas on making search and gather faster, for interactivity purposes #1578

Open

ctb mentioned this issue Jun 16, 2021

the picklist manifesto: picklists, manifests, and greyhound #1599

Closed

ctb mentioned this issue Jul 12, 2021

parallelizing SRA search via snakemake #1664

Open

ctb added the speeding-up-gather label Oct 13, 2022

ctb closed this as completed Sep 23, 2023

luizirber mentioned this issue Feb 18, 2024

MRG: Calculate all gather stats in rust; use for rocksdb gather #2943

Merged

bluegenes mentioned this issue Feb 21, 2024

calculate all stats for rust linear gather #3020

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

other makeshift strategies for large scale database search - the "greyhound" issue #1226

other makeshift strategies for large scale database search - the "greyhound" issue #1226

ctb commented Oct 30, 2020

ctb commented Nov 2, 2020

taylorreiter commented Nov 2, 2020

ctb commented Nov 2, 2020 •

edited by luizirber

Loading

ctb commented Nov 2, 2020

luizirber commented Nov 3, 2020

ctb commented Nov 3, 2020

luizirber commented Nov 3, 2020

ctb commented Nov 3, 2020

luizirber commented Nov 3, 2020

ctb commented Nov 5, 2020

luizirber commented Nov 5, 2020

ctb commented Nov 6, 2020 •

edited

Loading

ctb commented Nov 10, 2020

ctb commented Dec 18, 2020

luizirber commented Dec 19, 2020

ctb commented Jan 10, 2021

ctb commented Sep 23, 2023

other makeshift strategies for large scale database search - the "greyhound" issue #1226

other makeshift strategies for large scale database search - the "greyhound" issue #1226

Comments

ctb commented Oct 30, 2020

ctb commented Nov 2, 2020

taylorreiter commented Nov 2, 2020

ctb commented Nov 2, 2020 • edited by luizirber Loading

ctb commented Nov 2, 2020

luizirber commented Nov 3, 2020

ctb commented Nov 3, 2020

luizirber commented Nov 3, 2020

ctb commented Nov 3, 2020

luizirber commented Nov 3, 2020

ctb commented Nov 5, 2020

luizirber commented Nov 5, 2020

ctb commented Nov 6, 2020 • edited Loading

ctb commented Nov 10, 2020

ctb commented Dec 18, 2020

luizirber commented Dec 19, 2020

ctb commented Jan 10, 2021

ctb commented Sep 23, 2023

ctb commented Nov 2, 2020 •

edited by luizirber

Loading

ctb commented Nov 6, 2020 •

edited

Loading