prefetch-only `Index` classes and/or remote servers? #2229

ctb · 2022-08-21T13:34:25Z

@luizirber and I chatted a bit over slack about the new mastiff service he built, which allows ~realtime search of the SRA public metagenomes (!!)

This, in turn, enables other things like realtime JavaScript dashboards for genome inclusion in metagenomes, etc. So that's cool.

One of the things that stuck with me is that there is an increasingly useful distinction between "prefetch" on databases and then further triage and reporting. Here, prefetch is our internal term for "give me all the overlaps that exist for this query", and it can be turned into containment searches or Jaccard similarity searches or other things easily; see #1392 for some background here.

For mastiff, luiz has, I think, primarily sped up this prefetch functionality. Actual Index-like functionality that requires access to the signatures is completely distinct, and would be hard to implement on top of mastiff directly, without providing access to the signatures.

So maybe there is a useful distinction here for future API development:

first, provide on disk or client/server prefetch-style access to massive databases;
second, provide on disk or client/server access to signatures (kind of Storage-like, now that I think of it?)
third, provide a generic mix-and-match Index class that combines the two to provide full Index class services that enable all the good things.

ref RPC more generally, #1644

The text was updated successfully, but these errors were encountered:

ctb · 2022-08-21T13:49:28Z

incidentally, one way to jerry-rig this directly into our current API setup is to have the prefetch server return precisely two pieces of information: the number of shared hashes, and the md5sum (or other unique key). This can then be used as picklists for more detailed analyses of signatures.

this is quite different from the greyhound idea which is to use massive parallelism to search non-overlapping subsets of databases.

luizirber · 2022-08-21T19:49:42Z

For mastiff, luiz has, I think, primarily sped up this prefetch functionality. Actual Index-like functionality that requires access to the signatures is completely distinct, and would be hard to implement on top of mastiff directly, without providing access to the signatures.

It is the same dilemma from the LCA index: the signatures are not explicitly present, but you can recompute them. SBTs have both index and sigs explicitly in the distribution. In mastiff it is sidestepped by having two signature types, Internal (sig is stored in rocksdb) and External (points to a path). In greyhound (#1943) the index also holds a Storage, and can load from ZipStorage (the External in mastiff can be easily converted from a path to a file to a path inside any Storage), and so the "distribution" becomes a zip file for the sig collection + the files for the index per se.

So maybe there is a useful distinction here for future API development:

first, provide on disk or client/server prefetch-style access to massive databases;

second, provide on disk or client/server access to signatures (kind of Storage-like, now that I think of it?)

third, provide a generic mix-and-match Index class that combines the two to provide full Index class services that enable all the good things.

👍

incidentally, one way to jerry-rig this directly into our current API setup is to have the prefetch server return precisely two pieces of information: the number of shared hashes, and the md5sum (or other unique key). This can then be used as picklists for more detailed analyses of signatures.

Technically this is what mastiff returns nowadays: number of shared hashes == containment * len(query), and the SRA accession is unique. Very easy to change to any other field available in the sig.

this is quite different from the greyhound idea which is to use massive parallelism to search non-overlapping subsets of databases.

greyhound and mastiff are literally the same thing once they are built (API-wise). They both build in parallel, but greyhound is hard to serialize to disk, and mastiff gets it for free from being rocksdb-based.

One thing that #1943 is doing is splitting greyhound into a basic LinearIndex that use massive parallelism for search, and from it a RevIndex can be built to optimize search (and avoid the parallel search, because you don't need to access all sigs anymore). The RevIndex::from_zipstorage method literally builds a LinearIndex first, and then calls .index() on it to generate a RevIndex. This is likely to be the API I'll use for the mastiff PR too (#2230)

ctb · 2024-07-14T17:07:24Z

an update, nearly two years later:

RocksDB indices don't need the raw sketches to support overlap/containment analysis per MRG: provide --internal-storage and --no-internal-storage for index sourmash_plugin_branchwater#390 (comment)
RocksDB indices do require the raw sketches for gather
@luizirber is implementing remote storage options for the latter case - see feat: RocksDB storage and self-contained RevIndex with internal storage #3250

So in this sense they are in fact excellent examples of prefetch-only Index classes :)

ctb mentioned this issue Sep 5, 2022

supporting file loading over HTTP - thoughts and concerns #2257

Open

ctb mentioned this issue Jan 5, 2023

plugins/plug-in architecture #1353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prefetch-only `Index` classes and/or remote servers? #2229

prefetch-only `Index` classes and/or remote servers? #2229

ctb commented Aug 21, 2022

ctb commented Aug 21, 2022 •

edited

Loading

luizirber commented Aug 21, 2022

ctb commented Jul 14, 2024 •

edited

Loading

prefetch-only Index classes and/or remote servers? #2229

prefetch-only Index classes and/or remote servers? #2229

Comments

ctb commented Aug 21, 2022

ctb commented Aug 21, 2022 • edited Loading

luizirber commented Aug 21, 2022

ctb commented Jul 14, 2024 • edited Loading

prefetch-only `Index` classes and/or remote servers? #2229

prefetch-only `Index` classes and/or remote servers? #2229

ctb commented Aug 21, 2022 •

edited

Loading

ctb commented Jul 14, 2024 •

edited

Loading