Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prefetch-only Index classes and/or remote servers? #2229

Open
ctb opened this issue Aug 21, 2022 · 3 comments
Open

prefetch-only Index classes and/or remote servers? #2229

ctb opened this issue Aug 21, 2022 · 3 comments

Comments

@ctb
Copy link
Contributor

ctb commented Aug 21, 2022

@luizirber and I chatted a bit over slack about the new mastiff service he built, which allows ~realtime search of the SRA public metagenomes (!!)

This, in turn, enables other things like realtime JavaScript dashboards for genome inclusion in metagenomes, etc. So that's cool.

One of the things that stuck with me is that there is an increasingly useful distinction between "prefetch" on databases and then further triage and reporting. Here, prefetch is our internal term for "give me all the overlaps that exist for this query", and it can be turned into containment searches or Jaccard similarity searches or other things easily; see #1392 for some background here.

For mastiff, luiz has, I think, primarily sped up this prefetch functionality. Actual Index-like functionality that requires access to the signatures is completely distinct, and would be hard to implement on top of mastiff directly, without providing access to the signatures.

So maybe there is a useful distinction here for future API development:

  • first, provide on disk or client/server prefetch-style access to massive databases;
  • second, provide on disk or client/server access to signatures (kind of Storage-like, now that I think of it?)
  • third, provide a generic mix-and-match Index class that combines the two to provide full Index class services that enable all the good things.

ref RPC more generally, #1644

@ctb
Copy link
Contributor Author

ctb commented Aug 21, 2022

incidentally, one way to jerry-rig this directly into our current API setup is to have the prefetch server return precisely two pieces of information: the number of shared hashes, and the md5sum (or other unique key). This can then be used as picklists for more detailed analyses of signatures.

this is quite different from the greyhound idea which is to use massive parallelism to search non-overlapping subsets of databases.

@luizirber
Copy link
Member

For mastiff, luiz has, I think, primarily sped up this prefetch functionality. Actual Index-like functionality that requires access to the signatures is completely distinct, and would be hard to implement on top of mastiff directly, without providing access to the signatures.

It is the same dilemma from the LCA index: the signatures are not explicitly present, but you can recompute them. SBTs have both index and sigs explicitly in the distribution. In mastiff it is sidestepped by having two signature types, Internal (sig is stored in rocksdb) and External (points to a path). In greyhound (#1943) the index also holds a Storage, and can load from ZipStorage (the External in mastiff can be easily converted from a path to a file to a path inside any Storage), and so the "distribution" becomes a zip file for the sig collection + the files for the index per se.

So maybe there is a useful distinction here for future API development:

  • first, provide on disk or client/server prefetch-style access to massive databases;
  • second, provide on disk or client/server access to signatures (kind of Storage-like, now that I think of it?)
  • third, provide a generic mix-and-match Index class that combines the two to provide full Index class services that enable all the good things.

👍

incidentally, one way to jerry-rig this directly into our current API setup is to have the prefetch server return precisely two pieces of information: the number of shared hashes, and the md5sum (or other unique key). This can then be used as picklists for more detailed analyses of signatures.

Technically this is what mastiff returns nowadays: number of shared hashes == containment * len(query), and the SRA accession is unique. Very easy to change to any other field available in the sig.

this is quite different from the greyhound idea which is to use massive parallelism to search non-overlapping subsets of databases.

greyhound and mastiff are literally the same thing once they are built (API-wise). They both build in parallel, but greyhound is hard to serialize to disk, and mastiff gets it for free from being rocksdb-based.

One thing that #1943 is doing is splitting greyhound into a basic LinearIndex that use massive parallelism for search, and from it a RevIndex can be built to optimize search (and avoid the parallel search, because you don't need to access all sigs anymore). The RevIndex::from_zipstorage method literally builds a LinearIndex first, and then calls .index() on it to generate a RevIndex. This is likely to be the API I'll use for the mastiff PR too (#2230)

@ctb
Copy link
Contributor Author

ctb commented Jul 14, 2024

an update, nearly two years later:

So in this sense they are in fact excellent examples of prefetch-only Index classes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants