-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] refactor gather
functionality for speed & modularity; provide prefetch
functionality.
#1370
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1370 +/- ##
==========================================
+ Coverage 89.68% 95.29% +5.61%
==========================================
Files 124 99 -25
Lines 19966 17399 -2567
Branches 1515 1585 +70
==========================================
- Hits 17906 16581 -1325
+ Misses 1831 591 -1240
+ Partials 229 227 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Update: merged into this PR already. Some more thoughts now that I've gotten this first pass implemented - make prefetch a method on
|
@luizirber suggested adding some information to |
another thought for changes - here or in another PR - is to make sure that the top-level APIs for search and Index are using Right now most of the top-level APIs do use |
👍 👍 (but I think punt to another issue, this is already too long...) EDIT: punted to #1514 |
this is actually there --
but it relies on having actual matches. I'm not sure how to do a progress indicator given that |
thanks for merging! did we want to add this unload branch in? |
-> #1513 |
Fixes #1310.
This PR revamps the overall
gather
functionality in sourmash to be much more flexible and approximately 50% faster based on benchmarking as well as (much) less memory intensive. The ultimate goal is to support oxidation ofgather
per #1226, but this also adds a lot of nice new functionality.Changes and additions:
gather_databases
to use a newCounterGather
interface that supports optimizedgather
functionality withpeek
andconsume
; these objects are returned by a newIndex.counter_gather(...)
method. See [MRG] refactorgather
functionality for speed & modularity; provideprefetch
functionality. #1370 for detailed discussion and benchmarking.prefetch
command-line subcommand that does a streaming pass across the provided databases, doing so either with a (potentially optimized)Index.find(...)
call from [MRG] Rework thefind
functionality forIndex
classes #1392, or with a linear pass across all signatures yielded byIndex.signatures()
.Index.gather(...)
functionality to useIndex.prefetch(...)
underneath by default;sourmash gather
commands with all four combinations of prefetch and linear.--save-prefetch
option onsourmash gather
to save all of the overlapping signatures before digesting them down to the min set cover.LazyLinearIndex(...)
class that defers signature selection until as late as possible, to support better prefetch.sourmash search --save-matches
Index.find(...)
to return anIndexSearchResult
to preserve location.signatures_with_location()
toIndex
base class, so as to preserve location in baseIndex.find(...)
functionalityNote: this PR includes the new
--save-matches
behavior from #1493.Motivation, inspiration, and related issues
The prefetch functionality is inspired by
prefetch_gather
in genome-grist as well as greyhound #1226.This PR makes it fairly straightforward to support reporting of all matches with equal containment for large databases, which can help address #278 #707 #1366 #1198
In particular, we can now support better tie-breaking in gather-style searching #1366, as well as "reverse" gather #1198
CounterGather interface
A key addition in this PR is the addition and systematic usage of a
CounterGather
interface, based on @luizirber PR #1311, which in turn is motivated by the greyhound issue and pull request.The core code is located in
src/sourmash/index.py
, which provides aCounterGather
class that collects and prioritizes matches for gather and also supports cross-database gather. This object is a query-specific object created and returned by theIndex.counter_gather(...)
method which (by default) usesIndex.prefetch(...)
underneath to do a single pass across the database and collect all possible matches.The key piece of the refactoring is that
CounterGather
now provides two methods,peek(query)
andconsume(...)
.peek(query)
provides the best containment result from this counter, but does not adjust any of the internal information;consume(...)
is used to remove a match (potentially from multipleCounterGather
objects from multiple databases).Below is some demo code implementing multi-database gather; it is essentially what is used by
_find_best
insrc/sourmash/search.py
.Implementation checklist etc.
TODO:
prefetch --linear
to force a linear pass across databases, e.g. for the current large GenBank SBTs.--prefetch
--save-prefetch
tosourmash gather
Index.prefetch
method at Python levelSaveSignaturesToLocation
code to a new PR.gather
functionality for speed & modularity; provideprefetch
functionality. #1370, etc.LazyLinearIndex
for laziness!LazyLinearIndex
bool and lenZipFileLinearIndex
boolValueError
catch for prefetch_databases