Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] refactor gather functionality for speed & modularity; provide prefetch functionality. #1370

Merged
merged 258 commits into from
May 10, 2021

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Mar 6, 2021

Fixes #1310.

This PR revamps the overall gather functionality in sourmash to be much more flexible and approximately 50% faster based on benchmarking as well as (much) less memory intensive. The ultimate goal is to support oxidation of gather per #1226, but this also adds a lot of nice new functionality.

Changes and additions:

  • refactors gather_databases to use a new CounterGather interface that supports optimized gather functionality with peek and consume; these objects are returned by a new Index.counter_gather(...) method. See [MRG] refactor gather functionality for speed & modularity; provide prefetch functionality. #1370 for detailed discussion and benchmarking.
  • adds a prefetch command-line subcommand that does a streaming pass across the provided databases, doing so either with a (potentially optimized) Index.find(...) call from [MRG] Rework the find functionality for Index classes #1392, or with a linear pass across all signatures yielded by Index.signatures().
  • refactors Index.gather(...) functionality to use Index.prefetch(...) underneath by default;
  • tests all sourmash gather commands with all four combinations of prefetch and linear.
  • provides a --save-prefetch option on sourmash gather to save all of the overlapping signatures before digesting them down to the min set cover.
  • creates a new LazyLinearIndex(...) class that defers signature selection until as late as possible, to support better prefetch.
  • Adds a test for the (previously untested) sourmash search --save-matches
  • change Index.find(...) to return an IndexSearchResult to preserve location
  • add .signatures_with_location() to Index base class, so as to preserve location in base Index.find(...) functionality

Note: this PR includes the new --save-matches behavior from #1493.

Motivation, inspiration, and related issues

The prefetch functionality is inspired byprefetch_gather in genome-grist as well as greyhound #1226.

This PR makes it fairly straightforward to support reporting of all matches with equal containment for large databases, which can help address #278 #707 #1366 #1198

In particular, we can now support better tie-breaking in gather-style searching #1366, as well as "reverse" gather #1198

CounterGather interface

A key addition in this PR is the addition and systematic usage of a CounterGather interface, based on @luizirber PR #1311, which in turn is motivated by the greyhound issue and pull request.

The core code is located in src/sourmash/index.py, which provides a CounterGather class that collects and prioritizes matches for gather and also supports cross-database gather. This object is a query-specific object created and returned by the Index.counter_gather(...) method which (by default) uses Index.prefetch(...) underneath to do a single pass across the database and collect all possible matches.

The key piece of the refactoring is that CounterGather now provides two methods, peek(query) and consume(...). peek(query) provides the best containment result from this counter, but does not adjust any of the internal information; consume(...) is used to remove a match (potentially from multiple CounterGather objects from multiple databases).

Below is some demo code implementing multi-database gather; it is essentially what is used by _find_best in src/sourmash/search.py.

    def gather(self, query, threshold_bp):
        results = []

        best_result = None
        best_intersect_mh = None

        # find the best score across multiple counters, without consuming       
        for counter in self.counters:
            result = counter.peek(query.minhash, query.minhash.scaled,
                                  threshold_bp)
            if result:
                (sr, intersect_mh) = result

                if best_result is None or sr.score > best_result.score:
                    best_result = sr
                    best_intersect_mh = intersect_mh

        if best_result:
            # remove the best result from each counter                          
            for counter in self.counters:
                counter.consume(best_intersect_mh)

            # and done!                                                         
            return [best_result]
        return []

Implementation checklist etc.

TODO:

@codecov
Copy link

codecov bot commented Mar 6, 2021

Codecov Report

Merging #1370 (92a2511) into latest (18cd040) will increase coverage by 5.61%.
The diff coverage is 99.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1370      +/-   ##
==========================================
+ Coverage   89.68%   95.29%   +5.61%     
==========================================
  Files         124       99      -25     
  Lines       19966    17399    -2567     
  Branches     1515     1585      +70     
==========================================
- Hits        17906    16581    -1325     
+ Misses       1831      591    -1240     
+ Partials      229      227       -2     
Flag Coverage Δ
python 95.29% <99.54%> (+0.36%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/index.py 97.63% <97.81%> (+3.01%) ⬆️
tests/test_index.py 99.84% <99.61%> (-0.16%) ⬇️
src/sourmash/cli/__init__.py 95.69% <100.00%> (+0.04%) ⬆️
src/sourmash/cli/gather.py 100.00% <100.00%> (ø)
src/sourmash/cli/prefetch.py 100.00% <100.00%> (ø)
src/sourmash/commands.py 86.25% <100.00%> (+1.72%) ⬆️
src/sourmash/lca/lca_db.py 91.22% <100.00%> (ø)
src/sourmash/sbt.py 80.85% <100.00%> (ø)
src/sourmash/search.py 96.44% <100.00%> (+3.17%) ⬆️
src/sourmash/sourmash_args.py 94.41% <100.00%> (+0.04%) ⬆️
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 18cd040...92a2511. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented Mar 6, 2021

Update: merged into this PR already.


Some more thoughts now that I've gotten this first pass implemented -

make prefetch a method on Index, maybe?

this fits with @luizirber dishwasher thinking re greyhound and the 'counter_gather' method in #1311, but I would suggest instead having the prefetch method return another Index object that can be used for gather or search.

i.e. from this comment on the greyhound issue,

For step 1, the query is loaded in memory and then all the reference sigs are loaded in parallel and checked for intersection size with the query (map). If they are above the threshold_bp, the hashes are added to a LCA-like DB (reduce). A Counter-like (in the Python sense) structure is built from the DB, for ordering references by intersection size with the query.

For step 2, while there are matches in the counter the top match is selected, and each reference in the counter has their count decreased if they have hashes in the top match (using the revindex to find which references to decrease).

I would have step 1 return an Index data structure to support step 2. Then we'd use the normal gather code for everything else.

note: this is started in #1371

@ctb ctb mentioned this pull request May 8, 2021
@ctb
Copy link
Contributor Author

ctb commented May 8, 2021

@luizirber suggested adding some information to sourmash prefetch and (presumably) gather --prefetch about how many signatures have been collected, and what input file it's on.

@ctb
Copy link
Contributor Author

ctb commented May 9, 2021

another thought for changes - here or in another PR - is to make sure that the top-level APIs for search and Index are using SourmashSignature and not MinHash objects. That protects us from specific details of the sketch, and potentially lets us make use of fine tuned, or more advanced, or just plain different sketch types.

Right now most of the top-level APIs do use query rather than query_mh, but the peek and consume methods and the make_gather_query function take MinHash directly. Just something to think about.

@luizirber
Copy link
Member

luizirber commented May 10, 2021

another thought for changes - here or in another PR - is to make sure that the top-level APIs for search and Index are using SourmashSignature and not MinHash objects. That protects us from specific details of the sketch, and potentially lets us make use of fine tuned, or more advanced, or just plain different sketch types.

Right now most of the top-level APIs do use query rather than query_mh, but the peek and consume methods and the make_gather_query function take MinHash directly. Just something to think about.

👍 👍

(but I think punt to another issue, this is already too long...)

EDIT: punted to #1514

@ctb
Copy link
Contributor Author

ctb commented May 10, 2021

@luizirber suggested adding some information to sourmash prefetch and (presumably) gather --prefetch about how many signatures have been collected, and what input file it's on.

this is actually there --

loading signatures from 'gtdb-r202.genomic-reps.k31.zip'
total of 78 matching signatures.

but it relies on having actual matches. I'm not sure how to do a progress indicator given that Index.find doesn't provide any information about how many signatures have been searched.

@luizirber luizirber merged commit f60c44d into latest May 10, 2021
@luizirber luizirber deleted the add/prefetch_cli branch May 10, 2021 18:28
@ctb
Copy link
Contributor Author

ctb commented May 10, 2021

One random thought on the memory stuff - I don't think we're passing in unload_data to the SBT, maybe that would do it?

Yeah, I used the hammer in add/prefetch_cli...add/prefetch_cli_unload and set unload to True as default, will report soon.

thanks for merging! did we want to add this unload branch in?

@luizirber
Copy link
Member

thanks for merging! did we want to add this unload branch in?

-> #1513

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

maybe add new command, sourmash prefetch
3 participants