-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Speed up sourmash gather
by ignoring unidentifiable hashes
#1613
Conversation
@bluegenes I have one test to add, but this is ready to review. In particular, if you can try this out on #1552, I'd be much obliged :). |
Codecov Report
@@ Coverage Diff @@
## latest #1613 +/- ##
==========================================
+ Coverage 81.29% 89.49% +8.19%
==========================================
Files 103 76 -27
Lines 10485 6843 -3642
Branches 1217 1228 +11
==========================================
- Hits 8524 6124 -2400
+ Misses 1753 510 -1243
- Partials 208 209 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
huh. Using an old copy of @bluegenes hanging sig in #1552, it still takes a really long time to run |
sourmash gather
by ignoring unidentifiable hashessourmash gather
by ignoring unidentifiable hashes
Ready for review and merge @bluegenes! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
keywords used by genome-grist: known and unknown hashes. |
This PR refactors the
search.gather_databases(...)
generator into an iterator classGatherDatabases(...)
that should be much faster.The three main optimizations are:
MinHash.remove_many(...)
method from [MRG] ImprovingMinHash.remove_many(...)
performance #1571.Should fix #1552 but @bluegenes will need to confirm.