-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] fix issue with identifiers in the LCA code #1347
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1347 +/- ##
==========================================
+ Coverage 88.74% 94.11% +5.36%
==========================================
Files 123 96 -27
Lines 18124 14593 -3531
Branches 1399 1405 +6
==========================================
- Hits 16084 13734 -2350
+ Misses 1801 621 -1180
+ Partials 239 238 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I think this is ready, @luizirber and @bluegenes. And I think it should go into 4.0 release, because it is a problem that is triggered by #1179, which is one of the "breaking backwards compatibility" PRs going into 4.0. |
Fixed! thanks @bluegenes! |
thanks! |
* update release docs for latest build config * update release notes to include #1347/lca fixes * add note about database compatibility * add numerical/results stability * update docs to include example of sourmash sketch stdin * update command line example with better clustering output
This fixes some problems with identifiers due to the changes to
SourmashSignature.name
in #1179.This should probably go into 4.0.0...
Brief writeup of my understanding of the issues so far:
sig.name
can be empty if no name is given on the signature, but this then causes problems in bothsourmash lca index
andLCA_Database.insert(...)
sourmash lca index
- if the identifiers are identical (in this case''
, but in whatever case) then the number of leftover signatures gets mis-reported.A particularly simple situation where this arises is in the test
test_index_empty_sketch_name
, where we:sourmash sketch dna
, without providing a name;lca index
;sig.name
is now empty, which triggers a problem inLCA_Database.insert(...)
with duplicate identifiers;record_no_lineage
is now miscounted as 1 instead of 2, because we used a set.This PR fixes the following issues:
sourmash sketch
andsourmash compute
no longer create empty signatures from empty files and stdin;sourmash sketch
andsourmash compute
setsig.filename
to empty string when filename is-
;sourmash lca index
,record_no_lineage
is now a list instead of set;sourmash lca index
,ident
is taken fromsig.filename
ifsig.name
is empty.LCA_Database
,ident
defaults tostr(sig)
TODO:
sketch
filename behavior on stdin (punted to adjust screed to allow replacement of stdin input #1348)lca_db.insert(...)
testssourmash sig describe
behaviorChecklist
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?