Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix problem where tree search is truncated incorrectly. #244

Merged
merged 35 commits into from
Oct 25, 2017

Conversation

ctb
Copy link
Contributor

@ctb ctb commented May 20, 2017

Fixes #200, where tree search is truncated on containment measures (number of shared hashes) rather than similarity.

The test in here uses --best-only to demonstrate the behavior because it is easy, but the bug will also show up in other searches; this is the line of code in sbtmh.py that needs to be fixed:

if len(mins) and float(matches) / len(mins) >= threshold:

In this line, we also need to take into the number of mins in the search signature. This requires modification of the data stored at each internal node of the SBT.

Additionally, this PR:

This PR need some cleanup as well - the gather functionality should be revisited before a merge is requested - although it does pass all tests.

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@ctb
Copy link
Contributor Author

ctb commented Sep 20, 2017

@luizirber your thoughts on the use of max_n_below on the SBT are requested! Briefly, I decorate the SBT internal nodes with a new metadata dictionary and store that dictionary in the SBT JSON file.

query_ksize = query.minhash.ksize

# calculate the band size/resolution R for the genome
R_metagenome = sourmash_lib.MAX_HASH / float(orig_query.minhash.max_hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use get_scaled_for_max_hash here

@luizirber
Copy link
Member

Looking good so far!

@ctb
Copy link
Contributor Author

ctb commented Oct 25, 2017

This is substantial enough that maybe it can be merged now - it has tests for the critical bug that it fixes, and resolving merge conflicts is getting annoying. @luizirber thoughts?

@luizirber
Copy link
Member

Seems like tests/test-data/genome-s10+s11.fa.gz is missing, @ctb

@ctb
Copy link
Contributor Author

ctb commented Oct 25, 2017 via email

@ctb
Copy link
Contributor Author

ctb commented Oct 25, 2017

Huh. That's a weird py2.7 error....

@luizirber
Copy link
Member

@ctb random scipy installation error is gone, but now the test is failing: https://travis-ci.org/dib-lab/sourmash/jobs/292597698#L630

@luizirber
Copy link
Member

I can track down the error, pretty sure it is a division problem with py2...

@codecov-io
Copy link

codecov-io commented Oct 25, 2017

Codecov Report

Merging #244 into master will increase coverage by 0.07%.
The diff coverage is 92.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #244      +/-   ##
==========================================
+ Coverage   87.08%   87.15%   +0.07%     
==========================================
  Files          13       14       +1     
  Lines        2029     2087      +58     
  Branches       36       36              
==========================================
+ Hits         1767     1819      +52     
- Misses        261      267       +6     
  Partials        1        1
Impacted Files Coverage Δ
sourmash_lib/sbt.py 82.85% <100%> (+0.18%) ⬆️
sourmash_lib/commands.py 89.88% <100%> (-0.36%) ⬇️
sourmash_lib/sbtmh.py 85.24% <84.37%> (-0.76%) ⬇️
sourmash_lib/sourmash_args.py 94.36% <85.71%> (-1.02%) ⬇️
sourmash_lib/search.py 94.84% <94.84%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b14671...4322140. Read the comment docs.

@luizirber
Copy link
Member

@luizirber luizirber merged commit 2c2552a into master Oct 25, 2017
@luizirber luizirber deleted the bug/sbt_similarity branch October 25, 2017 20:26
@ctb ctb mentioned this pull request Jun 2, 2018
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants