Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Divide non-DNA MinHash ksize by 3 for external consumption. #1277

Merged
merged 41 commits into from
Jan 19, 2021

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jan 16, 2021

This is a "light touch" fix to #1271, which makes it so that the entire Python (and CLI) layer of sourmash does protein/hp/dayhoff ksizes correctly: DNA k-mer sizes are the same, protein k-mer sizes are the lengths of the actual amino acid k-mers being used, and translated DNA k-mers are translated into the correct protein k-mer size.

This PR changes almost everything on the command-line and Python API side to match this - in particular, command line selectors are now different.
The one exception is sourmash compute which still uses the "old" meaning of ksize for backwards compatibility reasons.

No changes apply to existing signature or database formats, so this is forwards and backwards compatible with 3.5!

The key changes are in the MinHash wrapper class in src/sourmash/minhash.py:

  • before passing ksize into the Rust layer, the constructor multiples the ksize by 3 for non-DNA minhashes.
  • when retrieving ksize from the Rust layer, the ksize property divides the ksize by 3 for non-DNA minhashes.

Along the way, I also had to fix the LCA database to save/load its database correctly, with the adjusted ksize for non-DNA databases.

Most of the rest of the changes are minor fixes to command-line parameters or output, except for -

Fixes #1271.
Includes #1019

TODO:

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

ctb and others added 30 commits June 5, 2020 09:43
@ctb ctb changed the title [WIP] Divide non-DNA signature ksizes by 3 for external consumption. [WIP] Divide non-DNA MinHash ksize by 3 for external consumption. Jan 16, 2021
@codecov
Copy link

codecov bot commented Jan 16, 2021

Codecov Report

Merging #1277 (3994e8b) into latest (5378d2d) will increase coverage by 0.02%.
The diff coverage is 98.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1277      +/-   ##
==========================================
+ Coverage   88.67%   88.69%   +0.02%     
==========================================
  Files         125      125              
  Lines       18231    18281      +50     
  Branches     1434     1440       +6     
==========================================
+ Hits        16167    16215      +48     
- Misses       1818     1819       +1     
- Partials      246      247       +1     
Flag Coverage Δ
python 93.96% <98.26%> (+<0.01%) ⬆️
rust 67.33% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tests/test_bugs.py 100.00% <ø> (ø)
tests/test_sbt.py 100.00% <ø> (ø)
src/sourmash/minhash.py 94.18% <84.61%> (-0.54%) ⬇️
src/sourmash/cli/compute.py 100.00% <100.00%> (ø)
src/sourmash/command_compute.py 95.52% <100.00%> (+0.09%) ⬆️
src/sourmash/lca/lca_db.py 93.05% <100.00%> (+0.12%) ⬆️
src/sourmash/sourmash_args.py 91.92% <100.00%> (ø)
tests/test__minhash.py 99.26% <100.00%> (ø)
tests/test_cmd_signature.py 100.00% <100.00%> (ø)
tests/test_index.py 100.00% <100.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5378d2d...3994e8b. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented Jan 16, 2021

ready for review @luizirber @bluegenes

@luizirber
Copy link
Member

I'll leave the divide by 3 comments to @bluegenes (since she uses it =])

The key changes are in the MinHash wrapper class in src/sourmash/minhash.py:

  • before passing ksize into the Rust layer, the constructor multiples the ksize by 3 for non-DNA minhashes.
  • when retrieving ksize from the Rust layer, the ksize property divides the ksize by 3 for non-DNA minhashes.

Should these be moved into Rust, then? (not asking you to do it, but since both operations are 'fixing' what is being sent/retrieved from the Rust layer...)

@ctb
Copy link
Contributor Author

ctb commented Jan 18, 2021

The key changes are in the MinHash wrapper class in src/sourmash/minhash.py:

  • before passing ksize into the Rust layer, the constructor multiples the ksize by 3 for non-DNA minhashes.
  • when retrieving ksize from the Rust layer, the ksize property divides the ksize by 3 for non-DNA minhashes.

Should these be moved into Rust, then? (not asking you to do it, but since both operations are 'fixing' what is being sent/retrieved from the Rust layer...)

Before I wrote this PR, I would have said "no" or "maybe" to moving it to the Rust layer. Now I'm on "maybe" or "yes" - it's not that big a change, in practice! (I'm impressed with how clean it turned out to be.)

@taylorreiter
Copy link
Contributor

No comment on mv to rust layer, but

before passing ksize into the Rust layer, the constructor multiples the ksize by 3 for non-DNA minhashes.
when retrieving ksize from the Rust layer, the ksize property divides the ksize by 3 for non-DNA minhashes.

This is wonderful. If I'm in protein space and request k = 9, I want sourmash to give me 9 amino acids. It's so so confusing when that is not the case. If @bluegenes has a different preference/opinion, hers is probably better than mine :)

The one exception is sourmash compute which still uses the "old" meaning of ksize for backwards compatibility reasons.

So we still have to do the weird 3 thing...but only in part of sourmash?

@ctb
Copy link
Contributor Author

ctb commented Jan 19, 2021

The one exception is sourmash compute which still uses the "old" meaning of ksize for backwards compatibility reasons.

So we still have to do the weird 3 thing...but only in part of sourmash?

yes, and that part is getting deprecated in 4.0 and removed in 5.0.

The reason to keep it working the same is so that we don't mess up people who are using 2.x and 3.x sourmash compute terminology in their workflows. An alternative would be to explicitly "break" protein calculations via sourmash compute and require people to use sourmash sketch - but that seems ungentle.

@bluegenes
Copy link
Contributor

As I understand it:

  • sourmash sketch and the python api will now take the direct protein ksize
  • sourmash compute remains as-is, which is direct protein ksize *3
  • internals all use DNA ksizes, no matter the input alphabets.

I'm 100% on board with these! I do think that a deprecation warning for compute and/or detailed help message for how sketch differs from compute might be helpful for 4.0, though as you point out, we're likely the only ones using protein functionality.

This PR gets close to not needing to worry about protein ksize conversion!

Main question -- What is happening with selectors? It's not intuitive to sketch at protein k=10 but need to use k=30 to select that sketch from a sigfile, particularly given the fact that we're pushing towards no manual ksize corrections in sketch / python api.

@bluegenes
Copy link
Contributor

bluegenes commented Jan 19, 2021

This is wonderful. If I'm in protein space and request k = 9, I want sourmash to give me 9 amino acids. It's so so confusing when that is not the case. If @bluegenes has a different preference/opinion, hers is probably better than mine :)

compute has some other negatives for protein sketches, mainly that it's difficult to use specific ksizes for the different protein alphabets if you're sketching all at once. So I've entirely abandoned compute in favor of sketch, so have no issue with compute remaining as-is 🙂 🤷‍♀️

@ctb
Copy link
Contributor Author

ctb commented Jan 19, 2021

As I understand it:

* `sourmash sketch` and the python api will now take the direct protein ksize

yes!

* `sourmash compute` remains as-is, which is direct protein ksize *3

yes!

* internals all use DNA ksizes, no matter the input alphabets.

actually, no - Python internals use "correct" ksizes!!

I'm 100% on board with these! I do think that a deprecation warning for compute and/or detailed help message for how sketch differs from compute might be helpful for 4.0, though as you point out, we're likely the only ones using protein functionality.

absolutely. I'm working on that in documentation PRs and will also put in a deprecation here (and think about backporting it).

This PR gets close to not needing to worry about protein ksize conversion!

I think, for you, you would simply never need to worry about it, yes! Rust users might, but that's an army of 1.6, for now (standard 1.5x multiplier for Luiz, plus 0.1 for me).

Main question -- What is happening with selectors? It's not intuitive to sketch at protein k=10 but need to use k=30 to select that sketch from a sigfile, particularly given the fact that we're pushing towards no manual ksize corrections in sketch / python api.

Selectors work with the divide-by-3 number, so it's all consistent. The only odd-command-out is compute.

I will take all of this as license to proceed towards making this a merge-able PR - thanks, all!

@ctb
Copy link
Contributor Author

ctb commented Jan 19, 2021

Ready for review and merge!

@luizirber
Copy link
Member

(I'll do a new PR for moving the checks into Rust, but easier to have this one merged and then use the tests to move the code around =])

@ctb
Copy link
Contributor Author

ctb commented Jan 19, 2021

yay w00t!

@ctb ctb merged commit fe5d210 into latest Jan 19, 2021
@ctb ctb deleted the change_protein_kmer branch January 19, 2021 21:59
@ctb ctb changed the title [WIP] Divide non-DNA MinHash ksize by 3 for external consumption. [MRG] Divide non-DNA MinHash ksize by 3 for external consumption. Jan 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

adjust protein k-mer sizes for sourmash sketch?
4 participants