Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute oxidation #845

Merged
merged 2 commits into from
Jan 21, 2020
Merged

Compute oxidation #845

merged 2 commits into from
Jan 21, 2020

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Jan 14, 2020

This PR expose functionality in a way that doesn't requiring moving all the Python compute code into Rust.

  • I moved the internal functions from compute (make_minhashes, add_seq, build_siglist, save_siglist) to the outside scope, and made all the arguments explicit. It would be hard to refactor otherwise...
  • add_sequence and add_protein methods for SourmashSignature. The idea here is to only cross the FFI layer once for each sequence, and being able to control better what can be improved on the Rust side (like using rayon for parallelization?).
  • add_protein in MinHash was also moved to Rust, and it is 100x faster 🤣 /cc @olgabot
  • Benchmarks in Python (for add_protein) and in Rust (for add_sequence)
  • add_sequence in Rust is way simpler, and way faster. Turns out doing simpler things and letting the compiler optimize was better than my micro-optimizations =P
    • but there is still a trick: I pre-calculate the reverse complement for the full sequence, and then while moving the k-mer window forward for the sequence I move another one backwards for the RC. The code for that is not complicated, but I need to add a comment to explain it (because it is not obvious at first sight what is happening).
      let kmer = &sequence[i..i + ksize];
      let krc = &rc[len - ksize - i..len - i];

Funny bits:

  • It is a gigantic PR because the ComputeParameters setters and getters are... a lot. But it's also very repetitive code (and maybe could become a macro, but then it becomes harder to reason about/fix bugs later).
  • [RFC] making compute CLI command usable inside Python #245 explodes the args in the compute function signature, but since Rust doesn't have default arguments for functions I went in the other direction there (collapsing all the args into a ComputeParameters struct). Python's default arguments are so nice.

Previous blurb:


I'm using this branch for implementing decoct compute and figuring out how to:

  1. expose functionality in a way that doesn't requiring moving all the Python compute code into Rust
  2. allows some parallelization if using the Rust compute parts (implemented with rayon)

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@luizirber luizirber added this to the 3.2 milestone Jan 14, 2020
@codecov
Copy link

codecov bot commented Jan 14, 2020

Codecov Report

Merging #845 into master will decrease coverage by 12.29%.
The diff coverage is 64.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #845      +/-   ##
==========================================
- Coverage   91.19%   78.89%   -12.3%     
==========================================
  Files          69       82      +13     
  Lines        4916     6993    +2077     
  Branches        0      479     +479     
==========================================
+ Hits         4483     5517    +1034     
- Misses        433     1174     +741     
- Partials        0      302     +302
Flag Coverage Δ
#pytests 90.54% <98.48%> (?)
#rusttests 47.29% <15.21%> (?)
Impacted Files Coverage Δ
src/core/src/signature.rs 33.73% <0%> (ø)
sourmash/_minhash.py 97.21% <100%> (-0.13%) ⬇️
sourmash/cli/compute.py 100% <100%> (ø) ⬆️
sourmash/exceptions.py 79.41% <100%> (+1.28%) ⬆️
sourmash/signature.py 87.67% <100%> (+1.07%) ⬆️
src/core/src/sketch/minhash.rs 32.2% <18.91%> (ø)
sourmash/command_compute.py 97.2% <99.1%> (+1.04%) ⬆️
src/core/src/from.rs 71.66% <0%> (ø)
src/core/src/lib.rs 51.63% <0%> (ø)
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d25d786...69ce7a6. Read the comment docs.

@luizirber luizirber force-pushed the rust_compute_decoct branch 4 times, most recently from 4a47061 to ab08a76 Compare January 18, 2020 03:12
@ctb ctb mentioned this pull request Jan 18, 2020
@luizirber luizirber force-pushed the rust_compute_decoct branch 4 times, most recently from c63d815 to 9f547cd Compare January 18, 2020 22:48
@luizirber luizirber changed the title [WIP] Compute oxidation Compute oxidation Jan 18, 2020
@luizirber luizirber force-pushed the rust_compute_decoct branch 5 times, most recently from c6f4f84 to ddf4bc5 Compare January 21, 2020 04:29
@ctb
Copy link
Contributor

ctb commented Jan 21, 2020

I held off on fixing sourmash compute to use FileOutput (#853) because of this PR, but you might want to consider tossing it in here. Or I can do a PR into this branch. lmk.

@luizirber
Copy link
Member Author

I held off on fixing sourmash compute to use FileOutput (#853) because of this PR, but you might want to consider tossing it in here. Or I can do a PR into this branch. lmk.

I updated it in 03bd393, does it match what you did in the other commands?

Also ready for review and merge @ctb

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

return [sig]


def save_siglist(siglist, output_fp, filename=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change output_fp to something like output_filename here?

add benchmarks for add_sequence
add_protein in Rust
add a test for params, fix scaled default
allow building ComputeParameters from compute argparse args
@luizirber luizirber merged commit a601b4a into master Jan 21, 2020
@luizirber luizirber deleted the rust_compute_decoct branch January 21, 2020 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants