Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve sketching performance #860

Closed
kloetzl opened this issue Jan 24, 2020 · 8 comments
Closed

improve sketching performance #860

kloetzl opened this issue Jan 24, 2020 · 8 comments

Comments

@kloetzl
Copy link
Contributor

kloetzl commented Jan 24, 2020

Mash has recently integrated some changes making it faster. I think sourmash could also benefit from them. However, my Rust isn't good enough for an actual pull request. So instead I will just point out the issues and suggest solutions.

In minhash.rs#L684 _checkdna is repeatedly called on the same characters. Mash achieved a 30% performance boost by adding just one more counter.

The function _checkdna uses a match statement which compiles to a chain of cmp and jumps. Using a lookup table will give you much better performance.

Best,
Fabian

@ctb
Copy link
Contributor

ctb commented Jan 24, 2020 via email

@kloetzl
Copy link
Contributor Author

kloetzl commented Jan 24, 2020

No problem. I think this should get you about 12% of performance. Maybe more. See the profile below.

sourmash

@luizirber
Copy link
Member

Mash has recently integrated some changes making it faster. I think sourmash could also benefit from them. However, my Rust isn't good enough for an actual pull request. So instead I will just point out the issues and suggest solutions.

This is what I suggest if you want to try it:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
git clone https://github.com/dib-lab/sourmash.git
cd sourmash
cargo bench -- add_sequence

There are 4 benchmarks for add_sequence, and (I think) they cover most cases in the add_sequence method.

In minhash.rs#L684 _checkdna is repeatedly called on the same characters. Mash achieved a 30% performance boost by adding just one more counter.

Nice, I'll try it out! I'm also looking into the faster revcomp discussion, which can also help us.

The function _checkdna uses a match statement which compiles to a chain of cmp and jumps. Using a lookup table will give you much better performance.

@camillescott has suggestions too, for keeping track of valid/invalid positions (using a deque).

There was also a PR for the ntHash crate with similar suggestions (lookup vs match), and it was way faster.

Thanks for the great suggestions!

@luizirber
Copy link
Member

Oh, another point: @kloetzl, did you run your tests with sourmash installed from pip, or from latest master? #845 brings some improvements and avoids _checkdna calls for each k-mer if the sequence is valid (only calls it once for the full sequence)

@kloetzl
Copy link
Contributor Author

kloetzl commented Jan 24, 2020

I used the current master (aka a601b4a). But I can't remember whether my profile was done on a clean bacterial genome, or if it did contain some characters other than ACGT. It was probably a clean bacterial genome with only ACGTs. So my performance boost estimate are off, but still you will see an effect.

@kloetzl
Copy link
Contributor Author

kloetzl commented Jan 24, 2020

There was also a PR for the ntHash crate with similar suggestions (lookup vs match), and it was way faster.

I am trying to assemble a number of these super fast DNA processing routines into a project I call libdna. It will take some time and/or help until its finished, though.

@luizirber
Copy link
Member

Fixed in #861, thanks @kloetzl!

@kloetzl
Copy link
Contributor Author

kloetzl commented Jan 26, 2020

I am glad that I could help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants