improve sketching performance #860

kloetzl · 2020-01-24T14:42:11Z

Mash has recently integrated some changes making it faster. I think sourmash could also benefit from them. However, my Rust isn't good enough for an actual pull request. So instead I will just point out the issues and suggest solutions.

In minhash.rs#L684 _checkdna is repeatedly called on the same characters. Mash achieved a 30% performance boost by adding just one more counter.

The function _checkdna uses a match statement which compiles to a chain of cmp and jumps. Using a lookup table will give you much better performance.

Best,
Fabian

The text was updated successfully, but these errors were encountered:

ctb · 2020-01-24T14:46:42Z

thank you!! stuff like this is red meat to @luizirber I think :)

kloetzl · 2020-01-24T15:02:35Z

No problem. I think this should get you about 12% of performance. Maybe more. See the profile below.

luizirber · 2020-01-24T16:40:53Z

Mash has recently integrated some changes making it faster. I think sourmash could also benefit from them. However, my Rust isn't good enough for an actual pull request. So instead I will just point out the issues and suggest solutions.

This is what I suggest if you want to try it:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
git clone https://github.com/dib-lab/sourmash.git
cd sourmash
cargo bench -- add_sequence

There are 4 benchmarks for add_sequence, and (I think) they cover most cases in the add_sequence method.

In minhash.rs#L684 _checkdna is repeatedly called on the same characters. Mash achieved a 30% performance boost by adding just one more counter.

Nice, I'll try it out! I'm also looking into the faster revcomp discussion, which can also help us.

The function _checkdna uses a match statement which compiles to a chain of cmp and jumps. Using a lookup table will give you much better performance.

@camillescott has suggestions too, for keeping track of valid/invalid positions (using a deque).

There was also a PR for the ntHash crate with similar suggestions (lookup vs match), and it was way faster.

Thanks for the great suggestions!

luizirber · 2020-01-24T19:04:41Z

Oh, another point: @kloetzl, did you run your tests with sourmash installed from pip, or from latest master? #845 brings some improvements and avoids _checkdna calls for each k-mer if the sequence is valid (only calls it once for the full sequence)

kloetzl · 2020-01-24T20:00:41Z

I used the current master (aka a601b4a). ~~But I can't remember whether my profile was done on a clean bacterial genome, or if it did contain some characters other than ACGT.~~ It was probably a clean bacterial genome with only ACGTs. So my performance boost estimate are off, but still you will see an effect.

kloetzl · 2020-01-24T20:57:51Z

There was also a PR for the ntHash crate with similar suggestions (lookup vs match), and it was way faster.

I am trying to assemble a number of these super fast DNA processing routines into a project I call libdna. It will take some time and/or help until its finished, though.

luizirber · 2020-01-25T23:02:35Z

Fixed in #861, thanks @kloetzl!

kloetzl · 2020-01-26T06:50:10Z

I am glad that I could help.

luizirber mentioned this issue Jan 24, 2020

Improve sketching performance with lookup tables for complement and DNA validation #861

Merged

5 tasks

luizirber closed this as completed Jan 25, 2020

kloetzl mentioned this issue Jan 26, 2020

Improve sketching performance for DNA #865

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve sketching performance #860

improve sketching performance #860

kloetzl commented Jan 24, 2020 •

edited

Loading

ctb commented Jan 24, 2020 via email

kloetzl commented Jan 24, 2020

luizirber commented Jan 24, 2020

luizirber commented Jan 24, 2020

kloetzl commented Jan 24, 2020 •

edited

Loading

kloetzl commented Jan 24, 2020

luizirber commented Jan 25, 2020

kloetzl commented Jan 26, 2020

improve sketching performance #860

improve sketching performance #860

Comments

kloetzl commented Jan 24, 2020 • edited Loading

ctb commented Jan 24, 2020 via email

kloetzl commented Jan 24, 2020

luizirber commented Jan 24, 2020

luizirber commented Jan 24, 2020

kloetzl commented Jan 24, 2020 • edited Loading

kloetzl commented Jan 24, 2020

luizirber commented Jan 25, 2020

kloetzl commented Jan 26, 2020

kloetzl commented Jan 24, 2020 •

edited

Loading

kloetzl commented Jan 24, 2020 •

edited

Loading