Runtime improvement #20

jsgounot · 2023-01-12T08:31:16Z

Hi,

First, thanks for this soft. I've been running a lot of badread runs for a project and while waiting for the soft to finish, I found it fun to check why it takes so long to simulate reads. I profile badread using cProfile using this command line:

python -m cProfile -s cumtime badread-runner.py simulate --reference test/kleb_MN.fasta --quantity 1x --error_model nanopore2020 --qscore_model nanopore2020 --identity 87.5,97.5,5 --glitches 1000,25,25 --junk_reads 1 --random_reads 1 --chimeras 1 --seed 50 > test/res

with kleb_MN being a Klebsiella pneumoniae strain I've on disk.

Here is the head of the cprofile results with current version:

         223062325 function calls (223053151 primitive calls) in 124.887 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    746/1    0.015    0.000  124.892  124.892 {built-in method builtins.exec}
        1    0.000    0.000  124.892  124.892 badread-runner.py:1(<module>)
        1    0.027    0.027  124.862  124.862 __main__.py:27(main)
        1    0.027    0.027  124.255  124.255 simulate.py:32(simulate)
      381    5.318    0.014  116.793    0.307 simulate.py:237(sequence_fragment)
      381    9.946    0.026   81.135    0.213 qscore_model.py:35(get_qscores)
  5255819    7.593    0.000   53.411    0.000 qscore_model.py:289(get_qscore)
  9289110   44.446    0.000   53.282    0.000 random.py:506(choices)
  4033291    5.213    0.000   17.898    0.000 error_model.py:131(add_errors_to_kmer)
   455719    8.872    0.000    8.872    0.000 {edlib.align}
  4642048    1.152    0.000    8.022    0.000 random.py:366(randint)
  4642048    3.351    0.000    6.869    0.000 random.py:292(randrange)
  9289110    4.252    0.000    6.866    0.000 random.py:540(<listcomp>)

You can see that a lot of time is actually used to compute the qscores, mostly because of the millionof random.choice calls.

I decided to give a try and replaced the random part using numpy generator. For numpy generator to be effective, we need to compute a lot of choices first and iterate over them later. One of the issue is that all cigar strings do not have the same frequency, and generating a lot of choices for all cigar string is ineffective.

The CIGARQProbDist object tries to circumvent this issue by generating choices based on an increasingly buffer size each time all choices have been consumed. I also use a lru_cache from functools to avoid reformatting of the same cigar strings over and over.

New profiling with the same command line looks like this:

         193134940 function calls (193125766 primitive calls) in 75.878 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    746/1    0.016    0.000   75.878   75.878 {built-in method builtins.exec}
        1    0.000    0.000   75.878   75.878 badread-runner.py:1(<module>)
        1    0.027    0.027   75.856   75.856 __main__.py:27(main)
        1    0.029    0.029   75.328   75.328 simulate.py:32(simulate)
      345    4.821    0.014   67.893    0.197 simulate.py:237(sequence_fragment)
      345    9.069    0.026   34.336    0.100 qscore_model.py:36(get_qscores)
  3892071    5.189    0.000   17.189    0.000 error_model.py:131(add_errors_to_kmer)
   454897    9.375    0.000    9.375    0.000 {edlib.align}
  3892071    6.134    0.000    9.175    0.000 random.py:506(choices)
  5254902    2.872    0.000    8.192    0.000 qscore_model.py:330(get_qscore)
  4426818    1.033    0.000    7.227    0.000 random.py:366(randint)
  4426818    3.000    0.000    6.194    0.000 random.py:292(randrange)
        1    0.000    0.000    5.193    5.193 error_model.py:88(__init__)
        1    0.080    0.080    5.193    5.193 error_model.py:107(load_from_file)

We have here a substantial speed increase (1.6x). Only drawback is the increased memory footprint but this should be ok in most cases.

jsgounot · 2023-01-12T08:35:14Z

I forgot to add this in the initial PR. Current version stop generating choices for a cigar values once the buffersize hits a size limit (defined here as 1e6). Logic here is to avoid generating overlarge buffer while it's very likely that cycling over 1e6 choices will provide a good randomness of the potential available scores.

Runtime improvement

cb6aee5

fix travis and rm unused code

d2ea85f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime improvement #20

Runtime improvement #20

jsgounot commented Jan 12, 2023

jsgounot commented Jan 12, 2023

Runtime improvement #20

Are you sure you want to change the base?

Runtime improvement #20

Conversation

jsgounot commented Jan 12, 2023

jsgounot commented Jan 12, 2023