Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime improvement #20

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Runtime improvement #20

wants to merge 2 commits into from

Conversation

jsgounot
Copy link

Hi,

First, thanks for this soft. I've been running a lot of badread runs for a project and while waiting for the soft to finish, I found it fun to check why it takes so long to simulate reads. I profile badread using cProfile using this command line:

python -m cProfile -s cumtime badread-runner.py simulate --reference test/kleb_MN.fasta --quantity 1x --error_model nanopore2020 --qscore_model nanopore2020 --identity 87.5,97.5,5 --glitches 1000,25,25 --junk_reads 1 --random_reads 1 --chimeras 1 --seed 50 > test/res

with kleb_MN being a Klebsiella pneumoniae strain I've on disk.

Here is the head of the cprofile results with current version:

         223062325 function calls (223053151 primitive calls) in 124.887 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    746/1    0.015    0.000  124.892  124.892 {built-in method builtins.exec}
        1    0.000    0.000  124.892  124.892 badread-runner.py:1(<module>)
        1    0.027    0.027  124.862  124.862 __main__.py:27(main)
        1    0.027    0.027  124.255  124.255 simulate.py:32(simulate)
      381    5.318    0.014  116.793    0.307 simulate.py:237(sequence_fragment)
      381    9.946    0.026   81.135    0.213 qscore_model.py:35(get_qscores)
  5255819    7.593    0.000   53.411    0.000 qscore_model.py:289(get_qscore)
  9289110   44.446    0.000   53.282    0.000 random.py:506(choices)
  4033291    5.213    0.000   17.898    0.000 error_model.py:131(add_errors_to_kmer)
   455719    8.872    0.000    8.872    0.000 {edlib.align}
  4642048    1.152    0.000    8.022    0.000 random.py:366(randint)
  4642048    3.351    0.000    6.869    0.000 random.py:292(randrange)
  9289110    4.252    0.000    6.866    0.000 random.py:540(<listcomp>)

You can see that a lot of time is actually used to compute the qscores, mostly because of the millionof random.choice calls.

I decided to give a try and replaced the random part using numpy generator. For numpy generator to be effective, we need to compute a lot of choices first and iterate over them later. One of the issue is that all cigar strings do not have the same frequency, and generating a lot of choices for all cigar string is ineffective.

The CIGARQProbDist object tries to circumvent this issue by generating choices based on an increasingly buffer size each time all choices have been consumed. I also use a lru_cache from functools to avoid reformatting of the same cigar strings over and over.

New profiling with the same command line looks like this:

         193134940 function calls (193125766 primitive calls) in 75.878 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    746/1    0.016    0.000   75.878   75.878 {built-in method builtins.exec}
        1    0.000    0.000   75.878   75.878 badread-runner.py:1(<module>)
        1    0.027    0.027   75.856   75.856 __main__.py:27(main)
        1    0.029    0.029   75.328   75.328 simulate.py:32(simulate)
      345    4.821    0.014   67.893    0.197 simulate.py:237(sequence_fragment)
      345    9.069    0.026   34.336    0.100 qscore_model.py:36(get_qscores)
  3892071    5.189    0.000   17.189    0.000 error_model.py:131(add_errors_to_kmer)
   454897    9.375    0.000    9.375    0.000 {edlib.align}
  3892071    6.134    0.000    9.175    0.000 random.py:506(choices)
  5254902    2.872    0.000    8.192    0.000 qscore_model.py:330(get_qscore)
  4426818    1.033    0.000    7.227    0.000 random.py:366(randint)
  4426818    3.000    0.000    6.194    0.000 random.py:292(randrange)
        1    0.000    0.000    5.193    5.193 error_model.py:88(__init__)
        1    0.080    0.080    5.193    5.193 error_model.py:107(load_from_file)

We have here a substantial speed increase (1.6x). Only drawback is the increased memory footprint but this should be ok in most cases.

@jsgounot
Copy link
Author

I forgot to add this in the initial PR. Current version stop generating choices for a cigar values once the buffersize hits a size limit (defined here as 1e6). Logic here is to avoid generating overlarge buffer while it's very likely that cycling over 1e6 choices will provide a good randomness of the potential available scores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant