-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow run time of sourmash sig subtract
when re-implemented in python API
#2248
Comments
Put files here, https://github.com/ctb/2022-sourmash-slow-sig-subtract, and fixed them a bit so they'd run 😜. |
I can't replicate with downsampled sigs (to scaled=1000, k=31 only) - I see
for the Python API version, and
so, basically identical. I will note that in the Python version, you don't need to reload I will now re-run with the scaled=200 sketches and see what happens. |
tl;dr no substantial difference in time or maxresident, but a massive difference in from and a lot more page faults. huh. python
shell
|
I can replicate the inputs differences with a much smaller query - makes me suspect that I'll think about it some more. But for now I cannot reproduce the performance delta itself. |
I recently wrote a snakefile to subtract a one signature from another. I noticed that my implementation of sourmash sig subtract, which I wrote using the python API but basically copied from sourmash src, was super slow. Signature loading was slow, but subtracting was the slowest step (discovered with liberal print statements). I then switched it to running in sourmash sig subtract using the cli, but wrapped in
shell()
, so still executed within python. That was also slow in the same ways as above. Lastly, I ran it just on the command line, and then it was fast again!I looked at memory while these were running at it consistently stayed around 3GB (on a 64 gb machine). I also tried giving each rule 64GB of memory in case snakemake was silently limiting resources, and there was no change in perceived runtime.
I realize that the two approaches that use just python are written so that ALL signatures in the metatdata file would be run before the rule finished, while the one that uses the CLI operates on one sample (two sigs) at a time. However, I was monitoring this just running on the one sample provided here, and the first two implementations below were WAY slower than just using the cli. (this isn't really important for the reprex, as I only included on sample in the metadata table).
I have a working solution for what I want to do, but I wanted to record what I felt like was weird behavior that I couldn't figure out an explanation for.
I'm including a snakefile here that reproduces the behavior I observed. The metadata file (attached) I included only runs the code for one sample (two SRA run accessions) so it will hopefully be ~quick. You'll need sourmash, pandas, sra tools, and snakemake in your environment for the snakefiles to run.
metadata.txt
Conda environment file:
The text was updated successfully, but these errors were encountered: