-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how much memory does sourmash compare
need?
#2299
Comments
some things we could do -
@ccbaumler this is a good benchmarking issue! some things to measure, and some things to fix! |
I should also say that I'm not 100% sure it was OOM, but it ran for quite some time, then I left for lunch, and when I returned when I returned my whole remote session had crashed (I was using OOM seems like the most likely explanation to me. But if sketches are held in memory, assuming ~80Gb for the matrix itself, and assuming there were some other processes running requiring some memory, 40Gb remaining / 100k sketches means sketches would only have to be on the order of 400Kb each in memory to cause a problem. Is this plausible? Another idea would be to give the option of reading sketches from disk each time to save memory at the expense of speed, though if you don't have mmaped sketches and have to scan from the beginning each time, this could get real slow. |
Hi, I am trying to run Here are some configurations (RAM and number of processes) I've tried:
Any ideas on what might be happening? Do I just need to throw more memory at it? Here are some example outputs of my job stats:
Thank you for any assistance you can provide, |
ahh, metagenome signatures 😱 . It's not completely surprising to me b/c I would guess that most of the memory usage is being taken up by just loading the sketches into memory. Still, ...suboptimal. Some details that would be helpful in terms of providing guidance
thanks! One specific recommendation: raise the scaled value when doing compare, e.g. You can also try pyo3_branchwater's multisearch command which should be much more memory efficient but this is still early-stage so, you know, buyer beware 😆 . The output is also not as convenient as sourmash compare's for viz, since it's just a sparse set of places where an above-threshold comparison was found. Last but not least, I'm not sure where @mr-eyes kSpider is, but it was built for this purpose, so: #2271 and https://dib-lab.github.io/kSpider/ OK, actually last: check out #2735. I'm challenged by the notion of using |
kSpider@dev branch has the most recent updates (yet no docs for it), but I am happy to help build/run it until released. Also, I think branchwater/multisearch can tackle this. The way |
And here's a bonus script for converting branchwater/multisearch results to a Newick file. |
You can also modify the script in sourmash-bio/sourmash_plugin_branchwater#111 to generate similar output to sourmash compare. You will simply need to do |
Hi @ctb and @mr-eyes, thank you for the detailed answers.
I will check out the other resources that you've listed and report back. Best, |
In the end it took me 800 GB for 1123 metagenomes:
|
wow! thank you for letting us know! |
@kescobo @vinisalazar the branchwater plugin for sourmash now has a It is also now pretty straightforward to install from conda-forge, which is nice :). I'm pretty sure @mr-eyes has a script to convert its output into a matrix format but I am unable to find it at the moment. Mo? (I don't think it's this one) Anyway, just wanted to drop by and say this 😆 . We haven't fixed sourmash compare yet, but ...eventually... |
@ctb you are correct, I have a script. I have created a PR for it here sourmash-bio/sourmash_plugin_branchwater#198 The PR will be ready after writing tests, though. |
For 14,663,820 pairwise comparisons done by branchwater |
That's awesome, thank you @ctb / sourmash team. |
After an enhancement, it takes now 10 seconds to export the TSV dense matrix. |
FYI - #3134 contains a bunch of information about how to use the |
@kescobo ran into some out-of-memory errors when trying to do
sourmash compare
on 100k genomes, and we became curious about memory usage 😆kevin:
me:
me:
kevin:
me:
Kevin:
me:
The text was updated successfully, but these errors were encountered: