-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sourmash database creation parameter choices & consequences #778
Comments
Built the LCA DB for bacteria! Requires 18 GB of RAM to load, it looks like, and about 5 minutes to search for a single genome. But it's a small file at only 405MB! Yay?
|
You've gotta love SBTs. 13 seconds to search all 500,000 bacterial genomes, in under 1 GB of RAM.
|
OTOH the SBT directory is 30GB uncompressed! So, um, ok. |
And you're using sourmash 2.0.1, it should be faster in newer versions.
That is a pretty small query for gather, but glad it was found quickly =]
Possible solutions:
|
Another comment: For This is not so useful for A dirty version of this is in the |
compress internal nodes - is that Rust dependent? |
No, it should work now (it's a khmer feature). Need to
|
all done; here are the sizes of the SBTs.
|
This is pretty out of date with the new .sbt.zip stuff. Closing as irrelevant. |
A few notes from things posted to slack --
for fungi, d2 search took 8 seconds; d10 search took 35 seconds.
presumably this is because when you’re weeding out false hits beneath a node, you have to load an average of d/2 nodes to find the right one, or some such.
all with d2 for fungi alone.
One of the big obstacles to using larger bloom filters here is that we want to compress the bloom filters on disk b/c otherwise they get way too big. I assume that the new buffer based bloom filter stuff in rust allows loading from gzipped files??
The text was updated successfully, but these errors were encountered: