sourmash database creation parameter choices & consequences #778

ctb · 2019-11-27T14:07:12Z

A few notes from things posted to slack --

d10 SBTs are a lot slower than d2 SBTs.

for fungi, d2 search took 8 seconds; d10 search took 35 seconds.

presumably this is because when you’re weeding out false hits beneath a node, you have to load an average of d/2 nodes to find the right one, or some such.

I also see really dramatic decreases in search time for larger bloom filters (like, duh)

-x1e4 - 74.05user
-x1e5 - 28.87user
-x1e6 - 8.54user

all with d2 for fungi alone.

One of the big obstacles to using larger bloom filters here is that we want to compress the bloom filters on disk b/c otherwise they get way too big. I assume that the new buffer based bloom filter stuff in rust allows loading from gzipped files??

The text was updated successfully, but these errors were encountered:

ctb · 2019-11-28T12:57:11Z

Built the LCA DB for bacteria! Requires 18 GB of RAM to load, it looks like, and about 5 minutes to search for a single genome. But it's a small file at only 405MB! Yay?

% /usr/bin/time sourmash gather $Y outputs/lca/scaled/genbank-bacteria-k31-scaled10k.lca.json.gz
== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting default query k=31.
loaded query: AAEZMM010000001.1 Salmonella e... (k=31, DNA)
loaded 1 databases.


overlap     p_query p_match avg_abund
---------   ------- ------- ---------
4.7 Mbp      100.0%  100.0%       1.0    AAADRZ010000001.1 Salmonella enterica...

found 1 matches total;
the recovered matches hit 100.0% of the query

316.99user 35.32system 5:50.61elapsed 100%CPU (0avgtext+0avgdata 18766200maxresident)k
1227944inputs+24outputs (51major+31793920minor)pagefaults 0swaps```

ctb · 2019-12-03T18:09:26Z

You've gotta love SBTs. 13 seconds to search all 500,000 bacterial genomes, in under 1 GB of RAM.

% /usr/bin/time sourmash gather $X outputs/trees/scaled/genbank-bacteria-d2-x1e5-k31.sbt.json 
== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting default query k=31.
loaded query: AAEZMM010000001.1 Salmonella e... (k=31, DNA)
loaded 1 databases.


overlap     p_query p_match avg_abund
---------   ------- ------- ---------
4.7 Mbp      100.0%  100.0%       1.0    AAEZMM010000001.1 Salmonella enterica...

found 1 matches total;
the recovered matches hit 100.0% of the query

12.86user 3.81system 0:33.25elapsed 50%CPU (0avgtext+0avgdata 1064704maxresident)k
1204000inputs+583984outputs (68major+396507minor)pagefaults 0swaps

ctb · 2019-12-03T18:12:33Z

OTOH the SBT directory is 30GB uncompressed! So, um, ok.

luizirber · 2019-12-03T18:51:58Z

You've gotta love SBTs. 13 seconds to search all 500,000 bacterial genomes, in under 1 GB of RAM.
% /usr/bin/time sourmash gather $X outputs/trees/scaled/genbank-bacteria-d2-x1e5-k31.sbt.json 
== This is sourmash version 2.0.1. ==

And you're using sourmash 2.0.1, it should be faster in newer versions.

loaded query: AAEZMM010000001.1 Salmonella e... (k=31, DNA)

That is a pretty small query for gather, but glad it was found quickly =]

OTOH the SBT directory is 30GB uncompressed! So, um, ok.

Possible solutions:

Short term: compress internal nodes. Should work already.
Medium term:
- read internal nodes from buffers, instead of the tempfile dance happening now.
  (The tempfile dance: content of internal node is read from storage, written to tempfile, and then loaded into a Nodegraph from disk. This is a limitation in khmer, but the rust Nodegraph can be loaded from a memory buffer)
- load SBTs from compressed files (a zipfile), without having to decompress the file we distribute.
Long term: Dinamically sized internal nodes (with MQF).

luizirber · 2019-12-03T18:57:05Z

Another comment: For search we only need to load internal nodes once (they will never be checked again). This helps saving total memory consumed (because we can unload the internal node after checking it). The feature/unload branch expose this in the find function, but needs more tests.

This is not so useful for gather, because internal nodes might be checked more than once, but might lead to a "low-memory" mode that always unload internal node data, or a mixed approach where we cache the internal node data for frequently accessed nodes.

A dirty version of this is in the unassigned.py I wrote for @taylorreiter, but I would rather avoid having to dig into private fields like this and have a proper method =P

ctb · 2019-12-04T14:39:56Z

compress internal nodes - is that Rust dependent?

luizirber · 2019-12-05T23:03:50Z

compress internal nodes - is that Rust dependent?

No, it should work now (it's a khmer feature). Need to

change indexing code to generate compressed nodes
or load an SBT/compress internal nodes/update .sbt.json file (make a new command to port old SBTs?)

ctb · 2019-12-06T15:07:37Z

all done; here are the sizes of the SBTs.

173M    .sbt.genbank-archaea-d10-k21
102M    .sbt.genbank-archaea-d10-x1e4-k21
102M    .sbt.genbank-archaea-d10-x1e4-k31
102M    .sbt.genbank-archaea-d10-x1e4-k51
113M    .sbt.genbank-archaea-d10-x1e5-k21
113M    .sbt.genbank-archaea-d10-x1e5-k31
113M    .sbt.genbank-archaea-d10-x1e5-k51
173M    .sbt.genbank-archaea-d10-x1e6-k21
173M    .sbt.genbank-archaea-d10-x1e6-k31
174M    .sbt.genbank-archaea-d10-x1e6-k51
623M    .sbt.genbank-archaea-d2-k21
176M    .sbt.genbank-archaea-d2-x1e4-k21
176M    .sbt.genbank-archaea-d2-x1e4-k31
177M    .sbt.genbank-archaea-d2-x1e4-k51
222M    .sbt.genbank-archaea-d2-x1e5-k21
222M    .sbt.genbank-archaea-d2-x1e5-k31
223M    .sbt.genbank-archaea-d2-x1e5-k51
623M    .sbt.genbank-archaea-d2-x1e6-k21
623M    .sbt.genbank-archaea-d2-x1e6-k31
624M    .sbt.genbank-archaea-d2-x1e6-k51
29G     .sbt.genbank-bacteria-d2-x1e5-k21
30G     .sbt.genbank-bacteria-d2-x1e5-k31
30G     .sbt.genbank-bacteria-d2-x1e5-k51
1.3G    .sbt.genbank-fungi-d10-k21
1.1G    .sbt.genbank-fungi-d10-x1e4-k21
1.1G    .sbt.genbank-fungi-d10-x1e4-k31
1.1G    .sbt.genbank-fungi-d10-x1e4-k51
1.1G    .sbt.genbank-fungi-d10-x1e5-k21
1.1G    .sbt.genbank-fungi-d10-x1e5-k31
1.1G    .sbt.genbank-fungi-d10-x1e5-k51
1.3G    .sbt.genbank-fungi-d10-x1e6-k21
1.3G    .sbt.genbank-fungi-d10-x1e6-k31
1.3G    .sbt.genbank-fungi-d10-x1e6-k51
2.7G    .sbt.genbank-fungi-d2-k21
1.1G    .sbt.genbank-fungi-d2-x1e4-k21
1.1G    .sbt.genbank-fungi-d2-x1e4-k31
1.1G    .sbt.genbank-fungi-d2-x1e4-k51
1.3G    .sbt.genbank-fungi-d2-x1e5-k21
1.3G    .sbt.genbank-fungi-d2-x1e5-k31
1.3G    .sbt.genbank-fungi-d2-x1e5-k51
2.7G    .sbt.genbank-fungi-d2-x1e6-k21
2.7G    .sbt.genbank-fungi-d2-x1e6-k31
2.8G    .sbt.genbank-fungi-d2-x1e6-k51
549M    .sbt.genbank-viral-d10-k21
322M    .sbt.genbank-viral-d10-x1e4-k21
328M    .sbt.genbank-viral-d10-x1e4-k31
335M    .sbt.genbank-viral-d10-x1e4-k51
334M    .sbt.genbank-viral-d10-x1e5-k21
340M    .sbt.genbank-viral-d10-x1e5-k31
348M    .sbt.genbank-viral-d10-x1e5-k51
549M    .sbt.genbank-viral-d10-x1e6-k21
555M    .sbt.genbank-viral-d10-x1e6-k31
563M    .sbt.genbank-viral-d10-x1e6-k51
2.6G    .sbt.genbank-viral-d2-k21
671M    .sbt.genbank-viral-d2-x1e4-k21
675M    .sbt.genbank-viral-d2-x1e4-k31
684M    .sbt.genbank-viral-d2-x1e4-k51
726M    .sbt.genbank-viral-d2-x1e5-k21
733M    .sbt.genbank-viral-d2-x1e5-k31
741M    .sbt.genbank-viral-d2-x1e5-k51
2.6G    .sbt.genbank-viral-d2-x1e6-k21
2.6G    .sbt.genbank-viral-d2-x1e6-k31
2.6G    .sbt.genbank-viral-d2-x1e6-k51

ctb · 2020-07-18T14:25:37Z

This is pretty out of date with the new .sbt.zip stuff. Closing as irrelevant.

ctb changed the title ~~sourmash SBT parameter choices & consequences~~ sourmash database creation parameter choices & consequences Nov 28, 2019

luizirber mentioned this issue Dec 5, 2019

Expose an unload method for SBT nodes #784

Merged

5 tasks

ctb closed this as completed Jul 18, 2020

luizirber mentioned this issue Jul 18, 2020

upgrade sourmash_databases soon (for sourmash 4.0) sourmash-bio/databases#10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sourmash database creation parameter choices & consequences #778

sourmash database creation parameter choices & consequences #778

ctb commented Nov 27, 2019

ctb commented Nov 28, 2019

ctb commented Dec 3, 2019

ctb commented Dec 3, 2019

luizirber commented Dec 3, 2019

luizirber commented Dec 3, 2019 •

edited

Loading

ctb commented Dec 4, 2019

luizirber commented Dec 5, 2019

ctb commented Dec 6, 2019

ctb commented Jul 18, 2020

sourmash database creation parameter choices & consequences #778

sourmash database creation parameter choices & consequences #778

Comments

ctb commented Nov 27, 2019

ctb commented Nov 28, 2019

ctb commented Dec 3, 2019

ctb commented Dec 3, 2019

luizirber commented Dec 3, 2019

luizirber commented Dec 3, 2019 • edited Loading

ctb commented Dec 4, 2019

luizirber commented Dec 5, 2019

ctb commented Dec 6, 2019

ctb commented Jul 18, 2020

luizirber commented Dec 3, 2019 •

edited

Loading