Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash database creation parameter choices & consequences #778

Closed
ctb opened this issue Nov 27, 2019 · 9 comments
Closed

sourmash database creation parameter choices & consequences #778

ctb opened this issue Nov 27, 2019 · 9 comments

Comments

@ctb
Copy link
Contributor

ctb commented Nov 27, 2019

A few notes from things posted to slack --

  • d10 SBTs are a lot slower than d2 SBTs.

for fungi, d2 search took 8 seconds; d10 search took 35 seconds.

presumably this is because when you’re weeding out false hits beneath a node, you have to load an average of d/2 nodes to find the right one, or some such.

  • I also see really dramatic decreases in search time for larger bloom filters (like, duh)
-x1e4 - 74.05user
-x1e5 - 28.87user
-x1e6 - 8.54user

all with d2 for fungi alone.

One of the big obstacles to using larger bloom filters here is that we want to compress the bloom filters on disk b/c otherwise they get way too big. I assume that the new buffer based bloom filter stuff in rust allows loading from gzipped files??

@ctb ctb changed the title sourmash SBT parameter choices & consequences sourmash database creation parameter choices & consequences Nov 28, 2019
@ctb
Copy link
Contributor Author

ctb commented Nov 28, 2019

Built the LCA DB for bacteria! Requires 18 GB of RAM to load, it looks like, and about 5 minutes to search for a single genome. But it's a small file at only 405MB! Yay?

% /usr/bin/time sourmash gather $Y outputs/lca/scaled/genbank-bacteria-k31-scaled10k.lca.json.gz
== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting default query k=31.
loaded query: AAEZMM010000001.1 Salmonella e... (k=31, DNA)
loaded 1 databases.


overlap     p_query p_match avg_abund
---------   ------- ------- ---------
4.7 Mbp      100.0%  100.0%       1.0    AAADRZ010000001.1 Salmonella enterica...

found 1 matches total;
the recovered matches hit 100.0% of the query

316.99user 35.32system 5:50.61elapsed 100%CPU (0avgtext+0avgdata 18766200maxresident)k
1227944inputs+24outputs (51major+31793920minor)pagefaults 0swaps```

@ctb
Copy link
Contributor Author

ctb commented Dec 3, 2019

You've gotta love SBTs. 13 seconds to search all 500,000 bacterial genomes, in under 1 GB of RAM.

% /usr/bin/time sourmash gather $X outputs/trees/scaled/genbank-bacteria-d2-x1e5-k31.sbt.json 
== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting default query k=31.
loaded query: AAEZMM010000001.1 Salmonella e... (k=31, DNA)
loaded 1 databases.


overlap     p_query p_match avg_abund
---------   ------- ------- ---------
4.7 Mbp      100.0%  100.0%       1.0    AAEZMM010000001.1 Salmonella enterica...

found 1 matches total;
the recovered matches hit 100.0% of the query

12.86user 3.81system 0:33.25elapsed 50%CPU (0avgtext+0avgdata 1064704maxresident)k
1204000inputs+583984outputs (68major+396507minor)pagefaults 0swaps

@ctb
Copy link
Contributor Author

ctb commented Dec 3, 2019

OTOH the SBT directory is 30GB uncompressed! So, um, ok.

@luizirber
Copy link
Member

You've gotta love SBTs. 13 seconds to search all 500,000 bacterial genomes, in under 1 GB of RAM.

% /usr/bin/time sourmash gather $X outputs/trees/scaled/genbank-bacteria-d2-x1e5-k31.sbt.json 
== This is sourmash version 2.0.1. ==

And you're using sourmash 2.0.1, it should be faster in newer versions.

loaded query: AAEZMM010000001.1 Salmonella e... (k=31, DNA)

That is a pretty small query for gather, but glad it was found quickly =]

OTOH the SBT directory is 30GB uncompressed! So, um, ok.

Possible solutions:

  • Short term: compress internal nodes. Should work already.
  • Medium term:
    • read internal nodes from buffers, instead of the tempfile dance happening now.
      (The tempfile dance: content of internal node is read from storage, written to tempfile, and then loaded into a Nodegraph from disk. This is a limitation in khmer, but the rust Nodegraph can be loaded from a memory buffer)
    • load SBTs from compressed files (a zipfile), without having to decompress the file we distribute.
  • Long term: Dinamically sized internal nodes (with MQF).

@luizirber
Copy link
Member

luizirber commented Dec 3, 2019

Another comment: For search we only need to load internal nodes once (they will never be checked again). This helps saving total memory consumed (because we can unload the internal node after checking it). The feature/unload branch expose this in the find function, but needs more tests.

This is not so useful for gather, because internal nodes might be checked more than once, but might lead to a "low-memory" mode that always unload internal node data, or a mixed approach where we cache the internal node data for frequently accessed nodes.

A dirty version of this is in the unassigned.py I wrote for @taylorreiter, but I would rather avoid having to dig into private fields like this and have a proper method =P

@ctb
Copy link
Contributor Author

ctb commented Dec 4, 2019

compress internal nodes - is that Rust dependent?

@luizirber
Copy link
Member

compress internal nodes - is that Rust dependent?

No, it should work now (it's a khmer feature). Need to

  1. change indexing code to generate compressed nodes
  2. or load an SBT/compress internal nodes/update .sbt.json file (make a new command to port old SBTs?)

@ctb
Copy link
Contributor Author

ctb commented Dec 6, 2019

all done; here are the sizes of the SBTs.

173M    .sbt.genbank-archaea-d10-k21
102M    .sbt.genbank-archaea-d10-x1e4-k21
102M    .sbt.genbank-archaea-d10-x1e4-k31
102M    .sbt.genbank-archaea-d10-x1e4-k51
113M    .sbt.genbank-archaea-d10-x1e5-k21
113M    .sbt.genbank-archaea-d10-x1e5-k31
113M    .sbt.genbank-archaea-d10-x1e5-k51
173M    .sbt.genbank-archaea-d10-x1e6-k21
173M    .sbt.genbank-archaea-d10-x1e6-k31
174M    .sbt.genbank-archaea-d10-x1e6-k51
623M    .sbt.genbank-archaea-d2-k21
176M    .sbt.genbank-archaea-d2-x1e4-k21
176M    .sbt.genbank-archaea-d2-x1e4-k31
177M    .sbt.genbank-archaea-d2-x1e4-k51
222M    .sbt.genbank-archaea-d2-x1e5-k21
222M    .sbt.genbank-archaea-d2-x1e5-k31
223M    .sbt.genbank-archaea-d2-x1e5-k51
623M    .sbt.genbank-archaea-d2-x1e6-k21
623M    .sbt.genbank-archaea-d2-x1e6-k31
624M    .sbt.genbank-archaea-d2-x1e6-k51
29G     .sbt.genbank-bacteria-d2-x1e5-k21
30G     .sbt.genbank-bacteria-d2-x1e5-k31
30G     .sbt.genbank-bacteria-d2-x1e5-k51
1.3G    .sbt.genbank-fungi-d10-k21
1.1G    .sbt.genbank-fungi-d10-x1e4-k21
1.1G    .sbt.genbank-fungi-d10-x1e4-k31
1.1G    .sbt.genbank-fungi-d10-x1e4-k51
1.1G    .sbt.genbank-fungi-d10-x1e5-k21
1.1G    .sbt.genbank-fungi-d10-x1e5-k31
1.1G    .sbt.genbank-fungi-d10-x1e5-k51
1.3G    .sbt.genbank-fungi-d10-x1e6-k21
1.3G    .sbt.genbank-fungi-d10-x1e6-k31
1.3G    .sbt.genbank-fungi-d10-x1e6-k51
2.7G    .sbt.genbank-fungi-d2-k21
1.1G    .sbt.genbank-fungi-d2-x1e4-k21
1.1G    .sbt.genbank-fungi-d2-x1e4-k31
1.1G    .sbt.genbank-fungi-d2-x1e4-k51
1.3G    .sbt.genbank-fungi-d2-x1e5-k21
1.3G    .sbt.genbank-fungi-d2-x1e5-k31
1.3G    .sbt.genbank-fungi-d2-x1e5-k51
2.7G    .sbt.genbank-fungi-d2-x1e6-k21
2.7G    .sbt.genbank-fungi-d2-x1e6-k31
2.8G    .sbt.genbank-fungi-d2-x1e6-k51
549M    .sbt.genbank-viral-d10-k21
322M    .sbt.genbank-viral-d10-x1e4-k21
328M    .sbt.genbank-viral-d10-x1e4-k31
335M    .sbt.genbank-viral-d10-x1e4-k51
334M    .sbt.genbank-viral-d10-x1e5-k21
340M    .sbt.genbank-viral-d10-x1e5-k31
348M    .sbt.genbank-viral-d10-x1e5-k51
549M    .sbt.genbank-viral-d10-x1e6-k21
555M    .sbt.genbank-viral-d10-x1e6-k31
563M    .sbt.genbank-viral-d10-x1e6-k51
2.6G    .sbt.genbank-viral-d2-k21
671M    .sbt.genbank-viral-d2-x1e4-k21
675M    .sbt.genbank-viral-d2-x1e4-k31
684M    .sbt.genbank-viral-d2-x1e4-k51
726M    .sbt.genbank-viral-d2-x1e5-k21
733M    .sbt.genbank-viral-d2-x1e5-k31
741M    .sbt.genbank-viral-d2-x1e5-k51
2.6G    .sbt.genbank-viral-d2-x1e6-k21
2.6G    .sbt.genbank-viral-d2-x1e6-k31
2.6G    .sbt.genbank-viral-d2-x1e6-k51

@ctb
Copy link
Contributor Author

ctb commented Jul 18, 2020

This is pretty out of date with the new .sbt.zip stuff. Closing as irrelevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants