Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin: sourmash_plugin_containment_search #2970

Open
ctb opened this issue Feb 4, 2024 · 0 comments
Open

plugin: sourmash_plugin_containment_search #2970

ctb opened this issue Feb 4, 2024 · 0 comments
Labels
plugin a plugin for sourmash!

Comments

@ctb
Copy link
Contributor

ctb commented Feb 4, 2024

https://github.com/sourmash-bio/sourmash_plugin_containment_search/

From the README:

sourmash_plugin_containment_search: improved containment search for genomes in metagenomes

This plugin provides a command sourmash scripts mgsearch that
provides new & nicer output for searching genomes against metagenomes.

Installation

pip install sourmash_plugin_containment_search

Usage

This command:

sourmash scripts mgsearch query.sig metagenome.sig [ metagenome2.sig ...] \
    [ -o output.csv ]

will search for the query genome query.sig in one or more
metagenome.sig files, producing decent human-readable output and
(optionally) useful CSV outputs.

For example,

sourmash scripts mgsearch ../sourmash/podar-ref/0.fa.sig ../sourmash/SRR606249.trim.k31.sig.gz

produces:

Loaded query signature: CP001472.1 Acidobacterium capsulatum ATCC 51196, com...

p_genome avg_abund   p_metag   metagenome name
-------- ---------   -------   ---------------
 100.0%    55.4         3.1%   SRR606249

This plugin will work with all the standard sourmash database types, too.

Note that the metagenomes must have been sketched with -p abund.

Backstory: Why this command?

sourmash search supports sample search x sample search, broadly -
perhaps too
broadly. And the output formats aren't that helpful.

sourmash prefetch supports metagenome overlap search against many
genomes, which is the reverse of this use case. Moreover,
prefetch doesn't provided weighted results
and its output isn't friendly.

sourmash gather has friendly and useful output, but can't be used to
calculate the overlap between a single query genome and many subject
metagenomes.

There is also some interest in
reverse containment search.

The manysearch command of
the sourmash branchwater plugin
also does a nice containment search like this plugin, but it doesn't
provide nice human-readable output and it also doesn't provide
weighted results. (manysearch is, however, much lower memory &
probably a fair bit faster because it's mostly in Rust.)

Advanced info: implementation details

This command is streaming, in the sense that it will load each
metagenome, calculate the match, and then discard the metagenome.
Hence its memory usage peaks with the largest metagenome, and its max
should be driven by the size of the query + the size of the largest
metagenome.

@ctb ctb added the plugin a plugin for sourmash! label Feb 4, 2024
@ctb ctb changed the title New plugin: sourmash_plugin_containment_search plugin: sourmash_plugin_containment_search Feb 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plugin a plugin for sourmash!
Projects
None yet
Development

No branches or pull requests

1 participant