From 5e6fdb9c5a527c866b371a230dfe49c7d1811011 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 4 Feb 2024 21:07:21 -0800 Subject: [PATCH] MRG: update the CLI docs and help for `search --containment` and `prefetch` (#2971) Adds useful information about the order of containment searches: * `search --containment A B` reports A contained in B; * `prefetch A B` reports B contained in A; Fixes https://github.com/sourmash-bio/sourmash/issues/2968. --- doc/command-line.md | 12 +++++++++++- src/sourmash/cli/search.py | 3 +++ 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/doc/command-line.md b/doc/command-line.md index ee5421a6be..4697797e1b 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -325,6 +325,13 @@ Match information can be saved to a CSV file with `-o/--output`; with `-o`, all matches above the threshold will be saved, not just those printed to stdout (which are limited to `-n/--num-results`). +The `--containment` flag calculates the containment of the query in +database matches; this is an asymmetric order-dependent measure, +unlike Jaccard. Here, `search --containment Q A B C D` will report the +containment of `Q` in each of `A`, `B`, `C`, and `D`. This is opposite +to the order used by `prefetch`, where the composite sketch (e.g. metagenomes) +is the query, and the matches are contained items (e.g. genomes). + As of sourmash 4.2.0, `search` supports `--picklist`, to [select a subset of signatures to search, based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures). This can be used to search only a small subset of a large collection, or to @@ -477,7 +484,10 @@ The `prefetch` subcommand searches a collection of scaled signatures for matches in a large database, using containment. It is similar to `search --containment`, while taking a `--threshold-bp` argument like `gather` does for thresholding matches (instead of using Jaccard -similarity or containment). +similarity or containment). Note that `prefetch` uses the composite +sketch (e.g. a metagenome) as the query, and finds all matching +subjects (e.g. genomes) from the database - the arguments are in the +opposite order from `search --containment`. `sourmash prefetch` is intended to select a subset of a large database for further processing. As such, it can search very large collections diff --git a/src/sourmash/cli/search.py b/src/sourmash/cli/search.py index fc37367d2e..2c11873963 100644 --- a/src/sourmash/cli/search.py +++ b/src/sourmash/cli/search.py @@ -35,6 +35,9 @@ [1] https://en.wikipedia.org/wiki/Jaccard_index +When `--containment` is provided, the containment of the query in each +of the search signatures or databases is reported. + --- """