From ccb1d491807767ab9d3c9b005ff21ac953c34883 Mon Sep 17 00:00:00 2001 From: Colton Baumler <63077899+ccbaumler@users.noreply.github.com> Date: Wed, 17 Jan 2024 12:26:37 -0800 Subject: [PATCH] [MRG] Adding threshold-bp and scaled relationship to faqs (#2930) This pull request is in response and fixes #2929 I have adding a short dialogue about threshold-bp as referenced in other documents as well as how I understand its function. --- doc/faq.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/doc/faq.md b/doc/faq.md index bafbcd72ee..df17c56726 100644 --- a/doc/faq.md +++ b/doc/faq.md @@ -141,6 +141,25 @@ them with a lower scaled value. Please also see [What resolution should my signatures be?](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-create-them). +## What threshold-bp value should I use with `sourmash prefetch` and `sourmash gather`? + +The parameter `--threshold-bp` sets the minimum estimated overlap for reporting +a match, in both the `gather` and `prefetch` commands. The default is 50kb, and +this works well for microbial-genome-scale work, where the genomes are often +quite large (one or more megabases). + +In case you need more sensitivity, setting `--threshold-bp=0` will return any +match that shares at least one hash. This will also increase potential +false positives, however. + +We have found a good intermediate threshold is 3 times the `scaled` value, e.g. +`--threshold-bp=3000` for a scaled value of 1000. This requires at least three +overlapping hashes before a match is reported. If you are using a lower scaled +value (a higher density sketch) because you are looking for matches between +shorter sequences, then setting threshold-bp to 3 times that scaled value will +take advantage of the increased sensitivity to short matches without introducing +more false positives. + ## How do k-mer-based analyses compare with read mapping? tl;dr very well! But it's a bit one sided: if k-mers match, reads will