Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Adding threshold-bp and scaled relationship to faqs #2930

Merged
merged 2 commits into from
Jan 17, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions doc/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,25 @@ them with a lower scaled value.

Please also see [What resolution should my signatures be?](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-create-them).

## What threshold-bp value should I use with `sourmash prefetch` and `sourmash gather`?

The parameter `--threshold-bp` sets the minimum estimated overlap for reporting
a match, in both the `gather` and `prefetch` commands. The default is 50kb, and
this works well for microbial-genome-scale work, where the genomes are often
quite large (one or more megabases).

In case you need more sensitivity, setting `--threshold-bp=0` will return any
match that shares at least one hash. This will also increase potential
false positives, however.

We have found a good intermediate threshold is 3 times the `scaled` value, e.g.
`--threshold-bp=3000` for a scaled value of 1000. This requires at least three
overlapping hashes before a match is reported. If you are using a lower scaled
value (a higher density sketch) because you are looking for matches between
shorter sequences, then setting threshold-bp to 3 times that scaled value will
take advantage of the increased sensitivity to short matches without introducing
more false positives.

## How do k-mer-based analyses compare with read mapping?

tl;dr very well! But it's a bit one sided: if k-mers match, reads will
Expand Down
Loading