Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider adding --max-containment option for some sourmash utilities #1343

Closed
bluegenes opened this issue Feb 20, 2021 · 3 comments · Fixed by #1346
Closed

consider adding --max-containment option for some sourmash utilities #1343

bluegenes opened this issue Feb 20, 2021 · 3 comments · Fixed by #1346

Comments

@bluegenes
Copy link
Contributor

bluegenes commented Feb 20, 2021

Given the work we've been doing with maximum containment --> genome distance metrics, there are some places where it might be useful to return the maximum containment between signatures, rather than directional containment. This could be enabled with a --max-containment option for certain commands.

Cases I can think of:

  • compare
  • search -- e.g. my current desired use case is sourmash search --max-containment --best-only to find the best match for an input sig (or list of input sigs) to a list of cluster founders.

--max-containment is likely not useful for gather-style metagenome applications, primarily bc the direction of containment (intersection/reference genome hashes) is already ideal.

@ctb
Copy link
Contributor

ctb commented Feb 22, 2021

it may not be possible to do this efficiently on SBTs - thinking out loud,

  • SBTs provide Jaccard similarity and containment searches for a query Q in a database D
  • can there be a match S in D that has low Jaccard similarity and containment but high max containment? this would be a match that cannot be found using current SBT, but would be reported for high max containment.
  • I can't think about that clearly, but maybe a way to rephrase it is to ask about doing containment searches (Q in D) and ((all d in D0 against Q) and taking the best match?

perhaps this issue on reverse containment is related? #1198

@ctb
Copy link
Contributor

ctb commented Feb 22, 2021

so uh that was silly of me,

yes, this is straightforward. the SBT guarantees results by largest number of hashes shared; max containment would just be changing the denominator.

@ctb
Copy link
Contributor

ctb commented Feb 23, 2021

so, uh, not quite as straightforward as I'd hoped, but we'll see :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants