Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"reverse" containment #1198

Closed
phiweger opened this issue Sep 22, 2020 · 26 comments
Closed

"reverse" containment #1198

phiweger opened this issue Sep 22, 2020 · 26 comments
Labels
code herein lies code plugin_todo Write a plugin for this! python Pull requests that update Python code

Comments

@phiweger
Copy link

I recently had this use case: I have a large database of say phages, and I want to know if they are contained in my query genome.

Ideally, I'd like to index them in an SBT and then search using containment but currently this is not possible right? Bc/

sourmash search --containment genome index asks whether the genome is contained in records from the index. Is there a way to efficiently do the reverse, ie sourmash search --containment index genome?

Thanks!

@ctb
Copy link
Contributor

ctb commented Sep 23, 2020

hi @phiweger, no, nothing is available via the command line, I'm afraid. It's fairly straightforward via the API, tho.

I'm wondering if sourmash search --containment genome index --threshold=0 would end up giving you the right candidates? You'd have to post process the results to sort different is all... hmm.

@phiweger
Copy link
Author

how would you solve this using the API (I just need a pointer to the main functions to look at)?

@ctb
Copy link
Contributor

ctb commented Sep 23, 2020 via email

@ctb
Copy link
Contributor

ctb commented Sep 25, 2020

This at least executes 😁

file reverse-gather.py:

#! /usr/bin/env python
import sys
import sourmash
from sourmash.sourmash_args import load_file_as_signatures, load_dbs_and_sigs
import argparse


def main():
    p = argparse.ArgumentParser()
    p.add_argument('many_sigs', help='collection to query')
    p.add_argument('dbs', help='databases to query', nargs='+')
    args = p.parse_args()

    query_list = list(load_file_as_signatures(args.many_sigs))
    # grab first query to use for load_dbs_and_sigs
    first_query = query_list[0]

    dbs = load_dbs_and_sigs(args.dbs, first_query, True)

    for query_sig in query_list:
        results = []
        for db, filename, dbtype in dbs:
            results.extend(db.gather(query_sig))

        results.sort(reverse=True)

        print('query:', query_sig.name())
        if not results:
            print('  ** no matches **')
        else:
            for (score, match_sig, filename) in results:
                print(f'   {score:.3f} {match_sig.name()} in {filename}')
    
    return 0


if __name__ == '__main__':
    sys.exit(main())

@ctb ctb added code herein lies code python Pull requests that update Python code labels Sep 25, 2020
@phiweger
Copy link
Author

Oh wow, thank you very much. I'll give this a spin and report back!

@ctb
Copy link
Contributor

ctb commented Sep 26, 2020 via email

@phiweger
Copy link
Author

@ctb is many_sigs a single file with multiple signatures and dbs an SBT?

@phiweger
Copy link
Author

Ah yes, just did tested this empirically ;) It works! Thanks a lot, this will be very useful to me.

@ctb
Copy link
Contributor

ctb commented Sep 28, 2020

Excellent!

This script is using some of the newer API calls behind the following statement here in the docs --

Most commands will load signatures automatically from indexed databases (SBT and LCA formats) as well as from signature files, and you can load signatures from stdin using - on the command line.

so you should be able to use any database or collection of sigs as input for many_sigs. Glad it's working!

@ctb
Copy link
Contributor

ctb commented Sep 28, 2020

(and glad it's helpful!)

@phiweger phiweger reopened this Oct 28, 2020
@phiweger
Copy link
Author

Sorry to bother, I initially commented out a line so for completeness there seems to be a minor issue:

python reverse-gather.py some.sig some.sbt.zip

Traceback (most recent call last):
    results.sort(reverse=True)
TypeError: '<' not supported between instances of 'SourmashSignature' and 'SourmashSignature'

Commenting out results.sort(reverse=True) makes the script run, and it seems from eyeballing the results already are sorted?

Thanks @ctb

@phiweger
Copy link
Author

Does db.gather(query_sig) use the "winner-takes-all-kmers" heuristic? If so, I guess reordering the result is not necessary.

@ctb
Copy link
Contributor

ctb commented Oct 28, 2020

the results are sorted for a single database, but not when retrieved for multiple databases, I think.

Not sure why the sort isn't working... OH! It's because there are conditions where sort is looking at the second element of the list (because the containment, the first element, is equal). If you do something like

results.sort(key=lambda x:x[0], reverse=True)

that should fix it.

@olgabot
Copy link
Collaborator

olgabot commented Nov 3, 2020

👋 Hello! I'm also interested in a CLI-level usage. Here's the use case:

  • Build SBT db of single cells
  • Query with bulk RNA-seq signature --> decompose/deconvolve the bulk signature into putative single cell types

Is the Python API level the main way to go right now?

@ctb
Copy link
Contributor

ctb commented Nov 3, 2020

for now, yes.

I'd suggest writing/modifying a script (and posting it here :), and then updating it as you find it does (or doesn't) suite your needs - then we can make it more generic and include it in sourmash proper!

@ctb
Copy link
Contributor

ctb commented Nov 3, 2020

personally I'm a fan of the CLI usage in the prefetch script posted in #1126:

cmd.py --query <list of query filenames and indices> --db <list of db filenames and indices>

as it gives you a lot of flexibility. That code also makes full use of the load_file_as_signatures API which I think is the right way to go.

@ctb
Copy link
Contributor

ctb commented Mar 6, 2021

note that sourmash prefetch looks like it might give you the necessary info at the CLI - see #1370.

@phiweger
Copy link
Author

phiweger commented Mar 17, 2022

@ctb db.gather() in the script above only returns the best hit -- what is the equivalent to return all hits ("recursive gather")?

I tried

cnt = db.counter_gather(first_query, 0)
assert len(cnt.counter) == len(cnt.siglist)

and get something that looks like what I am looking for. Am I correct to think that .counter gives me the number of shared hashes and the corresponding index in .siglist is the matching signature? I so, I can calculate

threshold_bp = count * scaled, correct?

@ctb
Copy link
Contributor

ctb commented Mar 17, 2022

hi @phiweger please try this, which makes use of APIs available in sourmash 4 -

#! /usr/bin/env python
import sys
import sourmash
from sourmash.sourmash_args import load_file_as_signatures, load_dbs_and_sigs
import argparse


def main():
    p = argparse.ArgumentParser()
    p.add_argument('many_sigs', help='collection to query')
    p.add_argument('dbs', help='databases to query', nargs='+')
    args = p.parse_args()

    query_list = list(load_file_as_signatures(args.many_sigs))
    # grab first query to use for load_dbs_and_sigs
    first_query = query_list[0]

    dbs = load_dbs_and_sigs(args.dbs, first_query, False)

    for query_sig in query_list:
        results = []
        for db in dbs:
            results.extend(db.prefetch(query_sig, threshold_bp=0))

        print(results[:5])

        results.sort(reverse=True, key=lambda x: x.score)

        print('query:', query_sig.name)
        if not results:
            print('  ** no matches **')
        else:
            for sr in results:
                print(f'   {sr.score:.3f} {sr.signature.name} in {sr.location}')
    
    return 0


if __name__ == '__main__':
    sys.exit(main())

@ctb
Copy link
Contributor

ctb commented Mar 17, 2022

note that if you run this with a set of query signatures with different k-mer or scaled or whatnot, you'll get an incompatible MinHash error somewhere in there, because the script doesn't make sure that all query sketches are compatible. I can fix this up if it's important, but it adds a lot of boilerplate code that's annoying to look at so I left it out for now :)

@phiweger
Copy link
Author

Thanks! Prefetch will treat each sig independently correct? Like, say sig A has 10 hashes in common w/ my query ("QandA") and sig B has 15 in common ("QandB"), then the intersection of QandA and QandB is not necessarily empty. Like, for recursive gather, I have to take the top prefetch hit, remove the hashes from the query, then repeat?

@ctb
Copy link
Contributor

ctb commented Mar 17, 2022

yes, correct.

we do have internal APIs for gather that you can use if you want to do the min-set-cov/gather; lmk.

@phiweger
Copy link
Author

Would be really great if you could point me to some code to implement this/ not reinvent the wheel. Thx a lot!

@ctb
Copy link
Contributor

ctb commented Mar 20, 2022

try out the below!

I'm not sure it does what you want - it does treat each query independently - but hopefully the code is reasonably clear as to what it's actually doing :)

#! /usr/bin/env python
import sys
import sourmash
from sourmash.sourmash_args import load_file_as_signatures, load_dbs_and_sigs
from sourmash.search import GatherDatabases
import argparse


def main():
    p = argparse.ArgumentParser()
    p.add_argument('many_sigs', help='collection to query')
    p.add_argument('dbs', help='databases to query', nargs='+')
    args = p.parse_args()

    query_list = list(load_file_as_signatures(args.many_sigs))
    # grab first query to use for load_dbs_and_sigs
    first_query = query_list[0]

    dbs = load_dbs_and_sigs(args.dbs, first_query, False)

    for query_sig in query_list:
        results = []
        for gather_result, weighted_missed in GatherDatabases(query_sig, dbs,
                                                              threshold_bp=0):

            results.append(gather_result)

        print(results[:5])

        # see src/sourmash/search.py, GatherResult namedtuple
        results.sort(reverse=True, key=lambda x: x.intersect_bp)

        print('query:', query_sig.name)
        if not results:
            print('  ** no matches **')
        else:
            for gr in results:
                print(f'   {gr.intersect_bp} {gr.match.name}')
    
    return 0


if __name__ == '__main__':
    sys.exit(main())

@phiweger
Copy link
Author

thanks a lot @ctb I will report

@ctb ctb added the plugin_todo Write a plugin for this! label Sep 23, 2023
@ctb
Copy link
Contributor

ctb commented Feb 5, 2024

I think the new plugin #2970 provides the necessary functionality in a nicely packaged way. 🎉

@ctb ctb closed this as completed Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code herein lies code plugin_todo Write a plugin for this! python Pull requests that update Python code
Projects
None yet
Development

No branches or pull requests

3 participants