-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implications of multiple signatures in a single .sig #1624
Comments
hi @gustavo-salazar, if you have multiple signatures in the query file, you will need to pick one using the k-mer size/moltype selectors ( if you have multiple signatures passed as subjects to search, the "correct" ksize/moltype will be picked out and used for the search. |
aah, so the idea of multiple signatures in the same file is to have multiple signatures of the same sequence file, and they are in the same file, because they are using different settings That's where I was confused I thought I could create a single file that groups signatures of multiple files. One of our goals is to allow our users to choose multiple files, e.g. all the contig files of a metagenomic sample. And what I plan to do is to create signatures for each of those files, and then run gather for each of those signatures. When I noticed the signature file was an array I thought I would be able to construct a single file and run gather only once; hence this question. Thanks for your help |
yep, that's the common use case! although there are lots of other use cases -
Yep! Note that something similar was requested in #1198 and is straightforward to do with the Python API. We don't currently support this via the default gather CLI because until recently there wasn't any way to optimize the gather search given multiple queries - 5 queries would take 5 times as long as 1 query. So it seemed like an unnecessary complication. But, with the addition of prefetch functionality in 4.1.0, we could potentially merge multiple query signatures, do a prefetch with a single pass across the database to pull out all relevant matches, and then run gather on all the results. (This is easy to do with the API.) You might look into As a side note, picklists (#1588 - see docs) and manifests (#1590) will probably both be in the next release, and this will enable support for rapid and flexible subsetting of large databases and/or results from searching them. Happy to explain the performance tradeoffs more, if and when you're interested. For an example use case for this functionality, one of my big use cases for picklists is to do taxonomic subsetting of searches, so you'd be able to say, "only search/return results for Firmicutes", and depending on the database type, you'd potentially get a much faster search of just the Firmicute signatures.
Great questions, keep on asking! I'll try to keep my answers from being too long and confusing...! |
Your answers are awesome! I just need time and brain power to process them! My current implementations is literally just calling |
👍
Right - and I'd like to continue iterating on the API to meet your needs, so that you don't end up with a boatload of custom code that then needs to be refactored to match latest/greatest sourmash code. Very happy to have this be a two way street - you tell us what API you're using and what stability guarantees you'd like, and we'll put them in our tests so that at the very least we know when we're breaking stuff that you are using (and can bump the version numbers appropriately). |
Hey again,
So I noticed that the
.sig
files can either be a single signature, or an array of signatures. What are the implications of including multiple signatures in a single file? particularly interested for thegather
command.Is it the same to run
gather
once for a file with multiple signatures, than run it for each signature? What about the CSV output? 1 csv includes the results for all the signatures or multiple files are created?The text was updated successfully, but these errors were encountered: