-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding lineage manipulation & taxonomy reporting in more places in sourmash? #969
Comments
(Sorry, wrong button)
Another example: this is the script I'm using for thesis work and 2020-cami: https://github.com/dib-lab/2019-12-12-sourmash_viz/blob/1223b736add63ea49108eecceb3f4bca85c78492/src/gather_to_opal.py It's using (Converting from an SBT to an LCA index like in https://github.com/luizirber/2020-cami/blob/f7f34d3903cd7d2b6bc2ac0471f7d53a42aa86b2/rules/build_indices.smk#L64L66 without depending on external code would be pretty cool. I guess the
Totally agree, and I think it also hurts functionality. Deposited genomes in genbank or refseq don't change (but they might get new versions), but the accession to taxid mapping CAN change (see: recent Lactobacillus genera being split). It is so hard to deal with the NCBI taxonomy because it changes and doesn't have stable downloads for a specific date, so while generating taxinfo and putting it into SBTs has the benefit of capturing the taxonomy at one point in time (and being very convenient to use), but the drawback of potentially being outdated.
Yup, but would also like to see some option to override it somehow (re: outdated taxonomies) |
OK, a few more thoughts and some summarization --
Out of scope --
|
ETE3 is a great way for filtering/manipulating/etc. NCBI taxonomy that I've used FWIW |
another piece of taxonomy related functionality - newick output, ref #915 |
random thoughts --
|
That's kind of what https://github.com/dib-lab/2019-12-12-sourmash_viz/blob/1223b736add63ea49108eecceb3f4bca85c78492/src/gather_to_opal.py is doing, and since the CAMI profiling output requires summarizing for each level the info is already there too (could also ask for a specific rank too, I guess) |
On Mon, May 04, 2020 at 03:25:51PM -0700, Luiz Irber wrote:
> there is definitely a Kraken-style use case of "what lineages are in this metagenome", both with and without abundance. This could be something that the taxonomy module does _after_ a gather run, and is a place where multi-level output ("just kingdom, please!" or "all the way to strain!") would be useful.
That's kind of what https://github.com/dib-lab/2019-12-12-sourmash_viz/blob/1223b736add63ea49108eecceb3f4bca85c78492/src/gather_to_opal.py is doing, and since the CAMI profiling output requires summarizing for each level the info is already there too (could also ask for a specific rank too, I guess)
yep - we have plenty of examples of this (sourmash lca summarize, for example).
What I'm suggesting here most specifically is that we provide scripts that
do this _after_ gather is run, and separate out the "search for genomes" from
the "summarize taxonomy" functionality in a way that currently is only done
in ad hoc scripts in other projects. I have a prototype of this myself, over
in charcoal.
|
But... that's what I described 🤣 This is exactly what happens in 2020-cami: I run gather first, and then use the gather output CSV to summarize taxonomy (taking a pre-calculated accession-to-taxid file for a specific database, because it takes a long time to process the full |
On Mon, May 04, 2020 at 06:56:09PM -0700, Luiz Irber wrote:
But... that's what I described :rofl:
fine, fine.
|
I've started playing around with separating revindex and lineage information, over in dib-lab/charcoal, just_taxonomy.py. I created a A few observations from this strategy -
|
here, from silva, is a nice example of multiple taxonomies being available in a database.
|
mmseqs2 does nice stuff with multiple taxonomies, it looks like - search for "taxonomy" on https://github.com/soedinglab/mmseqs2/wiki |
note gather-to-tax.py |
I was talking with @bluegenes and after checking @erikyoung85 work in #1131 and the
On the Rust side it would be something like type LineagePairs = BTreeMap<String, String>;
pub struct TaxInfo {
lid_to_lineage: HashMap<u32, LineagePairs>
ident_to_lid: HashMap<String, u32>
}
impl TaxInfo {
fn get_lineage(&self, ident: &str) -> Option<LineagePairs> {}
} And I think it can be serialized/deserialized with some processing to the current format. On the Python side it will be similar to Feature-wise, I think it is also important to be able to override a TaxInfo in the command line (or provide a different one for Index in Python/Rust). Especially since we "precompute" some desired ranks for LCA today, and they might not match other use cases (the CAMI profile format allows providing your own rank, for example). Critically, the important info in the |
I was trying to process an existing LCA index with this proposed format, and... is
Or should lineage be optional? That is OK too, I guess. The |
re #969 (comment), I am ok with doing the Rust API so that that is easy, but I would suggest getting the current functionality oxidized and merged first before a split. Otherwise you end up with too many moving parts. |
yes, lineage is optional in LCA databases. |
random aside - splitting revindex and taxonomy would mean that we can update taxonomy without updating revindex, which currently is impossible in LCA databases. |
(an interesting short research paper might be to compare and contrast gather and LCA approaches on the same database directly) (and also GTDB and NCBI taxonomy result comparisons by finding the same k-mers, then applying different taxonomies) |
We are getting closer to being able to separate out the lineage stuff from the revindex stuff in LCA, so we could envision combining searches on signatures and SBTs with taxonomic reporting (where available).
The basic idea here is that we combine taxonomy with search and gather on signatures/SBTs/LCAs, by connecting identifiers with lineages -- do dynamically what
sourmash lca index
does once.This falls under use cases of taxonomic filtering and reporting.
What I don't know how to do is actually apply it in reality. A clunky (but perhaps functional) way that I'd been idly thinking of would be to build a taxonomy command-line interface that ingested the CSVs output by
sourmash search
andgather
and did various kinds of taxonomic reporting and manipulation on them, when given a taxonomy or lineage database.I don't really want to rewrite / extend gather to also have taxonomy... ugly complex code.
It might be possible to add taxonomy/lineage-aware selector functionality on signatures and databases, e.g. "run a gather on this database, but ignore all results that are not Archaea."
Should also look at sourmash lca gather to see about collapsing results taxonomically...
Thoughts welcome!
related conversations
The text was updated successfully, but these errors were encountered: