Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] explore reverse indexing in LCA v2 databases #604

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions doc/classifying-signatures.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,13 @@ output structured taxonomic information, and these are what you should look
to if you are interested in doing classification.

The command `lca gather` applies the `gather` algorithm to search an
LCA database; it reports taxonomy.
LCA database; it reports taxonomy where available.

It's important to note that taxonomy based on k-mers is very, very
specific and if you get a match, it's pretty reliable. On the
converse, however, k-mer identification is very brittle with respect
to evolutionary divergence, so if you don't get a match it may only mean
that the particular species isn't known.
that the particular species or genus isn't known.

## Abundance weighting

Expand All @@ -109,9 +109,9 @@ We suggest the following approach:
* build some signatures and do some searches, to get some basic familiarity
with sourmash;

* explore the available databases;
* explore the available databases using `search` and `gather`;

* then ask questions [via the issue tracker](https://github.com/dib-lab/sourmash/issues) and we will do our best to help you out!
* then ask questions [via the issue tracker](https://github.com/dib-lab/sourmash/issues) and we will do our best to help you out with your specific research question!

This helps us figure out what people are actually interested in doing, and
any help we provide via the issue tracker will eventually be added into the
Expand Down
1 change: 1 addition & 0 deletions sourmash/lca/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .command_index import index
from .command_revindex import revindex
from .command_classify import classify
from .command_summarize import summarize_main
from .command_rankinfo import rankinfo_main
Expand Down
2 changes: 2 additions & 0 deletions sourmash/lca/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import argparse

from . import classify, index, summarize_main, rankinfo_main, gather_main
from . import revindex
from .command_compare_csv import compare_csv
from ..logging import set_quiet, error

Expand All @@ -31,6 +32,7 @@ def main(sysv_args):

commands = {'classify': classify,
'index': index,
'revindex': revindex,
'summarize': summarize_main,
'rankinfo': rankinfo_main,
'gather': gather_main,
Expand Down
7 changes: 6 additions & 1 deletion sourmash/lca/command_gather.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@ def format_lineage(lineage_tup):
present = [ l.rank for l in lineage_tup if l.name ]
d = dict(lineage_tup) # rank: value

if 'genus' in present:
if not present:
name = '- (match w/no lineage assignment)'
elif 'genus' in present:
genus = d['genus']
if 'strain' in present:
name = d['strain']
Expand Down Expand Up @@ -96,6 +98,9 @@ def gather_signature(query_sig, dblist, ignore_abundance):
if not assignments:
break

# @CTB here would be where we could start looking at taxonomy
# instead of distinct signatures.

# count the distinct signatures.
counts = Counter()
for hashval, match_set in assignments.items():
Expand Down