Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I build a lineage spreadsheet for GTDB taxonomy and signatures? #1095

Open
phiweger opened this issue Jul 11, 2020 · 8 comments
Open

Comments

@phiweger
Copy link

I swear I tried to find out how to do this before filing an issue :)

The GTDB taxonomy has this format:

GB_GCA_002849265.1      d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus
RS_GCF_000637235.1      d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus

I have sketched all genomes in a folder "sigs" and they are named like:

GB_GCA_002849265.1_genomic.fna.gz.sig
RS_GCF_000637235.1_genomic.fna.gz.sig

They were NOT "named" using sourmash compute --name-from-first ....

What I don't understand is what should be the accession in the LCA taxonomy file so that it is linked to the signatures in my "sigs" directory?

I tried this script which returns an LCA taxonomy file like

accession,filename,superkingdom,phylum,class,order,family,genus,species
GCF_900045825,GCF_900045825.1_genomic.fna.gz,d__Bacteria,p__Firmicutes,c__Bacilli,o__Staphylococcales,f__Staphylococcaceae,g__Staphylococcus,s__Staphylococcus aureus

and then run

sourmash lca index -C 3 --require-taxonomy --split-identifiers lca_tax.csv out --traverse-directory sigs

but

examining spreadsheet headers...
** assuming column 'accession' is identifiers in spreadsheet
145904 distinct identities in spreadsheet out of 145904 rows.
24706 distinct lineages in spreadsheet out of 145904 rows.
ERROR: no hash values found - are there any signatures?

Any help is greatly appreciated.

@ctb
Copy link
Contributor

ctb commented Jul 11, 2020

There's no good way to do this within sourmash. Suggestions welcome! I used snakemake --

https://github.com/dib-lab/sourmash_databases/blob/master/gtdb/Snakefile#L123

@ctb
Copy link
Contributor

ctb commented Jul 11, 2020

(the key is this line, which looks up the name in a spreadsheet - https://github.com/dib-lab/sourmash_databases/blob/master/gtdb/Snakefile#L129)

@phiweger
Copy link
Author

ah, and sourmash lca index matches signatures based on the "accession" column in the csv and the "name" field in the signature?

@ctb
Copy link
Contributor

ctb commented Jul 11, 2020 via email

@phiweger
Copy link
Author

well, only if you want to :)

@phiweger
Copy link
Author

I ended up just working around the problem -- I don't see a very clean API to match names from the sourmash compute step and the csv required in sourmash lca index. Probably some kind of helper script?

Code for future reference:

import json
from glob import glob
import os

from tqdm import tqdm


'''
1. Add taxonomy identifier to signature. In our case we'll use the one in
the GTDB taxonomy csv file of the form GB_GCA_002163135.1, ie just the
filename prefix.
'''

# !mkdir sigs_renamed
files = glob('sigs/*.sig')
for path in tqdm(files):
    with open(path, 'r') as file:
        sig = json.load(file)
        fn = sig[0]['filename']
        name = os.path.basename(fn).replace('_genomic.fna.gz', '')
        sig[0]['name'] = name
        with open('sigs_renamed/' + os.path.basename(path), 'w+') as out:
            json.dump(sig, out)


'''
2. Now reformat the taxonomy provided by the GTDB.
'''

with open('taxonomy.csv', 'r') as file, \
     open('taxonomy.refmt.csv', 'w+') as out:

    # header
    out.write('accession,superkingdom,phylum,class,order,family,genus,species\n')
    for line in file:
        name, tax = line.strip().split('\t')
        tax = [i.split('__')[1] for i in tax.split(';')]
        out.write(f'{name},' + ','.join(tax) + '\n')

@ctb
Copy link
Contributor

ctb commented Jul 12, 2020

thanks for the code!

Other thoughts -

  • could support md5sum and/or filename for matching.

please leave this open and I'll work on it as I'm inspired :)

@ctb ctb changed the title Create custom LCA index for GTDB taxonomy How do I build a lineage spreadsheet for GTDB taxonomy and signatures? Jul 12, 2020
@ctb
Copy link
Contributor

ctb commented May 4, 2022

this is now built into our various database release processes; see #2015 for a guide, and Taylor put together an R script to do it, here: #1941 (comment). Still no standalone script that does it and is in version control tho :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants