How do I build a lineage spreadsheet for GTDB taxonomy and signatures? #1095

phiweger · 2020-07-11T12:56:34Z

I swear I tried to find out how to do this before filing an issue :)

The GTDB taxonomy has this format:

GB_GCA_002849265.1      d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus
RS_GCF_000637235.1      d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus

I have sketched all genomes in a folder "sigs" and they are named like:

GB_GCA_002849265.1_genomic.fna.gz.sig
RS_GCF_000637235.1_genomic.fna.gz.sig

They were NOT "named" using sourmash compute --name-from-first ....

What I don't understand is what should be the accession in the LCA taxonomy file so that it is linked to the signatures in my "sigs" directory?

I tried this script which returns an LCA taxonomy file like

accession,filename,superkingdom,phylum,class,order,family,genus,species
GCF_900045825,GCF_900045825.1_genomic.fna.gz,d__Bacteria,p__Firmicutes,c__Bacilli,o__Staphylococcales,f__Staphylococcaceae,g__Staphylococcus,s__Staphylococcus aureus

and then run

sourmash lca index -C 3 --require-taxonomy --split-identifiers lca_tax.csv out --traverse-directory sigs

but

examining spreadsheet headers...
** assuming column 'accession' is identifiers in spreadsheet
145904 distinct identities in spreadsheet out of 145904 rows.
24706 distinct lineages in spreadsheet out of 145904 rows.
ERROR: no hash values found - are there any signatures?

Any help is greatly appreciated.

The text was updated successfully, but these errors were encountered:

ctb · 2020-07-11T12:59:52Z

There's no good way to do this within sourmash. Suggestions welcome! I used snakemake --

https://github.com/dib-lab/sourmash_databases/blob/master/gtdb/Snakefile#L123

ctb · 2020-07-11T13:00:17Z

(the key is this line, which looks up the name in a spreadsheet - https://github.com/dib-lab/sourmash_databases/blob/master/gtdb/Snakefile#L129)

phiweger · 2020-07-11T13:05:32Z

ah, and sourmash lca index matches signatures based on the "accession" column in the csv and the "name" field in the signature?

ctb · 2020-07-11T13:16:42Z

On Sat, Jul 11, 2020 at 06:05:43AM -0700, Adrian Viehweger wrote: ah, and `sourmash lca index` matches signatures based on the "accession" column in the csv and the "name" field in the signature?

yep! ...I should probably document that, eh?

phiweger · 2020-07-11T15:18:43Z

well, only if you want to :)

phiweger · 2020-07-11T18:33:21Z

I ended up just working around the problem -- I don't see a very clean API to match names from the sourmash compute step and the csv required in sourmash lca index. Probably some kind of helper script?

Code for future reference:

import json
from glob import glob
import os

from tqdm import tqdm


'''
1. Add taxonomy identifier to signature. In our case we'll use the one in
the GTDB taxonomy csv file of the form GB_GCA_002163135.1, ie just the
filename prefix.
'''

# !mkdir sigs_renamed
files = glob('sigs/*.sig')
for path in tqdm(files):
    with open(path, 'r') as file:
        sig = json.load(file)
        fn = sig[0]['filename']
        name = os.path.basename(fn).replace('_genomic.fna.gz', '')
        sig[0]['name'] = name
        with open('sigs_renamed/' + os.path.basename(path), 'w+') as out:
            json.dump(sig, out)


'''
2. Now reformat the taxonomy provided by the GTDB.
'''

with open('taxonomy.csv', 'r') as file, \
     open('taxonomy.refmt.csv', 'w+') as out:

    # header
    out.write('accession,superkingdom,phylum,class,order,family,genus,species\n')
    for line in file:
        name, tax = line.strip().split('\t')
        tax = [i.split('__')[1] for i in tax.split(';')]
        out.write(f'{name},' + ','.join(tax) + '\n')

ctb · 2020-07-12T14:20:04Z

thanks for the code!

Other thoughts -

could support md5sum and/or filename for matching.

please leave this open and I'll work on it as I'm inspired :)

ctb · 2022-05-04T14:41:23Z

this is now built into our various database release processes; see #2015 for a guide, and Taylor put together an R script to do it, here: #1941 (comment). Still no standalone script that does it and is in version control tho :).

ctb changed the title ~~Create custom LCA index for GTDB taxonomy~~ How do I build a lineage spreadsheet for GTDB taxonomy and signatures? Jul 12, 2020

ctb mentioned this issue Dec 28, 2020

building large LCA databases for genbank subsets #1264

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I build a lineage spreadsheet for GTDB taxonomy and signatures? #1095

How do I build a lineage spreadsheet for GTDB taxonomy and signatures? #1095

phiweger commented Jul 11, 2020

ctb commented Jul 11, 2020

ctb commented Jul 11, 2020

phiweger commented Jul 11, 2020

ctb commented Jul 11, 2020 via email

phiweger commented Jul 11, 2020

phiweger commented Jul 11, 2020

ctb commented Jul 12, 2020

ctb commented May 4, 2022

How do I build a lineage spreadsheet for GTDB taxonomy and signatures? #1095

How do I build a lineage spreadsheet for GTDB taxonomy and signatures? #1095

Comments

phiweger commented Jul 11, 2020

ctb commented Jul 11, 2020

ctb commented Jul 11, 2020

phiweger commented Jul 11, 2020

ctb commented Jul 11, 2020 via email

phiweger commented Jul 11, 2020

phiweger commented Jul 11, 2020

ctb commented Jul 12, 2020

ctb commented May 4, 2022