-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I build a lineage spreadsheet for GTDB taxonomy and signatures? #1095
Comments
There's no good way to do this within sourmash. Suggestions welcome! I used snakemake -- https://github.com/dib-lab/sourmash_databases/blob/master/gtdb/Snakefile#L123 |
(the key is this line, which looks up the name in a spreadsheet - https://github.com/dib-lab/sourmash_databases/blob/master/gtdb/Snakefile#L129) |
ah, and |
On Sat, Jul 11, 2020 at 06:05:43AM -0700, Adrian Viehweger wrote:
ah, and `sourmash lca index` matches signatures based on the "accession" column in the csv and the "name" field in the signature?
yep!
...I should probably document that, eh?
|
well, only if you want to :) |
I ended up just working around the problem -- I don't see a very clean API to match names from the Code for future reference: import json
from glob import glob
import os
from tqdm import tqdm
'''
1. Add taxonomy identifier to signature. In our case we'll use the one in
the GTDB taxonomy csv file of the form GB_GCA_002163135.1, ie just the
filename prefix.
'''
# !mkdir sigs_renamed
files = glob('sigs/*.sig')
for path in tqdm(files):
with open(path, 'r') as file:
sig = json.load(file)
fn = sig[0]['filename']
name = os.path.basename(fn).replace('_genomic.fna.gz', '')
sig[0]['name'] = name
with open('sigs_renamed/' + os.path.basename(path), 'w+') as out:
json.dump(sig, out)
'''
2. Now reformat the taxonomy provided by the GTDB.
'''
with open('taxonomy.csv', 'r') as file, \
open('taxonomy.refmt.csv', 'w+') as out:
# header
out.write('accession,superkingdom,phylum,class,order,family,genus,species\n')
for line in file:
name, tax = line.strip().split('\t')
tax = [i.split('__')[1] for i in tax.split(';')]
out.write(f'{name},' + ','.join(tax) + '\n') |
thanks for the code! Other thoughts -
please leave this open and I'll work on it as I'm inspired :) |
this is now built into our various database release processes; see #2015 for a guide, and Taylor put together an R script to do it, here: #1941 (comment). Still no standalone script that does it and is in version control tho :). |
I swear I tried to find out how to do this before filing an issue :)
The GTDB taxonomy has this format:
I have sketched all genomes in a folder "sigs" and they are named like:
They were NOT "named" using
sourmash compute --name-from-first ...
.What I don't understand is what should be the accession in the LCA taxonomy file so that it is linked to the signatures in my "sigs" directory?
I tried this script which returns an LCA taxonomy file like
and then run
but
Any help is greatly appreciated.
The text was updated successfully, but these errors were encountered: