Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating sample table when updating database with MAGs - GTDB taxID and Accession Number #40

Open
PeterCx opened this issue Jan 19, 2023 · 5 comments

Comments

@PeterCx
Copy link

PeterCx commented Jan 19, 2023

Hi there,

I am trying to update the GTDB -r207 database I have downloaded using Struo2 with my own MAGs. It is not clear how I get some of the information including "ncbi_organism_name", "gtdb_taxid" and "accession".

I have annotated my MAGs using GTDB-Tk. Using the FastANI I have de-replicated my genomes removing those with 95% ANI. This has left me with a ~ 3000 MAGs. Given that these MAGs are not close to any other genome in GTDB I don't understand how I can get a taxid? I have attached the current information I have from GTDB about my MAGs.
GTDB_MAG_Information.txt

Your help is greatly appreciated.

Kind regards,

P

@nick-youngblut
Copy link
Contributor

You should get the GTDB taxids via https://github.com/shenwei356/gtdb-taxdump

I used that taxdump for setting taxids in GTDB-r207

@PeterCx
Copy link
Author

PeterCx commented Jan 20, 2023

Hi Nick,

Thanks for your response. A few things are still not clear to me. I have the GTDB taxids for r207 as obtained through the link above. But its not clear how I generated taxids for my own MAGs? I have used the below command which I found here

gtdb_to_taxdump.py
TaxID/gtdbtk.bac120.summary.tsv
https://data.gtdb.ecogenomic.org/releases/release207/207.0/bac120_taxonomy_r207.tsv.gz \

TaxID/taxID_info.tsv

This shows a taxid in the output file taxID_info.tsv.
How do I get the ncbi_organism_name and accession required for the databse update? I have confusion because most of my MAGs cannot be assigned a taxonomy beyond the genus level.

Many thanks

P

@nick-youngblut
Copy link
Contributor

nick-youngblut commented Jan 20, 2023

You could go from NCBI taxids for each of your MAGs to GTDB taxids, via gtdb_to_taxdump.py.

Another approach is getting the GTDB taxids directly from the GTDB taxdump created by https://github.com/shenwei356/gtdb-taxdump. You would probably need to create your own script for this, however. The process would likely be MAGs => GTDB-Tk (GTDB taxonomy) => map taxonomy to gtdb-taxdump => get GTDB taxids

@vinisalazar
Copy link

Hi @nick-youngblut, I am experiencing a similar issue. I have a GTDB-Tk output file with GTDB taxonomies, but I don't have any TaxIDs. How do I go about this step of map taxonomy to gtdb-taxdump?

Thank you for any assistance you can provide.

@vinisalazar
Copy link

vinisalazar commented May 31, 2023

This is what I ended up doing:

# Create lineage dataframe based on gtdb_classification column
# This is what a cell looks like: 'd__Archaea;p__Aenigmatarchaeota;c__Aenigmatarchaeia;o__GW2011-AR5;f__GCA-2688965;g__GCA-2688965;s__GCA-2688965 sp002688965'
gtdb_lineages = df.set_index("genome")["gtdb_classification"].str.split(";", expand=True)
gtdb_lineages = df["gtdb_classification"].str.split(";", expand=True) 

# Write a function to extract the scientific name
def get_sci_name_from_row(row):
    """This reads a Pandas Series (a row) and returns the lowest level scientific name."""
    # Iterate each value in the reversed row, return that value if it's valid after trimming
    ix = -1
    for value in row.to_list()[ix::-1]:
        if (value_fmt := value[3:]):    # must trim the 'value' as it contains the prefix denoting the rank
            return value_fmt
        else:  # if it isn't classified, go to the higher tax rank
            ix -= 1
            continue
    return None

# Export to a text file
gtdb_lineages.apply(get_sci_name_from_row, axis=1).to_csv("scinames.csv")

Now I run that with TaxonKIT (my --data-dir on TaxonKit is set for the GTDB r207 taxdump):

cut -f 2 -d , scinames.csv | taxonkit name2taxid > taxids.csv

This gives me a text file with the TaxIDs from the custom GTDB taxdump. I hope it helps.

Best,
Vini

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants