Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sketch names in GTDB database use NCBI taxonomy #3006

Open
ccbaumler opened this issue Feb 16, 2024 · 20 comments
Open

sketch names in GTDB database use NCBI taxonomy #3006

ccbaumler opened this issue Feb 16, 2024 · 20 comments

Comments

@ccbaumler
Copy link
Contributor

I am working on this pangenome database idea at https://github.com/ctb/2022-database-covers/. I may have found an error in the taxonomic classification while trying to figure some stuff out with the scripts.

For example, from the gtdb-rs214.lineages.csv:

GCA_000398885.1,f,d__Bacteria,p__Pseudomonadota,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia ruysiae

When I extract that signature from the GTDB db:

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /home/baumlerc/2022-database-covers/ruysiae-or-coli.zip
signature: GCA_000398885.1 Escherichia coli KTE33 strain=KTE33, Esch_coli_KTE33_V1
source file: /dev/fd/63
md5: 5fa7a957bf01b4a8bfbc2b3265e79d34
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 4489
sum hashes: 4489
signature license: CC0

loaded 1 signatures total, from 1 files

The GenBank link implies that this should be considered E. coli
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000398885.1/

@bluegenes
Copy link
Contributor

@ctb
Copy link
Contributor

ctb commented Feb 16, 2024

they probably changed it. they do that.

@ccbaumler
Copy link
Contributor Author

I can rerun the script to create the lineage spreadsheet, but I don't see how a GenBank update could affect the GTDB and lineage file. We created those at the same time and they should be accurate to each other. The only way GenBank update could affect this is if we recreated one file and not the other.

@ctb
Copy link
Contributor

ctb commented Feb 16, 2024

wait I don't understand. apologies if I'm missing something obvious 😓

GTDB is distinct from GenBank taxonomy. They use the same genome accessions, but the taxonomies are intentionally not concordant.

when you say "the genbank link implies..." that is something that genbank changes from time to time.

@ctb
Copy link
Contributor

ctb commented Feb 16, 2024

here, the genbank link matches what's in the GTDB zipfile, which is an assembly record named based on the Genbank classification.

the GTDB folk have decided it's a different Escherichia, and the taxonomy file for GTDB assigns that GenBank access to s__Escherichia ruysiae.

Probably the most confusing thing I see here is that we did not actually rename the sketch in the zip file to match the GTDB classification, but kept it as the GenBank name... computers are hard 😭

@ccbaumler
Copy link
Contributor Author

Thank you for walking through everything! I see my own misunderstanding clearly now.

Would it be best to update the signature name with the species rank name from the lineage spreadsheet and a unique identifier?

"Computers are hard." - Titus

@ctb
Copy link
Contributor

ctb commented Feb 17, 2024

Would it be best to update the signature name with the species rank name from the lineage spreadsheet and a unique identifier?

sounds like work

@ccbaumler
Copy link
Contributor Author

It's this kind of wisdom that keeps me showing up day after day.

@ctb
Copy link
Contributor

ctb commented Feb 17, 2024

so, some slightly more sober thoughts ;).

I agree that it's confusing that when you find matches with search/gather in a database named 'gtdb', you get the NCBI genome names, which basically use taxonomic names from NCBI.

The options to fix this would appear to be, roughly -

  • rename the sketches in the gtdb databases to match the GTDB taxonomy. This is certainly do-able, but breaks the principle that we're sketching the genomes/using the genome names as they are given to us by the source - which is NCBI. It might make tracking down the source of genome sketches just one step harder, too. Here it helps that would leave the accessions the same, since GTDB uses the same accession information as GenBank.
  • remove the names altogether and just go with accession. This would be annoying to me (and presumably other users) because I like seeing slightly more informative output in sourmash gather and so on.
  • leave it as it is - no work needs to be done. Certainly the easiest thing to do :).

I'm tempted to leave this issue open (maybe changing the issue title), or maybe create a new one, and then see what we feel like doing the next time GTDB does a release and we need to update.

I agree it is mildly confusing as it is, but we've been doing this for years and you're the first person who has noticed - kudos on paying attention 😆 ! - so maybe it's a minor confusion in a tapestry of many larger confusions? :). As it is, GTDB mostly agrees with NCBI taxonomy at the family/genus level, too, so it's not that jarring a disconnect in practice.

@ccbaumler
Copy link
Contributor Author

I think if we add more info into the signature name and discuss it in the documentation, it would alleviate any confusion. It could also show in the output that your return tax has some contradictions to investigate as well.

Last night I was looking through the script @bluegenes linked and the metadata files that were used to create the lineage files. I think we could update the sig names throughout the database to include the gtdb_genome_representative accession with the gtdb_taxonomy species name. Something like:

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /home/baumlerc/2022-database-covers/ruysiae-or-coli.zip

signature: GCA_000398885.1 Escherichia coli KTE33 strain=KTE33, Esch_coli_KTE33_V1 (GTDB_GCA_021307345.1 GTDB_Escherichia ruysiae)

source file: /dev/fd/63
md5: 5fa7a957bf01b4a8bfbc2b3265e79d34
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 4489
sum hashes: 4489
signature license: CC0

loaded 1 signatures total, from 1 files

Appendix

First few columns of bac120_metadata_r214.tsv:

accession ambiguous_bases checkm_completeness checkm_contamination checkm_marker_count checkm_marker_lineage checkm_marker_set_count checkm_strain_heterogeneity coding_bases coding_density contig_count gc_count gc_percentage genome_size gtdb_genome_representative gtdb_representative gtdb_taxonomy
GB_GCA_000398885.1 0 98.61 0.04 1173 f__Enterobacteriaceae (UID5124) 336 0 3884601 84.7628338874941 220 2275196 50.6641441436627 4582906 GB_GCA_021307345.1 f d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia ruysiae

The two columns in bac120_taxonomy_r214.tsv:

GB_GCA_021307345.1 d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia ruysiae

@luizirber
Copy link
Member

Back when I added NCBI genome calculation to wort I started building the name for the sig using these fields from assembly_summary_genbank.txt. At that point in time the only place we had to save this metadata was inside the signature, and it was this name that was pulled when reporting on gather (and was the solution for this item from Titus list:

remove the names altogether and just go with accession. This would be annoying to me (and presumably other users) because I like seeing slightly more informative output in sourmash gather and so on.

There is a whole discussion to be had on splitting "fast-changing" metadata (taxonomy, names) from "slow-changing" metadata/data (accession ID, the actual hashes in the minhash), and "name" is one of these fields that fall more in "fast" than "slow". But it is sort of the only one we have at the signature level.

But, turns out we have a better solution for collections nowadays: manifests!

We can add an extra column to the manifest for "preferred"/"overwrite" name, and use that (if present) when outputting in gather/search/any other place reporting names. Manifest are easy to change/update, don't mess with the signatures (so we can reuse NCBI sigs for GTDB, for example), and fit nicely into the "fast"/"slow" divide above.

@ctb
Copy link
Contributor

ctb commented Feb 17, 2024

We can add an extra column to the manifest for "preferred"/"overwrite" name, and use that (if present) when outputting in gather/search/any other place reporting names. Manifest are easy to change/update, don't mess with the signatures (so we can reuse NCBI sigs for GTDB, for example), and fit nicely into the "fast"/"slow" divide above.

👍

@ccbaumler
Copy link
Contributor Author

Thanks for all the great explanations, ideas and details!

One point I'm having trouble rectifying in my mind is the making signatures and taxonomies from GTDB data but naming the signature as NCBI taxonomic name. Intuitively, I would expect the sig "name" to be the GTDB taxonomic name.

@ctb
Copy link
Contributor

ctb commented Feb 17, 2024

Separation of name and taxonomy! A sourmash superpower (in many ways) is that we can key multiple different taxonomies on the same accession. But the flip side is that the name (which contains the accession plus some human-readable text) is distinct from taxonomy.

(IMO the content is what matters mostly, anyway. Names are ephemeral. Taxonomies change. Content is (closer to) eternal!

@ccbaumler
Copy link
Contributor Author

ccbaumler commented Feb 17, 2024

Truth, the representative genomic signature is the most important part of the database. While names are ephemeral, we are still using them. My understanding is we are using them in contradictory ways by making the database name and the signature names from two different sources.

And, for taxonomic profiling, would it be more intuitive for users reading their gather output for the database to be called NCBI instead of GTDB because the signature names derive from the NCBI and not the GTDB?

Then when using alternate taxonomic naming databases, they would expect to see some opposing taxonomies... ?

@ctb
Copy link
Contributor

ctb commented Feb 17, 2024

And, for taxonomic profiling, would it be more intuitive for users reading their gather output for the database to be called NCBI instead of GTDB because the signature names derive from the NCBI and not the GTDB?

but 'gather' doesn't produce taxonomic output... tax metagenome does.

What we'd really want to call the database is something like gtdb-subset-of-ncbi. Which feels likely to be even more confusing.

@ccbaumler
Copy link
Contributor Author

Please correct me if I misunderstood your comment about liking to see the output. I was referring to the fifth column of gather which returns the signature names of the database against the query.

If I use the GTDB sourmash database but it is returning NCBI taxonomies in that output... I don't know. Doesn't seem to be presenting the output intuitively in my mind. Especially when the output could differ from GTDB like the initial comment on this thread points out.

@ctb ctb changed the title Species level taxonomic mis-classification of at least one signature sketch names in GTDB database use NCBI taxonomy Feb 18, 2024
@ctb
Copy link
Contributor

ctb commented Feb 18, 2024

yes, I understand.

someone needs to propose a specific solution, and then someone needs to do the work. The person doing the work generally gets a large say in which solution is chosen :).

@ctb
Copy link
Contributor

ctb commented Feb 23, 2024

We can add an extra column to the manifest for "preferred"/"overwrite" name, and use that (if present) when outputting in gather/search/any other place reporting names. Manifest are easy to change/update, don't mess with the signatures (so we can reuse NCBI sigs for GTDB, for example), and fit nicely into the "fast"/"slow" divide above.

After talking through this with @ccbaumler a bit, I am endorsing this solution a bit more strongly -

  • renaming things is something we want to be able to do! adding an override name column to manifests strikes me as a good minimal change that enables that easily, and is also debuggable (as opposed to simply updating the name in a manifest and using that as a default override, which ...is not very explicit and hence not very debuggable)
  • if we do that, we should also consider adding a 'deprecated' boolean column per how might we distribute "diff" or patch databases? #985 (comment)

@ctb
Copy link
Contributor

ctb commented Feb 23, 2024

if we're going to upgrade manifest contents to include deprecated/removed and preferred_name, we could also:

this would (in my opinionated opinion) be distinct from adding other metadata fields, tags, and so on, as explored in my comments on sourmash-bio/branchwater#18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants