Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to handle suppressed records in databases? #2037

Open
bluegenes opened this issue May 5, 2022 · 5 comments
Open

how to handle suppressed records in databases? #2037

bluegenes opened this issue May 5, 2022 · 5 comments

Comments

@bluegenes
Copy link
Contributor

GCA_905332505.2 is part of gtdb-rs207 (https://gtdb.ecogenomic.org/genome?gid=GCA_905332505.2), but has been suppressed (see https://www.ncbi.nlm.nih.gov/assembly/GCA_905332505.2).

Genome/proteome download from NCBI fails (due to suppression).

Since wort sketches files as they become available, I believe we had genomic signatures available to include in our database. We do not have the same luxury for our protein database.

If we use the same taxonomy file between genome and proteome databases, there will be a "missing" identifier in the protein database. I think this might affect taxonomy functions?

I'm sure this won't be the only time this happens -- would be nice to handle this sort of case safely.

@taylorreiter
Copy link
Contributor

kblin/ncbi-genome-download#138

It seems others have had this issue as well. I can't find the assembly_summary_historical.txt file suggested to have download information

@taylorreiter
Copy link
Contributor

As a species representative, this genome will be downloadable from the GTDB ftp

So I guess we download the whole thing...and just take the one little genome we want?'

https://twitter.com/apcamargo_/status/1529881238164492289?s=20&t=aAe7UmO9hp3tVZgbyeebAw

data.ace.uq.edu.au/public/gtdb/data/releases/release207

@luizirber
Copy link
Member

kblin/ncbi-genome-download#138

It seems others have had this issue as well. I can't find the assembly_summary_historical.txt file suggested to have download information

I think name changed to assembly_summary_genbank_historical.txt

@bluegenes
Copy link
Contributor Author

this genome on farm: /home/tereiter/gtdb_genomes_reps_rs207/gtdb_genomes_reps_r207/GCA/905/332/505/GCA_905332505.2_genomic.fna.gz

@jorondo1
Copy link

Running into something similar here; I have 131 samples, altogether ~950 species identified across these using SM gather, 188 of which have been suppressed from ncbi for various reasons. That's a big chunk! I used gtdb-rs214-reps.

Is it because these were suppressed after the db was prepared? At this point, if I need to avoid this because I need to fetch the genomes of species I find in my sample, would the best solution be to create this reference database myself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants