Skip to content

GTDB database download and usage

Nick Youngblut edited this page Nov 13, 2021 · 5 revisions

Description

This tutorial describes how to download and utilize the custom databases generated by Struo2 from various GTDB releases.

See the main README for a list of databases.

For this tutorial, we will be downloading and using the GTDB r202 metagenome profiling databases. All files can be found at the Struo2 ftp site.

File download

Simple method: helper script

You can just use the database_download.py utility script in ./util_scripts/ to download pre-built custom Struo2 databases. An example of downloading GTDB-r202 Kraken2/Bracken databases (and associated files):

# requires `requests` and `bs4` python packages
# using 4 threads in this example
./util_scripts/database_download.py -t 4 -r 202 -d kraken2 metadata taxdump phylogeny -- custom_dbs

taxdump files

The taxdump files are used for creating/updating the Kraken2 database.

GTDB taxdump

By default, the pipeline uses custom GTDB taxIDs generated with gtdb_to_taxdump from the GTDB taxonomy. To download the custom taxdump files:

wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/taxdump/taxdump.tar.gz
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/taxdump/taxdump.tar.gz.md5
md5sum --check $DBDIR/taxdump.tar.gz.md5
tar -pzxvf $DBDIR/taxdump.tar.gz --directory $DBDIR

UniRef

UniRef databases are used for annotating genes. UniRef IDs are required for HUMANnN3. You do not need any UniRef databases if just creating Kraken2/Bracken databases.

IF using mmseqs search for gene annotation

mmseqs UniRef database(s)

See the mmseqs2 wiki on database downloading

# you must have mmseqs2 installed
# Example of downloading UniRef50 (WARNING: slow!)
mmseqs databases --remove-tmp-files 1 --threads 4 UniRef50 $DBDIR/mmseqs2/UniRef50 data/mmseqs2_TMP

IF using diamond blastp for gene annotation

HUMAnN3 UniRef diamond database(s)

See the "Download a translated search database" section of the humann3 docs.

# Example download of UniRef50 DIAMOND database
wget --directory-prefix $DBDIR http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref50_annotated_v201901.tar.gz
tar -pzxvf $DBDIR/uniref50_annotated_v201901.tar.gz --directory $DBDIR

UniRef50-90 index

Optional, but recommended

This is needed to map annotations from UniRef90 clusters to UniRef50 clusters. This allows for just annotating against UniRef90 and then mapping those annotations to UniRef50 cluster IDs. You then do not have to annotate against UniRef90 clusters and UniRef50 clusters, which requires a lot more querying of genes against UniRef.

wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/install/uniref_2019.01/uniref50-90.pkl

Kraken2 database

wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/kraken2/database.kraken
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/kraken2/database.kraken.md5
md5sum --check $DBDIR/database.kraken.md5
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/kraken2/hash.k2d
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/kraken2/opts.k2d
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/kraken2/taxo.k2d
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/kraken2/k2d.md5
md5sum --check $DBDIR/k2d.md5

Bracken database

You only need the read size that matches the lengths of your reads (eg., "100mers" for 100 bp reads)

# 100 bp
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/bracken/database100mers.kmer_distrib
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/bracken/database100mers.kraken
# 150 bp
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/bracken/database150mers.kmer_distrib
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/bracken/database150mers.kraken
# md5sum check
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/bracken/database_mers.md5
md5sum --check $DBDIR/database_mers.md5

HUMAnN3 database

You can choose between UniRef50 and UniRef90. The later with be more sensitive but HUMAnN3 will take substantially longer to complete.

# bowtie2 database
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/all_genes_annot.1.bt2l
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/all_genes_annot.2.bt2l
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/all_genes_annot.3.bt2l
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/all_genes_annot.4.bt2l
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/all_genes_annot.rev.1.bt2l
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/all_genes_annot.rev.2.bt2l
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/bt2l.md5
md5sum --check $DBDIR/bt2l.md5
# DIAMOND database
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/protein_database/uniref50_201901.dmnd
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/protein_database/uniref50_201901.md5
md5sum --check $DBDIR/uniref50_201901.md5
mkdir -p $DBDIR/protein_database
mv $OUTDIR/uniref50_201901.dmnd $OUTDIR/protein_database/uniref50_201901.dmnd

Database usage

Kraken2/Bracken

Kraken2

# assuming paired-end input reads
kraken2 --db $DBDIR --report --output - --paired {input.read1} {input.read2} > sample.kreport

Bracken

# assuming 150 bp reads
bracken -r 150 -d $DBDIR -i sample.kreport -o bracken_output

HUMAnN3

humann3 \
  --nucleotide-database $DBDIR \
  --protein-database $DBDIR \
  --input-format fastq \
  --output-basename hm3 \
  --input {input.reads} \
  --output output