The PacMAN decision support system integrates detections from various sources by connecting to the OBIS database. Data publishing to OBIS typically happens through an Integrated Publishing Toolkit (IPT) instance. This is also the case for the PacMAN monitoring campaigns. Before sequence data from eDNA sampling can be published to OBIS, it needs to be processed and formatted in the Darwin Core Archive format. Processing involves quality controlling and trimming of sequences, ASV inference, and taxonomic annotation. These steps are taken care of by the PacMAN bioinformatics pipeline.
The PacMAN pipeline uses a number of taxonomic annotation algorithms. The main taxonomic assignment, which is used to populate the scientificName
field in Darwin Core, is the naïve Bayesian classifier RDP Classifier. RDP Classifier calculates a probability for every possible taxonomic annotation using kmer frequences, and then applies a bootstrapping procedure to obtain a confidence score for each taxonomic level.
The pipeline also includes a VSEARCH step which uses kmer search to find the closest matches in a reference database. VSEARCH also provides a similarity score for each match.
Algorithm | Results |
---|---|
RDP Classifier | family X (confidence 1), genus Y (confidence 0.9), species B (confidence 0.3) |
VSEARCH | family X, genus Y, species A (identity 0.997) family X, genus Y, species B (identity 0.995) family X, genus Y, species B (identity 0.995) family X, genus Y, species A (identity 0.992) family X, genus Z, species C (identity 0.983) family X, genus Z, species C (identity 0.975) |
The PacMAN bioinformatics pipeline is workflow based on commonly used bioinformatics tools and custom scripts. The pipeline can be run using the Snakemake workflow management system. Snakemake takes care of installing the necessary dependencies in Conda environments, and running the different steps of the pipeline in the correct order.
In addition to installing Conda and Snakemake locally, it's also possible to run the pipeline using Docker. In this case, the pipeline is encapsulated in a Docker container, and the data folders are mounted as volumes.
The following files are required to run the pipeline:
- Configuration file
- Manifest
- Sample metadata
- Raw sequences
- RDP reference database
- VSEARCH reference database
See the data preparation section in the pipeline README for example files and reference database downloads. Structure the files like this:
└── data
├── config_files
│ ├── config.yaml
│ ├── manifest.csv
│ └── sample_data.csv
├── raw_sequences
│ ├── USP-24-01-172_S172_L001_R1_001.fastq.gz
│ └── USP-24-01-172_S172_L001_R2_001.fastq.gz
└── reference_databases
├── COI_ncbi_1_50000_pcr_pga_taxon_derep_clean_sintax.fasta
└── COI_terrimporter
├── bergeyTrainingTree.xml
├── genus_wordConditionalProbList.txt
├── logWordPrior.txt
├── rRNAClassifier.properties
└── wordConditionalProbIndexArr.txt
Run the pipeline with Snakemake or Docker using either of the following commands:
snakemake --use-conda --configfile data/config_files/config.yaml --rerun-incomplete --printshellcmds
docker run --platform linux/amd64 \
-v $(pwd)/data:/pipeline/data \
-v $(pwd)/results:/pipeline/results \
-v $(pwd)/.snakemake:/pipeline/.snakemake \
pieterprovoost/pacman-pipeline
Sign into the OBIS JupyterHub to explore an example pipeline result. To visualize the taxonomic composition of the dataset, run the following code in an R notebook:
library(dplyr)
library(psadd)
library(phyloseq)
ps <- readRDS("shared/example_results/05-dwca/phyloseq_object.rds")
tax_table(ps) <- tax_table(ps) %>%
as.data.frame() %>%
select(phylum, class, order, family, genus, species) %>%
as.matrix(rownames.force = T)
plot_krona(ps, output = "krona_plot", variable = "eventID")
- OBIS SG 12 training: training materials on R, JupyterHub, git, and DNADerivedData.
- First PacMAN training: training materials on data management, R, the PacMAN bioinformatics pipeline.
- PacMAN pipeline: the PacMAN bioinformatics pipeline.