PacMAN final meeting training

PacMAN data flow

The PacMAN decision support system integrates detections from various sources by connecting to the OBIS database. Data publishing to OBIS typically happens through an Integrated Publishing Toolkit (IPT) instance. This is also the case for the PacMAN monitoring campaigns. Before sequence data from eDNA sampling can be published to OBIS, it needs to be processed and formatted in the Darwin Core Archive format. Processing involves quality controlling and trimming of sequences, ASV inference, and taxonomic annotation. These steps are taken care of by the PacMAN bioinformatics pipeline.

The PacMAN bioinformatics pipeline

Taxonomic annotation

The PacMAN pipeline uses a number of taxonomic annotation algorithms. The main taxonomic assignment, which is used to populate the scientificName field in Darwin Core, is the naïve Bayesian classifier RDP Classifier. RDP Classifier calculates a probability for every possible taxonomic annotation using kmer frequences, and then applies a bootstrapping procedure to obtain a confidence score for each taxonomic level.

The pipeline also includes a VSEARCH step which uses kmer search to find the closest matches in a reference database. VSEARCH also provides a similarity score for each match.

Algorithm	Results
RDP Classifier	family X (confidence 1), genus Y (confidence 0.9), species B (confidence 0.3)
VSEARCH	family X, genus Y, species A (identity 0.997) family X, genus Y, species B (identity 0.995) family X, genus Y, species B (identity 0.995) family X, genus Y, species A (identity 0.992) family X, genus Z, species C (identity 0.983) family X, genus Z, species C (identity 0.975)

Running the PacMAN pipeline

The PacMAN bioinformatics pipeline is workflow based on commonly used bioinformatics tools and custom scripts. The pipeline can be run using the Snakemake workflow management system. Snakemake takes care of installing the necessary dependencies in Conda environments, and running the different steps of the pipeline in the correct order.

In addition to installing Conda and Snakemake locally, it's also possible to run the pipeline using Docker. In this case, the pipeline is encapsulated in a Docker container, and the data folders are mounted as volumes.

The following files are required to run the pipeline:

Configuration file
Manifest
Sample metadata
Raw sequences
RDP reference database
VSEARCH reference database

See the data preparation section in the pipeline README for example files and reference database downloads. Structure the files like this:

└── data
    ├── config_files
    │   ├── config.yaml
    │   ├── manifest.csv
    │   └── sample_data.csv
    ├── raw_sequences
    │   ├── USP-24-01-172_S172_L001_R1_001.fastq.gz
    │   └── USP-24-01-172_S172_L001_R2_001.fastq.gz
    └── reference_databases
        ├── COI_ncbi_1_50000_pcr_pga_taxon_derep_clean_sintax.fasta
        └── COI_terrimporter
            ├── bergeyTrainingTree.xml
            ├── genus_wordConditionalProbList.txt
            ├── logWordPrior.txt
            ├── rRNAClassifier.properties
            └── wordConditionalProbIndexArr.txt

Run the pipeline with Snakemake or Docker using either of the following commands:

snakemake --use-conda --configfile data/config_files/config.yaml --rerun-incomplete --printshellcmds

docker run --platform linux/amd64 \
    -v $(pwd)/data:/pipeline/data \
    -v $(pwd)/results:/pipeline/results \
    -v $(pwd)/.snakemake:/pipeline/.snakemake \
    pieterprovoost/pacman-pipeline

PacMAN pipeline results

Sign into the OBIS JupyterHub to explore an example pipeline result. To visualize the taxonomic composition of the dataset, run the following code in an R notebook:

library(dplyr)
library(psadd)
library(phyloseq)

ps <- readRDS("shared/example_results/05-dwca/phyloseq_object.rds")
tax_table(ps) <- tax_table(ps) %>%
    as.data.frame() %>%
    select(phylum, class, order, family, genus, species) %>%
    as.matrix(rownames.force = T)
plot_krona(ps, output = "krona_plot", variable = "eventID")

Biodiversity data publishing

Decision support

Other resources

OBIS SG 12 training: training materials on R, JupyterHub, git, and DNADerivedData.
First PacMAN training: training materials on data management, R, the PacMAN bioinformatics pipeline.
PacMAN pipeline: the PacMAN bioinformatics pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PacMAN final meeting training

PacMAN data flow

The PacMAN bioinformatics pipeline

Taxonomic annotation

Running the PacMAN pipeline

PacMAN pipeline results

Biodiversity data publishing

Decision support

Other resources

About

Releases

Packages

iobis/pacman-final-training

Folders and files

Latest commit

History

Repository files navigation

PacMAN final meeting training

PacMAN data flow

The PacMAN bioinformatics pipeline

Taxonomic annotation

Running the PacMAN pipeline

PacMAN pipeline results

Biodiversity data publishing

Decision support

Other resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages