Clinical research pipeline for exploring variants in PacBio HiFi whole genome (WGS) data (Hg38)

crg2-pacbio is a research pipeline aimed at discovering clinically relevant variants from PacBio HiFi whole genome sequence data in a family-based manner. crg2-pacbio uses Snakemake and Conda to manage jobs and software dependencies.

crg2-pacbio takes as input the following VCFs output by PacBio's WGS pipeline:

DeepVariant joint-genotyped small variant VCF
pbsv joint-genotyped structural variant VCF
TRGT VCFs, one per family member

Small variant annotation and report

annotate variants with VEP and vcfanno
generate a gemini variant database
generate a rare (less than 1% maximum gnomAD population AF) variant report with medium to high impact variants. Report generation scripts are derived from cre, but copied to this repo for convenience. Report details are described here; additional columns included are allele frequencies and counts from the CoLoRSDB cohort.

Small variant panel report

The annotation pipeline is the same as above, but only variants in a gene panel are considered. The panel report includes variants of any impact (i.e. it includes non-coding variants and intronic variants).
To generate a gene panel from an HPO text file exported from PhenomeCentral or G4RD, add the HPO filepath to config["run"]["hpo"].
The first time you run the pipeline, you will also need to generate Ensembl and RefSeq gene files as well as an HGNC gene mapping file.
Download and unzip Ensembl gtf: wget -qO- https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz | gunzip -c > Homo_sapiens.GRCh38.112.gtf
Download and unzip RefSeq gff: wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz | gunzip -c > GRCh38_latest_genomic.gff
Download RefSeq chromosome mapping file: wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_assembly_report.txt
Run script to parse the above files: python scripts/clean_gtf.py --ensembl_gtf /path/to/Homo_sapiens.GRCh38.112.gtf --refseq_gff3 /path/to/GRCh38_latest_genomic.gff --refseq_assembly /path/to/GRCh38_latest_assembly_report.txt
Add the paths to the output files, Homo_sapiens.GRCh38.112.gtf_subset.csv and GRCh38_latest_genomic.gff_subset.csv, to the config["gene"]["ensembl"] and config["gene"]["refseq"] fields.
You will also need the HGNC alias file: download this from https://www.genenames.org/download/custom/ using the default fields. Add the path this file to config["gene"]["hgnc"].

Structural variant annotation and report

annotate variants using SnpEFF and AnnotSV
filter out variants under 50bp
annotate SVs with custom Python script
a description of the report and fields can be found here

Repeat expansion outlier report

TRGT must have previously been run on each sample against the 937,122 repeats originally released by the Genome in a Bottle tandem repeat benchmarking project
this module is derived from the find-outlier-expansions workflow developed by Egor Dolzhenko (PacBio), Adam English (Baylor College of Medicine), Tom Mokveld (PacBio), Giulia Del Gobbo (CHEO), and Madeline Couse (SickKids)
construct a repeat database from 98 HPRC samples and the individuals from the family of interest
identify repeats with outlying size in family members
annnotate repeat outliers with custom Python script

Pathogenic repeat loci report

genotype repeats per-sample using TRGTv1.0.0 against the pathogenic repeat loci BED file provided by TRGT (the GIAB 937,122 catalog only contains 50/56 loci). Note that TRGT requires BAMs; these must be added to the samples.tsv file
merge sample VCFs into multi-sample family VCF
annotate repeat loci with custom Python script

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Clinical research pipeline for exploring variants in PacBio HiFi whole genome (WGS) data (Hg38)

Files

README.md

Latest commit

History

README.md

File metadata and controls

Clinical research pipeline for exploring variants in PacBio HiFi whole genome (WGS) data (Hg38)