Performance Assessment

Performance Benchmarking

Intro

The following is a variant identification analysis performed on CTAT-Mutation pipeline outputs in order to assess the pipeline's accuracy and performance. To validate the accuracy CTAT-Mutation pipeline was applied to GM12878 cell line, referred to as the Genome In A Bottle (GIAB), then benchmarked against the well-curated high confidence reference variants provided by the GIAB consortium. Performance is measured with Precision-Recall (PR) and Receiver-Operating (ROC) curves plotted under several different coverage thresholds.

Outline

Intro
Dataset
Analysis

Dataset

For the following analysis, GM12878 **Genome in the Bottle** (GIAB) cell line was chosen. The RNA is sequenced to a depth of 80 million. The reads are 150-bp paired-end reads sequenced using Illumina NextSeq (SRA accession - SRS2267720). GIAB is a public-private-academia consortium hosted by the National Institute of Standards and Technology (NIST) to develop reference methods, reference data and reference standards for research purposes.

The dataset has the following advantages :

The transcriptome, exome, and whole genome have been deeply sequenced for these samples, allowing accurate identification of variants from RNA and DNA of the same individual.
Matching between RNA and DNA samples enables certainty in RNA SNP calls. The RNA variant calls are compared with the variants present in the more reliable DNA samples, ensuring confidence in the RNA-seq variant calls. The GM12878 cell line has been extensively studied, and SNPs detected in its genome have been continuously deposited into dbSNP. These features make GM12878 good candidate set for evaluating the precision and sensitivity of the CTAT-mutations pipeline.

RNAseq reads:

https://www.ncbi.nlm.nih.gov/sra/?term=SRR5665260

High Confidence Regions:

https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed

Reference SNPs:

https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz

Exome bam file (when Exome is used as a reference):

ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/project.NIST_NIST7035_H7AP8ADXX_TAAGGCGA_1_NA12878.bwa.markDuplicates.bam

Analysis

CTAT-Mutations frames SNV refinement as a class-imbalance classification problem, and INDEL refinement as a regression-based classification problem, simultaneously targeting both somatic and germline variants. The models leveraged in the CTAT-Mutation pipeline go through hyperparameter optimization with standard 10-fold cross-validation designed to optimize their F1 scores. The F1 score can be thought of as a weighted average of the precision and recall. Therefore the F1 score is chosen to optimize as it achieves the best trade-off between sensitivity and specificity. There are Five tree based models (ADABOOST, GB, NGB, SGB, RF), and three linear based models (SVML, SVM, LR).

For benchmarking purposes, reference variants provided by the GIAB consortium and Exome sequencing variants are used as known/true variance references.

Performance measures:

TP (True Positive): SNP detected by CTAT-mutations pipeline same as the reference
FP (False Positive): SNP detected by CTAT-mutations pipeline but not found in the reference
FN (False Negative): SNP undetected by CTAT-mutations pipeline but found in the reference
TN (True Negative): 3.2e9-(TP + FP + FN) 
False Positive Rate = fp/(fp+tn)
Sensitivity = float(tp)/(tp + fn)
Positive Predictive Value = float(tp)/(tp + fp)
False Discovery Rate = 1 - PPV
F1 = (2 * SN * PPV) / (SN + PPV)

Symbol to Algorithm reference:

Symbol	Algorithm
ADABOOST	Adaboosting
GB	Gradient Boosting
SGB	Stochastic Gradient Boosting
NGB	Natural Gradient boosting
RF	Random Forest
SVML	Support Vector Machine with a Linear kernel
SVM-RBF	Support Vector Machine with a Radial Basis Function kernel
LR	Logistic Regression

The following plots are F1 scores for CTAT-Mutation pipeline outputs leveraging the assigned algorithm, along with a baseline output.

GIAB Reference Variants

summary statistics:

SNV

Type	tp	fp	fn	sn	ppv
RF	28314	420	1851	0.938637	0.985383
SGB	28187	636	1978	0.934427	0.977934
GB	27895	525	2270	0.924747	0.981527
NGB	27919	673	2246	0.925543	0.976462
ADABOOST	27916	993	2249	0.925443	0.965651
SVM-RBF	27944	1199	2221	0.926372	0.958858
SVML	28233	4326	1932	0.935952	0.867134
LR	28491	4997	1674	0.944505	0.850782
BASELINE	29321	12239	844	0.972021	0.705510

INDEL

Type	tp	fp	fn	sn	ppv
GB	2942	554	703	0.807133	0.841533
ADABOOST	2945	588	700	0.807956	0.833569
SGB	2958	720	687	0.811523	0.804241
NGB	2941	1024	704	0.806859	0.741740
RF	2949	1206	696	0.809053	0.709747
SVM-RBF	2920	1499	725	0.801097	0.660783
SVML	2863	2402	782	0.785460	0.543780
BASELINE	3010	2947	635	0.825789	0.505288

Exome Variants as Reference

In the case where a true variant reference is not available, Variants identified in Exome Reads (or bam) are used as a true reference. Here Exome alignment for GIAB data is performed using GATK’s best practice for this benchmarking analysis.

SNV

Type	tp	fp	fn	sn	ppv
SGB	16550	464	2117	0.886591	0.972728
GB	16507	427	2159	0.884335	0.974784
NGB	16485	492	2181	0.883157	0.971020
RF	16268	354	2399	0.871484	0.978703
ADABOOST	16456	587	2210	0.881603	0.965558
SVM-RBF	16675	1075	1991	0.893335	0.939437
LR	16942	2238	1724	0.907640	0.883316
SVML	16925	2240	1741	0.906729	0.883120
BASELINE	17094	2818	1572	0.915783	0.858477

INDELS

Type	tp	fp	fn	sn	ppv
ADABOOST	1592	397	1084	0.594918	0.800402
GB	1585	395	1091	0.592302	0.800505
SGB	1600	483	1076	0.597907	0.768123
NGB	1620	709	1056	0.605381	0.695578
RF	1624	815	1052	0.606876	0.665847
SVM-RBF	1604	848	1072	0.599402	0.654160
BASELINE	1670	1547	1006	0.624066	0.519117
SVML	1620	1432	1056	0.605381	0.530799

Workflow

The flowchart below represents the exome and rna-seq mutation detection workflows. The RNA labeled steps represent the CTAT-Mutation pipeline workflow for RNA-Seq data. The WXS labeled steps represent the steps performed outside of the CTAT-Mutation pipeline on WXS data. The WXS data is used for performance benchmarking purposes in order to assess the accuracy of the pipeline’s variant calling and filtering on RNA-Seq data, in the absence of true reference SNPs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly