Skip to content

Performance Assessment

M. Brown edited this page Apr 15, 2021 · 30 revisions

Performance Benchmarking

Intro

The following is a variant identification analysis performed on CTAT-Mutation pipeline outputs in order to assess the pipeline's accuracy and performance. To validate the accuracy CTAT-Mutation pipeline was applied to GM12878 cell line, referred to as the Genome In A Bottle (GIAB), then benchmarked against the well-curated high confidence reference variants provided by the GIAB consortium. Performance is measured with Precision-Recall (PR) and Receiver-Operating (ROC) curves plotted under several different coverage thresholds.

Outline

Dataset

For the following analysis, GM12878 **Genome in the Bottle** (GIAB) cell line was chosen. The RNA is sequenced to a depth of 80 million. The reads are 150-bp paired-end reads sequenced using Illumina NextSeq (SRA accession - SRS2267720). GIAB is a public-private-academia consortium hosted by the National Institute of Standards and Technology (NIST) to develop reference methods, reference data and reference standards for research purposes.

The dataset has the following advantages :

  1. The transcriptome, exome, and whole genome have been deeply sequenced for these samples, allowing accurate identification of variants from RNA and DNA of the same individual.

  2. Matching between RNA and DNA samples enables certainty in RNA SNP calls. The RNA variant calls are compared with the variants present in the more reliable DNA samples, ensuring confidence in the RNA-seq variant calls. The GM12878 cell line has been extensively studied, and SNPs detected in its genome have been continuously deposited into dbSNP. These features make GM12878 good candidate set for evaluating the precision and sensitivity of the CTAT-mutations pipeline.

RNAseq reads:

https://www.ncbi.nlm.nih.gov/sra/?term=SRR5665260 

High Confidence Regions:

https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed

Reference SNPs:

https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz

Exome bam file (when Exome is used as a reference):

ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/project.NIST_NIST7035_H7AP8ADXX_TAAGGCGA_1_NA12878.bwa.markDuplicates.bam

Analysis

CTAT-Mutations frames SNV refinement as a class-imbalance classification problem, and INDEL refinement as a regression-based classification problem, simultaneously targeting both somatic and germline variants. The models leveraged in the CTAT-Mutation pipeline go through hyperparameter optimization with standard 10-fold cross-validation designed to optimize their F1 scores. The F1 score can be thought of as a weighted average of the precision and recall. Therefore the F1 score is chosen to optimize as it achieves the best trade-off between sensitivity and specificity. There are Five tree based models (ADABOOST, GB, NGB, SGB, RF), and three linear based models (SVML, SVM, LR).

For benchmarking purposes, reference variants provided by the GIAB consortium and Exome sequencing variants are used as known/true variance references.

Performance measures:

TP (True Positive): SNP detected by CTAT-mutations pipeline same as the reference
FP (False Positive): SNP detected by CTAT-mutations pipeline but not found in the reference
FN (False Negative): SNP undetected by CTAT-mutations pipeline but found in the reference
TN (True Negative): 3.2e9-(TP + FP + FN) 
False Positive Rate = fp/(fp+tn)
Sensitivity = float(tp)/(tp + fn)
Positive Predictive Value = float(tp)/(tp + fp)
False Discovery Rate = 1 - PPV
F1 = (2 * SN * PPV) / (SN + PPV)

Symbol to Algorithm reference:

Symbol Algorithm
ADABOOST Adaboosting
GB Gradient Boosting
SGB Stochastic Gradient Boosting
NGB Natural Gradient boosting
RF Random Forest
SVML Support Vector Machine with a Linear kernel
SVM-RBF Support Vector Machine with a Radial Basis Function kernel
LR Logistic Regression

The following plots are F1 scores for CTAT-Mutation pipeline outputs leveraging the assigned algorithm, along with a baseline output.

GIAB Reference Variants

SNV F1

summary statistics:

SNV

Type tp fp fn sn ppv
RF 28314 420 1851 0.938637 0.985383
SGB 28187 636 1978 0.934427 0.977934
GB 27895 525 2270 0.924747 0.981527
NGB 27919 673 2246 0.925543 0.976462
ADABOOST 27916 993 2249 0.925443 0.965651
SVM-RBF 27944 1199 2221 0.926372 0.958858
SVML 28233 4326 1932 0.935952 0.867134
LR 28491 4997 1674 0.944505 0.850782
BASELINE 29321 12239 844 0.972021 0.705510

INDEL

Type tp fp fn sn ppv
GB 2942 554 703 0.807133 0.841533
ADABOOST 2945 588 700 0.807956 0.833569
SGB 2958 720 687 0.811523 0.804241
NGB 2941 1024 704 0.806859 0.741740
RF 2949 1206 696 0.809053 0.709747
SVM-RBF 2920 1499 725 0.801097 0.660783
SVML 2863 2402 782 0.785460 0.543780
BASELINE 3010 2947 635 0.825789 0.505288

Exome Variants as Reference

In the case where a true variant reference is not available, Variants identified in Exome Reads (or bam) are used as a true reference. Here Exome alignment for GIAB data is performed using GATK’s best practice for this benchmarking analysis.

SNV Ex F1

SNV

Type tp fp fn sn ppv
SGB 16550 464 2117 0.886591 0.972728
GB 16507 427 2159 0.884335 0.974784
NGB 16485 492 2181 0.883157 0.971020
RF 16268 354 2399 0.871484 0.978703
ADABOOST 16456 587 2210 0.881603 0.965558
SVM-RBF 16675 1075 1991 0.893335 0.939437
LR 16942 2238 1724 0.907640 0.883316
SVML 16925 2240 1741 0.906729 0.883120
BASELINE 17094 2818 1572 0.915783 0.858477

INDELS

Type tp fp fn sn ppv
ADABOOST 1592 397 1084 0.594918 0.800402
GB 1585 395 1091 0.592302 0.800505
SGB 1600 483 1076 0.597907 0.768123
NGB 1620 709 1056 0.605381 0.695578
RF 1624 815 1052 0.606876 0.665847
SVM-RBF 1604 848 1072 0.599402 0.654160
BASELINE 1670 1547 1006 0.624066 0.519117
SVML 1620 1432 1056 0.605381 0.530799

Workflow

The flowchart below represents the exome and rna-seq mutation detection workflows. The RNA labeled steps represent the CTAT-Mutation pipeline workflow for RNA-Seq data. The WXS labeled steps represent the steps performed outside of the CTAT-Mutation pipeline on WXS data. The WXS data is used for performance benchmarking purposes in order to assess the accuracy of the pipeline’s variant calling and filtering on RNA-Seq data, in the absence of true reference SNPs.