-
Notifications
You must be signed in to change notification settings - Fork 18
Performance Assessment
The following is a variant identification analysis performed on CTAT-Mutation pipeline outputs in order to assess the pipeline's accuracy and performance. To validate the accuracy CTAT-Mutation pipeline was applied to GM12878 cell line, referred to as the Genome In A Bottle (GIAB), then benchmarked against the well-curated high confidence reference variants provided by the GIAB consortium. Performance is measured with Precision-Recall (PR) and Receiver-Operating (ROC) curves plotted under several different coverage thresholds.
For the following analysis, GM12878 **Genome in the Bottle** (GIAB) cell line was chosen. The RNA is sequenced to a depth of 80 million. The reads are 150-bp paired-end reads sequenced using Illumina NextSeq (SRA accession - SRS2267720). GIAB is a public-private-academia consortium hosted by the National Institute of Standards and Technology (NIST) to develop reference methods, reference data and reference standards for research purposes.
The dataset has the following advantages :
-
The transcriptome, exome, and whole genome have been deeply sequenced for these samples, allowing accurate identification of variants from RNA and DNA of the same individual.
-
Matching between RNA and DNA samples enables certainty in RNA SNP calls. The RNA variant calls are compared with the variants present in the more reliable DNA samples, ensuring confidence in the RNA-seq variant calls. The GM12878 cell line has been extensively studied, and SNPs detected in its genome have been continuously deposited into dbSNP. These features make GM12878 good candidate set for evaluating the precision and sensitivity of the CTAT-mutations pipeline.
RNAseq reads:
https://www.ncbi.nlm.nih.gov/sra/?term=SRR5665260
High Confidence Regions:
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed
Reference SNPs:
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz
Exome bam file (when Exome is used as a reference):
ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/project.NIST_NIST7035_H7AP8ADXX_TAAGGCGA_1_NA12878.bwa.markDuplicates.bam
CTAT-Mutations frames SNV refinement as a class-imbalance classification problem, and INDEL refinement as a regression-based classification problem, simultaneously targeting both somatic and germline variants. The models leveraged in the CTAT-Mutation pipeline go through hyperparameter optimization with standard 10-fold cross-validation designed to optimize their F1 scores. The F1 score can be thought of as a weighted average of the precision and recall. Therefore the F1 score is chosen to optimize as it achieves the best trade-off between sensitivity and specificity. There are Five tree based models (ADABOOST, GB, NGB, SGB, RF), and three linear based models (SVML, SVM, LR).
For benchmarking purposes, reference variants provided by the GIAB consortium and Exome sequencing variants are used as known/true variance references.
TP (True Positive): SNP detected by CTAT-mutations pipeline same as the reference
FP (False Positive): SNP detected by CTAT-mutations pipeline but not found in the reference
FN (False Negative): SNP undetected by CTAT-mutations pipeline but found in the reference
TN (True Negative): 3.2e9-(TP + FP + FN)
False Positive Rate = fp/(fp+tn)
Sensitivity = float(tp)/(tp + fn)
Positive Predictive Value = float(tp)/(tp + fp)
False Discovery Rate = 1 - PPV
F1 = (2 * SN * PPV) / (SN + PPV)
Symbol | Algorithm |
---|---|
ADABOOST | Adaboosting |
GB | Gradient Boosting |
SGB | Stochastic Gradient Boosting |
NGB | Natural Gradient boosting |
RF | Random Forest |
SVML | Support Vector Machine with a Linear kernel |
SVM-RBF | Support Vector Machine with a Radial Basis Function kernel |
LR | Logistic Regression |
The following plots are F1 scores for CTAT-Mutation pipeline outputs leveraging the assigned algorithm, along with a baseline output.
SNV
Type | tp | fp | fn | sn | ppv |
---|---|---|---|---|---|
RF | 28314 | 420 | 1851 | 0.938637 | 0.985383 |
SGB | 28187 | 636 | 1978 | 0.934427 | 0.977934 |
GB | 27895 | 525 | 2270 | 0.924747 | 0.981527 |
NGB | 27919 | 673 | 2246 | 0.925543 | 0.976462 |
ADABOOST | 27916 | 993 | 2249 | 0.925443 | 0.965651 |
SVM-RBF | 27944 | 1199 | 2221 | 0.926372 | 0.958858 |
SVML | 28233 | 4326 | 1932 | 0.935952 | 0.867134 |
LR | 28491 | 4997 | 1674 | 0.944505 | 0.850782 |
BASELINE | 29321 | 12239 | 844 | 0.972021 | 0.705510 |
INDEL
Type | tp | fp | fn | sn | ppv |
---|---|---|---|---|---|
GB | 2942 | 554 | 703 | 0.807133 | 0.841533 |
ADABOOST | 2945 | 588 | 700 | 0.807956 | 0.833569 |
SGB | 2958 | 720 | 687 | 0.811523 | 0.804241 |
NGB | 2941 | 1024 | 704 | 0.806859 | 0.741740 |
RF | 2949 | 1206 | 696 | 0.809053 | 0.709747 |
SVM-RBF | 2920 | 1499 | 725 | 0.801097 | 0.660783 |
SVML | 2863 | 2402 | 782 | 0.785460 | 0.543780 |
BASELINE | 3010 | 2947 | 635 | 0.825789 | 0.505288 |
In the case where a true variant reference is not available, Variants identified in Exome Reads (or bam) are used as a true reference. Here Exome alignment for GIAB data is performed using GATK’s best practice for this benchmarking analysis.
SNV
Type | tp | fp | fn | sn | ppv |
---|---|---|---|---|---|
SGB | 16550 | 464 | 2117 | 0.886591 | 0.972728 |
GB | 16507 | 427 | 2159 | 0.884335 | 0.974784 |
NGB | 16485 | 492 | 2181 | 0.883157 | 0.971020 |
RF | 16268 | 354 | 2399 | 0.871484 | 0.978703 |
ADABOOST | 16456 | 587 | 2210 | 0.881603 | 0.965558 |
SVM-RBF | 16675 | 1075 | 1991 | 0.893335 | 0.939437 |
LR | 16942 | 2238 | 1724 | 0.907640 | 0.883316 |
SVML | 16925 | 2240 | 1741 | 0.906729 | 0.883120 |
BASELINE | 17094 | 2818 | 1572 | 0.915783 | 0.858477 |
INDELS
Type | tp | fp | fn | sn | ppv |
---|---|---|---|---|---|
ADABOOST | 1592 | 397 | 1084 | 0.594918 | 0.800402 |
GB | 1585 | 395 | 1091 | 0.592302 | 0.800505 |
SGB | 1600 | 483 | 1076 | 0.597907 | 0.768123 |
NGB | 1620 | 709 | 1056 | 0.605381 | 0.695578 |
RF | 1624 | 815 | 1052 | 0.606876 | 0.665847 |
SVM-RBF | 1604 | 848 | 1072 | 0.599402 | 0.654160 |
BASELINE | 1670 | 1547 | 1006 | 0.624066 | 0.519117 |
SVML | 1620 | 1432 | 1056 | 0.605381 | 0.530799 |
The flowchart below represents the exome and rna-seq mutation detection workflows. The RNA labeled steps represent the CTAT-Mutation pipeline workflow for RNA-Seq data. The WXS labeled steps represent the steps performed outside of the CTAT-Mutation pipeline on WXS data. The WXS data is used for performance benchmarking purposes in order to assess the accuracy of the pipeline’s variant calling and filtering on RNA-Seq data, in the absence of true reference SNPs.