Misleading StrandOddsRatio value when using --max-mnp-distance #5698

tfenne · 2019-02-21T00:55:46Z

Bug Report

Affected tool(s) or class(es)

HaplotypeCaller with --max-mnp-distance filter.

Affected version(s)

Latest public release version [4.1.0.0]
Latest master branch as of [date of test?]

Description

I think there's a problem with the StrandOddsRatio (SOR) annotation and the --map-mnp-distance flag. I'm looking at a small region of NA24143 (one of the GIAB samples). There's a pair of SNPs in very close proximity. When called without the MNP output I get a pair of variants as follows (some info removed for clarity), coordinates are HG19:

chr4  5743509  .  C  T  5903.03  .  FS=0.000;QD=25.36;SOR=9.825  GT:AD:DP:GQ:PL  1/1:0,135:135:99:5917,406,0
chr4  5743512  .  T  C  2766.60  .  FS=0.000;QD=21.12;SOR=0.983  GT:AD:DP:GQ:PL  0/1:57,74:131:99:2774,0,2060

I'm trying to get permission to share the BAM over this region, but the key information is that every single read that spans or is in proximity to these variants is on the R strand. There is zero F strand coverage.

This seems reasonable. It's a bit odd to me that the first SNP which is hom-var has a SOR value of 9.825, but it's homozygous so it's more or less irrelevant. Looking at the code, I think the problem here is that the code avoids divide-by-zero errors by adding pseudo-counts of 1.0 to the table, which for homozygous variants with no coverage on one strand creates a weird situation. I think it would be better to just detect if all coverage is on one strand and short-circuit the calculation, but I digress.

The real problem comes when running with --max-mnp-distance 5. Then I get this single variant:

chr4  5743509  .  CTAT  TTAC,TTAT  5506.10  .  FS=0.000;QD=25.36;SOR=9.750  GT:AD:DP:GQ:PL  1/2:0,74,56:130:99:5523,2213,2060,3016,0,2774

Now I have a het variant with an SOR of 9.75. This seems really wrong to me - note how FS is 0.0. Again all coverage of all alleles is on one strand. And the het SNP that forms part of this MNP had an SOR of 0.983 when called independently. Since the first SNP is hom-var and the second is het, I would have expected the SOR value for the MNP call to closely mirror that of the het SNP.

My suspicion is that what's going on here is probably that the calculation is being run using the contingency table for the hom-var SNP that's first in the MNP, perhaps filtered to only reads that span the whole MNP, since the value is marginally lower.

Steps to reproduce

I will try and post a BAM snippet later, but essentially to reproduce I think a synthetic test case with the following properties would work:

Coverage on only one strand
A HomVar SNP at position n
A Het SNP at position n+2
Calls made with --max-mnp-distance 5

Expected behavior

SOR value should be close in value to that of the het-snp called in isolation.

Actual behavior

SOR value is in fact similar to that of the HomVar snp.

The text was updated successfully, but these errors were encountered:

ldgauthier · 2019-02-21T17:14:57Z

@tfenne the bam would definitely be useful for the MNP issue (although likely not necessary), but if you can give me just the 2x2 contingency table (the FORMAT SB) then I can take a look at that pseudo count issue.

ldgauthier · 2019-02-21T17:21:40Z

Actually the contingency tables for both variants in the split case and for the MNP would be super useful. You have to run in GVCF mode to generate them.

tfenne · 2019-02-21T20:49:26Z

Thanks @ldgauthier! I think this is what you're asking for:

# gVCF without --max-mnp-distance
chr4  5743509  .  C     T,<NON_REF>          5888.77  .  DP=135;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=486000,135                                                            GT:AD:DP:GQ:PGT:PID:PL:PS:SB  1|1:0,135,0:135:99:0|1:5743509_C_T:5917,406,0,5917,406,5917:5743509:0,0,0,135
chr4  5743512  .  T     C,<NON_REF>          2745.77  .  BaseQRankSum=-1.002;DP=131;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=471600,131;ReadPosRankSum=2.613  GT:AD:DP:GQ:PGT:PID:PL:PS:SB  0|1:57,74,0:131:99:0|1:5743509_C_T:2774,0,2060,2945,2283,5228:5743509:0,57,0,74

# gVCF with --max-mnp-distance 5
chr4  5743509  .  CTAT  TTAC,TTAT,<NON_REF>  5485.73  .  DP=136;ExcessHet=3.0103;MLEAC=1,1,0;MLEAF=0.500,0.500,0.00;RAW_MQandDP=489600,136                                                   GT:AD:DP:GQ:PL:SB             1/2:0,74,56,0:130:99:5523,2213,2060,3016,0,2774,5388,2261,2994,5376:0,0,0,130

I should also be able to share the BAM later today - I just need to sanitize it a little before sharing.

ldgauthier · 2019-02-22T20:29:04Z

Perfect, thanks!

I'm pretty sure I've got it, in which case I don't need the bam. The contingency table is ref vs. not ref. For the 0|1 GT, those 57 reads are considered reference, but in the 1/2 case those 57 reads are considered alts. On one level, that seems fair to me because none of those reads match the reference at both positions. It does make for a less than informative comparison, though. But is measuring the bias of one alt against another informative for filtering? For MQRankSum I stand by comparing the alt mapping to the reference, but I have to think about this one some more.

Using allele-specific annotations as I've implemented them isn't going to help here either because all the comparisons (strand bias or rank sum) are made against the reference. I definitely agree that the SOR for the homVar is going to be bad for filtering and I'll fix it.

tfenne · 2019-02-22T20:37:26Z

Thanks @ldgauthier, that's definitely thought provoking. Perhaps you are right that for het-non-ref (i.e. 1/2 and similar) genotypes we should consider skipping the strand bias test. I had been assuming, incorrectly, that in a single sample case with a 1/2 genotype that the test would be based on the called alleles, not ref vs. alt.

Just thinking out loud about this, I wonder if computing SOR on the 1/2 alleles would catch a small number of incorrect genotypes? I've been doing something analogous with allele balance filtering - whereby I compute a per-sample allele balance and directionality. If the genotype is imbalanced towards ref, I filter the variant. If the genotype is imbalanced towards a different allele I "correct" the genotype to be homozygous for that allele. In testing on reference samples from GIAB and PlatGen we are seeing that correct a handful of genotypes where the sample is really hom-alt, but called het due to a high error rate.

I'll read up on how the allele-specific strand bias works because I've never looked at that before. But I wonder if it's possible from the available annotations to get a SOR-like value for each allele that is that allele vs all other alleles?

tfenne · 2019-02-22T22:04:12Z

One more question related to this. In playing around I've noticed that if I run HC in GVCF mode with -A AS_StrandOddsRatio it will output a table like this into the gVCF: AS_SB_TABLE=0,0|34,24|21,33|0,0. But when I run GenotypeGVCFs this gets reduced to AS_SOR=1.085, and the original AS_SB_TABLE annotation is removed. Is there any way to get GenotypeGVCFs to carry the table forward into the genotypes VCF?

sooheelee · 2019-03-14T22:35:40Z

Hi @tfenne, it may be that https://hail.is/ will allow you to do what you wish to do. Might be worth checking out.

ldgauthier · 2019-03-15T18:15:31Z

@tfenne there's no commandline option now, but it would be easy to add one. The line that's removing AS_SB_TABLE is here:

gatk/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/VariantAnnotatorEngine.java

Line 224 in 69f4006

variantAnnotations.remove(currentASannotation.getRawKeyName());

You'll probably need to add a new field to VariantAnnotatorEngine to be set via GenotypeGVCFs, but that's easy peasy.

I can toss this in with the F1R2 fix, but it might be a couple of days.

…ations (#5833) * Update A-type and R-type FORMAT annotations in GenotypeGVCFs * Add option to keep combined AS annotations, as requested in #5698

nh13 mentioned this issue Feb 21, 2019

Understanding the implementation of the StrandOddsRatio #5700

Closed

ldgauthier mentioned this issue Feb 22, 2019

SOR way too high for hom-vars/het-non-refs at the ends of exons #5711

Open

nh13 mentioned this issue Mar 21, 2019

F1R2 and F2R2 annotations not updated by GenotypeGvcfs #5704

Closed

ldgauthier added a commit that referenced this issue Mar 25, 2019

Add option to keep combined AS annotations, as requested in #5698

3040848

ldgauthier mentioned this issue Mar 25, 2019

Trim per-allele FORMAT annotations and optionally retain raw AS annotations #5833

Merged

tfenne mentioned this issue Jul 3, 2019

GenotypeGVCFs --keep-combined-raw-annotations doesn't subset to output alleles #6029

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misleading StrandOddsRatio value when using --max-mnp-distance #5698

Misleading StrandOddsRatio value when using --max-mnp-distance #5698

tfenne commented Feb 21, 2019

ldgauthier commented Feb 21, 2019

ldgauthier commented Feb 21, 2019

tfenne commented Feb 21, 2019

ldgauthier commented Feb 22, 2019

tfenne commented Feb 22, 2019

tfenne commented Feb 22, 2019

sooheelee commented Mar 14, 2019

ldgauthier commented Mar 15, 2019

Misleading StrandOddsRatio value when using --max-mnp-distance #5698

Misleading StrandOddsRatio value when using --max-mnp-distance #5698

Comments

tfenne commented Feb 21, 2019

Bug Report

Affected tool(s) or class(es)

Affected version(s)

Description

Steps to reproduce

Expected behavior

Actual behavior

ldgauthier commented Feb 21, 2019

ldgauthier commented Feb 21, 2019

tfenne commented Feb 21, 2019

ldgauthier commented Feb 22, 2019

tfenne commented Feb 22, 2019

tfenne commented Feb 22, 2019

sooheelee commented Mar 14, 2019

ldgauthier commented Mar 15, 2019