Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misleading StrandOddsRatio value when using --max-mnp-distance #5698

Open
1 of 2 tasks
tfenne opened this issue Feb 21, 2019 · 8 comments
Open
1 of 2 tasks

Misleading StrandOddsRatio value when using --max-mnp-distance #5698

tfenne opened this issue Feb 21, 2019 · 8 comments

Comments

@tfenne
Copy link
Contributor

tfenne commented Feb 21, 2019

Bug Report

Affected tool(s) or class(es)

HaplotypeCaller with --max-mnp-distance filter.

Affected version(s)

  • Latest public release version [4.1.0.0]
  • Latest master branch as of [date of test?]

Description

I think there's a problem with the StrandOddsRatio (SOR) annotation and the --map-mnp-distance flag. I'm looking at a small region of NA24143 (one of the GIAB samples). There's a pair of SNPs in very close proximity. When called without the MNP output I get a pair of variants as follows (some info removed for clarity), coordinates are HG19:

chr4  5743509  .  C  T  5903.03  .  FS=0.000;QD=25.36;SOR=9.825  GT:AD:DP:GQ:PL  1/1:0,135:135:99:5917,406,0
chr4  5743512  .  T  C  2766.60  .  FS=0.000;QD=21.12;SOR=0.983  GT:AD:DP:GQ:PL  0/1:57,74:131:99:2774,0,2060

I'm trying to get permission to share the BAM over this region, but the key information is that every single read that spans or is in proximity to these variants is on the R strand. There is zero F strand coverage.

This seems reasonable. It's a bit odd to me that the first SNP which is hom-var has a SOR value of 9.825, but it's homozygous so it's more or less irrelevant. Looking at the code, I think the problem here is that the code avoids divide-by-zero errors by adding pseudo-counts of 1.0 to the table, which for homozygous variants with no coverage on one strand creates a weird situation. I think it would be better to just detect if all coverage is on one strand and short-circuit the calculation, but I digress.

The real problem comes when running with --max-mnp-distance 5. Then I get this single variant:

chr4  5743509  .  CTAT  TTAC,TTAT  5506.10  .  FS=0.000;QD=25.36;SOR=9.750  GT:AD:DP:GQ:PL  1/2:0,74,56:130:99:5523,2213,2060,3016,0,2774

Now I have a het variant with an SOR of 9.75. This seems really wrong to me - note how FS is 0.0. Again all coverage of all alleles is on one strand. And the het SNP that forms part of this MNP had an SOR of 0.983 when called independently. Since the first SNP is hom-var and the second is het, I would have expected the SOR value for the MNP call to closely mirror that of the het SNP.

My suspicion is that what's going on here is probably that the calculation is being run using the contingency table for the hom-var SNP that's first in the MNP, perhaps filtered to only reads that span the whole MNP, since the value is marginally lower.

Steps to reproduce

I will try and post a BAM snippet later, but essentially to reproduce I think a synthetic test case with the following properties would work:

  1. Coverage on only one strand
  2. A HomVar SNP at position n
  3. A Het SNP at position n+2
  4. Calls made with --max-mnp-distance 5

Expected behavior

SOR value should be close in value to that of the het-snp called in isolation.

Actual behavior

SOR value is in fact similar to that of the HomVar snp.

@ldgauthier
Copy link
Contributor

@tfenne the bam would definitely be useful for the MNP issue (although likely not necessary), but if you can give me just the 2x2 contingency table (the FORMAT SB) then I can take a look at that pseudo count issue.

@ldgauthier
Copy link
Contributor

Actually the contingency tables for both variants in the split case and for the MNP would be super useful. You have to run in GVCF mode to generate them.

@tfenne
Copy link
Contributor Author

tfenne commented Feb 21, 2019

Thanks @ldgauthier! I think this is what you're asking for:

# gVCF without --max-mnp-distance
chr4  5743509  .  C     T,<NON_REF>          5888.77  .  DP=135;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=486000,135                                                            GT:AD:DP:GQ:PGT:PID:PL:PS:SB  1|1:0,135,0:135:99:0|1:5743509_C_T:5917,406,0,5917,406,5917:5743509:0,0,0,135
chr4  5743512  .  T     C,<NON_REF>          2745.77  .  BaseQRankSum=-1.002;DP=131;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=471600,131;ReadPosRankSum=2.613  GT:AD:DP:GQ:PGT:PID:PL:PS:SB  0|1:57,74,0:131:99:0|1:5743509_C_T:2774,0,2060,2945,2283,5228:5743509:0,57,0,74

# gVCF with --max-mnp-distance 5
chr4  5743509  .  CTAT  TTAC,TTAT,<NON_REF>  5485.73  .  DP=136;ExcessHet=3.0103;MLEAC=1,1,0;MLEAF=0.500,0.500,0.00;RAW_MQandDP=489600,136                                                   GT:AD:DP:GQ:PL:SB             1/2:0,74,56,0:130:99:5523,2213,2060,3016,0,2774,5388,2261,2994,5376:0,0,0,130

I should also be able to share the BAM later today - I just need to sanitize it a little before sharing.

@ldgauthier
Copy link
Contributor

Perfect, thanks!

I'm pretty sure I've got it, in which case I don't need the bam. The contingency table is ref vs. not ref. For the 0|1 GT, those 57 reads are considered reference, but in the 1/2 case those 57 reads are considered alts. On one level, that seems fair to me because none of those reads match the reference at both positions. It does make for a less than informative comparison, though. But is measuring the bias of one alt against another informative for filtering? For MQRankSum I stand by comparing the alt mapping to the reference, but I have to think about this one some more.

Using allele-specific annotations as I've implemented them isn't going to help here either because all the comparisons (strand bias or rank sum) are made against the reference. I definitely agree that the SOR for the homVar is going to be bad for filtering and I'll fix it.

@tfenne
Copy link
Contributor Author

tfenne commented Feb 22, 2019

Thanks @ldgauthier, that's definitely thought provoking. Perhaps you are right that for het-non-ref (i.e. 1/2 and similar) genotypes we should consider skipping the strand bias test. I had been assuming, incorrectly, that in a single sample case with a 1/2 genotype that the test would be based on the called alleles, not ref vs. alt.

Just thinking out loud about this, I wonder if computing SOR on the 1/2 alleles would catch a small number of incorrect genotypes? I've been doing something analogous with allele balance filtering - whereby I compute a per-sample allele balance and directionality. If the genotype is imbalanced towards ref, I filter the variant. If the genotype is imbalanced towards a different allele I "correct" the genotype to be homozygous for that allele. In testing on reference samples from GIAB and PlatGen we are seeing that correct a handful of genotypes where the sample is really hom-alt, but called het due to a high error rate.

I'll read up on how the allele-specific strand bias works because I've never looked at that before. But I wonder if it's possible from the available annotations to get a SOR-like value for each allele that is that allele vs all other alleles?

@tfenne
Copy link
Contributor Author

tfenne commented Feb 22, 2019

One more question related to this. In playing around I've noticed that if I run HC in GVCF mode with -A AS_StrandOddsRatio it will output a table like this into the gVCF: AS_SB_TABLE=0,0|34,24|21,33|0,0. But when I run GenotypeGVCFs this gets reduced to AS_SOR=1.085, and the original AS_SB_TABLE annotation is removed. Is there any way to get GenotypeGVCFs to carry the table forward into the genotypes VCF?

@sooheelee
Copy link
Contributor

Hi @tfenne, it may be that https://hail.is/ will allow you to do what you wish to do. Might be worth checking out.

@ldgauthier
Copy link
Contributor

@tfenne there's no commandline option now, but it would be easy to add one. The line that's removing AS_SB_TABLE is here:

variantAnnotations.remove(currentASannotation.getRawKeyName());

You'll probably need to add a new field to VariantAnnotatorEngine to be set via GenotypeGVCFs, but that's easy peasy.

I can toss this in with the F1R2 fix, but it might be a couple of days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants