Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenotypeGVCFs --keep-combined-raw-annotations doesn't subset to output alleles #6029

Closed
1 of 2 tasks
tfenne opened this issue Jul 3, 2019 · 4 comments
Closed
1 of 2 tasks
Assignees

Comments

@tfenne
Copy link
Contributor

tfenne commented Jul 3, 2019

Bug Report

Affected tool(s) or class(es)

GenotypeGVCFs with --keep-combined-raw-annotations

Affected version(s)

  • Latest public release version [version?]
  • Latest master branch as of (not tested)

Description

@ldgauthier was kind enough to introduce the --keep-combined-raw-annotations option for us after the discussion in issue #5698, and we've been using it extensively. We recently noticed a problem that affects a small fraction of variants though.

We're noticing this with AS_SB_TABLE but it probably applies to all annotations that are per-allele or per-alt allele. The problem is that when GenotypeGVCFs runs it may chose to output only a subset of the alleles present in the gVCF. When it does this it does not appear to update the annotations to remove the values for the removed alleles. This results in annotations with more values than there are alleles, and no safe/predictable way to interpret those annotations since you don't know the original ordering of alleles and which ones were removed when looking at the resulting VCF. This is happening, in my case, primarily at homopolymer sites and occasionally at STRs with larger repeat units.

I've attached a zip file - AS_SB_TABLE_bug.zip - which contains a one-record gVCF, the command to generate the VCF and the resulting VCF, which should be sufficient to demonstrate the problem and reproduce it.

Here's what an offending variant looks like:

chr1    100366446       .       GTT     G       562.64  .       AC=1;AF=0.500;AN=2;AS_SB_TABLE=19,6|16,6|4,0|2,2|1,1;...;REF_BASES=ATGTTTTTTTGTTTTTTTTTT;RPA=13,11;RU=T;ReadPosRankSum=-1.296e+00;SOR=0.534;STR    GT:AD:DP:F1R2:F2R1:GQ:PL        0/1:25,22:57:19,16:4,4:99:570,0,819

Steps to reproduce

See attached zip file.

Expected behavior

All per-allele and per-alt-allele annotations should be subsetted to only the values for the alleles that are output in the resulting VCF.

Actual behavior

All the values for all the input alleles come out.

@ldgauthier
Copy link
Contributor

I've been working on similar issues in other tools. Should be an easy enough fix.

@ldgauthier
Copy link
Contributor

@tfenne I have a PR that should clean this up. If there's anything else you'd like to see specifically in the tests, let me know.

@droazen
Copy link
Collaborator

droazen commented Nov 21, 2019

@ldgauthier Was your PR for this merged? Can we close this one?

@ldgauthier
Copy link
Contributor

Yes, #6079 took care of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants