Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added GVCF mode for VariantContext type determination #1544

Merged
merged 3 commits into from
Sep 22, 2021

Conversation

michaelgatzen
Copy link
Contributor

@michaelgatzen michaelgatzen commented Apr 9, 2021

Related to broadinstitute/gatk#7111

  • Usually, NON_REF alleles will be considered SYMBOLIC. Therefore, if a VariantContext contains the alleles A*,C,<NON_REF>, the resulting type would be MIXED. For GVCF files, however, it would be helpful that this would be considered a SNP. When true is passed as the optional argument ignoreNonRef to VariantContext.getType() NON_REF alleles will be ignored for type determination. If only NON_REF alleles are seen at a given site, the type will remain to be SYMBOLIC.
  • Default behavior will not change. This type determination is only performed if true is passed for the optional ignoreNonRef argument to VariantContext.getType()
  • Moderate refactoring of VariantContext type determination
    • This was necessary because the type caching needs to distinguish between ignoreNonRef being true or false
    • Changed return type of determineType and determinePolymorphicType from void to VariantContext.Type, otherwise multiple code branches would be necessary depending on which caching variable to set
    • Added unit test to catch if the cache separation works
  • Added unit tests

- Usually, NON_REF alleles will be considered SYMBOLIC. Therefore, if a VariantContext contains the alleles `A*,C,<NON_REF>`, the resulting type would be MIXED. For GVCF files, however, it would be helpful that this would be considered a SNP.
- Default behavior will not change, only if true is passed for the optional ignoreNonRef argument to getType()
- Added unit tests
@michaelgatzen
Copy link
Contributor Author

@ldgauthier regarding broadinstitute/gatk#7111

I implemented the option to ignore NON_REF for type determination here, by passing an argument boolean ignoreNonRef to VariantContext.getType(). You mentioned that you would prefer the *-allele to be treated as symbolic for GVCF support. Since we're trying to change this class anyway, I could also rename this argument to e.g. boolean gvcfMode and in addition to ignoring NON_REF, also treat the *-allele as symbolic. What do you (and/or others) think?

@ldgauthier
Copy link
Contributor

The * allele isn't specific to GVCFs, so I have to think a little more about the name.

My first choice for implementation would be to add Allele::wouldBeStarAllele as another OR on Allele::wouldBeSymbolicAllele, but that would probably be a BREAKING CHANGE!!! We could try a GATK branch with the change and see what happens to the integration tests.

@michaelgatzen
Copy link
Contributor Author

Right, thanks for your input. If I understand your suggestion correctly, this would be independent of this PR though, as with the change that you proposed we'd still have to treat NON_REF different to other symbolic alleles for correct GVCF variant type determination. So I would suggest merging this PR either way (this would be a non-breaking change) and dealing with the * allele separately, unless anyone disagrees.

- This was necessary because the type caching needs to distinguish between ignoreNonRef being true or false
- Changed return type of `determineType` and `determinePolymorphicType` from `void` to `VariantContext.Type`, otherwise multiple code branches would be necessary depending on which caching variable to set
- Added unit test to catch if the cache separation works
@ldgauthier
Copy link
Contributor

So is your proposal to keep the name gvcfmode and just ignore non-ref (not *)?

@michaelgatzen
Copy link
Contributor Author

michaelgatzen commented Apr 12, 2021

My proposal would to keep ignoreNonRef as an argument to VariantContext::getType() instead of gvcfMode, as this makes it clearer what's going on, which I think is good for such a "low-level" function, and yes, only ignore NON_REF and do nothing about *.

@michaelgatzen
Copy link
Contributor Author

The argument to GATK SelectVariants would be called --ignore-non-ref-in-types per broadinstitute/gatk#7193, and there would be a warning if a GVCF file is read and this argument is not set. But I'm of course happy to change things.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelgatzen This seems sane, I'm not sure if the boolean input value overload is better/worse than a new named method.

I have a question about the return value for only non-refs as well.

- Fixed bug when the type determination would output SYMBOLIC when it should output NO_VARIATION
- Moved new tests ignoring the nonRef allele to a different method
- Included tests that check if expectedType(alleles) == expectedTypeIgnoringNonRef(alleles + nonRef)
@codecov-commenter
Copy link

codecov-commenter commented Sep 20, 2021

Codecov Report

Merging #1544 (d2778b8) into master (d4464dd) will increase coverage by 0.425%.
The diff coverage is 83.333%.

@@               Coverage Diff               @@
##              master     #1544       +/-   ##
===============================================
+ Coverage     69.417%   69.842%   +0.425%     
- Complexity      8939      9638      +699     
===============================================
  Files            604       702       +98     
  Lines          35631     37619     +1988     
  Branches        5921      6111      +190     
===============================================
+ Hits           24734     26274     +1540     
- Misses          8549      8897      +348     
- Partials        2348      2448      +100     
Impacted Files Coverage Δ
.../htsjdk/variant/variantcontext/VariantContext.java 78.456% <83.333%> (+0.083%) ⬆️
...dk/samtools/seekablestream/SeekableHTTPStream.java 57.353% <0.000%> (-1.471%) ⬇️
src/main/java/htsjdk/io/AsyncWriterPool.java 72.222% <0.000%> (-1.389%) ⬇️
src/main/java/htsjdk/samtools/SamFiles.java 92.683% <0.000%> (ø)
src/main/java/htsjdk/samtools/SAMFileHeader.java 67.251% <0.000%> (ø)
...a/htsjdk/samtools/reference/ReferenceSequence.java 83.333% <0.000%> (ø)
...beta/codecs/variants/vcf/vcfv4_3/VCFCodecV4_3.java 77.778% <0.000%> (ø)
.../htsjdk/beta/io/bundle/SeekableStreamResource.java 72.414% <0.000%> (ø)
...tsjdk/beta/codecs/reads/htsget/HtsgetBAMCodec.java 66.667% <0.000%> (ø)
.../htsjdk/beta/plugin/registry/HtsCodecRegistry.java 85.714% <0.000%> (ø)
... and 117 more

@@ -108,109 +108,190 @@ public void testDetermineTypes() {

// test REF
List<Allele> alleles = Arrays.asList(Tref);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I figured this could be done programmatically with a dataprovider instead of needing quite so much boilerplate. Sorry for so much typing!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wasn't sure if we use data providers in htsjdk, so I just went ahead and did this, but it wasn't much effort anyway. If you want me to change it I can

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, we do. This test didn't for some reason, so I can see why it's confusing.

@michaelgatzen
Copy link
Contributor Author

Let me know if there are any more comments, otherwise from my side everything would be good to be merged

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lbergelson lbergelson merged commit aac46ee into samtools:master Sep 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants