-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReferenceConfidenceVariantContextMerger fixes for spanning deletions, attribute merging. #4680
Conversation
return Integer.parseInt(stringValue); | ||
// Use the VCF header's declared type for the given attribute to ensure that all the values for that attribute | ||
// across all the VCs being merged have the same boxed representation. Some VCs have a serialized value of "0" | ||
// for FLOAT attributes, with no embedded decimal point, but we still need to box those into Doubles, or the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of all the terrible things that happen to annotations in htsjdk, at least it guarantees some decimal precision (VCFEncoder.formatVCFDouble()). I'm guessing this was an issue related to GenomicsDB output (which uses htslib)? I mentioned this to Karthik a while ago (#4047 (comment)) It should have been addressed by #4261, but I haven't tested that explicitly. Not that I object to the belt and suspenders approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldgauthier Good to know - thanks. At least two users have reported this, one of them seemed to implicate bcftools, but that may be a red herring. I do think htsjdk can produce such output though, since it appears to rely on the in-memory representation of the value provided by the caller, rather than the header, to determine the format. I’d love to tighten that up in the next round of htsjdk updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests look good and I have really only one sticking issue that probably should be fixed.
@@ -174,6 +180,9 @@ public VariantContext merge(final List<VariantContext> vcs, final Locatable loc, | |||
for (final Allele a : vc.getAlternateAlleles()) { | |||
if (a.isSymbolic()) { | |||
result.add(a); | |||
} else if ( a == Allele.SPAN_DEL ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really gross but it looks like the best we can do. It really seems like a bug to me that an Allele.SPAN_DEL is not symbolic but i guess thats an HTSJDK issue isn't it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it really does seem like it should be symbolic. And there is a proposal/discussion about that here. Not sure how disruptive that change would be, but for now, this matches GATK3.
private Comparable<?> parseNumericInfoAttributeValue(final VCFHeader vcfHeader, final String key, final String stringValue) { | ||
final VCFInfoHeaderLine infoLine = vcfHeader.getInfoHeaderLine(key); | ||
if (infoLine == null) { | ||
warning.warn(String.format("No header line for INFO attribute %s was found in the input header", key)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning seems problematic. If it gets triggered it will probably get spit out for EVERY read in a VCF which would result in a torrential span. Perhaps us a OneShotLogger here to save peoples poor poor bash histories...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code leading up to this I don't expect this warning will be a problem for any non-reducible annotations except i'm not sure i completely trust our VCFStandardHeaderLines to have the correct types. We have a number of issues with the header lines either here or in HTSJDK being wrong. This is still better I would say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point - but warning
is a OneShotLogger
. Also, I did keep the fallback code for the old behavior if the header line isn't found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, good point about warning
. The next point is that currently its just a dumb implementation that only emits once. If there are multiple error states in a file its probably best to open multiple loggers so if both are triggered they will each be written out, warning is currently used for a different case to do with merging and numerical parsing. maybe split them into 2 loggers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Split out into a second logger, with a slightly more generic message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I rebased this on top of my master to pick up my last CombineGVCFs branch/fix.
f8602c4
to
3db1e4c
Compare
Codecov Report
@@ Coverage Diff @@
## master #4680 +/- ##
===============================================
+ Coverage 79.842% 79.847% +0.004%
- Complexity 17332 17334 +2
===============================================
Files 1074 1074
Lines 62924 62938 +14
Branches 10182 10186 +4
===============================================
+ Hits 50240 50254 +14
Misses 8704 8704
Partials 3980 3980
|
@jamesemery I think this is ready - not sure if you want to take one more quick look at it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One typo you should probably fix and a variable that should have a better name but after that you can merge 👍
private Comparable<?> parseNumericInfoAttributeValue(final VCFHeader vcfHeader, final String key, final String stringValue) { | ||
final VCFInfoHeaderLine infoLine = vcfHeader.getInfoHeaderLine(key); | ||
if (infoLine == null) { | ||
oneShotHeaderLineLogger.warn(String.format("At least one attribute was found (%s) for which there no corresponding header line", key)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there no -> there is no
protected final VariantAnnotatorEngine annotatorEngine; | ||
protected final OneShotLogger warning = new OneShotLogger(this.getClass()); | ||
protected final OneShotLogger oneShotHeaderLineLogger = new OneShotLogger(this.getClass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're at it, could you rename warning to something like nonNumericalAttributeWarning
or something to that end so its clearer in the future.
… use correct types for median calculation.
3db1e4c
to
7ba0a9f
Compare
… use correct types for median calculation. (broadinstitute#4680)
Fixes for #4525 and #4633. Longer term we should do a more significant rewrite/refactoring.