-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sample set and annotation improvements for SVConcordance #8211
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions for my own edification.
The sites-only test seems weak, but honestly I didn't go back to see what the expected behavior is.
* on the intersection of sample sets of the two VCFs, but all other annotations including variant truth status | ||
* and allele frequency use all records and samples available. See output header for descriptions | ||
* of the specific fields. For multi-allelic CNVs, only a copy state concordance metric is | ||
* annotated. Allele frequencies will be recalculated automatically if unavailable in the provided VCFs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a different tool that outputs a table or confusion matrix for concordance? That's what I'm used to seeing for concordance tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No but that's a good idea for the future. Right now we're using this for filtering so the important thing is the vcf annotations.
@@ -388,14 +395,6 @@ public static final Map<String, Object> keyValueArraysToMap(final String[] keys, | |||
return map; | |||
} | |||
|
|||
// Note that strands may not be valid | |||
public static SVCallRecord newCallRecordWithLengthAndTypeAndChrom2(final Integer length, final GATKSVVCFConstants.StructuralVariantAnnotationType svtype, final String chrom2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because we changed the output to better conform to the 4.3 spec a while ago?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember exactly when/where this was used, but I noticed it's unused now. I may have needed correct strands for new tests in the last PR and just used a more general method.
numCnvMatches = null; | ||
} else if (numCnvMatches != null && result.booleanValue()) { | ||
numCnvMatches++; | ||
if (samples == null || samples.contains(sample)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on how this is called, does the Guava set intersection return null or empty set if there's no samples in common? Given that this is a public method, maybe it's worth accounting for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, Guava will never return null there. The case samples == null
is just used for testing right now.
attributes.put(GATKSVVCFConstants.VAR_PPV_INFO, metrics.VAR_PPV); | ||
attributes.put(GATKSVVCFConstants.VAR_SENSITIVITY_INFO, metrics.VAR_SENSITIVITY); | ||
attributes.put(GATKSVVCFConstants.VAR_SPECIFICITY_INFO, metrics.VAR_SPECIFICITY); | ||
attributes.put(GATKSVVCFConstants.GENOTYPE_CONCORDANCE_INFO, Double.isNaN(metrics.GENOTYPE_CONCORDANCE) ? null : metrics.GENOTYPE_CONCORDANCE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also found that people doesn't especially like NaNs in their VCFs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think pysam might crash with those too.
private boolean hasAlleleFrequencyAnnotations(final SVCallRecord record) { | ||
Utils.nonNull(record); | ||
final Map<String, Object> truthAttr = record.getAttributes(); | ||
return (truthAttr.containsKey(GATKSVVCFConstants.TRUTH_ALLELE_COUNT_INFO) && truthAttr.get(GATKSVVCFConstants.TRUTH_ALLELE_COUNT_INFO) != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the truth VCF is going to have a special format? What's the benefit of requiring this and not using any old VCF with AC/AF/AN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Should be looking for regular AC/AF/AN here. I've fixed this and added regression tests.
@@ -227,4 +227,152 @@ public void testSelf() { | |||
} | |||
} | |||
} | |||
|
|||
@Test | |||
public void testSelfTruthSubset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very elegant
} | ||
|
||
@Test | ||
public void testSelfEvalSubset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of duplicate code here, so I might be inclined to suggest a data provider, but it's also sort of a weird case where we would have a genericified test that expects perfect concordance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I've created a assertPerfectConcordance()
to do the checking at least.
final Pair<VCFHeader, List<VariantContext>> outputVcf = VariantContextTestUtils.readEntireVCFIntoMemory(output.getAbsolutePath()); | ||
final List<SVCallRecord> inputEvalVariants = VariantContextTestUtils.readEntireVCFIntoMemory(evalVcfPath).getValue() | ||
.stream().map(SVCallRecordUtils::create).collect(Collectors.toList()); | ||
Assert.assertEquals(outputVcf.getValue().size(), inputEvalVariants.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little lax. Are there new annotations you expect in the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I added a check for the proper number of true positive annotations and that the genotype concordance is .
4083cdc
to
fd6c700
Compare
.
instead ofNaN
for example. Integration test added for this case.