Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample set and annotation improvements for SVConcordance #8211

Merged
merged 1 commit into from
Mar 21, 2023

Conversation

mwalker174
Copy link
Contributor

  • Relaxes restrictions for allowed samples in SVConcordance: the tool can now accept eval/truth VCFs with arbitrary sample sets and will have genotype concordance metrics computed on the intersection of the sample sets. All available samples are still used for AF/AC annotations. Integration tests added for cases when the samples sets are overlapping but not equal.
  • Small additional improvements for sites-only VCFs: concordance annotations will now be . instead of NaN for example. Integration test added for this case.
  • Improved behavior for eval AF annotations: these will not be recalculated if they already exist.
  • Improved behavior for truth AF annotations: these will now only be recalculated if they don't exist in the input truth VCF.
  • Updated tool doc

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions for my own edification.

The sites-only test seems weak, but honestly I didn't go back to see what the expected behavior is.

* on the intersection of sample sets of the two VCFs, but all other annotations including variant truth status
* and allele frequency use all records and samples available. See output header for descriptions
* of the specific fields. For multi-allelic CNVs, only a copy state concordance metric is
* annotated. Allele frequencies will be recalculated automatically if unavailable in the provided VCFs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a different tool that outputs a table or confusion matrix for concordance? That's what I'm used to seeing for concordance tools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No but that's a good idea for the future. Right now we're using this for filtering so the important thing is the vcf annotations.

@@ -388,14 +395,6 @@ public static final Map<String, Object> keyValueArraysToMap(final String[] keys,
return map;
}

// Note that strands may not be valid
public static SVCallRecord newCallRecordWithLengthAndTypeAndChrom2(final Integer length, final GATKSVVCFConstants.StructuralVariantAnnotationType svtype, final String chrom2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because we changed the output to better conform to the 4.3 spec a while ago?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember exactly when/where this was used, but I noticed it's unused now. I may have needed correct strands for new tests in the last PR and just used a more general method.

numCnvMatches = null;
} else if (numCnvMatches != null && result.booleanValue()) {
numCnvMatches++;
if (samples == null || samples.contains(sample)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on how this is called, does the Guava set intersection return null or empty set if there's no samples in common? Given that this is a public method, maybe it's worth accounting for both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, Guava will never return null there. The case samples == null is just used for testing right now.

attributes.put(GATKSVVCFConstants.VAR_PPV_INFO, metrics.VAR_PPV);
attributes.put(GATKSVVCFConstants.VAR_SENSITIVITY_INFO, metrics.VAR_SENSITIVITY);
attributes.put(GATKSVVCFConstants.VAR_SPECIFICITY_INFO, metrics.VAR_SPECIFICITY);
attributes.put(GATKSVVCFConstants.GENOTYPE_CONCORDANCE_INFO, Double.isNaN(metrics.GENOTYPE_CONCORDANCE) ? null : metrics.GENOTYPE_CONCORDANCE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also found that people doesn't especially like NaNs in their VCFs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think pysam might crash with those too.

private boolean hasAlleleFrequencyAnnotations(final SVCallRecord record) {
Utils.nonNull(record);
final Map<String, Object> truthAttr = record.getAttributes();
return (truthAttr.containsKey(GATKSVVCFConstants.TRUTH_ALLELE_COUNT_INFO) && truthAttr.get(GATKSVVCFConstants.TRUTH_ALLELE_COUNT_INFO) != null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the truth VCF is going to have a special format? What's the benefit of requiring this and not using any old VCF with AC/AF/AN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Should be looking for regular AC/AF/AN here. I've fixed this and added regression tests.

@@ -227,4 +227,152 @@ public void testSelf() {
}
}
}

@Test
public void testSelfTruthSubset() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very elegant

}

@Test
public void testSelfEvalSubset() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of duplicate code here, so I might be inclined to suggest a data provider, but it's also sort of a weird case where we would have a genericified test that expects perfect concordance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I've created a assertPerfectConcordance() to do the checking at least.

final Pair<VCFHeader, List<VariantContext>> outputVcf = VariantContextTestUtils.readEntireVCFIntoMemory(output.getAbsolutePath());
final List<SVCallRecord> inputEvalVariants = VariantContextTestUtils.readEntireVCFIntoMemory(evalVcfPath).getValue()
.stream().map(SVCallRecordUtils::create).collect(Collectors.toList());
Assert.assertEquals(outputVcf.getValue().size(), inputEvalVariants.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little lax. Are there new annotations you expect in the output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I added a check for the proper number of true positive annotations and that the genotype concordance is .

@mwalker174 mwalker174 force-pushed the mw_sv_concordance_sample_sets branch from 4083cdc to fd6c700 Compare March 20, 2023 17:55
@mwalker174 mwalker174 merged commit e68f066 into master Mar 21, 2023
@mwalker174 mwalker174 deleted the mw_sv_concordance_sample_sets branch March 21, 2023 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants