Sample set and annotation improvements for SVConcordance #8211

mwalker174 · 2023-02-16T19:23:32Z

Relaxes restrictions for allowed samples in SVConcordance: the tool can now accept eval/truth VCFs with arbitrary sample sets and will have genotype concordance metrics computed on the intersection of the sample sets. All available samples are still used for AF/AC annotations. Integration tests added for cases when the samples sets are overlapping but not equal.
Small additional improvements for sites-only VCFs: concordance annotations will now be . instead of NaN for example. Integration test added for this case.
Improved behavior for eval AF annotations: these will not be recalculated if they already exist.
Improved behavior for truth AF annotations: these will now only be recalculated if they don't exist in the input truth VCF.
Updated tool doc

ldgauthier

Mostly questions for my own edification.

The sites-only test seems weak, but honestly I didn't go back to see what the expected behavior is.

ldgauthier · 2023-02-23T17:18:33Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVConcordance.java

+ * on the intersection of sample sets of the two VCFs, but all other annotations including variant truth status
+ * and allele frequency use all records and samples available. See output header for descriptions
+ * of the specific fields. For multi-allelic CNVs, only a copy state concordance metric is
+ * annotated. Allele frequencies will be recalculated automatically if unavailable in the provided VCFs.


Is there a different tool that outputs a table or confusion matrix for concordance? That's what I'm used to seeing for concordance tools.

No but that's a good idea for the future. Right now we're using this for filtering so the important thing is the vcf annotations.

ldgauthier · 2023-02-23T17:25:35Z

src/test/java/org/broadinstitute/hellbender/tools/sv/SVTestUtils.java

@@ -388,14 +395,6 @@ public static final Map<String, Object> keyValueArraysToMap(final String[] keys,
        return map;
    }

-    // Note that strands may not be valid
-    public static SVCallRecord newCallRecordWithLengthAndTypeAndChrom2(final Integer length, final GATKSVVCFConstants.StructuralVariantAnnotationType svtype, final String chrom2) {


Is this because we changed the output to better conform to the 4.3 spec a while ago?

I can't remember exactly when/where this was used, but I noticed it's unused now. I may have needed correct strands for new tests in the last PR and just used a more general method.

ldgauthier · 2023-02-23T18:15:49Z

src/main/java/org/broadinstitute/hellbender/tools/sv/concordance/SVConcordanceAnnotator.java

-                    numCnvMatches = null;
-                } else if (numCnvMatches != null && result.booleanValue()) {
-                    numCnvMatches++;
+            if (samples == null || samples.contains(sample)) {


Based on how this is called, does the Guava set intersection return null or empty set if there's no samples in common? Given that this is a public method, maybe it's worth accounting for both.

Good question, Guava will never return null there. The case samples == null is just used for testing right now.

ldgauthier · 2023-02-23T18:23:08Z

src/main/java/org/broadinstitute/hellbender/tools/sv/concordance/SVConcordanceAnnotator.java

-            attributes.put(GATKSVVCFConstants.VAR_PPV_INFO, metrics.VAR_PPV);
-            attributes.put(GATKSVVCFConstants.VAR_SENSITIVITY_INFO, metrics.VAR_SENSITIVITY);
-            attributes.put(GATKSVVCFConstants.VAR_SPECIFICITY_INFO, metrics.VAR_SPECIFICITY);
+            attributes.put(GATKSVVCFConstants.GENOTYPE_CONCORDANCE_INFO, Double.isNaN(metrics.GENOTYPE_CONCORDANCE) ? null : metrics.GENOTYPE_CONCORDANCE);


I have also found that people doesn't especially like NaNs in their VCFs

Yeah, I think pysam might crash with those too.

ldgauthier · 2023-02-23T18:30:44Z

src/main/java/org/broadinstitute/hellbender/tools/sv/concordance/SVConcordanceAnnotator.java

+    private boolean hasAlleleFrequencyAnnotations(final SVCallRecord record) {
+        Utils.nonNull(record);
+        final Map<String, Object> truthAttr = record.getAttributes();
+        return (truthAttr.containsKey(GATKSVVCFConstants.TRUTH_ALLELE_COUNT_INFO) && truthAttr.get(GATKSVVCFConstants.TRUTH_ALLELE_COUNT_INFO) != null)


So the truth VCF is going to have a special format? What's the benefit of requiring this and not using any old VCF with AC/AF/AN?

Good catch! Should be looking for regular AC/AF/AN here. I've fixed this and added regression tests.

ldgauthier · 2023-02-28T18:14:41Z

src/test/java/org/broadinstitute/hellbender/tools/walkers/sv/SVConcordanceIntegrationTest.java

@@ -227,4 +227,152 @@ public void testSelf() {
            }
        }
    }
+
+    @Test
+    public void testSelfTruthSubset() {


Very elegant

ldgauthier · 2023-02-28T18:20:13Z

src/test/java/org/broadinstitute/hellbender/tools/walkers/sv/SVConcordanceIntegrationTest.java

+    }
+
+    @Test
+    public void testSelfEvalSubset() {


There's a lot of duplicate code here, so I might be inclined to suggest a data provider, but it's also sort of a weird case where we would have a genericified test that expects perfect concordance.

Okay I've created a assertPerfectConcordance() to do the checking at least.

ldgauthier · 2023-02-28T18:21:17Z

src/test/java/org/broadinstitute/hellbender/tools/walkers/sv/SVConcordanceIntegrationTest.java

+        final Pair<VCFHeader, List<VariantContext>> outputVcf = VariantContextTestUtils.readEntireVCFIntoMemory(output.getAbsolutePath());
+        final List<SVCallRecord> inputEvalVariants = VariantContextTestUtils.readEntireVCFIntoMemory(evalVcfPath).getValue()
+                .stream().map(SVCallRecordUtils::create).collect(Collectors.toList());
+        Assert.assertEquals(outputVcf.getValue().size(), inputEvalVariants.size());


This seems a little lax. Are there new annotations you expect in the output?

Sure, I added a check for the proper number of true positive annotations and that the genotype concordance is .

ldgauthier approved these changes Feb 28, 2023

View reviewed changes

Use sample set intersection for concordance

fd6c700

mwalker174 force-pushed the mw_sv_concordance_sample_sets branch from 4083cdc to fd6c700 Compare March 20, 2023 17:55

mwalker174 merged commit e68f066 into master Mar 21, 2023

mwalker174 deleted the mw_sv_concordance_sample_sets branch March 21, 2023 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample set and annotation improvements for SVConcordance #8211

Sample set and annotation improvements for SVConcordance #8211

mwalker174 commented Feb 16, 2023

ldgauthier left a comment

ldgauthier Feb 23, 2023

mwalker174 Mar 20, 2023

ldgauthier Feb 23, 2023

mwalker174 Mar 20, 2023

ldgauthier Feb 23, 2023

mwalker174 Mar 20, 2023

ldgauthier Feb 23, 2023

mwalker174 Mar 20, 2023

ldgauthier Feb 23, 2023

mwalker174 Mar 20, 2023

ldgauthier Feb 28, 2023

ldgauthier Feb 28, 2023

mwalker174 Mar 20, 2023

ldgauthier Feb 28, 2023

mwalker174 Mar 20, 2023

Sample set and annotation improvements for SVConcordance #8211

Sample set and annotation improvements for SVConcordance #8211

Conversation

mwalker174 commented Feb 16, 2023

ldgauthier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment