(SV) output merged single VCF for new interpretation tool #4996

SHuang-Broad · 2018-07-10T23:20:25Z

One more step towards using this new tool.

Does:

output a single VCF containing <INS>, <DEL>, <DUP>, <INV> calls (there will be more <INV> calls, but that cannot happen until someone takes a look at PR Inversion breakpoint linking (no merge yet) #4789 and check if the proposed algorithm makes sense)
since this new tool applies more permissive filters on MQ and alignment length of the assembly contigs' mappings, I've introduced some downstream filtering parameters allowing to filter VCF records based on annotations MAPPING_QUALITIES and MAX_ALIGN_LENGTH; the default value is chosen after some experimentation using the CHM PacBio as truth and the branch
sh-sv-interlvatree-eval.
cleans up VCF headers and related tests.

codecov-io · 2018-07-12T15:19:54Z

Codecov Report

Merging #4996 into master will decrease coverage by 0.009%.
The diff coverage is 84.463%.

@@               Coverage Diff               @@
##              master     #4996       +/-   ##
===============================================
- Coverage     86.385%   86.376%   -0.009%     
- Complexity     28822     28835       +13     
===============================================
  Files           1791      1791               
  Lines         133561    133762      +201     
  Branches       14902     14919       +17     
===============================================
+ Hits          115377    115538      +161     
- Misses         12791     12818       +27     
- Partials        5393      5406       +13

Impacted Files	Coverage Δ	Complexity Δ
.../AssemblyContigAlignmentsConfigPickerUnitTest.java	`99.057% <ø> (ø)`	`30 <0> (ø)`	⬇️
...ry/alignment/ContigAlignmentsModifierUnitTest.java	`99.194% <ø> (ø)`	`19 <0> (ø)`	⬇️
...very/alignment/AlignedContigGeneratorUnitTest.java	`97.17% <ø> (ø)`	`10 <0> (ø)`	⬇️
...ery/inference/SimpleNovelAdjacencyInterpreter.java	`80.882% <ø> (ø)`	`11 <0> (ø)`	⬇️
...ender/tools/spark/sv/utils/GATKSVVCFConstants.java	`0% <ø> (-75%)`	`0 <0> (-1)`
.../sv/discovery/inference/CpxVariantInterpreter.java	`79.839% <ø> (ø)`	`26 <0> (ø)`	⬇️
.../DiscoverVariantsFromContigAlignmentsSAMSpark.java	`82.143% <ø> (ø)`	`7 <0> (ø)`	⬇️
...lignment/AssemblyContigAlignmentsConfigPicker.java	`91.563% <ø> (ø)`	`108 <0> (ø)`	⬇️
...s/spark/sv/discovery/SvDiscoveryInputMetaData.java	`100% <ø> (ø)`	`7 <0> (ø)`	⬇️
...ry/inference/ImpreciseVariantDetectorUnitTest.java	`100% <ø> (ø)`	`5 <0> (ø)`	⬇️
... and 19 more

cwhelan

This looks mostly fine, I have some minor suggestions about how to restructure a little of the code.

cwhelan · 2018-07-25T19:52:42Z

...broadinstitute/hellbender/tools/spark/sv/StructuralVariationDiscoveryArgumentCollection.java

@@ -269,11 +269,12 @@ public void validate() {

    public enum SvEvidenceFilterType {DENSITY, XGBOOST}

-    public static class DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection implements Serializable {
+    public static class VariantsDiscoveryFromContigsAlignmentsSparkArgumentCollection implements Serializable {


Why change this name, I thought it read better the old way. Actually, though, both have inconsistent pluralization, it should either be: DiscoverVariantsFromContigAlignmentsSparkArgumentCollection or VariantDiscoveryFromContigAlignmentsSparkArgumentCollection

Picked DiscoverVariantsFromContigAlignmentsSparkArgumentCollection as suggested.

cwhelan · 2018-07-25T19:53:21Z

...broadinstitute/hellbender/tools/spark/sv/StructuralVariationDiscoveryArgumentCollection.java

@@ -269,11 +269,12 @@ public void validate() {

    public enum SvEvidenceFilterType {DENSITY, XGBOOST}

-    public static class DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection implements Serializable {
+    public static class VariantsDiscoveryFromContigsAlignmentsSparkArgumentCollection implements Serializable {
        private static final long serialVersionUID = 1L;

        public static final int GAPPED_ALIGNMENT_BREAK_DEFAULT_SENSITIVITY = STRUCTURAL_VARIANT_SIZE_LOWER_BOUND; // alignment with gap of size >= 50 will be broken apart.
        public static final int CHIMERIC_ALIGNMENTS_HIGHMQ_THRESHOLD = 60;


I'd add a TODO comment saying that this parameter will be removed when we remove the old tool.

cwhelan · 2018-07-30T13:30:49Z

.../org/broadinstitute/hellbender/tools/spark/sv/StructuralVariationDiscoveryPipelineSpark.java

@@ -43,9 +43,11 @@
 import org.broadinstitute.hellbender.utils.read.GATKRead;
 import org.broadinstitute.hellbender.utils.read.SAMRecordToGATKReadAdapter;
 import scala.Serializable;
+import scala.Tuple3;


I think this import isn't used.

cwhelan · 2018-07-30T13:31:37Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

 import org.broadinstitute.hellbender.tools.spark.sv.utils.SVIntervalTree;
 import org.broadinstitute.hellbender.tools.spark.sv.utils.SVUtils;
 import org.broadinstitute.hellbender.tools.spark.sv.utils.SVVCFWriter;
 import org.broadinstitute.hellbender.utils.Utils;
 import org.broadinstitute.hellbender.utils.io.IOUtils;
 import org.broadinstitute.hellbender.utils.read.GATKRead;
 import scala.Tuple2;
+import scala.Tuple3;


unused import

cwhelan · 2018-07-30T13:36:43Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

-        // TODO: 1/10/18 bring back read annotation, see ticket 4228
+        final List<VariantContext> simpleChimeraVariants =
+                extractSimpleVariants(contigsByPossibleRawTypes.simple, svDiscoveryInputMetaData, outputPrefixWithSampleName);
+        contigsByPossibleRawTypes.simple.unpersist(false);


I think you just moved around this code for this PR, but looking at this I'm not sure I like this method having the side effect of calling unpersist on the internal RDDs of contigsByPossibleRawTypes. Just to avoid side effects I'd prefer to pull out the unpersist calls to the calling method. Actually, since you've got them bundled in one object, why not make an unpersist method on AssemblyContigsClassifiedByAlignmentSignatures that unpersists all of the underlying RDDs, and then it's just a one line invocation from the callers of this method.

Thanks! That's much better.

cwhelan · 2018-07-30T13:46:29Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

+        }
+    }
+
+    public static final class SVAlnFilter implements StructuralVariantFilter {


Write out SVAlignmentLengthFilter

cwhelan · 2018-07-30T13:48:05Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/SvDiscoveryUtils.java

@@ -154,4 +153,27 @@ public static void writeSAMRecords(final List<GATKRead> reads, final Set<String>
        samRecords.sort(localComparator);
        SVFileUtils.writeSAMFile( outputPath, samRecords.iterator(), cloneHeader, true);
    }
+
+    /**
+     * this exist because for whatever reason,


Put in a TODO comment reminding us to get to the bottom of this behavior. If it's something wrong with our data, we should fix it, otherwise we should file a bug against htsjdk.

added todo.
I've talked with Louis and he is aware of the issue (this happens not only to our VCF) since a few months ago. Hopefully it will be resolved.

cwhelan · 2018-07-30T13:51:23Z

...hellbender/tools/spark/sv/discovery/inference/SegmentedCpxVariantSimpleVariantExtractor.java

@@ -113,6 +88,12 @@ private ExtractedSimpleVariants(final List<VariantContext> reInterpretZeroOrOneS
        public List<VariantContext> getReInterpretMultiSegmentsCalls() {
            return reInterpretMultiSegmentsCalls;
        }
+
+        public List<VariantContext> getMergedReInterpretCalls() {


getMergedReinterpretedCalls

cwhelan · 2018-07-30T13:53:56Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/GATKSVVCFConstants.java

+
+    ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+
+    public static final Stream<String> expectedAltAlleleHeaderKeysInVCF


I think that these should be in test code somewhere, not in main code.

Agree.
I put these here because it's difficult otherwise to remind people to add new annotations to our VCF header. And if we don't, the VCF file is, technically speaking, corrupt.
I tried to use some Java language features to get the fields declared here, and test if the keys are all present in the test VCF files we have, but that turns out to be ugly.
Do you have any suggestions?

What if we added some assertions to our integration test that checked whether everything is in the header? We could collect all of the alt alleles and info and format annotation keys we see in the variants in StructuralVariationDiscoveryPipelineSparkIntegrationTest.svDiscoveryVCFEquivalenceTest and assert that they have corresponding header line entries. Would that address your concern? You could split that work off of this PR, too, and make a ticket for it.

cwhelan · 2018-07-30T13:56:19Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/GATKSVVCFHeaderLines.java

        }

        // for variants-detected from assembly
-        // todo: create an alternate assembly file and link to it with breakpoint IDs according to the VCF spec
-        {
+        {// todo: create an alternate assembly file and link to it with breakpoint IDs according to the VCF spec


Do we already do this? Do you know what needs to happen to make this spec-compliant?

Yes, we are already doing this in the sense that the assembly contig names are recorded.

To be spec-compliant, specifically spec version 4.2, section 5.4.2 (what we are doing in this case is NOT violating the spec; we basically offer the convenience of getting the exact inserted sequence without having to parse another file, at the cost of larger file size)

we currently always pull the inserted sequence (or whatever sequence we need) from those contigs and place it here in the VCF records. As you can imagine sometimes these sequences can be quite long.

BND variants are going to be reported differently, as the inserted sequence between the novel adjacencies are currently reported explicitly in the VCF records. Again, convenience vs file size.

the assembly file needs to be revised in case the assembler gives the "reverse strand" representation of the event, and this could be a little bit more complicated since the assembly file, assuming it is going to be an GFA rather than SAM file, has connection information, hence revising it means the connections need to be revised as well (which can be a little tricky).

OK! Just checking to see if you wanted to keep this todo comment around or if it could be resolved.

SHuang-Broad

@cwhelan Thanks for the review! I've incorporated most of the suggested changes except the comment for GATKSVVCFConstants.java.expectedAltAlleleHeaderKeysInVCF, which I don't know what the best option is yet.

SHuang-Broad · 2018-07-30T20:27:05Z

.../org/broadinstitute/hellbender/tools/spark/sv/StructuralVariationDiscoveryPipelineSpark.java

@@ -43,9 +43,11 @@
 import org.broadinstitute.hellbender.utils.read.GATKRead;
 import org.broadinstitute.hellbender.utils.read.SAMRecordToGATKReadAdapter;
 import scala.Serializable;
+import scala.Tuple3;


SHuang-Broad · 2018-07-30T20:29:09Z

...broadinstitute/hellbender/tools/spark/sv/StructuralVariationDiscoveryArgumentCollection.java

@@ -269,11 +269,12 @@ public void validate() {

    public enum SvEvidenceFilterType {DENSITY, XGBOOST}

-    public static class DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection implements Serializable {
+    public static class VariantsDiscoveryFromContigsAlignmentsSparkArgumentCollection implements Serializable {
        private static final long serialVersionUID = 1L;

        public static final int GAPPED_ALIGNMENT_BREAK_DEFAULT_SENSITIVITY = STRUCTURAL_VARIANT_SIZE_LOWER_BOUND; // alignment with gap of size >= 50 will be broken apart.
        public static final int CHIMERIC_ALIGNMENTS_HIGHMQ_THRESHOLD = 60;


SHuang-Broad · 2018-07-30T20:29:44Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

 import org.broadinstitute.hellbender.tools.spark.sv.utils.SVIntervalTree;
 import org.broadinstitute.hellbender.tools.spark.sv.utils.SVUtils;
 import org.broadinstitute.hellbender.tools.spark.sv.utils.SVVCFWriter;
 import org.broadinstitute.hellbender.utils.Utils;
 import org.broadinstitute.hellbender.utils.io.IOUtils;
 import org.broadinstitute.hellbender.utils.read.GATKRead;
 import scala.Tuple2;
+import scala.Tuple3;


SHuang-Broad · 2018-07-30T20:43:32Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/GATKSVVCFHeaderLines.java

        }

        // for variants-detected from assembly
-        // todo: create an alternate assembly file and link to it with breakpoint IDs according to the VCF spec
-        {
+        {// todo: create an alternate assembly file and link to it with breakpoint IDs according to the VCF spec


Yes, we are already doing this in the sense that the assembly contig names are recorded.

To be spec-compliant, specifically spec version 4.2, section 5.4.2 (what we are doing in this case is NOT violating the spec; we basically offer the convenience of getting the exact inserted sequence without having to parse another file, at the cost of larger file size)

we currently always pull the inserted sequence (or whatever sequence we need) from those contigs and place it here in the VCF records. As you can imagine sometimes these sequences can be quite long.

BND variants are going to be reported differently, as the inserted sequence between the novel adjacencies are currently reported explicitly in the VCF records. Again, convenience vs file size.

the assembly file needs to be revised in case the assembler gives the "reverse strand" representation of the event, and this could be a little bit more complicated since the assembly file, assuming it is going to be an GFA rather than SAM file, has connection information, hence revising it means the connections need to be revised as well (which can be a little tricky).

SHuang-Broad · 2018-07-30T20:47:48Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/GATKSVVCFConstants.java

+
+    ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+
+    public static final Stream<String> expectedAltAlleleHeaderKeysInVCF


Agree.
I put these here because it's difficult otherwise to remind people to add new annotations to our VCF header. And if we don't, the VCF file is, technically speaking, corrupt.
I tried to use some Java language features to get the fields declared here, and test if the keys are all present in the test VCF files we have, but that turns out to be ugly.
Do you have any suggestions?

SHuang-Broad · 2018-07-30T21:12:16Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

+                new SVAlnFilter(svDiscoveryInputMetaData.getDiscoverStageArgs().minAlignLength));
+        for (final VariantContext variant : variants) {
+            String svType = variant.getAttributeAsString(GATKSVVCFConstants.SVTYPE, "");
+            if (svType.equals(GATKSVVCFConstants.SYMB_ALT_ALLELE_DEL) || svType.equals(GATKSVVCFConstants.SYMB_ALT_ALLELE_INS) || svType.equals(GATKSVVCFConstants.SYMB_ALT_ALLELE_DUP)) {


Sorry. This is some mis-communication:

Two types of variants are being treated specially here: BND and INV.
For BND, the record simply doesn't have a SVLEN now in our implementation. And IMO they should not.
For INV records, in this new interpretation tool, they will have exactly length 0, following the technical definition of SVLEN, which reads (at the beginning of Section 3, VCF spec version 4.2)

##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">

In the stable versioned interpretation tool, SVLEN for inversions are essentially "the number of reference bases inverted + the length of inserted sequence at breakpoints". So this is a breaking change.

BTW, I think the Number field of SVLEN is wrong; it should be Number=A, but spec being the spec, I cannot change it, yet....

SHuang-Broad · 2018-07-30T21:15:44Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

+        final List<VariantContext> inversions = extractInversions();// TODO: 6/29/18 placeholder
+
+        // merged output
+        final List<VariantContext> merged = new ArrayList<>(20_000); // estimated size


SHuang-Broad · 2018-07-30T23:23:27Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

+            variantsWithFilterApplied.add(postHocFilterVariant(variant, filters));
+        }
+
+        final String out = outputPrefixWithSampleName + "merged_simple.vcf";


SHuang-Broad · 2018-07-30T23:27:53Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

-        // TODO: 1/10/18 bring back read annotation, see ticket 4228
+        final List<VariantContext> simpleChimeraVariants =
+                extractSimpleVariants(contigsByPossibleRawTypes.simple, svDiscoveryInputMetaData, outputPrefixWithSampleName);
+        contigsByPossibleRawTypes.simple.unpersist(false);


Thanks! That's much better.

SHuang-Broad · 2018-07-30T23:29:27Z

...broadinstitute/hellbender/tools/spark/sv/StructuralVariationDiscoveryArgumentCollection.java

@@ -269,11 +269,12 @@ public void validate() {

    public enum SvEvidenceFilterType {DENSITY, XGBOOST}

-    public static class DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection implements Serializable {
+    public static class VariantsDiscoveryFromContigsAlignmentsSparkArgumentCollection implements Serializable {


Picked DiscoverVariantsFromContigAlignmentsSparkArgumentCollection as suggested.

cwhelan

👍 This looks good now except for my two minor comments.

* refactor some utilities and GATKSVVCFConstants and GATKSVVCFHeaderLines * group methods in StructuralVariationDiscoveryPipelineSpark by functionality * bring MAX_ALIGN_LENGTH and MAPPING_QUALITIES annotations from CPX variants to re-interpreted simple variants * add new CLI argument and filter assembly based variants based on annotation MAPPING_QUALITIES, MAX_ALIGN_LENGTH * filter out variants of size < 50

SHuang-Broad added the SV label Jul 10, 2018

SHuang-Broad force-pushed the sh-sv-exp-single-vcf branch from ece04d2 to 0158e61 Compare July 12, 2018 14:18

SHuang-Broad force-pushed the sh-sv-exp-single-vcf branch 6 times, most recently from 601eff3 to ccd5f87 Compare July 25, 2018 19:36

cwhelan self-assigned this Jul 25, 2018

cwhelan requested changes Jul 30, 2018

View reviewed changes

SHuang-Broad commented Jul 30, 2018

View reviewed changes

cwhelan reviewed Aug 1, 2018

View reviewed changes

cwhelan approved these changes Aug 1, 2018

View reviewed changes

SHuang-Broad force-pushed the sh-sv-exp-single-vcf branch 2 times, most recently from 45f8106 to 8cdcd26 Compare August 1, 2018 20:51

SHuang-Broad force-pushed the sh-sv-exp-single-vcf branch from 8cdcd26 to ff37d02 Compare August 1, 2018 22:26

SHuang-Broad merged commit 2810890 into master Aug 2, 2018

SHuang-Broad deleted the sh-sv-exp-single-vcf branch August 2, 2018 01:24


		///////////////////////////////////////////////////////////////////////////////////////////////////////////////////

		public static final Stream<String> expectedAltAlleleHeaderKeysInVCF

(SV) output merged single VCF for new interpretation tool #4996

(SV) output merged single VCF for new interpretation tool #4996

Conversation

SHuang-Broad commented Jul 10, 2018

codecov-io commented Jul 12, 2018 • edited Loading

Codecov Report

cwhelan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SHuang-Broad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwhelan left a comment

Choose a reason for hiding this comment

codecov-io commented Jul 12, 2018 •

edited

Loading