Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(SV) output merged single VCF for new interpretation tool #4996

Merged
merged 1 commit into from
Aug 2, 2018

Conversation

SHuang-Broad
Copy link
Contributor

One more step towards using this new tool.

Does:

  • output a single VCF containing <INS>, <DEL>, <DUP>, <INV> calls (there will be more <INV> calls, but that cannot happen until someone takes a look at PR Inversion breakpoint linking (no merge yet) #4789 and check if the proposed algorithm makes sense)
  • since this new tool applies more permissive filters on MQ and alignment length of the assembly contigs' mappings, I've introduced some downstream filtering parameters allowing to filter VCF records based on annotations MAPPING_QUALITIES and MAX_ALIGN_LENGTH; the default value is chosen after some experimentation using the CHM PacBio as truth and the branch
    sh-sv-interlvatree-eval.
  • cleans up VCF headers and related tests.

@codecov-io
Copy link

codecov-io commented Jul 12, 2018

Codecov Report

Merging #4996 into master will decrease coverage by 0.009%.
The diff coverage is 84.463%.

@@               Coverage Diff               @@
##              master     #4996       +/-   ##
===============================================
- Coverage     86.385%   86.376%   -0.009%     
- Complexity     28822     28835       +13     
===============================================
  Files           1791      1791               
  Lines         133561    133762      +201     
  Branches       14902     14919       +17     
===============================================
+ Hits          115377    115538      +161     
- Misses         12791     12818       +27     
- Partials        5393      5406       +13
Impacted Files Coverage Δ Complexity Δ
.../AssemblyContigAlignmentsConfigPickerUnitTest.java 99.057% <ø> (ø) 30 <0> (ø) ⬇️
...ry/alignment/ContigAlignmentsModifierUnitTest.java 99.194% <ø> (ø) 19 <0> (ø) ⬇️
...very/alignment/AlignedContigGeneratorUnitTest.java 97.17% <ø> (ø) 10 <0> (ø) ⬇️
...ery/inference/SimpleNovelAdjacencyInterpreter.java 80.882% <ø> (ø) 11 <0> (ø) ⬇️
...ender/tools/spark/sv/utils/GATKSVVCFConstants.java 0% <ø> (-75%) 0 <0> (-1)
.../sv/discovery/inference/CpxVariantInterpreter.java 79.839% <ø> (ø) 26 <0> (ø) ⬇️
.../DiscoverVariantsFromContigAlignmentsSAMSpark.java 82.143% <ø> (ø) 7 <0> (ø) ⬇️
...lignment/AssemblyContigAlignmentsConfigPicker.java 91.563% <ø> (ø) 108 <0> (ø) ⬇️
...s/spark/sv/discovery/SvDiscoveryInputMetaData.java 100% <ø> (ø) 7 <0> (ø) ⬇️
...ry/inference/ImpreciseVariantDetectorUnitTest.java 100% <ø> (ø) 5 <0> (ø) ⬇️
... and 19 more

@SHuang-Broad SHuang-Broad force-pushed the sh-sv-exp-single-vcf branch 6 times, most recently from 601eff3 to ccd5f87 Compare July 25, 2018 19:36
@cwhelan cwhelan self-assigned this Jul 25, 2018
Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly fine, I have some minor suggestions about how to restructure a little of the code.

@@ -269,11 +269,12 @@ public void validate() {

public enum SvEvidenceFilterType {DENSITY, XGBOOST}

public static class DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection implements Serializable {
public static class VariantsDiscoveryFromContigsAlignmentsSparkArgumentCollection implements Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this name, I thought it read better the old way. Actually, though, both have inconsistent pluralization, it should either be: DiscoverVariantsFromContigAlignmentsSparkArgumentCollection or VariantDiscoveryFromContigAlignmentsSparkArgumentCollection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picked DiscoverVariantsFromContigAlignmentsSparkArgumentCollection as suggested.

@@ -269,11 +269,12 @@ public void validate() {

public enum SvEvidenceFilterType {DENSITY, XGBOOST}

public static class DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection implements Serializable {
public static class VariantsDiscoveryFromContigsAlignmentsSparkArgumentCollection implements Serializable {
private static final long serialVersionUID = 1L;

public static final int GAPPED_ALIGNMENT_BREAK_DEFAULT_SENSITIVITY = STRUCTURAL_VARIANT_SIZE_LOWER_BOUND; // alignment with gap of size >= 50 will be broken apart.
public static final int CHIMERIC_ALIGNMENTS_HIGHMQ_THRESHOLD = 60;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a TODO comment saying that this parameter will be removed when we remove the old tool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -43,9 +43,11 @@
import org.broadinstitute.hellbender.utils.read.GATKRead;
import org.broadinstitute.hellbender.utils.read.SAMRecordToGATKReadAdapter;
import scala.Serializable;
import scala.Tuple3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this import isn't used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import org.broadinstitute.hellbender.tools.spark.sv.utils.SVIntervalTree;
import org.broadinstitute.hellbender.tools.spark.sv.utils.SVUtils;
import org.broadinstitute.hellbender.tools.spark.sv.utils.SVVCFWriter;
import org.broadinstitute.hellbender.utils.Utils;
import org.broadinstitute.hellbender.utils.io.IOUtils;
import org.broadinstitute.hellbender.utils.read.GATKRead;
import scala.Tuple2;
import scala.Tuple3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// TODO: 1/10/18 bring back read annotation, see ticket 4228
final List<VariantContext> simpleChimeraVariants =
extractSimpleVariants(contigsByPossibleRawTypes.simple, svDiscoveryInputMetaData, outputPrefixWithSampleName);
contigsByPossibleRawTypes.simple.unpersist(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you just moved around this code for this PR, but looking at this I'm not sure I like this method having the side effect of calling unpersist on the internal RDDs of contigsByPossibleRawTypes. Just to avoid side effects I'd prefer to pull out the unpersist calls to the calling method. Actually, since you've got them bundled in one object, why not make an unpersist method on AssemblyContigsClassifiedByAlignmentSignatures that unpersists all of the underlying RDDs, and then it's just a one line invocation from the callers of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That's much better.

}
}

public static final class SVAlnFilter implements StructuralVariantFilter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write out SVAlignmentLengthFilter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -154,4 +153,27 @@ public static void writeSAMRecords(final List<GATKRead> reads, final Set<String>
samRecords.sort(localComparator);
SVFileUtils.writeSAMFile( outputPath, samRecords.iterator(), cloneHeader, true);
}

/**
* this exist because for whatever reason,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put in a TODO comment reminding us to get to the bottom of this behavior. If it's something wrong with our data, we should fix it, otherwise we should file a bug against htsjdk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added todo.
I've talked with Louis and he is aware of the issue (this happens not only to our VCF) since a few months ago. Hopefully it will be resolved.

@@ -113,6 +88,12 @@ private ExtractedSimpleVariants(final List<VariantContext> reInterpretZeroOrOneS
public List<VariantContext> getReInterpretMultiSegmentsCalls() {
return reInterpretMultiSegmentsCalls;
}

public List<VariantContext> getMergedReInterpretCalls() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getMergedReinterpretedCalls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


///////////////////////////////////////////////////////////////////////////////////////////////////////////////////

public static final Stream<String> expectedAltAlleleHeaderKeysInVCF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that these should be in test code somewhere, not in main code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.
I put these here because it's difficult otherwise to remind people to add new annotations to our VCF header. And if we don't, the VCF file is, technically speaking, corrupt.
I tried to use some Java language features to get the fields declared here, and test if the keys are all present in the test VCF files we have, but that turns out to be ugly.
Do you have any suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we added some assertions to our integration test that checked whether everything is in the header? We could collect all of the alt alleles and info and format annotation keys we see in the variants in StructuralVariationDiscoveryPipelineSparkIntegrationTest.svDiscoveryVCFEquivalenceTest and assert that they have corresponding header line entries. Would that address your concern? You could split that work off of this PR, too, and make a ticket for it.

}

// for variants-detected from assembly
// todo: create an alternate assembly file and link to it with breakpoint IDs according to the VCF spec
{
{// todo: create an alternate assembly file and link to it with breakpoint IDs according to the VCF spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already do this? Do you know what needs to happen to make this spec-compliant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are already doing this in the sense that the assembly contig names are recorded.

To be spec-compliant, specifically spec version 4.2, section 5.4.2 (what we are doing in this case is NOT violating the spec; we basically offer the convenience of getting the exact inserted sequence without having to parse another file, at the cost of larger file size)

  • we currently always pull the inserted sequence (or whatever sequence we need) from those contigs and place it here in the VCF records. As you can imagine sometimes these sequences can be quite long.
  • BND variants are going to be reported differently, as the inserted sequence between the novel adjacencies are currently reported explicitly in the VCF records. Again, convenience vs file size.
  • the assembly file needs to be revised in case the assembler gives the "reverse strand" representation of the event, and this could be a little bit more complicated since the assembly file, assuming it is going to be an GFA rather than SAM file, has connection information, hence revising it means the connections need to be revised as well (which can be a little tricky).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! Just checking to see if you wanted to keep this todo comment around or if it could be resolved.

Copy link
Contributor Author

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cwhelan Thanks for the review! I've incorporated most of the suggested changes except the comment for GATKSVVCFConstants.java.expectedAltAlleleHeaderKeysInVCF, which I don't know what the best option is yet.

@@ -43,9 +43,11 @@
import org.broadinstitute.hellbender.utils.read.GATKRead;
import org.broadinstitute.hellbender.utils.read.SAMRecordToGATKReadAdapter;
import scala.Serializable;
import scala.Tuple3;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -269,11 +269,12 @@ public void validate() {

public enum SvEvidenceFilterType {DENSITY, XGBOOST}

public static class DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection implements Serializable {
public static class VariantsDiscoveryFromContigsAlignmentsSparkArgumentCollection implements Serializable {
private static final long serialVersionUID = 1L;

public static final int GAPPED_ALIGNMENT_BREAK_DEFAULT_SENSITIVITY = STRUCTURAL_VARIANT_SIZE_LOWER_BOUND; // alignment with gap of size >= 50 will be broken apart.
public static final int CHIMERIC_ALIGNMENTS_HIGHMQ_THRESHOLD = 60;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import org.broadinstitute.hellbender.tools.spark.sv.utils.SVIntervalTree;
import org.broadinstitute.hellbender.tools.spark.sv.utils.SVUtils;
import org.broadinstitute.hellbender.tools.spark.sv.utils.SVVCFWriter;
import org.broadinstitute.hellbender.utils.Utils;
import org.broadinstitute.hellbender.utils.io.IOUtils;
import org.broadinstitute.hellbender.utils.read.GATKRead;
import scala.Tuple2;
import scala.Tuple3;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

// for variants-detected from assembly
// todo: create an alternate assembly file and link to it with breakpoint IDs according to the VCF spec
{
{// todo: create an alternate assembly file and link to it with breakpoint IDs according to the VCF spec
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are already doing this in the sense that the assembly contig names are recorded.

To be spec-compliant, specifically spec version 4.2, section 5.4.2 (what we are doing in this case is NOT violating the spec; we basically offer the convenience of getting the exact inserted sequence without having to parse another file, at the cost of larger file size)

  • we currently always pull the inserted sequence (or whatever sequence we need) from those contigs and place it here in the VCF records. As you can imagine sometimes these sequences can be quite long.
  • BND variants are going to be reported differently, as the inserted sequence between the novel adjacencies are currently reported explicitly in the VCF records. Again, convenience vs file size.
  • the assembly file needs to be revised in case the assembler gives the "reverse strand" representation of the event, and this could be a little bit more complicated since the assembly file, assuming it is going to be an GFA rather than SAM file, has connection information, hence revising it means the connections need to be revised as well (which can be a little tricky).


///////////////////////////////////////////////////////////////////////////////////////////////////////////////////

public static final Stream<String> expectedAltAlleleHeaderKeysInVCF
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.
I put these here because it's difficult otherwise to remind people to add new annotations to our VCF header. And if we don't, the VCF file is, technically speaking, corrupt.
I tried to use some Java language features to get the fields declared here, and test if the keys are all present in the test VCF files we have, but that turns out to be ugly.
Do you have any suggestions?

new SVAlnFilter(svDiscoveryInputMetaData.getDiscoverStageArgs().minAlignLength));
for (final VariantContext variant : variants) {
String svType = variant.getAttributeAsString(GATKSVVCFConstants.SVTYPE, "");
if (svType.equals(GATKSVVCFConstants.SYMB_ALT_ALLELE_DEL) || svType.equals(GATKSVVCFConstants.SYMB_ALT_ALLELE_INS) || svType.equals(GATKSVVCFConstants.SYMB_ALT_ALLELE_DUP)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. This is some mis-communication:

Two types of variants are being treated specially here: BND and INV.
For BND, the record simply doesn't have a SVLEN now in our implementation. And IMO they should not.
For INV records, in this new interpretation tool, they will have exactly length 0, following the technical definition of SVLEN, which reads (at the beginning of Section 3, VCF spec version 4.2)

##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">

In the stable versioned interpretation tool, SVLEN for inversions are essentially "the number of reference bases inverted + the length of inserted sequence at breakpoints". So this is a breaking change.


BTW, I think the Number field of SVLEN is wrong; it should be Number=A, but spec being the spec, I cannot change it, yet....

final List<VariantContext> inversions = extractInversions();// TODO: 6/29/18 placeholder

// merged output
final List<VariantContext> merged = new ArrayList<>(20_000); // estimated size
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

variantsWithFilterApplied.add(postHocFilterVariant(variant, filters));
}

final String out = outputPrefixWithSampleName + "merged_simple.vcf";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// TODO: 1/10/18 bring back read annotation, see ticket 4228
final List<VariantContext> simpleChimeraVariants =
extractSimpleVariants(contigsByPossibleRawTypes.simple, svDiscoveryInputMetaData, outputPrefixWithSampleName);
contigsByPossibleRawTypes.simple.unpersist(false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That's much better.

@@ -269,11 +269,12 @@ public void validate() {

public enum SvEvidenceFilterType {DENSITY, XGBOOST}

public static class DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection implements Serializable {
public static class VariantsDiscoveryFromContigsAlignmentsSparkArgumentCollection implements Serializable {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picked DiscoverVariantsFromContigAlignmentsSparkArgumentCollection as suggested.

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This looks good now except for my two minor comments.

@SHuang-Broad SHuang-Broad force-pushed the sh-sv-exp-single-vcf branch 2 times, most recently from 45f8106 to 8cdcd26 Compare August 1, 2018 20:51
  * refactor some utilities and GATKSVVCFConstants and GATKSVVCFHeaderLines
  * group methods in StructuralVariationDiscoveryPipelineSpark by functionality
  * bring MAX_ALIGN_LENGTH and MAPPING_QUALITIES annotations from CPX variants to re-interpreted simple variants
  * add new CLI argument and filter assembly based variants based on annotation MAPPING_QUALITIES, MAX_ALIGN_LENGTH
  * filter out variants of size < 50
@SHuang-Broad SHuang-Broad merged commit 2810890 into master Aug 2, 2018
@SHuang-Broad SHuang-Broad deleted the sh-sv-exp-single-vcf branch August 2, 2018 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants