Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inversion breakpoint linking (no merge yet) #4789

Closed
wants to merge 6 commits into from

Conversation

SHuang-Broad
Copy link
Contributor

@SHuang-Broad SHuang-Broad commented May 19, 2018

This is for demo purpose only, so the code is not ready yet to be merged:

Description

background & goal

Currently, there are two parallel code path for structural variation breakpoint location and type inference using local assembly contig alignment signatures in the pipeline StructuralVariationDiscoveryPipelineSpark

  • the stable code path: scanning neighbor chimeric alignment pairs of a contig iteratively and outputs inversion breakpoints as symbolic variant <INV>, annotated with INV55 and INV33 for signaling if it is the left or right breakpoint of the assumed inversion.
  • the experimental code path that separates the alignment pre-processing step from the inference step, and studying the alignments in whole; this code path, in addition to outputting insertion, deletion and small duplication calls as does the stable path, outputs
    • BND records representing assembled breakpoints for which type could not be completely determined using only the contig alignments; this includes supposedly inversion breakpoints
    • complex (<CPX>) variants from assembly contigs with more than 2 alignments

The tool proposed in this PR is based on manual review of a callset generated a long time ago (but still useful for studying filtering inversion breakpoints), and is designed to be integrated with the experimental code path.

proposed algo

input:

  • the "INV55/INV33"-annotated BND records output by the upstream experimental code path
    • BND's have related concepts of MATE and PARTNER (see figure below, left)
    • MATE: novel adjacency, i.e. contiguity on sample that is absent on reference (e.g. mobile element insertions, deletions)
    • PARTNER: novel disruption, i.e. contiguity on reference disrupted on sample (e.g. insertions, deletions)

inversion_demo

  • complex variants detected by the upstream experimental code path; the reason is that sometimes inversion calls are incorporated as part of a larger, more complex event and the logic implemented in the upstream code, theoretically, allows for arbitrarily complex rearrangement; shown above on the right is a table of inversion calls (TP/FP 12/7 using PacBio calls on CHM-1 & 13 cell lines as truth) extracted from the <CPX> calls, they were extracted by the tool proposed in PR (SV) re-interpreting CPX records by experimental interpretation tool #4602.

stages:

  • Primitive filter on breakpoints

    • low MQ of assembly contigs' mappings that evidenced the BND records,
    • suspiciously large distance between mates (mate pairs whose distance are over $10^5$bp (~1/3 of input, see blelow) are more likely to be artifact or dispersed/segmental duplications)

    * if overlaps with CPX (supposedly they should be captured already, or is more complex than what can be comprehended by the logic proposed here)

The mates are then converted to intervals bounded by the mates' locations. These "normal sized" variants are sent down for further analysis.

  • Filtering based on overlap signatures. Here we have several possible scenarios (total ~130 pairs of mates):

    • no overlappers (~ 50 mate pairs, balanced between ++/--)
    • multiple overlappers (~ 10 mate pairs, balanced between ++/--)
    • unique overlapping pairs of mate pairs (~ 60 mate pairs)
      • which overlaps with same type (++/++ or --/--, ~ 10 mate pairs)
      • which overlaps with opposite type (++/-- overlap, ~ 50 mate pairs). These are the overlapping pairs sent down for breakpoint linking, expecting a maximum of 20~30 inversion calls.
  • Type inference. Here we have four possible cases, each signaling what could be involved (primed block is inverted):

    • INV55 interval left/right boundary upstream of INV33 interval's left/right boundary: ABC -> B'
    • INV33 interval left/right boundary upstream of INV55 interval's left/right boundary: ABC -> AC'B'A'C
    • INV33 interval contains INV55 interval: ABC -> ABA' or AB'A'
    • INV55 interval contains INV33 interval: ABC -> C'BC or C'B'C

Note that for the last two cases, where the inverted dispersed duplication is guaranteed, the two possible alternate alleles are reverse complement—inversion—of each other, hence signatures of contig alignments along is not enough, and alignments of short reads within the affected region cannot break the degeneracy either.
So we need to attach left and right flanking regions to the affected region, and align short reads back to these two haplotypes and study the pair orientations of the alignments to break the degeneracy.

output:

  • VCF containing the inversion and flanking deletion and dispersed duplication calls (together with CPX-derived inversion calls, total number of INV calls are ~ 30)

  • BED file on filtered BND's citing reason for filtering

Relevant files, including an IGV session can be found in this bucket

Todo:

  • Large inversions. The primitive size-based filtering step and the filtering step requiring matched mate pairs undoubtedly will cause us false negatives, as sometimes we don't expect assembly of all breakpoints for inversions complicated by copy number events. The inversions currently captured tend to be small inversions
Min.     1st Qu.    Median      Mean     3rd Qu.       Max.
77.0       188.0     359.0    1144.5       823.2    12697.0
  • run check against reference annotation: known Seg. Dup., RC-STR, centromere, as well as consistent pair support from short reads

    • Question: is this pre-filtering a bad idea; if not, how to report?

    • Question: is the number $10^5$ reasonable?

    • Question: any other suggestion on filtering criteria?

  • Corner cases here & there in the proposed tool

@SHuang-Broad SHuang-Broad changed the title Sh sv inversion demo Inversion breakpoint linking (no merge yet) May 19, 2018
@codecov-io
Copy link

codecov-io commented May 25, 2018

Codecov Report

Merging #4789 into master will decrease coverage by 0.47%.
The diff coverage is 9.266%.

@@              Coverage Diff               @@
##              master     #4789      +/-   ##
==============================================
- Coverage     86.657%   86.187%   -0.47%     
- Complexity     29049     29057       +8     
==============================================
  Files           1808      1813       +5     
  Lines         134688    135515     +827     
  Branches       14938     15043     +105     
==============================================
+ Hits          116716    116796      +80     
- Misses         12559     13300     +741     
- Partials        5413      5419       +6
Impacted Files Coverage Δ Complexity Δ
.../DiscoverVariantsFromContigAlignmentsSAMSpark.java 82.143% <ø> (ø) 7 <0> (ø) ⬇️
...covery/inference/CpxVariantReInterpreterSpark.java 100% <ø> (ø) 5 <0> (ø) ⬇️
...inference/LinkedInversionBreakpointsInference.java 0% <0%> (ø) 0 <0> (?)
...y/inference/InversionBreakendInterpreterSpark.java 0% <0%> (ø) 0 <0> (?)
...ools/spark/sv/utils/ReferenceSequencesAligner.java 0% <0%> (ø) 0 <0> (?)
...s/spark/sv/discovery/SvDiscoveryInputMetaData.java 100% <100%> (ø) 7 <0> (ø) ⬇️
.../sv/StructuralVariationDiscoveryPipelineSpark.java 89.189% <100%> (ø) 14 <0> (ø) ⬇️
...iscovery/inference/InversionBreakendPreFilter.java 30.601% <30.601%> (ø) 5 <5> (?)
...iscoverFromLocalAssemblyContigAlignmentsSpark.java 69.048% <48%> (-3.145%) 9 <1> (ø)
...ellbender/tools/spark/sv/utils/SVLocalContext.java 5.747% <5.747%> (ø) 1 <1> (?)
... and 7 more

@SHuang-Broad SHuang-Broad force-pushed the sh-sv-inversion-demo branch 3 times, most recently from e5317f3 to efb8340 Compare June 14, 2018 18:19
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-inversion-demo branch 3 times, most recently from 91e37be to 062d542 Compare June 24, 2018 20:18
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-inversion-demo branch 5 times, most recently from c3658ff to 1992778 Compare June 29, 2018 23:03
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-inversion-demo branch 2 times, most recently from c774c29 to da83a81 Compare July 9, 2018 20:50
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-inversion-demo branch 2 times, most recently from ad35d45 to 322a7f6 Compare July 12, 2018 14:14
Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked through at a high level although I'd prefer not to do a line-by-line code review until you think this is ready to be merged). Here are some comments on your proposed algorithm description:

  • I would of course prefer not to have to have a hard filter on length. This would mean we would never call a large inversion even if it exists. What about some other filters more specifically aimed at the artifacts that cause these false large calls? I think it's a good idea to check annotations -- ie. do the mates lie at two regions that are segmental duplications of each other, or one side of the mate looks like a transposable element insertion? I guess it's ok to put in a tool with this limit temporarily, though.
  • Building in an ability to check the copy number of the region in which an inversion breakpoint lies (by checking against a CNV call for example) would probably be really helpful.
  • When you say 'overlaps with CPX' it might be helpful have a more particular set of criteria.. A larger inversion event might span over a smaller complex event.
  • I'd check to see if there are additional filters implemented in SV-Pipe that you could apply here.
  • I don't think it's that necessary to provide reports on things that didn't become breakpoints. If they still exist as BNDs in the original perhaps there are some codes we could add as INFO field annotations to indicate that they were considered for inversion calling but filtered, and the reason for doing so.

Question about the implementation:

  • Does this even have to be a spark tool? It looks like you are just reading the variants into a parallel spark context, filtering, and then collecting them to actually process them. Why not just make this a non-spark tool and process it all in memory on one node?

.makeInterpretation(preprocessingResult, pathToFastqsDir, reference, localLogger);

// TODO: 5/16/18 artifact that we currently don't have sample columns for discovery VCF (until genotyping code is in)
final String sampleId = "sample";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just take sample name as a parameter until this gets sorted out.

final InvBreakEndContext threePrimeBreakEnd;

final String chr;
final int fiveIntervalLeftBoundary;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not store these as a pair of SimpleIntervals?

* since we currently don't have records with more than one mate,
* hence the only one mate is enough.
*/
private static final class PrimitiveFilteringResult {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sort of another version of a conversation we've had be before, but I still think that this pattern of splitting a collection into multiple smaller collections and holding them all together in an object is inefficient and leads to tons of extra and unnecessary code. Why not add a filter or filters field to InvBreakEndContext and simply make one pass through the list populating it? Then you could get rid of the PrimitiveFilteringResult and AnnotatedFilteredInterval classes altogether, I think.

* {@link VariantContext#getAttributeAsString(String, String)}
* gives back an array ......
*/
public final Stream<String> makeSureAttributeIsList(final String attributeKey) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could figure out what's going on with this and fix it, whether it's something to do with the values we put in it or a bug in VariantContext. If the latter it'd be great to file a bug against htsjdk to fix.

return getID().endsWith("_1");
}

public static Stream<BreakEndSVContext> getOneEnd(final Stream<BreakEndSVContext> breakEndVariants,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the name for this method. Which end is it trying to return?

private final Map<InvBreakEndContext, List<String>> multipleOverlappers; // variants who overlaps multiple other records (note: not using map because VC.equals() is hard)
private final Set<Tuple2<InvBreakEndContext, InvBreakEndContext>> uniqueStrangeOverlappers; // pairs of variants (INV55/55, INV33/33) that overlap each other uniquely

private final Set<OverlappingPair> uniqueHetOverlappers; // pairs of variants (INV55/33, INV33/55) that overlap each other uniquely, pairs that will be analyzed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you call these het? I don't see what this has to do with genotype.


private final Set<OverlappingPair> uniqueHetOverlappers; // pairs of variants (INV55/33, INV33/55) that overlap each other uniquely, pairs that will be analyzed

OverlapBasedFilteringResult(final Set<InvBreakEndContext> noOverlappers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about storing these breakpoints in a graph structure?

Then you could traverse nodes in the graph and get their category on the fly instead of splitting them into multiple collections.

* is NOT necessary
* to resolve which variants are here.
*/
abstract static class NonDegenerateScenario extends OverlappingPairHandler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think naming abstract subclasses after their superclass is a good idea for readability, ie call this NonDegenerateOverlappingPairHandler

overlappingPair.threeIntervalLeftBoundary, overlappingPair.fiveIntervalRightBoundary);
final VariantContextBuilder inversionBuilder = makeInversion(invertedInterval, Allele.create(refBase, true));

if ( overlappingPair.threeIntervalLeftBoundary - overlappingPair.fiveIntervalLeftBoundary - 1 > 49 ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have the 50 value stored as a constant somewhere? Also, what's the rationale for using 50 bases here?

}
}

private static final class FiveInterleavesThree extends NonDegenerateScenario {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the code between this class and the three -> five class below looks pretty similar. It'd be good to find opportunities for reuse to remove code duplication.

@SHuang-Broad
Copy link
Contributor Author

I've looked through at a high level although I'd prefer not to do a line-by-line code review until you think this is ready to be merged). Here are some comments on your proposed algorithm description:

Answer: thanks for looking! That's what I intended as I'm not sure if the algorithm itself would face strong critics. If that's the case, time spent on coding is not worth it IMO. In fact ideally I'd like to write a design doc or something similar, and only code after the design is agreed on.


I would of course prefer not to have to have a hard filter on length. This would mean we would never call a large inversion even if it exists.

Answer: Totally agree. Now looking back, it get clearer to me that this proposal contains two parts: the filtering part, and the breakpoint linking part, separated into two major classes InversionBreakendPreFilter and LinkedInversionBreakpointsInference. That being said, it doesn't make much sense to separate them into two PRs because currently the filtering part is designed around the linking part, i.e. it is trying to check which BND's are suitable to the logic implemented in the linking part, and if the logic isn't applicable to an BND, the BND simply slips through without generating any new interpretations. So InversionBreakendPreFilter is a filter and a classifier at the same time, it function is really diverting different BND's to be handled by different logics, and it definitely should be improved.
If you buy this argument, I am also fully aware of the code design issue that it is preferable to NOT divert—gather—send through different handlers like it currently is for calling variants from the assembly contigs, instead it should be a single stream pass through all the BND's. I'll try to follow the preferred design.

What about some other filters more specifically aimed at the artifacts that cause these false large calls? I think it's a good idea to check annotations -- ie. do the mates lie at two regions that are segmental duplications of each other, or one side of the mate looks like a transposable element insertion? I guess it's ok to put in a tool with this limit temporarily, though.

Answer: Agree. And I think these kind of checking are not only good, but a mandate for a good caller, i.e. to take advantage of prior knowledge. I am also thinking about improving it by checking if there are short read evidence supporting the BND's (sometimes there aren't any, which is strange for the alignments of the assembly to point to such novel adjacencies in the first place).
However, I might have to push back a little on this particular request at this point because such features can be later added on like you said, and probably now is not the best time to do it, because the linking logic is unlikely to be affected by presence of these reference annotations.
One note though, and I think you don't mean it literally either, the large ones are not necessarily artifacts. Like what we discussed during the group meetings, they are "filtered" partly because they are more likely to overlap with multiple other BND mates, and are more likely to be artifacts or mobile element insertions. In other words, they are not sent for linking because they are less likely to be suited for the linking logic in LinkedInversionBreakpointsInference. But if they do, the linking is totally agnostic about size.

Building in an ability to check the copy number of the region in which an inversion breakpoint lies (by checking against a CNV call for example) would probably be really helpful.

Answer: Agree. And it shouldn't be too hard to do that. But again, may be in a different PR?

When you say 'overlaps with CPX' it might be helpful have a more particular set of criteria.. A larger inversion event might span over a smaller complex event.

Answer: Note taken. I'll improve on this in the next polished implementation that is ready for line-by-line review.

I'd check to see if there are additional filters implemented in SV-Pipe that you could apply here.

Answer: Yes. But I'm not sure if by SV-Pipe you mean RDTest repo? Sorry I wasn't around when I was added to that repo so I don't know the context. In addition, I believe this could be another future improvements?

I don't think it's that necessary to provide reports on things that didn't become breakpoints. If they still exist as BNDs in the original perhaps there are some codes we could add as INFO field annotations to indicate that they were considered for inversion calling but filtered, and the reason for doing so.

Answer: Agree. And I think the final format can definitely change, as long as the resolved variants, be it deletions, inverted dispersed duplications, and/or real inversions are registered in a VCF. The BED file that is currently produced is for development use, i.e. for me to check why a certain BND is not sent for the analysis so that I can develop ideas on how to catch them in the future (you know, BED are IGV-friendly), and it can certainly be dropped in the final output.
Regarding adding another INFO field saying they are not used for THIS PARTICULAR logic, I prefer not to do it this way, because we can add more logics in the future, and capturing/resolving these BND's that are not suitable for THIS PARTICULAR logic. I guess in general my personal preference is to put less algorithm-related information in VCF for analysts (less reading for them), and produce add on files for tool developers. What's your thoughts?

Does this even have to be a spark tool? It looks like you are just reading the variants into a parallel spark context, filtering, and then collecting them to actually process them. Why not just make this a non-spark tool and process it all in memory on one node?

Answer: Agree. It doesn't have to be, at least in theory, and it probably is going to be faster as we don't need to incur the Spark overhead for such a typically small job. But (I'm saying too many buts....) up to this point all SV tools are under the package hellbender.tools.spark.sv, so I'm following suit here. Note the two classes's main interface methods mentions nothing about RDDs (that's on purpose).
On the other hand, this is an engineering question I believe, and it depends on whether we want to put as much of discovery code as possible into StructuralVariationDiscoveryPipelineSpark (the last commit actually hooks the two classes into it, so a single invocation of the tool produces more variants), or we go wdl in pipelining the whole process.


All in all, I think the comments and critics are generally about the "filtering"/"classifying" part, and the most serious concern about it is false negatives. Am I understanding correctly? If so, given that the filtering step is only picking the BND's that are suitable to the linking logic, I can imagine the false negative problem be solved in the future by other logics (e.g. more relaxed requirement on pair matching, or not even requiring matching INV55/INV33 pairs, etc.) In fact, that's what I'm planning on.
Another part of the problem is how much I can accomplish in this single PR, and how large it should be—the forever existing problem for new tools.

@SHuang-Broad SHuang-Broad force-pushed the sh-sv-inversion-demo branch 2 times, most recently from b2b3950 to 3c1d12f Compare August 7, 2018 21:05
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-inversion-demo branch 2 times, most recently from f1024a2 to d8f404f Compare August 17, 2018 14:19
@SHuang-Broad
Copy link
Contributor Author

closing as it won't be merged anytime soon (only I care about it now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants