-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inversion breakpoint linking (no merge yet) #4789
Conversation
dbe6675
to
5fafc90
Compare
Codecov Report
@@ Coverage Diff @@
## master #4789 +/- ##
==============================================
- Coverage 86.657% 86.187% -0.47%
- Complexity 29049 29057 +8
==============================================
Files 1808 1813 +5
Lines 134688 135515 +827
Branches 14938 15043 +105
==============================================
+ Hits 116716 116796 +80
- Misses 12559 13300 +741
- Partials 5413 5419 +6
|
e5317f3
to
efb8340
Compare
91e37be
to
062d542
Compare
c3658ff
to
1992778
Compare
c774c29
to
da83a81
Compare
ad35d45
to
322a7f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked through at a high level although I'd prefer not to do a line-by-line code review until you think this is ready to be merged). Here are some comments on your proposed algorithm description:
- I would of course prefer not to have to have a hard filter on length. This would mean we would never call a large inversion even if it exists. What about some other filters more specifically aimed at the artifacts that cause these false large calls? I think it's a good idea to check annotations -- ie. do the mates lie at two regions that are segmental duplications of each other, or one side of the mate looks like a transposable element insertion? I guess it's ok to put in a tool with this limit temporarily, though.
- Building in an ability to check the copy number of the region in which an inversion breakpoint lies (by checking against a CNV call for example) would probably be really helpful.
- When you say 'overlaps with CPX' it might be helpful have a more particular set of criteria.. A larger inversion event might span over a smaller complex event.
- I'd check to see if there are additional filters implemented in SV-Pipe that you could apply here.
- I don't think it's that necessary to provide reports on things that didn't become breakpoints. If they still exist as BNDs in the original perhaps there are some codes we could add as INFO field annotations to indicate that they were considered for inversion calling but filtered, and the reason for doing so.
Question about the implementation:
- Does this even have to be a spark tool? It looks like you are just reading the variants into a parallel spark context, filtering, and then collecting them to actually process them. Why not just make this a non-spark tool and process it all in memory on one node?
.makeInterpretation(preprocessingResult, pathToFastqsDir, reference, localLogger); | ||
|
||
// TODO: 5/16/18 artifact that we currently don't have sample columns for discovery VCF (until genotyping code is in) | ||
final String sampleId = "sample"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just take sample name as a parameter until this gets sorted out.
final InvBreakEndContext threePrimeBreakEnd; | ||
|
||
final String chr; | ||
final int fiveIntervalLeftBoundary; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not store these as a pair of SimpleIntervals
?
* since we currently don't have records with more than one mate, | ||
* hence the only one mate is enough. | ||
*/ | ||
private static final class PrimitiveFilteringResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is sort of another version of a conversation we've had be before, but I still think that this pattern of splitting a collection into multiple smaller collections and holding them all together in an object is inefficient and leads to tons of extra and unnecessary code. Why not add a filter
or filters
field to InvBreakEndContext
and simply make one pass through the list populating it? Then you could get rid of the PrimitiveFilteringResult
and AnnotatedFilteredInterval
classes altogether, I think.
* {@link VariantContext#getAttributeAsString(String, String)} | ||
* gives back an array ...... | ||
*/ | ||
public final Stream<String> makeSureAttributeIsList(final String attributeKey) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we could figure out what's going on with this and fix it, whether it's something to do with the values we put in it or a bug in VariantContext
. If the latter it'd be great to file a bug against htsjdk to fix.
return getID().endsWith("_1"); | ||
} | ||
|
||
public static Stream<BreakEndSVContext> getOneEnd(final Stream<BreakEndSVContext> breakEndVariants, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the name for this method. Which end is it trying to return?
private final Map<InvBreakEndContext, List<String>> multipleOverlappers; // variants who overlaps multiple other records (note: not using map because VC.equals() is hard) | ||
private final Set<Tuple2<InvBreakEndContext, InvBreakEndContext>> uniqueStrangeOverlappers; // pairs of variants (INV55/55, INV33/33) that overlap each other uniquely | ||
|
||
private final Set<OverlappingPair> uniqueHetOverlappers; // pairs of variants (INV55/33, INV33/55) that overlap each other uniquely, pairs that will be analyzed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you call these het
? I don't see what this has to do with genotype.
|
||
private final Set<OverlappingPair> uniqueHetOverlappers; // pairs of variants (INV55/33, INV33/55) that overlap each other uniquely, pairs that will be analyzed | ||
|
||
OverlapBasedFilteringResult(final Set<InvBreakEndContext> noOverlappers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you thought about storing these breakpoints in a graph structure?
Then you could traverse nodes in the graph and get their category on the fly instead of splitting them into multiple collections.
* is NOT necessary | ||
* to resolve which variants are here. | ||
*/ | ||
abstract static class NonDegenerateScenario extends OverlappingPairHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think naming abstract subclasses after their superclass is a good idea for readability, ie call this NonDegenerateOverlappingPairHandler
overlappingPair.threeIntervalLeftBoundary, overlappingPair.fiveIntervalRightBoundary); | ||
final VariantContextBuilder inversionBuilder = makeInversion(invertedInterval, Allele.create(refBase, true)); | ||
|
||
if ( overlappingPair.threeIntervalLeftBoundary - overlappingPair.fiveIntervalLeftBoundary - 1 > 49 ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have the 50 value stored as a constant somewhere? Also, what's the rationale for using 50 bases here?
} | ||
} | ||
|
||
private static final class FiveInterleavesThree extends NonDegenerateScenario { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the code between this class and the three -> five class below looks pretty similar. It'd be good to find opportunities for reuse to remove code duplication.
Answer: thanks for looking! That's what I intended as I'm not sure if the algorithm itself would face strong critics. If that's the case, time spent on coding is not worth it IMO. In fact ideally I'd like to write a design doc or something similar, and only code after the design is agreed on.
Answer: Totally agree. Now looking back, it get clearer to me that this proposal contains two parts: the filtering part, and the breakpoint linking part, separated into two major classes
Answer: Agree. And I think these kind of checking are not only good, but a mandate for a good caller, i.e. to take advantage of prior knowledge. I am also thinking about improving it by checking if there are short read evidence supporting the BND's (sometimes there aren't any, which is strange for the alignments of the assembly to point to such novel adjacencies in the first place).
Answer: Agree. And it shouldn't be too hard to do that. But again, may be in a different PR?
Answer: Note taken. I'll improve on this in the next polished implementation that is ready for line-by-line review.
Answer: Yes. But I'm not sure if by SV-Pipe you mean RDTest repo? Sorry I wasn't around when I was added to that repo so I don't know the context. In addition, I believe this could be another future improvements?
Answer: Agree. And I think the final format can definitely change, as long as the resolved variants, be it deletions, inverted dispersed duplications, and/or real inversions are registered in a VCF. The BED file that is currently produced is for development use, i.e. for me to check why a certain BND is not sent for the analysis so that I can develop ideas on how to catch them in the future (you know, BED are IGV-friendly), and it can certainly be dropped in the final output.
Answer: Agree. It doesn't have to be, at least in theory, and it probably is going to be faster as we don't need to incur the Spark overhead for such a typically small job. But (I'm saying too many buts....) up to this point all SV tools are under the package All in all, I think the comments and critics are generally about the "filtering"/"classifying" part, and the most serious concern about it is false negatives. Am I understanding correctly? If so, given that the filtering step is only picking the BND's that are suitable to the linking logic, I can imagine the false negative problem be solved in the future by other logics (e.g. more relaxed requirement on pair matching, or not even requiring matching INV55/INV33 pairs, etc.) In fact, that's what I'm planning on. |
b2b3950
to
3c1d12f
Compare
f1024a2
to
d8f404f
Compare
d8f404f
to
4f1acbd
Compare
closing as it won't be merged anytime soon (only I care about it now). |
This is for demo purpose only, so the code is not ready yet to be merged:
Description
background & goal
Currently, there are two parallel code path for structural variation breakpoint location and type inference using local assembly contig alignment signatures in the pipeline
StructuralVariationDiscoveryPipelineSpark
<INV>
, annotated withINV55
andINV33
for signaling if it is the left or right breakpoint of the assumed inversion.<CPX>
) variants from assembly contigs with more than 2 alignmentsThe tool proposed in this PR is based on manual review of a callset generated a long time ago (but still useful for studying filtering inversion breakpoints), and is designed to be integrated with the experimental code path.
proposed algo
input:
BND
records output by the upstream experimental code pathMATE
andPARTNER
(see figure below, left)MATE
: novel adjacency, i.e. contiguity on sample that is absent on reference (e.g. mobile element insertions, deletions)PARTNER
: novel disruption, i.e. contiguity on reference disrupted on sample (e.g. insertions, deletions)<CPX>
calls, they were extracted by the tool proposed in PR (SV) re-interpreting CPX records by experimental interpretation tool #4602.stages:
Primitive filter on breakpoints
* if overlaps with CPX (supposedly they should be captured already, or is more complex than what can be comprehended by the logic proposed here)
The mates are then converted to intervals bounded by the mates' locations. These "normal sized" variants are sent down for further analysis.
Filtering based on overlap signatures. Here we have several possible scenarios (total ~130 pairs of mates):
Type inference. Here we have four possible cases, each signaling what could be involved (primed block is inverted):
ABC -> B'
ABC -> AC'B'A'C
ABC -> ABA' or AB'A'
ABC -> C'BC or C'B'C
Note that for the last two cases, where the inverted dispersed duplication is guaranteed, the two possible alternate alleles are reverse complement—inversion—of each other, hence signatures of contig alignments along is not enough, and alignments of short reads within the affected region cannot break the degeneracy either.
So we need to attach left and right flanking regions to the affected region, and align short reads back to these two haplotypes and study the pair orientations of the alignments to break the degeneracy.
output:
VCF containing the inversion and flanking deletion and dispersed duplication calls (together with CPX-derived inversion calls, total number of
INV
calls are ~ 30)BED file on filtered BND's citing reason for filtering
Relevant files, including an IGV session can be found in this bucket
Todo:
run check against reference annotation: known Seg. Dup., RC-STR, centromere, as well as consistent pair support from short reads
Question: is this pre-filtering a bad idea; if not, how to report?
Question: is the number$10^5$ reasonable?
Question: any other suggestion on filtering criteria?
Corner cases here & there in the proposed tool