-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul tests on SV assembly-based non-complex breakpoint and type inference code #4835
Overhaul tests on SV assembly-based non-complex breakpoint and type inference code #4835
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments for you, mostly focused on the refactoring and other code changes. I haven't been able to go through all of the tests in great detail but the few I've looked at seemed to make sense. Thanks for adding all these tests.
|
||
final VariantContextBuilder builder0 = new VariantContextBuilder(firstVar); | ||
builder0.attribute(linkKey, secondVar.getID()); | ||
|
||
final VariantContextBuilder builder1 = new VariantContextBuilder(secondVar); | ||
builder1.attribute(linkKey, firstVar.getID()); | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
final Broadcast<SAMSequenceDictionary> broadcastSequenceDictionary, | ||
final Broadcast<SVIntervalTree<VariantContext>> broadcastCNVCalls, | ||
final String sampleId) throws IOException { | ||
if ( inferredType instanceof BreakEndVariantType ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're doing a lot of switching on the inferred type here and below on line 179, as a refactoring suggestion what if you moved this method to SVType subclasses? It seems like a good candidate for it.
final Broadcast<SAMSequenceDictionary> broadcastSequenceDictionary, | ||
final Broadcast<SVIntervalTree<VariantContext>> broadcastCNVCalls, | ||
final String sampleId) { | ||
if (broadcastCNVCalls == null || end < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would end be less than zero here. That seems like a strange condition that should be caught upstream somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure about this end < 0
check, do you know why this is here?
EDIT
Sorry I cannot reply (it's "outdated"....), so I edited your message.
I forgot to remove them. Previously it was to guard against taking in an end
of an BND record which up to that point (with previous implementations) is negative (end
was later converted to simply take value of POS
to populate the VariantContextBuider.stop(end); both previous and current implementation uses POS
for the value of stop in constructing the VC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this referring to SvTYPE.NO_APPLICABLE_END
? If so please reference that. Also, what if we added methods hasApplicableEnd
and hasApplicableLength
to SvType and used those first instead of checking for negative numbers?
EDIT
Sorry I cannot reply (it's "outdated"....), so I edited your message.
Updated as suggested.
private static boolean indicatesInv55(boolean bracketPointsLeft, boolean basesFirst) { | ||
return (bracketPointsLeft && basesFirst); | ||
@Override | ||
public boolean equals(final Object o) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could remove these equals
and hashcode
methods if they are just calling super
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
@Override | ||
public boolean equals(final Object o) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again could remove these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
VariantContext actual = AnnotatedVariantProducer.produceAnnotatedVcFromAssemblyEvidence(del_chr20_28831535_29212083, simpleNovelAdjacencyAndChimericAlignmentEvidence, referenceBroadcast, refSeqDictBroadcast, null, "testSample").make(); | ||
VariantContextTestUtils.assertVariantContextsAreEqual(actual, expected, Collections.singletonList(HQ_MAPPINGS)); | ||
|
||
// cnv call available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this test
|
||
// these are used for testing | ||
public final boolean expectedFirstContigRegionHasLaterReferenceMapping; | ||
public final SimpleChimera manuallyCuratedSimpleChimera; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe what you mean by "manually curated"? Is it the just the expected values for these results? If so can you rename them just to "expected"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
this.manuallyCuratedBiPathBubble = manuallyCuratedBiPathBubble; | ||
this.manuallyCuratedSVTypes = manuallyCuratedSVTypes; | ||
this.manuallyCuratedVariants = manuallyCuratedVariants; | ||
this.inferencer = inferencer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have to store the inferencer
class along with the test data? Is there any way that you could make selecting the inferencer part of the test too? It seems like it allows the potentially confusing condition of writing a test but then choosing an inferencer
that wouldn't actually be chosen by the system in an integration setting. Alternatively, why not just have different data providers for each inferencer
subtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After talking in person, I think renaming it expectedInferencerClass
better conveys the idea.
public static final TestDataForSimpleSV forComplexTanDup_2to3_short_noPseudoHom_minus; | ||
|
||
static { | ||
try ( final ByteArrayOutputStream outputStream = new ByteArrayOutputStream() ){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a single outputStream
shared by all the tests? It looks like most of the tests call reset on it before writing anything to it. Why have a shared one at all -- why not just do it in each test?
Also, do you need output streams at all? It looks like you are just using them to concatenate the portions of the chimera (flanks and homology). Why not just use string concatentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was trying to avoid multiple try blocks. But using Strings removed the need for streams.
* Return a list of two entries for positive and reverse strand representations. | ||
*/ | ||
private static List<TestDataForSimpleSV> | ||
forSimpleDeletion(final ByteArrayOutputStream outputStream) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice in these tests if you could add a few comments to each one demarcating the input test data and the expected results. Otherwise it's hard to figure out where one stops and the other stops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
named all variables that are expected accordingly
Codecov Report
@@ Coverage Diff @@
## master #4835 +/- ##
===============================================
+ Coverage 80.468% 80.828% +0.359%
- Complexity 17852 19207 +1355
===============================================
Files 1093 1091 -2
Lines 64295 68880 +4585
Branches 10378 11597 +1219
===============================================
+ Hits 51737 55674 +3937
- Misses 8504 8996 +492
- Partials 4054 4210 +156
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cwhelan Hi Chris, I've incorporated most of the changes you suggested and separated them into 3 commits.
Would you take a look again when possible? Thanks!
* Return a list of two entries for positive and reverse strand representations. | ||
*/ | ||
private static List<TestDataForSimpleSV> | ||
forSimpleDeletion(final ByteArrayOutputStream outputStream) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
named all variables that are expected accordingly
public static final TestDataForSimpleSV forComplexTanDup_2to3_short_noPseudoHom_minus; | ||
|
||
static { | ||
try ( final ByteArrayOutputStream outputStream = new ByteArrayOutputStream() ){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was trying to avoid multiple try blocks. But using Strings removed the need for streams.
|
||
// these are used for testing | ||
public final boolean expectedFirstContigRegionHasLaterReferenceMapping; | ||
public final SimpleChimera manuallyCuratedSimpleChimera; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public class AnnotatedVariantProducerUnitTest extends GATKBaseTest { | ||
|
||
/** | ||
* Hack to force trigger test data generation. | ||
*/ | ||
@BeforeClass | ||
private void makeSureDataIsAvailable() { | ||
if(!SimpleSVDiscoveryTestDataProvider.testDataInitialized) { | ||
new SimpleSVDiscoveryTestDataProvider(); | ||
if(!AssemblyBasedSVDiscoveryTestDataProviderForSimpleSV.testDataInitialized) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. removed
final List<VariantContext> expectedVariants) throws IOException { | ||
if (inferredTypes.size() == 1) { | ||
VariantContextTestUtils.assertVariantContextsAreEqual(produceAnnotatedVcFromAssemblyEvidence(inferredTypes.get(0), novelAdjacencyAndAssemblyEvidence, broadcastReference, broadcastSequenceDictionary, null, sampleId).make(), | ||
expectedVariants.get(0), Arrays.asList(CONTIG_NAMES, HQ_MAPPINGS)); // these two are omitted because: 1) HQ_MAPPINGS equal testing code is wrong (saying 1!=1), 2)CONTIG_NAMES will be tested in another test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually debugged on this,
if i recall correctly, it is saying an integer 1 not equal to an String "1" (note HQ_MAPPINGS holds a simple integer), which is a strange feature of VC. So I stopped there.
This attribute is later tested in testMiscCases()
/** | ||
* @return the inferred type could be | ||
* 1) a single entry for simple variants, or | ||
* 2) a list of two entries for "replacement" case where the ref- and alt- path are both >= {@link STRUCTURAL_VARIANT_SIZE_LOWER_BOUND} bp long, or | ||
* 2) a list of two entries for "replacement" case where the ref- and alt- path are both >= | ||
* {@link StructuralVariationDiscoveryArgumentCollection#STRUCTURAL_VARIANT_SIZE_LOWER_BOUND} bp long, or | ||
* 3) a list of two entries with BND mates. | ||
* It is safe, for now, to assume that ({@link SimpleSVType} and {@link BreakEndVariantType} are never mixed) | ||
*/ | ||
@VisibleForTesting | ||
public List<SvType> toSimpleOrBNDTypes(final ReferenceMultiSource reference, final SAMSequenceDictionary referenceDictionary) { | ||
|
||
switch (type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note taken.
List<VariantContext> annotatedVariants = discoverVariantsFromChimeras(svDiscoveryInputMetaData, parsedContigAlignments); | ||
@SuppressWarnings("deprecation") | ||
List<VariantContext> annotatedVariants = | ||
org.broadinstitute.hellbender.tools.spark.sv.discovery.inference.ContigChimericAlignmentIterativeInterpreter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to suppress the deprecation warning.
So cannot import the class.
That's my understanding after reading
https://stackoverflow.com/a/20909204/2666587
The method in question is deprecated because it is the method that I'm trying to phase out, and used in three places
DiscoverVariantsFromContigAlignmentsSAMSpark
: tool I'm planning to phase outStructuralVariationDiscoveryPipelineSpark
: before the tool is phased out, I need to call into it.SegmentedCpxVariantSimpleVariantExtractor
: for extracting simple variants from CPX calls (a working but really not beautiful solution)
@@ -136,7 +120,7 @@ protected void runTool(final JavaSparkContext ctx) { | |||
StructuralVariationDiscoveryPipelineSpark.broadcastCNVCalls(ctx, getHeaderForReads(), | |||
discoverStageArgs.cnvCallsFile); | |||
|
|||
final String vcfOutputFile = prefixForOutput + "_" + SVUtils.getSampleId(getHeaderForReads()) + "_inv_del_ins.vcf"; | |||
final String vcfOutputFile = prefixForOutput + (prefixForOutput.endsWith("/") ? "" : "_") + SVUtils.getSampleId(getHeaderForReads()) + "_inv_del_ins.vcf"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed with a new method getVcfOutputPath()
, and did the same for SvDiscoverFromLocalAssemblyContigAlignmentsSpark
return ins; | ||
} else { | ||
if (narl.getStrandSwitch() == StrandSwitch.NO_SWITCH) { | ||
return narl.getComplication().getInsertedSequenceForwardStrandRep(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using suggested change. Thanks!
private final boolean indicatesInv55; | ||
@VisibleForTesting | ||
static String extractInsertedSequence(final NovelAdjacencyAndAltHaplotype narl, final boolean forUpstreamLoc) { | ||
final String ins = narl.getComplication().getInsertedSequenceForwardStrandRep(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are actually two issues at hand here,
- We could have a reverse strand representation of the event; this is covered by
BreakpointComplications.reverseComplementIfNecessary()
- We could have a right/downstream BND record, to whom the inserted sequence (and homology as well), must be RC of the left/upstream BND record.
The second case is very confusing, and is the case handled here.
Picture this left breakpoint of an inversion
where ref is
GGGGGGG .......... TTTTTTTTT
and the alt is (+ strand rep)
GGGGGGGXXXXXXAAAAAAA.....
where XXXXX is inserted (assume its RC is Y), and (-) strand rep is
..........TTTTTTTTYYYYYYYCCCCCCC.
Unfortunately, this (-) strand rep is also the (+) strand representation of the downstream BND record.
And this holds for homology as well....... 🙀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks better @SHuang-Broad. A few more comments and questions.
|
||
// manually remove inserted sequence information from RPL event-produced DEL, when it can be linked with an INS | ||
if (linkedVariants._1 instanceof SimpleSVType.Deletion) | ||
return Arrays.asList(builder1.rmAttribute(GATKSVVCFConstants.INSERTED_SEQUENCE).rmAttribute(GATKSVVCFConstants.INSERTED_SEQUENCE_LENGTH).rmAttribute(GATKSVVCFConstants.SEQ_ALT_HAPLOTYPE).make(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you shorten these lines for readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
final Broadcast<SAMSequenceDictionary> broadcastSequenceDictionary, | ||
final Broadcast<SVIntervalTree<VariantContext>> broadcastCNVCalls, | ||
final String sampleId) { | ||
if (broadcastCNVCalls == null || end < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure about this end < 0
check, do you know why this is here?
EDIT
Sorry I cannot reply (it's "outdated"....), so I edited your message.
I forgot to remove them. Previously it was to guard against taking in an end
of an BND record which up to that point (with previous implementations) is negative (end
was later converted to simply take value of POS
to populate the VariantContextBuider.stop(end); both previous and current implementation uses POS
for the value of stop in constructing the VC)
@VisibleForTesting | ||
static String extractInsertedSequence(final NovelAdjacencyAndAltHaplotype narl, final boolean forUpstreamLoc) { | ||
final String ins = narl.getComplication().getInsertedSequenceForwardStrandRep(); | ||
if (ins.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does reverse complement thrown an exception if you give it an empty string? If not, you could skip this check and make this method a 2-liner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it doesn't, according to a unit test I wrote temporarily. So updated as suggested. Thanks!
List<VariantContext> annotatedVariants = discoverVariantsFromChimeras(svDiscoveryInputMetaData, parsedContigAlignments); | ||
@SuppressWarnings("deprecation") | ||
List<VariantContext> annotatedVariants = | ||
org.broadinstitute.hellbender.tools.spark.sv.discovery.inference.ContigChimericAlignmentIterativeInterpreter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but if we're still using it why bother deprecating it? I'd use the actual deprecation tag for parts of a public API or something, but here we don't really need to warn ourselves about it, do we? It just seems counterproductive to deprecate the method and then add warning suppressors everywhere we call it.
EDIT
Sorry I cannot reply (it's "outdated"....), so I edited your message.
Agree. So I removed the deprecated annotation on the method, and fixed the warning suppressions everywhere.
} finally { | ||
narlsAndSources.unpersist(); | ||
private String getVcfOutputPath() { | ||
if ( java.nio.file.Files.exists(Paths.get(prefixForOutput)) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you import this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected.
@@ -287,14 +301,49 @@ public boolean equals(Object obj) { | |||
* NOTE THAT this does not handle case when one of the duplication copy is inverted; | |||
* for that see {@link IntraChrStrandSwitchBreakpointComplications} (possibly partially). | |||
*/ | |||
abstract static class SmallDuplicationBreakpointComplications extends BreakpointComplications { | |||
abstract public static class SmallDuplicationBreakpointComplications extends BreakpointComplications { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put public
first in the keywords list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
IntraChrStrandSwitchBreakpointComplications(final SimpleChimera simpleChimera, final byte[] contigSeq, final boolean firstAfterSecond) { | ||
resolveComplicationForSimpleStrandSwitch(simpleChimera, contigSeq, firstAfterSecond); | ||
// if ( simpleChimera.isCandidateInvertedDuplication() ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@@ -343,7 +366,8 @@ void resolveComplications(final SimpleChimera simpleChimera, final byte[] contig | |||
protected BNDTypeBreakpointsInference(final SimpleChimera simpleChimera, | |||
final byte[] contigSequence, | |||
final SAMSequenceDictionary referenceDictionary) { | |||
super(simpleChimera, contigSequence); | |||
super(simpleChimera, contigSequence, referenceDictionary); | |||
altHaplotypeSequence = new byte[]{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we do have the assembled haplotype.. It's true that it's somewhat redundant but I was thinking it might be nice to have there for consistency with the other variant types. I forget -- do we include any padding in the alt haplotype, or is it really just the inserted sequence in this case?
EDIT
Sorry I cannot reply (it's "outdated"....), so I edited your message.
No this doesn't pad around the breakpoints; I think Valentine does some padding when he does genotyping, but not here.
I made a last, optional commit that puts back this annotation for BND records, but I'm not sure how much of a value it contains.
final boolean firstSegmentNeighborsHeadAlignment = basicInfo.forwardStrandRep ? (firstSegment.getStart() - head.referenceSpan.getEnd() == 1) | ||
: (head.referenceSpan.getStart() - firstSegment.getEnd() == 1); | ||
if ( ! firstSegmentNeighborsHeadAlignment ) | ||
final boolean expectedCase = basicInfo.forwardStrandRep ? (firstSegment.getStart() - head.referenceSpan.getEnd() == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old name for this variable (firstSegmentNeighborsHeadAlignment
) seemed more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted back
if (isCandidateForFatInsertion()) { | ||
return new SimpleInterval(leftJustifiedLeftRefLoc.getContig(), leftJustifiedLeftRefLoc.getStart(), leftJustifiedRightRefLoc.getEnd()); | ||
} else | ||
throw new UnsupportedOperationException("trying to get interval from a novel adjacency that is no a fat insertion: " + toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"is not a"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Looks good apart from the optional commit which I think we should hold back to be its own unit of work as discussed in person.
b0008d8
to
aee9931
Compare
… inference code Also * expand cases on BND format variants * change to how VariantContext are extracted from NARL and SVType * refactor the tool DiscoverVariantsFromContigAlignmentsSAMSpark into a new worker class ContigChimericAlignmentIterativeInterpreter
aee9931
to
7c291ba
Compare
… inference code (#4835) Also * expand cases on BND format variants * change to how VariantContext are extracted from NARL and SVType * refactor the tool DiscoverVariantsFromContigAlignmentsSAMSpark into a new worker class ContigChimericAlignmentIterativeInterpreter
This is to overhaul tests on SV assembly-based non-complex breakpoint and type inference code.
What is done:
Classes affected in
main
AnnotatedVariantProducer
: methods are grouped together and renamed to reflect that the annotations added are assembly-specific or using short readsBreakEndVariantType
: more detailed types (mostly about how alt allele, with the ref bases and square brackets)BreakpointComplications
andBreakpointsInference
: mostly to add trivial methods used only in testsDiscoverVariantsFromContigAlignmentsSAMSpark
: now a thin CLI, where functionalities are refactored into new classContigChimericAlignmentIterativeInterpreter
SimpleChimera
: new documented and tested methodfirstContigRegionRefSpanAfterSecond
and trivial test-related code