-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to merge intervals for better GGVCFs performance on GDB exome input #5741
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5741 +/- ##
===============================================
+ Coverage 87.042% 87.063% +0.021%
- Complexity 32163 32173 +10
===============================================
Files 1974 1978 +4
Lines 147466 147493 +27
Branches 16232 16213 -19
===============================================
+ Hits 128357 128412 +55
+ Misses 13189 13171 -18
+ Partials 5920 5910 -10
|
4ab8c1c
to
430c7c1
Compare
Tests pass! (after rebasing to remove the offending ReadsPipeline Spark test) |
@droazen can you reassign this? I'm guessing it's going to be a while before Louis gets to it and this is a feature we're using in production. It would be great to get this in the release. |
@jamesemery can you review this one when you get a chance? It's only ~100 lines or so. |
430c7c1
to
3cc21d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally don't want to expose certain engine arguments to client (non-walker) tools. Most of my comments pertain to refactoring this branch so you can avoid unduly exposing fields in GATKTool that can be avoided.
@@ -111,6 +111,15 @@ | |||
return getTraversalParameters(sequenceDict).getIntervalsForTraversal(); | |||
} | |||
|
|||
/** | |||
* Get the largest interval per contig that contains the intervals specified on the command line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
largest
-> smallest
or minimal
or some other rewording to clarify what this is doing.
@@ -157,7 +157,7 @@ | |||
* Walker base classes (ReadWalker, etc.) are responsible for hooking these intervals up to | |||
* their particular driving data source. | |||
*/ | |||
List<SimpleInterval> userIntervals; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would prefer not to expose these engine fields for safety purposes. As of right now these fields are only exposed to engine walkers etc... I would say revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... it looks like the tests/etc would be made somewhat easier if there were a getter for this. It should be harmless to add a getUserIntervals()
that just returns an unmodifiable list of these simple intervals.
@@ -432,7 +432,7 @@ void initializeFeatures() { | |||
* Package-private so that engine classes can access it, but concrete tool child classes cannot. | |||
* May be overridden by traversals that require custom initialization of intervals. | |||
*/ | |||
void initializeIntervals() { | |||
protected void initializeIntervals() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we don't necessarily want to expose the initializeIntervals()
argument to tool authors (mostly because we don't want to worry about tool authors changing the internal user intervals in unsafe ways. Since what you primarily need right here is to maybe mutate the user intervals on this line: userIntervals = intervalArgumentCollection.getIntervals(sequenceDictionary);
I would consider adding a seperate method to do this.
Add a protected method transformTraversalIntervals(getIntervals, sequenceDictionary)
that returns a list of intervals and make that what GATKTool sets as userIntervals
. Then you can add the mutation you need to genomicsDBImport and leave the default behavior untouched without exposing anything dangerously by just making the default behavior return the supplied interval list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert the change of initializeIntervals()
to protected?
@@ -189,13 +217,22 @@ public void onTraversalStart() { | |||
|
|||
final VCFHeader inputVCFHeader = getHeaderForVariants(); | |||
|
|||
if (hasUserSuppliedIntervals()) { | |||
if (mergeInputIntervals) { | |||
intervals = intervalArgumentCollection.getSpanningIntervals(getBestAvailableSequenceDictionary()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be shared code across the two places (here and initializeIntervals()
). Alternatively you could add a safe getter to userIntervals
that will let you access these intervals after the merge.
Utils.resetRandomGenerator(); | ||
runCommandLine(args); | ||
|
||
// if this isn't working it will take *FOREVER* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this as a solution for our tests. This could easily cause inexplicable failures on travis that nobody knows how to solve. TestNG has a test timeout function that might be applicable but i'm not sure I like the risk that introduces when there are different hardware configurations to worry about. I think you should expose the userIntervalList with a getter and then use that to explicitly assert that the intervals were merged correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do need to test that the mergeIntervals arg works as expected. I added an actual comparison. If you're still concerned about the runtime if something breaks, then I can reduce the number of intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor comments. Once those are addressed this can be merged.
@@ -432,7 +432,7 @@ void initializeFeatures() { | |||
* Package-private so that engine classes can access it, but concrete tool child classes cannot. | |||
* May be overridden by traversals that require custom initialization of intervals. | |||
*/ | |||
void initializeIntervals() { | |||
protected void initializeIntervals() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert the change of initializeIntervals()
to protected?
@@ -232,6 +245,32 @@ public void assertMatchingGenotypesFromTileDB(File input, File expected, Locatab | |||
runGenotypeGVCFSAndAssertSomething(genomicsDBUri, expected, NO_EXTRA_ARGS, VariantContextTestUtils::assertVariantContextsHaveSameGenotypes, reference); | |||
} | |||
|
|||
@Test(dataProvider = "getGVCFsForGenomicsDBOverMultipleIntervals") | |||
public void testGenotypeGVCFsMultiIntervalGDBQuery(File input, File expected, List<Locatable> intervals, String reference) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you really expect this test to be quick and only take 0.06 seconds we should probably have a contingency plan in place so this fails gracefully on travis (my fear is of a world where this fails and the signal is travis times out due to inactivity). You could add to this test a testNG timeout option timeOut = 1000
which would be 1000 ms or some more reasonable gross overestimation of the runtime that we expect to fail if this branch didin't work and use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops! Looks like I gave you a poor choice of number. I forgot to account for the runtime units being in terms of minutes. Sorry about that...
}; | ||
} | ||
|
||
@Test(dataProvider = "TestGetTransformedTraversalIntervalsProvider") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this test
Change test timeout expectations (TIL the value is in milliseconds)
@@ -238,7 +238,7 @@ public void testGenotypesOnly(File input, File expected, List<String> extraArgs, | |||
} | |||
|
|||
//this only tests single-sample | |||
@Test(dataProvider = "getGVCFsForGenomicsDB", timeOut = 1000) | |||
@Test(dataProvider = "getGVCFsForGenomicsDB", timeOut = 1000000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops! Looks like I gave you a poor choice of number. I forgot to account for the runtime units being in terms of minutes...
@@ -232,6 +245,32 @@ public void assertMatchingGenotypesFromTileDB(File input, File expected, Locatab | |||
runGenotypeGVCFSAndAssertSomething(genomicsDBUri, expected, NO_EXTRA_ARGS, VariantContextTestUtils::assertVariantContextsHaveSameGenotypes, reference); | |||
} | |||
|
|||
@Test(dataProvider = "getGVCFsForGenomicsDBOverMultipleIntervals") | |||
public void testGenotypeGVCFsMultiIntervalGDBQuery(File input, File expected, List<Locatable> intervals, String reference) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops! Looks like I gave you a poor choice of number. I forgot to account for the runtime units being in terms of minutes. Sorry about that...
Rather than start off the exome genotyping task with a step to merge the query intervals for better GDB performance, make it an option as for Import in #5540
This is necessary to get the exome joint calling pipeline onto official GATK release dockers.