Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to merge intervals for better GGVCFs performance on GDB exome input #5741

Merged
merged 6 commits into from
Mar 28, 2019

Conversation

ldgauthier
Copy link
Contributor

Rather than start off the exome genotyping task with a step to merge the query intervals for better GDB performance, make it an option as for Import in #5540

This is necessary to get the exome joint calling pipeline onto official GATK release dockers.

@codecov-io
Copy link

codecov-io commented Mar 1, 2019

Codecov Report

Merging #5741 into master will increase coverage by 0.021%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##              master     #5741       +/-   ##
===============================================
+ Coverage     87.042%   87.063%   +0.021%     
- Complexity     32163     32173       +10     
===============================================
  Files           1974      1978        +4     
  Lines         147466    147493       +27     
  Branches       16232     16213       -19     
===============================================
+ Hits          128357    128412       +55     
+ Misses         13189     13171       -18     
+ Partials        5920      5910       -10
Impacted Files Coverage Δ Complexity Δ
...ls/genomicsdb/GenomicsDBImportIntegrationTest.java 88.546% <ø> (+0.076%) 81 <0> (+1) ⬆️
...va/org/broadinstitute/hellbender/GATKBaseTest.java 98.214% <100%> (-1.786%) 8 <0> (+1)
...dinstitute/hellbender/engine/GATKToolUnitTest.java 91.611% <100%> (+0.595%) 79 <8> (+8) ⬆️
.../hellbender/tools/genomicsdb/GenomicsDBImport.java 78.439% <100%> (+0.243%) 55 <2> (+2) ⬆️
...er/tools/walkers/GenotypeGVCFsIntegrationTest.java 85.714% <100%> (+1.602%) 42 <3> (+3) ⬆️
...titute/hellbender/tools/walkers/GenotypeGVCFs.java 91.286% <100%> (+1.009%) 103 <0> (+9) ⬆️
...org/broadinstitute/hellbender/engine/GATKTool.java 91.204% <100%> (+0.041%) 102 <1> (+1) ⬆️
...alkers/mutect/filtering/ReadOrientationFilter.java 78.947% <0%> (-15.497%) 10% <0%> (+1%)
...walkers/mutect/filtering/Mutect2VariantFilter.java 77.778% <0%> (-11.111%) 9% <0%> (-1%)
...nder/utils/io/DeleteRecursivelyOnExitPathHook.java 70% <0%> (-5%) 3% <0%> (ø)
... and 83 more

@ldgauthier
Copy link
Contributor Author

Tests pass! (after rebasing to remove the offending ReadsPipeline Spark test)

@ldgauthier
Copy link
Contributor Author

@droazen can you reassign this? I'm guessing it's going to be a while before Louis gets to it and this is a feature we're using in production. It would be great to get this in the release.

@droazen droazen requested review from jamesemery and removed request for lbergelson March 25, 2019 19:36
@droazen droazen assigned jamesemery and unassigned lbergelson Mar 25, 2019
@droazen
Copy link
Collaborator

droazen commented Mar 25, 2019

@jamesemery can you review this one when you get a chance? It's only ~100 lines or so.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally don't want to expose certain engine arguments to client (non-walker) tools. Most of my comments pertain to refactoring this branch so you can avoid unduly exposing fields in GATKTool that can be avoided.

@@ -111,6 +111,15 @@
return getTraversalParameters(sequenceDict).getIntervalsForTraversal();
}

/**
* Get the largest interval per contig that contains the intervals specified on the command line.
Copy link
Collaborator

@jamesemery jamesemery Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

largest -> smallest or minimal or some other rewording to clarify what this is doing.

@@ -157,7 +157,7 @@
* Walker base classes (ReadWalker, etc.) are responsible for hooking these intervals up to
* their particular driving data source.
*/
List<SimpleInterval> userIntervals;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would prefer not to expose these engine fields for safety purposes. As of right now these fields are only exposed to engine walkers etc... I would say revert this change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... it looks like the tests/etc would be made somewhat easier if there were a getter for this. It should be harmless to add a getUserIntervals() that just returns an unmodifiable list of these simple intervals.

@@ -432,7 +432,7 @@ void initializeFeatures() {
* Package-private so that engine classes can access it, but concrete tool child classes cannot.
* May be overridden by traversals that require custom initialization of intervals.
*/
void initializeIntervals() {
protected void initializeIntervals() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don't necessarily want to expose the initializeIntervals() argument to tool authors (mostly because we don't want to worry about tool authors changing the internal user intervals in unsafe ways. Since what you primarily need right here is to maybe mutate the user intervals on this line: userIntervals = intervalArgumentCollection.getIntervals(sequenceDictionary); I would consider adding a seperate method to do this.

Add a protected method transformTraversalIntervals(getIntervals, sequenceDictionary) that returns a list of intervals and make that what GATKTool sets as userIntervals. Then you can add the mutation you need to genomicsDBImport and leave the default behavior untouched without exposing anything dangerously by just making the default behavior return the supplied interval list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert the change of initializeIntervals() to protected?

@@ -189,13 +217,22 @@ public void onTraversalStart() {

final VCFHeader inputVCFHeader = getHeaderForVariants();

if (hasUserSuppliedIntervals()) {
if (mergeInputIntervals) {
intervals = intervalArgumentCollection.getSpanningIntervals(getBestAvailableSequenceDictionary());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be shared code across the two places (here and initializeIntervals()). Alternatively you could add a safe getter to userIntervals that will let you access these intervals after the merge.

Utils.resetRandomGenerator();
runCommandLine(args);

// if this isn't working it will take *FOREVER*
Copy link
Collaborator

@jamesemery jamesemery Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this as a solution for our tests. This could easily cause inexplicable failures on travis that nobody knows how to solve. TestNG has a test timeout function that might be applicable but i'm not sure I like the risk that introduces when there are different hardware configurations to worry about. I think you should expose the userIntervalList with a getter and then use that to explicitly assert that the intervals were merged correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do need to test that the mergeIntervals arg works as expected. I added an actual comparison. If you're still concerned about the runtime if something breaks, then I can reduce the number of intervals.

@jamesemery jamesemery assigned ldgauthier and kreinold and unassigned jamesemery and kreinold Mar 25, 2019
Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor comments. Once those are addressed this can be merged.

@@ -432,7 +432,7 @@ void initializeFeatures() {
* Package-private so that engine classes can access it, but concrete tool child classes cannot.
* May be overridden by traversals that require custom initialization of intervals.
*/
void initializeIntervals() {
protected void initializeIntervals() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert the change of initializeIntervals() to protected?

@@ -232,6 +245,32 @@ public void assertMatchingGenotypesFromTileDB(File input, File expected, Locatab
runGenotypeGVCFSAndAssertSomething(genomicsDBUri, expected, NO_EXTRA_ARGS, VariantContextTestUtils::assertVariantContextsHaveSameGenotypes, reference);
}

@Test(dataProvider = "getGVCFsForGenomicsDBOverMultipleIntervals")
public void testGenotypeGVCFsMultiIntervalGDBQuery(File input, File expected, List<Locatable> intervals, String reference) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really expect this test to be quick and only take 0.06 seconds we should probably have a contingency plan in place so this fails gracefully on travis (my fear is of a world where this fails and the signal is travis times out due to inactivity). You could add to this test a testNG timeout option timeOut = 1000 which would be 1000 ms or some more reasonable gross overestimation of the runtime that we expect to fail if this branch didin't work and use that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops! Looks like I gave you a poor choice of number. I forgot to account for the runtime units being in terms of minutes. Sorry about that...

};
}

@Test(dataProvider = "TestGetTransformedTraversalIntervalsProvider")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this test

@jamesemery jamesemery assigned ldgauthier and unassigned jamesemery Mar 27, 2019
Change test timeout expectations (TIL the value is in milliseconds)
@@ -238,7 +238,7 @@ public void testGenotypesOnly(File input, File expected, List<String> extraArgs,
}

//this only tests single-sample
@Test(dataProvider = "getGVCFsForGenomicsDB", timeOut = 1000)
@Test(dataProvider = "getGVCFsForGenomicsDB", timeOut = 1000000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops! Looks like I gave you a poor choice of number. I forgot to account for the runtime units being in terms of minutes...

@@ -232,6 +245,32 @@ public void assertMatchingGenotypesFromTileDB(File input, File expected, Locatab
runGenotypeGVCFSAndAssertSomething(genomicsDBUri, expected, NO_EXTRA_ARGS, VariantContextTestUtils::assertVariantContextsHaveSameGenotypes, reference);
}

@Test(dataProvider = "getGVCFsForGenomicsDBOverMultipleIntervals")
public void testGenotypeGVCFsMultiIntervalGDBQuery(File input, File expected, List<Locatable> intervals, String reference) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops! Looks like I gave you a poor choice of number. I forgot to account for the runtime units being in terms of minutes. Sorry about that...

@droazen droazen merged commit 0f43867 into master Mar 28, 2019
@droazen droazen deleted the ldg_GGVCFsSpanIntervals branch March 28, 2019 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants