-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Definine GenomicsDB "partitions" over the span of the input intervals in #5540
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5540 +/- ##
===============================================
+ Coverage 87.068% 87.075% +0.007%
- Complexity 31324 31332 +8
===============================================
Files 1921 1921
Lines 144579 144598 +19
Branches 15949 15952 +3
===============================================
+ Hits 125882 125908 +26
+ Misses 12905 12896 -9
- Partials 5792 5794 +2
|
order to dramatically improve exome performance
db1038b
to
cb54aad
Compare
@lbergelson you were totally right. I pulled out a flag to turn on interval merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldgauthier Looks good, needs a test for the new method. Thank you for checking into the behavior. We could potentially drive this down into the interval argument collection itself as a new merging mode, but I'm not sure if that's useful in general or a liability when other people incorrectly specify it.
shortName = MERGE_INPUT_INTERVALS_LONG_NAME, | ||
doc = "Boolean flag to import all data in between intervals. Improves performance using large lists of " + | ||
"intervals, as in exome sequencing, especially if GVCF data only exists for specified intervals.") | ||
private boolean mergeInputIntervals = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good... I guess the other alternative would be to add a new interval merging rule to the interval argument collection and push this behavior into the engine.
final ArgumentsBuilder args = new ArgumentsBuilder(); | ||
args.addArgument(GenomicsDBImport.WORKSPACE_ARG_LONG_NAME, workspace); | ||
intervals.forEach(interval -> args.addArgument("L", IntervalUtils.locatableToString(interval))); | ||
vcfInputs.forEach(vcf -> args.addArgument("V", vcf)); | ||
args.addArgument("batch-size", String.valueOf(batchSize)); | ||
args.addArgument(GenomicsDBImport.VCF_INITIALIZER_THREADS_LONG_NAME, String.valueOf(threads)); | ||
if (useBufferSize) | ||
if (mergeIntervals) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could use args.addBooleanArgument(GenomicsDBImport.MERGE_INPUT_INTERVALS_LONG_NAME, mergeIntervals)
and drop the if
@@ -151,14 +151,16 @@ public void testGenomicsDBImportFileInputsWithMultipleIntervals() throws IOExcep | |||
testGenomicsDBImporter(LOCAL_GVCFS, MULTIPLE_INTERVALS, COMBINED_MULTI_INTERVAL, b38_reference_20_21, true, 1); | |||
} | |||
|
|||
private void testGenomicsDBImportWith1000Intervals() throws IOException { | |||
@Test | |||
public void testGenomicsDBImportWith1000Intervals() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename this to say something about the interval merging.
src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java
Show resolved
Hide resolved
d5d4564
to
e745f5e
Compare
Comments addressed -- back to you @lbergelson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Looks good 👍
… exome input (#5741) Rather than start off the exome genotyping task with a step to merge the query intervals for better GDB performance, make it an option as for Import in #5540 This is necessary to get the exome joint calling pipeline onto official GATK release dockers. GenotypeGVCFs also performs better reading from a GenomicsDB as a single interval
order to dramatically improve exome performance
This makes an 81X speedup in the 1000interval test I put back in. Exomes are just about unrunnable without some sort of interval manipulation. The 65 sample exome joint calling callset output was exactly the same with this version. I also did a comparison for a chr1 and chr20 import (so that we were looking at two intervals that were far apart) and the runtime was the same. This is a huge improvement for some typical use cases.