-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fasta gz support in gatk #5120
fasta gz support in gatk #5120
Conversation
c6bf892
to
2870e36
Compare
Codecov Report
@@ Coverage Diff @@
## master #5120 +/- ##
===============================================
+ Coverage 86.797% 86.799% +0.002%
- Complexity 29706 29711 +5
===============================================
Files 1820 1820
Lines 137384 137406 +22
Branches 15151 15160 +9
===============================================
+ Hits 119245 119267 +22
+ Misses 12651 12635 -16
- Partials 5488 5504 +16
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review complete, back to @lbergelson. Found one bug plus a couple of other issues, but looks good overall.
@@ -94,26 +77,21 @@ public CachingIndexedFastaSequenceFile(final Path fasta, final FastaSequenceInde | |||
* @param cacheSize the size of the cache to use in this CachingIndexedFastaReader, must be >= 0 | |||
* @param preserveCase If true, we will keep the case of the underlying bases in the FASTA, otherwise everything is converted to upper case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs javadoc for the preserveIUPAC
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
this(fasta, index, cacheSize, false, false); | ||
private static ReferenceSequenceFile requireIndex(Path fasta, ReferenceSequenceFile referenceSequenceFile) { | ||
if( !referenceSequenceFile.isIndexed()) { | ||
throw new UserException("Couldn't create indexed fasta: " + fasta.toAbsolutePath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Fasta index not found, but required" would be a better error message, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error reporting here is kind of bad, there's no easy way to tell what the problem was. I added a gross block of error diagnosing code, let me know what you think...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I punted here and moved the error checking all to be in the initial checks. This throws gatk error now if it ends up here without being indexed since we should be preventing that up front.
if( !referenceSequenceFile.isIndexed()) { | ||
throw new UserException("Couldn't create indexed fasta: " + fasta.toAbsolutePath()); | ||
} | ||
return referenceSequenceFile; | ||
} | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why keep any of these constructors public? Why not force all clients to go through the checkAndCreate()
factory methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote it to be only constructors and to make them all perform the necessary safety checks. It's much less confusing now I think.
@@ -136,7 +114,7 @@ public CachingIndexedFastaSequenceFile(final Path fasta) throws FileNotFoundExce | |||
* @param fasta The file to open. | |||
* @param preserveCase If true, we will keep the case of the underlying bases in the FASTA, otherwise everything is converted to upper case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constructor docs should note that IUPAC bases will not be preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -221,7 +194,7 @@ public static CachingIndexedFastaSequenceFile checkAndCreate(final Path fastaPat | |||
* @param fasta The file to open. | |||
* @param cacheSize the size of the cache to use in this CachingIndexedFastaReader, must be >= 0 | |||
*/ | |||
public CachingIndexedFastaSequenceFile(final Path fasta, final long cacheSize ) throws FileNotFoundException { | |||
public CachingIndexedFastaSequenceFile(final Path fasta, final long cacheSize ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be encapsulated, clients should go through checkAndCreate()
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add overload of checkAndCreate()
that lets the client specify the cache size, and any other overloads necessary to replicate the functionality of the constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this the other way around, getting rid of check and create and making all the constructors safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the unused constructor to set cache size, but exposing the full argument constructor as public so people can do what they want
// times. | ||
final int minExpectedHits = (int) Math.floor( | ||
(Math.min(caching.getCacheSize(), contig.getSequenceLength()) - querySize * 2.0) / STEP_SIZE); | ||
caching.printEfficiency(Level.WARN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, WARN
seems inappropriate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does
int middleStart = (contig.getSequenceLength() - querySize) / 2; | ||
int middleStop = middleStart + querySize; | ||
|
||
logger.warn(String.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too I question the use of warn()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check
if (stop <= contig.getSequenceLength()) { | ||
ReferenceSequence grabMiddle = caching.getSubsequenceAt(contig.getSequenceName(), middleStart, | ||
middleStop); | ||
ReferenceSequence cachedVal = caching.getSubsequenceAt(contig.getSequenceName(), start, stop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this test be asserting that you get a high cache hit rate, given that you query a nearby interval first?
private int testCases(final IndexedFastaSequenceFile original, | ||
final IndexedFastaSequenceFile casePreserving, | ||
final IndexedFastaSequenceFile allUpper, | ||
private int testCases(final ReferenceSequenceFile original, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method name is terrible...how about mixedCasesTestHelper()
or something that ties it to the testMixedCasesInExample()
test above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
final IndexedFastaSequenceFile allUpper, | ||
private int testCases(final ReferenceSequenceFile original, | ||
final ReferenceSequenceFile casePreserving, | ||
final ReferenceSequenceFile allUpper, | ||
final String contig, final int start, final int stop ) { | ||
final String orig = fetchBaseString(original, contig, start, stop); | ||
final String keptCase = fetchBaseString(casePreserving, contig, start, stop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Um, why does this test call toUpperCase()
explicitly on allUpper
(see line below this one)? Shouldn't it be testing that allUpper
returns uppercase bases on its own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's really weird and wrong. Removed that and the test is fine so I guess it was a no-op.
Looking forward to this feature! |
@lbergelson Ping me once this is ready for a final pass |
2870e36
to
66a5f6c
Compare
@droazen Could you take another look at this? I had to hit a lot of files to do the constructor update and close the readers where they weren't being closed already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final review complete, back to @lbergelson. Merge after addressing remaining comments.
src/main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCaller.java
Outdated
Show resolved
Hide resolved
@@ -257,5 +250,13 @@ public void closeTool() { | |||
if ( hcEngine != null ) { | |||
hcEngine.shutdown(); | |||
} | |||
|
|||
if ( referenceReader != null){ | |||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you take my suggestion above, you can get rid of this try/catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it makes more sense to put the close()
call in HaplotypeCallerEngine.shutdown()
, which is already called from closeTool()
, and get rid of the tool-level referenceReader
field completely. This would also be more consistent with Mutect2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems wrong to have HaplotypeCallerEngine
close the reader since it's passed in externally. It should be the responsibility of the caller to close it. In HaplotypeCallerSpark
we currently broadcast an instance of ReferenceMultiSource which is reused between calls and should not be closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it's gross having the reader as a new field in the tool itself, and inconsistent with existing patterns. I think it's better to transfer responsibility for the reference reader to the HaplotypeCallerEngine
at construction time, and let the shutdown()
call there govern when it gets closed. The Spark version can always refrain from calling shutdown() until it's ready to do so, can't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess so. It looks like it never bothers calling shutdown anyway... so it won't be effected... Maybe we should fix that also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, on further thought, I don't think the spark version will ever be able to properly clean up if we make this change, not without changing how it uses readers at least. Since it should be calling shutdown whenever it's done with a haplotypecaller engine, but the readers persist through multiple engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it but I'm not sure if it's ideal.
src/main/java/org/broadinstitute/hellbender/utils/fasta/CachingIndexedFastaSequenceFile.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/fasta/CachingIndexedFastaSequenceFile.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/fasta/CachingIndexedFastaSequenceFile.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/engine/AssemblyRegionUnitTest.java
Show resolved
Hide resolved
...g/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngineUnitTest.java
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/utils/read/ReadUtilsUnitTest.java
Show resolved
Hide resolved
.../java/org/broadinstitute/hellbender/utils/fasta/CachingIndexedFastaSequenceFileUnitTest.java
Outdated
Show resolved
Hide resolved
.../java/org/broadinstitute/hellbender/utils/fasta/CachingIndexedFastaSequenceFileUnitTest.java
Outdated
Show resolved
Hide resolved
66a5f6c
to
e1ebd3a
Compare
I think I'm cursed. This is failing the oracle build again. |
* GATK can now read fasta files that are bgzipped This requires both a .fai and .gzi index A gzi index can be created using the bgzip program the same .fai can be used for the zipped and unzipped version of a file * replacing mixed use of CachingIndexFastaFile constructors and static factory methods with a single set of constructors * closing instances of CachingIndexFastaFile
e66322a
to
21203af
Compare
It was cursed with a bad cache. |
this is a bit of a wip but probably good enough to take a look at for sanity.