Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fasta gz support in gatk #5120

Merged
merged 3 commits into from
Sep 20, 2018
Merged

fasta gz support in gatk #5120

merged 3 commits into from
Sep 20, 2018

Conversation

lbergelson
Copy link
Member

this is a bit of a wip but probably good enough to take a look at for sanity.

@codecov-io
Copy link

codecov-io commented Aug 22, 2018

Codecov Report

Merging #5120 into master will increase coverage by 0.002%.
The diff coverage is 78.713%.

@@               Coverage Diff               @@
##              master     #5120       +/-   ##
===============================================
+ Coverage     86.797%   86.799%   +0.002%     
- Complexity     29706     29711        +5     
===============================================
  Files           1820      1820               
  Lines         137384    137406       +22     
  Branches       15151     15160        +9     
===============================================
+ Hits          119245    119267       +22     
+ Misses         12651     12635       -16     
- Partials        5488      5504       +16
Impacted Files Coverage Δ Complexity Δ
...oadinstitute/hellbender/engine/AssemblyRegion.java 84.496% <ø> (ø) 48 <0> (ø) ⬇️
...er/tools/walkers/rnaseq/OverhangFixingManager.java 94.079% <ø> (ø) 67 <0> (ø) ⬇️
.../walkers/rnaseq/OverhangFixingManagerUnitTest.java 98.83% <ø> (ø) 31 <0> (ø) ⬇️
...plotypecaller/ReadLikelihoodCalculationEngine.java 100% <ø> (ø) 0 <0> (ø) ⬇️
...haplotypecaller/HaplotypeCallerEngineUnitTest.java 96.296% <ø> (ø) 6 <0> (ø) ⬇️
...institute/hellbender/exceptions/UserException.java 67.606% <0%> (-2.467%) 3 <0> (ø)
...lbender/engine/AssemblyRegionIteratorUnitTest.java 84.444% <100%> (+0.175%) 11 <6> (ø) ⬇️
...llbender/engine/filters/VariantFilterUnitTest.java 90.244% <100%> (ø) 6 <0> (ø) ⬇️
...tute/hellbender/utils/GenomeLocParserUnitTest.java 88.435% <100%> (-0.263%) 46 <2> (ø)
.../readthreading/ReadThreadingAssemblerUnitTest.java 98.718% <100%> (+0.011%) 38 <1> (+1) ⬆️
... and 19 more

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review complete, back to @lbergelson. Found one bug plus a couple of other issues, but looks good overall.

@@ -94,26 +77,21 @@ public CachingIndexedFastaSequenceFile(final Path fasta, final FastaSequenceInde
* @param cacheSize the size of the cache to use in this CachingIndexedFastaReader, must be >= 0
* @param preserveCase If true, we will keep the case of the underlying bases in the FASTA, otherwise everything is converted to upper case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs javadoc for the preserveIUPAC argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

this(fasta, index, cacheSize, false, false);
private static ReferenceSequenceFile requireIndex(Path fasta, ReferenceSequenceFile referenceSequenceFile) {
if( !referenceSequenceFile.isIndexed()) {
throw new UserException("Couldn't create indexed fasta: " + fasta.toAbsolutePath());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Fasta index not found, but required" would be a better error message, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error reporting here is kind of bad, there's no easy way to tell what the problem was. I added a gross block of error diagnosing code, let me know what you think...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I punted here and moved the error checking all to be in the initial checks. This throws gatk error now if it ends up here without being indexed since we should be preventing that up front.

if( !referenceSequenceFile.isIndexed()) {
throw new UserException("Couldn't create indexed fasta: " + fasta.toAbsolutePath());
}
return referenceSequenceFile;
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why keep any of these constructors public? Why not force all clients to go through the checkAndCreate() factory methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote it to be only constructors and to make them all perform the necessary safety checks. It's much less confusing now I think.

@@ -136,7 +114,7 @@ public CachingIndexedFastaSequenceFile(final Path fasta) throws FileNotFoundExce
* @param fasta The file to open.
* @param preserveCase If true, we will keep the case of the underlying bases in the FASTA, otherwise everything is converted to upper case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constructor docs should note that IUPAC bases will not be preserved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -221,7 +194,7 @@ public static CachingIndexedFastaSequenceFile checkAndCreate(final Path fastaPat
* @param fasta The file to open.
* @param cacheSize the size of the cache to use in this CachingIndexedFastaReader, must be >= 0
*/
public CachingIndexedFastaSequenceFile(final Path fasta, final long cacheSize ) throws FileNotFoundException {
public CachingIndexedFastaSequenceFile(final Path fasta, final long cacheSize ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be encapsulated, clients should go through checkAndCreate() instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add overload of checkAndCreate() that lets the client specify the cache size, and any other overloads necessary to replicate the functionality of the constructors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this the other way around, getting rid of check and create and making all the constructors safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the unused constructor to set cache size, but exposing the full argument constructor as public so people can do what they want

// times.
final int minExpectedHits = (int) Math.floor(
(Math.min(caching.getCacheSize(), contig.getSequenceLength()) - querySize * 2.0) / STEP_SIZE);
caching.printEfficiency(Level.WARN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, WARN seems inappropriate here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does

int middleStart = (contig.getSequenceLength() - querySize) / 2;
int middleStop = middleStart + querySize;

logger.warn(String.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too I question the use of warn()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check

if (stop <= contig.getSequenceLength()) {
ReferenceSequence grabMiddle = caching.getSubsequenceAt(contig.getSequenceName(), middleStart,
middleStop);
ReferenceSequence cachedVal = caching.getSubsequenceAt(contig.getSequenceName(), start, stop);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this test be asserting that you get a high cache hit rate, given that you query a nearby interval first?

private int testCases(final IndexedFastaSequenceFile original,
final IndexedFastaSequenceFile casePreserving,
final IndexedFastaSequenceFile allUpper,
private int testCases(final ReferenceSequenceFile original,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method name is terrible...how about mixedCasesTestHelper() or something that ties it to the testMixedCasesInExample() test above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

final IndexedFastaSequenceFile allUpper,
private int testCases(final ReferenceSequenceFile original,
final ReferenceSequenceFile casePreserving,
final ReferenceSequenceFile allUpper,
final String contig, final int start, final int stop ) {
final String orig = fetchBaseString(original, contig, start, stop);
final String keptCase = fetchBaseString(casePreserving, contig, start, stop);
Copy link
Contributor

@droazen droazen Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, why does this test call toUpperCase() explicitly on allUpper (see line below this one)? Shouldn't it be testing that allUpper returns uppercase bases on its own?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's really weird and wrong. Removed that and the test is fine so I guess it was a no-op.

@droazen droazen assigned lbergelson and unassigned droazen Aug 28, 2018
@sooheelee
Copy link
Contributor

Looking forward to this feature!

@droazen
Copy link
Contributor

droazen commented Sep 12, 2018

@lbergelson Ping me once this is ready for a final pass

@lbergelson
Copy link
Member Author

@droazen Could you take another look at this? I had to hit a lot of files to do the constructor update and close the readers where they weren't being closed already.

@droazen droazen assigned droazen and unassigned lbergelson Sep 13, 2018
Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final review complete, back to @lbergelson. Merge after addressing remaining comments.

@@ -257,5 +250,13 @@ public void closeTool() {
if ( hcEngine != null ) {
hcEngine.shutdown();
}

if ( referenceReader != null){
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take my suggestion above, you can get rid of this try/catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it makes more sense to put the close() call in HaplotypeCallerEngine.shutdown(), which is already called from closeTool(), and get rid of the tool-level referenceReader field completely. This would also be more consistent with Mutect2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems wrong to have HaplotypeCallerEngine close the reader since it's passed in externally. It should be the responsibility of the caller to close it. In HaplotypeCallerSpark we currently broadcast an instance of ReferenceMultiSource which is reused between calls and should not be closed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's gross having the reader as a new field in the tool itself, and inconsistent with existing patterns. I think it's better to transfer responsibility for the reference reader to the HaplotypeCallerEngine at construction time, and let the shutdown() call there govern when it gets closed. The Spark version can always refrain from calling shutdown() until it's ready to do so, can't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so. It looks like it never bothers calling shutdown anyway... so it won't be effected... Maybe we should fix that also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on further thought, I don't think the spark version will ever be able to properly clean up if we make this change, not without changing how it uses readers at least. Since it should be calling shutdown whenever it's done with a haplotypecaller engine, but the readers persist through multiple engines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it but I'm not sure if it's ideal.

@lbergelson
Copy link
Member Author

I think I'm cursed. This is failing the oracle build again.

* GATK can now read fasta files that are bgzipped
  This requires both a .fai and .gzi index
  A gzi index can be created using the bgzip program
  the same .fai can be used for the zipped and unzipped version of a file

* replacing mixed use of CachingIndexFastaFile constructors and static
factory methods with a single set of constructors

* closing instances of CachingIndexFastaFile
@lbergelson
Copy link
Member Author

It was cursed with a bad cache.

@lbergelson lbergelson dismissed droazen’s stale review September 20, 2018 02:22

responded to comments

@lbergelson lbergelson merged commit 6e352bb into master Sep 20, 2018
@lbergelson lbergelson deleted the lb_fasta_gz branch September 20, 2018 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants