Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HaplotypeCallerSpark doesn't write g.vcf.gz files #4274

Closed
lbergelson opened this issue Jan 26, 2018 · 4 comments
Closed

HaplotypeCallerSpark doesn't write g.vcf.gz files #4274

lbergelson opened this issue Jan 26, 2018 · 4 comments

Comments

@lbergelson
Copy link
Member

If you ask HaplotypeCallerSpark for a gvcf.gz it outputs a base pair resolution GVCF with no blocking. This is due to confusion in hadoop-bam / VariantSparkSink.

It works fine if you write an uncompressed g.vcf.

This is due to a conditional statement in KeyIgnoringVCFOutputFormat.getRecordWriter(askAttemptContext ctx)

		if (!isCompressed) {
			return getRecordWriter(ctx, file);
		} else {
			FileSystem fs = file.getFileSystem(conf);
			return getRecordWriter(ctx, codec.createOutputStream(fs.create(file)));
		}

The two branches call two different overloads of getRecordWriter

getRecordWriter(TaskAttemptContext ctx, Path out)

getRecordWriter(TaskAttemptContext ctx, OutputStream outputStream)

The first is public, and overriden to provide GVCF writers in our code, the second is private and doesn't know about our GVCF writer. We could override getRecordWriter(ctx) but we need access to a constructor for VCFRecordWriter that takes a stream and propagates the ctx which doesn't exist.

@lbergelson
Copy link
Member Author

see #4275 for the temporary workaround

@lbergelson
Copy link
Member Author

This will likely require a fix in hadoop-bam unless we either copy most of the the hadoop-bam code into gatk or someone comes up with a more clever solution than I.

lbergelson added a commit that referenced this issue Jan 26, 2018
* this is currently broken, see #4274
* add a check to HaplotypeCallerSpark and VariantSparkSink and throw a clear exception in this case
* added test for GVCF writing in VariantSparkSink which previously didn't exist
* added new UserException.UnimplementedFeature class
* closes #4275
lbergelson added a commit that referenced this issue Jan 29, 2018
* this is currently broken, see #4274
* add a check to HaplotypeCallerSpark and VariantSparkSink and throw a clear exception in this case
* added test for GVCF writing in VariantSparkSink which previously didn't exist
* added new UserException.UnimplementedFeature class
* closes #4275
@droazen droazen added this to the Engine-4.1 milestone Jan 29, 2018
@lbergelson
Copy link
Member Author

lbergelson commented Jan 29, 2018

Similarly, see #4303 for the inability to write g.bcf files. A much lower priority problem...

droazen pushed a commit that referenced this issue Jan 30, 2018
* prevent users from requesting g.vcf.gz in Spark

* this is currently broken, see #4274
* add a check to HaplotypeCallerSpark and VariantSparkSink and throw a clear exception in this case
* added test for GVCF writing in VariantSparkSink which previously didn't exist
* added new UserException.UnimplementedFeature class
* closes #4275
lbergelson added a commit that referenced this issue Jan 31, 2018
* prevent users from requesting g.vcf.gz in Spark

* this is currently broken, see #4274
* add a check to HaplotypeCallerSpark and VariantSparkSink and throw a clear exception in this case
* added test for GVCF writing in VariantSparkSink which previously didn't exist
* added new UserException.UnimplementedFeature class
* closes #4275
@droazen droazen modified the milestones: Engine-4.1, Engine-1Q2018 Feb 5, 2018
@droazen
Copy link
Contributor

droazen commented Feb 5, 2018

@tomwhite Should be a fairly easy fix in Hadoop-BAM, we think.

@droazen droazen modified the milestones: Engine-1Q2018, Engine-2Q2018 Apr 6, 2018
lbergelson pushed a commit that referenced this issue May 10, 2018
* Support g.vcf.gz files in Spark tools
* fixes #4274 
* upgrade hadoop-bam 7.9.1 -> 7.10.0
* Remove bcf files from Spark tests since spark currently can't write bcf files correctly 
   * this is tracked by #4303 
   * a file called named .bcf is produced, but the file is actually encoded as a vcf
   * updated tests to verify that the file extension matches the actual datatype in the file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants