HaplotypeCallerSpark doesn't write g.vcf.gz files #4274

lbergelson · 2018-01-26T19:54:31Z

If you ask HaplotypeCallerSpark for a gvcf.gz it outputs a base pair resolution GVCF with no blocking. This is due to confusion in hadoop-bam / VariantSparkSink.

It works fine if you write an uncompressed g.vcf.

This is due to a conditional statement in KeyIgnoringVCFOutputFormat.getRecordWriter(askAttemptContext ctx)

		if (!isCompressed) {
			return getRecordWriter(ctx, file);
		} else {
			FileSystem fs = file.getFileSystem(conf);
			return getRecordWriter(ctx, codec.createOutputStream(fs.create(file)));
		}

The two branches call two different overloads of getRecordWriter

getRecordWriter(TaskAttemptContext ctx, Path out)

getRecordWriter(TaskAttemptContext ctx, OutputStream outputStream)

The first is public, and overriden to provide GVCF writers in our code, the second is private and doesn't know about our GVCF writer. We could override getRecordWriter(ctx) but we need access to a constructor for VCFRecordWriter that takes a stream and propagates the ctx which doesn't exist.

The text was updated successfully, but these errors were encountered:

lbergelson · 2018-01-26T19:55:41Z

see #4275 for the temporary workaround

lbergelson · 2018-01-26T19:56:15Z

This will likely require a fix in hadoop-bam unless we either copy most of the the hadoop-bam code into gatk or someone comes up with a more clever solution than I.

* this is currently broken, see #4274 * add a check to HaplotypeCallerSpark and VariantSparkSink and throw a clear exception in this case * added test for GVCF writing in VariantSparkSink which previously didn't exist * added new UserException.UnimplementedFeature class * closes #4275

lbergelson · 2018-01-29T22:27:24Z

Similarly, see #4303 for the inability to write g.bcf files. A much lower priority problem...

* prevent users from requesting g.vcf.gz in Spark * this is currently broken, see #4274 * add a check to HaplotypeCallerSpark and VariantSparkSink and throw a clear exception in this case * added test for GVCF writing in VariantSparkSink which previously didn't exist * added new UserException.UnimplementedFeature class * closes #4275

droazen · 2018-02-05T19:54:06Z

@tomwhite Should be a fairly easy fix in Hadoop-BAM, we think.

* Support g.vcf.gz files in Spark tools * fixes #4274 * upgrade hadoop-bam 7.9.1 -> 7.10.0 * Remove bcf files from Spark tests since spark currently can't write bcf files correctly * this is tracked by #4303 * a file called named .bcf is produced, but the file is actually encoded as a vcf * updated tests to verify that the file extension matches the actual datatype in the file

lbergelson added bug HaplotypeCaller Spark hadoop-bam labels Jan 26, 2018

lbergelson mentioned this issue Jan 26, 2018

prevent running HaplotypeCallerSpark with broken output formats #4275

Closed

lbergelson mentioned this issue Jan 26, 2018

prevent users from requesting g.vcf.gz in Spark #4277

Merged

droazen added this to the Engine-4.1 milestone Jan 29, 2018

lbergelson mentioned this issue Jan 29, 2018

Allow spark tools to write bcf files #4303

Closed

droazen assigned tomwhite Feb 5, 2018

droazen modified the milestones: Engine-4.1, Engine-1Q2018 Feb 5, 2018

tomwhite mentioned this issue Feb 27, 2018

Support g.vcf.gz, g.bcf and g.bcf.gz files in Spark. #4463

Merged

droazen modified the milestones: Engine-1Q2018, Engine-2Q2018 Apr 6, 2018

lbergelson closed this as completed in #4463 May 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HaplotypeCallerSpark doesn't write g.vcf.gz files #4274

HaplotypeCallerSpark doesn't write g.vcf.gz files #4274

lbergelson commented Jan 26, 2018

lbergelson commented Jan 26, 2018

lbergelson commented Jan 26, 2018

lbergelson commented Jan 29, 2018 •

edited

Loading

droazen commented Feb 5, 2018

HaplotypeCallerSpark doesn't write g.vcf.gz files #4274

HaplotypeCallerSpark doesn't write g.vcf.gz files #4274

Comments

lbergelson commented Jan 26, 2018

lbergelson commented Jan 26, 2018

lbergelson commented Jan 26, 2018

lbergelson commented Jan 29, 2018 • edited Loading

droazen commented Feb 5, 2018

lbergelson commented Jan 29, 2018 •

edited

Loading