Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISQ-29] Use a separate instance of VCFCodec for each partition #30

Merged
merged 5 commits into from
Sep 18, 2018

Conversation

tomwhite
Copy link
Member

This exposes the bug in #29, by using all available cores on a machine for testing (in BaseTest). (I was also able to repoduce the bug when testing on large files.)

The fix relies on the latest htsjdk release.

@@ -31,7 +32,7 @@

public BamRecordWriter(Configuration conf, Path file, SAMFileHeader header) throws IOException {
this.out = file.getFileSystem(conf).create(file);
BlockCompressedOutputStream compressedOut = new BlockCompressedOutputStream(out, null);
BlockCompressedOutputStream compressedOut = new BlockCompressedOutputStream(out, (File) null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate htsjdk.

Anyone object to making an assertion in this library that all method parameters should be nonnull by default?

@ParametersAreNonnullByDefault
package foo;

import javax.annotation.ParametersAreNonnullByDefault;

https://github.com/google/guava/wiki/UsingAndAvoidingNullExplained#using-and-avoiding-null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heuermh I'm not very familiar with the various nullness annotations. What does adding that do? Is it just a statement of principle or is enforced somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, a statement of principle, although there may be tools that use that annotation in static code analysis.

Throughout the guava library method parameters are checked via Preconditions to fail fast on nulls. Adding a dependency on guava comes with its own set of problems though.

I'd be happy to simply say we're not going to design our API to accept nulls as method parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to not allowing nulls in the API. I'd rather not have a Guava dependency though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 also for non-null (I am trying the same in htsjdk-next). Maybe it is worth to check the Bean Validation project for constraints in the library (I am planning to propose this also in htsjdk-next).

@@ -96,14 +97,18 @@ private static InputStream bufferAndDecompressIfNecessary(final InputStream in)
}
enableBGZFCodecs(conf);

Broadcast<VCFCodec> vcfCodecBroadcast = jsc.broadcast(getVCFCodec(jsc, path));
VCFCodec vcfCodec = getVCFCodec(jsc, path);
VCFHeader header = vcfCodec.getHeader();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining why this is necessary for future generations? The version thing is very non obvious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments

@@ -31,7 +32,7 @@

public BamRecordWriter(Configuration conf, Path file, SAMFileHeader header) throws IOException {
this.out = file.getFileSystem(conf).create(file);
BlockCompressedOutputStream compressedOut = new BlockCompressedOutputStream(out, null);
BlockCompressedOutputStream compressedOut = new BlockCompressedOutputStream(out, (File) null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 also for non-null (I am trying the same in htsjdk-next). Maybe it is worth to check the Bean Validation project for constraints in the library (I am planning to propose this also in htsjdk-next).

@@ -33,7 +34,7 @@ public VcfRecordWriter(Configuration conf, Path file, VCFHeader header, String e
boolean compressed =
extension.endsWith(BGZFCodec.DEFAULT_EXTENSION) || extension.endsWith(".gz");
if (compressed) {
out = new BlockCompressedOutputStream(out, null);
out = new BlockCompressedOutputStream(out, (File) null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much I miss in htsjdk a method to set the filename/source (as a String) to bgzip streams...

@@ -96,14 +97,20 @@ private static InputStream bufferAndDecompressIfNecessary(final InputStream in)
}
enableBGZFCodecs(conf);

Broadcast<VCFCodec> vcfCodecBroadcast = jsc.broadcast(getVCFCodec(jsc, path));
VCFCodec vcfCodec = getVCFCodec(jsc, path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be final (also in other parts of the code)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -20,7 +20,7 @@
public static void setup() {
SparkConf sparkConf = new SparkConf();
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
jsc = new JavaSparkContext("local", "myapp", sparkConf);
jsc = new JavaSparkContext("local[*]", "myapp", sparkConf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be possible to have both kind of tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would need some more work. I think it's most useful to test with multiple cores as it's most likely to expose concurrency issues (as it did in this case).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But having consistency between both versions might be also beneficial, right? Maybe we can open an issue if someone wants to look at refactoring to test both...

@tomwhite
Copy link
Member Author

Are we happy for this to go in now?

@magicDGS
Copy link
Contributor

👍

@tomwhite tomwhite merged commit b612485 into disq-bio:master Sep 18, 2018
@tomwhite tomwhite deleted the vcf-concurrency-fix branch September 18, 2018 08:22
@tomwhite
Copy link
Member Author

Thanks @magicDGS!

@lbergelson
Copy link
Contributor

Woops, was assigning the wrong ticket..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants