-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISQ-29] Use a separate instance of VCFCodec for each partition #30
Conversation
@@ -31,7 +32,7 @@ | |||
|
|||
public BamRecordWriter(Configuration conf, Path file, SAMFileHeader header) throws IOException { | |||
this.out = file.getFileSystem(conf).create(file); | |||
BlockCompressedOutputStream compressedOut = new BlockCompressedOutputStream(out, null); | |||
BlockCompressedOutputStream compressedOut = new BlockCompressedOutputStream(out, (File) null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate htsjdk.
Anyone object to making an assertion in this library that all method parameters should be nonnull by default?
@ParametersAreNonnullByDefault
package foo;
import javax.annotation.ParametersAreNonnullByDefault;
https://github.com/google/guava/wiki/UsingAndAvoidingNullExplained#using-and-avoiding-null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heuermh I'm not very familiar with the various nullness annotations. What does adding that do? Is it just a statement of principle or is enforced somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, a statement of principle, although there may be tools that use that annotation in static code analysis.
Throughout the guava library method parameters are checked via Preconditions
to fail fast on nulls. Adding a dependency on guava comes with its own set of problems though.
I'd be happy to simply say we're not going to design our API to accept nulls as method parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to not allowing nulls in the API. I'd rather not have a Guava dependency though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 also for non-null (I am trying the same in htsjdk-next). Maybe it is worth to check the Bean Validation project for constraints in the library (I am planning to propose this also in htsjdk-next).
@@ -96,14 +97,18 @@ private static InputStream bufferAndDecompressIfNecessary(final InputStream in) | |||
} | |||
enableBGZFCodecs(conf); | |||
|
|||
Broadcast<VCFCodec> vcfCodecBroadcast = jsc.broadcast(getVCFCodec(jsc, path)); | |||
VCFCodec vcfCodec = getVCFCodec(jsc, path); | |||
VCFHeader header = vcfCodec.getHeader(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining why this is necessary for future generations? The version thing is very non obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some comments
@@ -31,7 +32,7 @@ | |||
|
|||
public BamRecordWriter(Configuration conf, Path file, SAMFileHeader header) throws IOException { | |||
this.out = file.getFileSystem(conf).create(file); | |||
BlockCompressedOutputStream compressedOut = new BlockCompressedOutputStream(out, null); | |||
BlockCompressedOutputStream compressedOut = new BlockCompressedOutputStream(out, (File) null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 also for non-null (I am trying the same in htsjdk-next). Maybe it is worth to check the Bean Validation project for constraints in the library (I am planning to propose this also in htsjdk-next).
@@ -33,7 +34,7 @@ public VcfRecordWriter(Configuration conf, Path file, VCFHeader header, String e | |||
boolean compressed = | |||
extension.endsWith(BGZFCodec.DEFAULT_EXTENSION) || extension.endsWith(".gz"); | |||
if (compressed) { | |||
out = new BlockCompressedOutputStream(out, null); | |||
out = new BlockCompressedOutputStream(out, (File) null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much I miss in htsjdk a method to set the filename/source (as a String
) to bgzip streams...
@@ -96,14 +97,20 @@ private static InputStream bufferAndDecompressIfNecessary(final InputStream in) | |||
} | |||
enableBGZFCodecs(conf); | |||
|
|||
Broadcast<VCFCodec> vcfCodecBroadcast = jsc.broadcast(getVCFCodec(jsc, path)); | |||
VCFCodec vcfCodec = getVCFCodec(jsc, path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be final (also in other parts of the code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -20,7 +20,7 @@ | |||
public static void setup() { | |||
SparkConf sparkConf = new SparkConf(); | |||
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); | |||
jsc = new JavaSparkContext("local", "myapp", sparkConf); | |||
jsc = new JavaSparkContext("local[*]", "myapp", sparkConf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be possible to have both kind of tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would need some more work. I think it's most useful to test with multiple cores as it's most likely to expose concurrency issues (as it did in this case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But having consistency between both versions might be also beneficial, right? Maybe we can open an issue if someone wants to look at refactoring to test both...
7556608
to
d3370a1
Compare
Are we happy for this to go in now? |
👍 |
Thanks @magicDGS! |
Woops, was assigning the wrong ticket.. |
This exposes the bug in #29, by using all available cores on a machine for testing (in
BaseTest
). (I was also able to repoduce the bug when testing on large files.)The fix relies on the latest htsjdk release.