Adds a class to represent expected insert-size distribution (normal a… #4827

vruano · 2018-05-30T20:44:18Z

…nd log-normal distributed) parametrized by

insert size mean and stddev

vruano · 2018-05-30T20:45:34Z

I use this class to allow the user to pass a Normal or LogNormal expected distribution for insert sizes; notice that the String parameter constructor allows for it to use as an [at]Argument type in a tool.

@tedsharpe please review.

tedsharpe

Other than the fact that we have already determined the empirical distribution from the data (see classes LibraryStatistics and IntHistogram.CDF), and a vague uneasiness about asking users for parameters that they're unlikely to know and that vary from experiment to experiment, this is fine with me except for a couple of minor class organization suggestions.

tedsharpe · 2018-05-30T21:45:25Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/InsertSizeDistribution.java

+/**
+ * Holds the information characterizing and insert size distribution.
+ */
+public class InsertSizeDistribution implements Serializable {


I thought we had a convention for class layout that put all fields first, then constructors, then other stuff. I could be wrong.

tedsharpe · 2018-05-30T21:46:35Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/InsertSizeDistribution.java

+        return dist;
+    }
+
+    public double average() {


Maybe put this down with the other methods like min, max, density? Should it be called "mean" rather than "average"?

codecov-io · 2018-06-01T21:46:26Z

Codecov Report

Merging #4827 into master will increase coverage by 0.001%.
The diff coverage is 78.341%.

@@               Coverage Diff               @@
##              master     #4827       +/-   ##
===============================================
+ Coverage     86.735%   86.737%   +0.001%     
- Complexity     29312     30073      +761     
===============================================
  Files           1810      1817        +7     
  Lines         135549    139369     +3820     
  Branches       15031     15803      +772     
===============================================
+ Hits          117569    120884     +3315     
- Misses         12566     12973      +407     
- Partials        5414      5512       +98

Impacted Files	Coverage Δ	Complexity Δ
.../spark/sv/InsertSizeDistributionShapeUnitTest.java	`100% <100%> (ø)`	`11 <11> (?)`
...adinstitute/hellbender/utils/IntHistogramTest.java	`96.512% <100%> (+1.274%)`	`17 <4> (+4)`	⬆️
...er/tools/spark/sv/InsertSizeDistributionShape.java	`62.143% <62.143%> (ø)`	`14 <14> (?)`
...lbender/tools/spark/sv/InsertSizeDistribution.java	`63.83% <63.83%> (ø)`	`12 <12> (?)`
...llbender/tools/spark/sv/evidence/ReadMetadata.java	`87.715% <66.667%> (-1.703%)`	`57 <0> (ø)`
...ute/hellbender/tools/spark/utils/IntHistogram.java	`88.82% <81.081%> (-2.38%)`	`22 <4> (+3)`
...nder/tools/spark/sv/evidence/ReadMetadataTest.java	`96.629% <93.478%> (-3.371%)`	`12 <8> (+7)`
...tools/spark/sv/InsertSizeDistributionUnitTest.java	`95.699% <95.699%> (ø)`	`21 <21> (?)`
...iscoverFromLocalAssemblyContigAlignmentsSpark.java	`85.019% <0%> (-5.508%)`	`17% <0%> (-7%)`
...ariationDiscoveryPipelineSparkIntegrationTest.java	`92.982% <0%> (-2.894%)`	`20% <0%> (+6%)`
... and 53 more

vruano · 2018-08-23T01:08:24Z

Added support for the empirical distribution read-metadata txt file generated in the pipeline.
Also rebased to current master.

vruano · 2018-08-23T21:40:01Z

@tedsharpe
Could you review the last commit adding support for the empirical distribution?

tedsharpe · 2018-08-24T14:04:40Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/InsertSizeDistribution.java

+
+        AbstractRealDistribution fromMeanAndStdDeviation(final double mean, final double stddev);
+
+        default AbstractRealDistribution fromReadMetadataFile(final String meanString) {


It seems rather fragile to parse a text file for this information. We already have a means of serializing the ReadMetadata to a file. Why don't you just read that?
You could change the ReadMetadata code to always write read-metadata.txt as well as read-metadata.bin.

tedsharpe

Still not clear to me how or why you'd use the normal and log-normal distribution code, but your call.

…nd log-normal distributed) parametrized by insert size mean and stddev

…metadata file

…ta structure. Adds test to check that it works.

…n for each isize). Move the different shape types (normal, lognormal and empirical) into an enum.

vruano · 2018-08-31T03:05:05Z

Added a pure "empirical" isize distribution where the prob of each size is determine by the fraction of cases with that insert size + some smoothing so that 0-count insert size don't have 0 probability.

Also I moved the supported isize distribution "shapes" (normal, lognormal, empirical) to a enum.

vruano · 2018-08-31T03:05:37Z

@tedsharpe please check on the last commit.

tedsharpe · 2018-09-04T15:53:11Z

This is difficult to review because there isn't any client code: I don't know how this is going to be used. I suspect I'd have a lot of "YAGNI" comments if I knew.
For example, you are basing all your implementations on Apache's AbstractIntegerDistribution. That class, it seems to me, is really intended to allow you to do sampling from a distribution. But I suspect you won't be sampling, you'll only be asking questions about density. If so, there's a lot of baggage that gets pulled into your anonymous implementations of this class: random number generators, boundary information, etc. Lots of extra boilerplate.

Couldn't this be clearer if reorganized as an abstract class implementing AbstractIntegerDistribution, 3 concrete classes for each case (rather than the current anonymous classes), a factory that takes a spec and returns the correct distribution, and a simple enum class?

It seems weird that the distributions you allow users to realize using a spec are both two-tailed distributions, when fragment size is a one-tailed distribution.

It seems awkward that failure to parse a distribution spec leads to a code path where you try to extract a file name and read serialized read metadata. Wouldn't it be clearer to have two completely distinct code paths with a different program argument for the empirical case?

The read metadata gives per library distributions. It seems suspect that you are folding them all together. Different libraries can have rather different fragment size stats.

Still don't like that you're providing the possibility of reading the metadata text file. Seems fragile. Why don't you modify the ReadMetadata code to always produce just the data you need. Then you could eliminate the text-file code. And you could simplify the code that processes the serialized ReadMetadata which now has this awkward code path: CDF -> density -> sum across libs -> density+CDF stored in memory. If you have the CDF you can trivially produce density on demand.

Notwithstanding all this, if you're happy with the code as it stands, feel free to merge.
Back to you, review done.

vruano · 2018-09-04T17:33:10Z

Noted, will mege, I would rather address those concerns in a separate pull-request(s), later on.

vruano assigned tedsharpe May 30, 2018

vruano requested a review from tedsharpe May 30, 2018 20:44

tedsharpe approved these changes May 30, 2018

View reviewed changes

vruano force-pushed the vrr_isd branch from 6cca79f to b8e29f6 Compare August 23, 2018 01:06

vruano force-pushed the vrr_isd branch from b8e29f6 to 4e3682c Compare August 23, 2018 01:13

tedsharpe reviewed Aug 24, 2018

View reviewed changes

tedsharpe approved these changes Aug 24, 2018

View reviewed changes

vruano force-pushed the vrr_isd branch 3 times, most recently from 325f87b to 2eff1aa Compare August 30, 2018 22:15

vruano added 4 commits August 30, 2018 19:45

Adds a class to represent expected insert-size distribution (normal a…

f8c54d0

…nd log-normal distributed) parametrized by insert size mean and stddev

Addressed comments, added support for empirical distribution in read-…

f04ffd8

…metadata file

Adds the hability of reading serialized forms of the read-metadata da…

32cbb11

…ta structure. Adds test to check that it works.

Added the Empirical Insert-Size-Distribution Shape (arbitrary fractio…

b5a198a

…n for each isize). Move the different shape types (normal, lognormal and empirical) into an enum.

vruano force-pushed the vrr_isd branch from 2eff1aa to b5a198a Compare August 30, 2018 23:51

vruano merged commit 01d5ea2 into master Sep 4, 2018

vruano deleted the vrr_isd branch September 4, 2018 17:34

vruano mentioned this pull request Sep 4, 2018

Improvements to the Insert-Size Distribution framework for SV. #5153

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a class to represent expected insert-size distribution (normal a… #4827

Adds a class to represent expected insert-size distribution (normal a… #4827

vruano commented May 30, 2018

vruano commented May 30, 2018

tedsharpe left a comment

tedsharpe May 30, 2018

tedsharpe May 30, 2018

codecov-io commented Jun 1, 2018 •

edited

Loading

vruano commented Aug 23, 2018

vruano commented Aug 23, 2018

tedsharpe Aug 24, 2018

tedsharpe left a comment

vruano commented Aug 31, 2018

vruano commented Aug 31, 2018

tedsharpe commented Sep 4, 2018

vruano commented Sep 4, 2018


		AbstractRealDistribution fromMeanAndStdDeviation(final double mean, final double stddev);

		default AbstractRealDistribution fromReadMetadataFile(final String meanString) {

Adds a class to represent expected insert-size distribution (normal a… #4827

Adds a class to represent expected insert-size distribution (normal a… #4827

Conversation

vruano commented May 30, 2018

vruano commented May 30, 2018

tedsharpe left a comment

Choose a reason for hiding this comment

tedsharpe May 30, 2018

Choose a reason for hiding this comment

tedsharpe May 30, 2018

Choose a reason for hiding this comment

codecov-io commented Jun 1, 2018 • edited Loading

Codecov Report

vruano commented Aug 23, 2018

vruano commented Aug 23, 2018

tedsharpe Aug 24, 2018

Choose a reason for hiding this comment

tedsharpe left a comment

Choose a reason for hiding this comment

vruano commented Aug 31, 2018

vruano commented Aug 31, 2018

tedsharpe commented Sep 4, 2018

vruano commented Sep 4, 2018

codecov-io commented Jun 1, 2018 •

edited

Loading