Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a class to represent expected insert-size distribution (normal a… #4827

Merged
merged 4 commits into from
Sep 4, 2018

Conversation

vruano
Copy link
Contributor

@vruano vruano commented May 30, 2018

…nd log-normal distributed) parametrized by

insert size mean and stddev

@vruano
Copy link
Contributor Author

vruano commented May 30, 2018

I use this class to allow the user to pass a Normal or LogNormal expected distribution for insert sizes; notice that the String parameter constructor allows for it to use as an [at]Argument type in a tool.

@tedsharpe please review.

Copy link
Contributor

@tedsharpe tedsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the fact that we have already determined the empirical distribution from the data (see classes LibraryStatistics and IntHistogram.CDF), and a vague uneasiness about asking users for parameters that they're unlikely to know and that vary from experiment to experiment, this is fine with me except for a couple of minor class organization suggestions.

/**
* Holds the information characterizing and insert size distribution.
*/
public class InsertSizeDistribution implements Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had a convention for class layout that put all fields first, then constructors, then other stuff. I could be wrong.

return dist;
}

public double average() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this down with the other methods like min, max, density? Should it be called "mean" rather than "average"?

@codecov-io
Copy link

codecov-io commented Jun 1, 2018

Codecov Report

Merging #4827 into master will increase coverage by 0.001%.
The diff coverage is 78.341%.

@@               Coverage Diff               @@
##              master     #4827       +/-   ##
===============================================
+ Coverage     86.735%   86.737%   +0.001%     
- Complexity     29312     30073      +761     
===============================================
  Files           1810      1817        +7     
  Lines         135549    139369     +3820     
  Branches       15031     15803      +772     
===============================================
+ Hits          117569    120884     +3315     
- Misses         12566     12973      +407     
- Partials        5414      5512       +98
Impacted Files Coverage Δ Complexity Δ
.../spark/sv/InsertSizeDistributionShapeUnitTest.java 100% <100%> (ø) 11 <11> (?)
...adinstitute/hellbender/utils/IntHistogramTest.java 96.512% <100%> (+1.274%) 17 <4> (+4) ⬆️
...er/tools/spark/sv/InsertSizeDistributionShape.java 62.143% <62.143%> (ø) 14 <14> (?)
...lbender/tools/spark/sv/InsertSizeDistribution.java 63.83% <63.83%> (ø) 12 <12> (?)
...llbender/tools/spark/sv/evidence/ReadMetadata.java 87.715% <66.667%> (-1.703%) 57 <0> (ø)
...ute/hellbender/tools/spark/utils/IntHistogram.java 88.82% <81.081%> (-2.38%) 22 <4> (+3)
...nder/tools/spark/sv/evidence/ReadMetadataTest.java 96.629% <93.478%> (-3.371%) 12 <8> (+7)
...tools/spark/sv/InsertSizeDistributionUnitTest.java 95.699% <95.699%> (ø) 21 <21> (?)
...iscoverFromLocalAssemblyContigAlignmentsSpark.java 85.019% <0%> (-5.508%) 17% <0%> (-7%)
...ariationDiscoveryPipelineSparkIntegrationTest.java 92.982% <0%> (-2.894%) 20% <0%> (+6%)
... and 53 more

@vruano
Copy link
Contributor Author

vruano commented Aug 23, 2018

Added support for the empirical distribution read-metadata txt file generated in the pipeline.
Also rebased to current master.

@vruano
Copy link
Contributor Author

vruano commented Aug 23, 2018

@tedsharpe
Could you review the last commit adding support for the empirical distribution?


AbstractRealDistribution fromMeanAndStdDeviation(final double mean, final double stddev);

default AbstractRealDistribution fromReadMetadataFile(final String meanString) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems rather fragile to parse a text file for this information. We already have a means of serializing the ReadMetadata to a file. Why don't you just read that?
You could change the ReadMetadata code to always write read-metadata.txt as well as read-metadata.bin.

Copy link
Contributor

@tedsharpe tedsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not clear to me how or why you'd use the normal and log-normal distribution code, but your call.

@vruano vruano force-pushed the vrr_isd branch 3 times, most recently from 325f87b to 2eff1aa Compare August 30, 2018 22:15
…nd log-normal distributed) parametrized by

insert size mean and stddev
…ta structure. Adds test to check that it works.
…n for each isize).

Move the different shape types (normal, lognormal and empirical) into an enum.
@vruano
Copy link
Contributor Author

vruano commented Aug 31, 2018

Added a pure "empirical" isize distribution where the prob of each size is determine by the fraction of cases with that insert size + some smoothing so that 0-count insert size don't have 0 probability.

Also I moved the supported isize distribution "shapes" (normal, lognormal, empirical) to a enum.

@vruano
Copy link
Contributor Author

vruano commented Aug 31, 2018

@tedsharpe please check on the last commit.

@tedsharpe
Copy link
Contributor

This is difficult to review because there isn't any client code: I don't know how this is going to be used. I suspect I'd have a lot of "YAGNI" comments if I knew.
For example, you are basing all your implementations on Apache's AbstractIntegerDistribution. That class, it seems to me, is really intended to allow you to do sampling from a distribution. But I suspect you won't be sampling, you'll only be asking questions about density. If so, there's a lot of baggage that gets pulled into your anonymous implementations of this class: random number generators, boundary information, etc. Lots of extra boilerplate.

Couldn't this be clearer if reorganized as an abstract class implementing AbstractIntegerDistribution, 3 concrete classes for each case (rather than the current anonymous classes), a factory that takes a spec and returns the correct distribution, and a simple enum class?

It seems weird that the distributions you allow users to realize using a spec are both two-tailed distributions, when fragment size is a one-tailed distribution.

It seems awkward that failure to parse a distribution spec leads to a code path where you try to extract a file name and read serialized read metadata. Wouldn't it be clearer to have two completely distinct code paths with a different program argument for the empirical case?

The read metadata gives per library distributions. It seems suspect that you are folding them all together. Different libraries can have rather different fragment size stats.

Still don't like that you're providing the possibility of reading the metadata text file. Seems fragile. Why don't you modify the ReadMetadata code to always produce just the data you need. Then you could eliminate the text-file code. And you could simplify the code that processes the serialized ReadMetadata which now has this awkward code path: CDF -> density -> sum across libs -> density+CDF stored in memory. If you have the CDF you can trivially produce density on demand.

Notwithstanding all this, if you're happy with the code as it stands, feel free to merge.
Back to you, review done.

@vruano
Copy link
Contributor Author

vruano commented Sep 4, 2018

Noted, will mege, I would rather address those concerns in a separate pull-request(s), later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants