-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a class to represent expected insert-size distribution (normal a… #4827
Conversation
I use this class to allow the user to pass a Normal or LogNormal expected distribution for insert sizes; notice that the String parameter constructor allows for it to use as an [at]Argument type in a tool. @tedsharpe please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the fact that we have already determined the empirical distribution from the data (see classes LibraryStatistics and IntHistogram.CDF), and a vague uneasiness about asking users for parameters that they're unlikely to know and that vary from experiment to experiment, this is fine with me except for a couple of minor class organization suggestions.
/** | ||
* Holds the information characterizing and insert size distribution. | ||
*/ | ||
public class InsertSizeDistribution implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had a convention for class layout that put all fields first, then constructors, then other stuff. I could be wrong.
return dist; | ||
} | ||
|
||
public double average() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put this down with the other methods like min, max, density? Should it be called "mean" rather than "average"?
Codecov Report
@@ Coverage Diff @@
## master #4827 +/- ##
===============================================
+ Coverage 86.735% 86.737% +0.001%
- Complexity 29312 30073 +761
===============================================
Files 1810 1817 +7
Lines 135549 139369 +3820
Branches 15031 15803 +772
===============================================
+ Hits 117569 120884 +3315
- Misses 12566 12973 +407
- Partials 5414 5512 +98
|
Added support for the empirical distribution read-metadata txt file generated in the pipeline. |
@tedsharpe |
|
||
AbstractRealDistribution fromMeanAndStdDeviation(final double mean, final double stddev); | ||
|
||
default AbstractRealDistribution fromReadMetadataFile(final String meanString) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems rather fragile to parse a text file for this information. We already have a means of serializing the ReadMetadata to a file. Why don't you just read that?
You could change the ReadMetadata code to always write read-metadata.txt as well as read-metadata.bin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not clear to me how or why you'd use the normal and log-normal distribution code, but your call.
325f87b
to
2eff1aa
Compare
…nd log-normal distributed) parametrized by insert size mean and stddev
…ta structure. Adds test to check that it works.
…n for each isize). Move the different shape types (normal, lognormal and empirical) into an enum.
Added a pure "empirical" isize distribution where the prob of each size is determine by the fraction of cases with that insert size + some smoothing so that 0-count insert size don't have 0 probability. Also I moved the supported isize distribution "shapes" (normal, lognormal, empirical) to a enum. |
@tedsharpe please check on the last commit. |
This is difficult to review because there isn't any client code: I don't know how this is going to be used. I suspect I'd have a lot of "YAGNI" comments if I knew. Couldn't this be clearer if reorganized as an abstract class implementing AbstractIntegerDistribution, 3 concrete classes for each case (rather than the current anonymous classes), a factory that takes a spec and returns the correct distribution, and a simple enum class? It seems weird that the distributions you allow users to realize using a spec are both two-tailed distributions, when fragment size is a one-tailed distribution. It seems awkward that failure to parse a distribution spec leads to a code path where you try to extract a file name and read serialized read metadata. Wouldn't it be clearer to have two completely distinct code paths with a different program argument for the empirical case? The read metadata gives per library distributions. It seems suspect that you are folding them all together. Different libraries can have rather different fragment size stats. Still don't like that you're providing the possibility of reading the metadata text file. Seems fragile. Why don't you modify the ReadMetadata code to always produce just the data you need. Then you could eliminate the text-file code. And you could simplify the code that processes the serialized ReadMetadata which now has this awkward code path: CDF -> density -> sum across libs -> density+CDF stored in memory. If you have the CDF you can trivially produce density on demand. Notwithstanding all this, if you're happy with the code as it stands, feel free to merge. |
Noted, will mege, I would rather address those concerns in a separate pull-request(s), later on. |
…nd log-normal distributed) parametrized by
insert size mean and stddev