-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add binomial and Poisson distributions #96
Conversation
Not sure what dependencies that are allowed by the library, but POSIX.1-2001 defines both the tgamma and lgamma function for computing the gamma function and ln-Gamma function, respectively. Hence the log gamma function might only be necessary to define if the intention is to not be dependent on POSIX. |
I don't think the crate should depend on POSIX, it's also supposed to work on Windows, after all. That being said, we could conditionally compile to use the POSIX lgamma on systems supporting it, but I'm not sure if it's worth the effort. Also, it looks like this PR might be made obsolete by the future changes in the API anyway, so... ;) |
I don't know about obselete. There's been no discussion yet of whether distributions should be moved to a separate crate; having these distributions would still be useful. |
@fizyk20 are you happy to update this now? I think we could merge now.
|
Awesome! Rebased, fixed and squashed. If any other changes are necessary, please let me know. |
2ced48c
to
b7c59c6
Compare
Thanks! Are you force-pushing? (Was told page was out of date twice.) Just let me know when this is ready for review.. |
Yes, I was - I realised that my rustfmt ran automatically and modified pretty much the whole crate, so I amended the commit so that the changes are minimal. Should be ready now 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you remembered my aversions to rustfmt 🤣
Just had a quick look at the API and doc; looks good! I'd still like to go over the maths though I don't expect any probleems.
src/distributions/log_gamma.rs
Outdated
|
||
/// Calculates ln(gamma(x)) (natural logarithm of the gamma | ||
/// function) using the Lanczos approximation with g=5 | ||
pub fn log_gamma(x: f64) -> f64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to see a little more doc on this.
Edit: sorry, it's not exported so this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, I added more before I noticed the edit :P Well, it won't hurt, and may help in reading the code ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good anyway 👍
src/distributions/binomial.rs
Outdated
} | ||
|
||
impl Distribution<u64> for Binomial { | ||
fn sample<R: Rng>(&self, rng: &mut R) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using R: Rng + ?Sized
for now. (I'm not sure whether to change to R: RngCore + ?Sized
; I'm actually surprised this compiles since it differs from the trait.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it wouldn't compile if some code tried to use it with an unsized R
- but I'd expect it to notice the difference, too. Maybe it's something that could be reported to the compiler team?
src/distributions/binomial.rs
Outdated
/// `n`, `p`. Panics if `p <= 0` or `p >= 1`. | ||
pub fn new(n: u64, p: f64) -> Binomial { | ||
assert!(p > 0.0, "Binomial::new called with `p` <= 0"); | ||
assert!(p < 1.0, "Binomial::new called with `p` >= 1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why quote p
in the assert messages? I think better not to (also for λ later)
src/distributions/poisson.rs
Outdated
} | ||
|
||
impl Distribution<u64> for Poisson { | ||
fn sample<R: Rng>(&self, rng: &mut R) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R: Rng + ?Sized
again
src/distributions/poisson.rs
Outdated
/// The Poisson distribution `Poisson(lambda)`. | ||
/// | ||
/// This distribution has density function: `f(k) = lambda^k * | ||
/// exp(-lambda) / k!` for `k >= 0`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better not to split the code block over multiple lines IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments; still needs more review of the maths (which are not trivial).
if expected < 25.0 { | ||
let mut lresult = 0.0; | ||
for _ in 0 .. self.n { | ||
if rng.gen::<f64>() < p { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitdicker what do you think about this probability test? I've been wondering if we should add a dedicated Bernoulli distribution for more accurate sampling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should first learn a lot before I can make any meaningful comments w.r.t. the distributions...
This single line is pretty much the Bernoulli distribution? It might be more generally useful than gen_weighted_bool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we should add a Bernoulli distribution, and it would be nice to have it reasonably accurate for small p
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can do better than just rng.gen::<f64>() < p
, then sure, this is a good idea, otherwise I'm not really convinced that it makes sense to make it a separate distribution.
And as for doing better, that's a bit over my head, so unfortunately I won't be able to help...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fill you in @fizyk20, @pitdicker already did quite a bit of work implementing higher-precision floating point sampling, since the default method uses the same precision over the 0-1 range despite the format being able to represent a lot more close to 0 — however, we seem to have decided not to use this sampling method by default. There's also the thing that we use a small offset, which normally isn't an issue, but might be for correct sampling of small probabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost available with Rng::gen_bool(p)
from #308.
src/distributions/poisson.rs
Outdated
lambda: lambda, | ||
exp_lambda: (-lambda).exp(), | ||
log_lambda: lambda.ln(), | ||
magic_val: lambda * lambda.ln() - log_gamma(1.0 + lambda), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why call lambda.ln()
twice? Also, the choice of method could be made here and stored in an enum; this reduces the number of parameters since exp_lambda
and the last two parameters are not used simultaneously. Maybe also cache (2.0 * self.lambda).sqrt()
since SQRT is slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an oversight, I wouldn't be surprised if the compiler optimised that, though.
And sure, caching the sqrt sounds reasonable.
fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> u64 { | ||
// using the algorithm from Numerical Recipes in C | ||
|
||
// for low expected values use the Knuth method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to use the inverse transform method for small samples, since it only requires 1 random sample? I don't know a lot about this topic unfortunately.
https://en.wikipedia.org/wiki/Poisson_distribution#Generating_Poisson-distributed_random_variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't know, my knowledge is also limited. Maybe it is a good idea, sampling just once sounds attractive. I mostly just ported the algorithm from "Numerical Recipes", but it is possible that something else could be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm; unless an expert in this area turns up (unlikely), perhaps the best we can do is implement some tests (e.g. plot a high-resolution histogram), then say this is good enough for now the best we can do.
One small thing to add yet are benchmarks. |
|
||
mod float; | ||
mod integer; | ||
mod log_gamma; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module also needs the cfg
gate for std-only
I have just gone over the code here and in "Numerical Recipes in C" side by side. It is a little different and uses much clearer function names. But it seems like a clean translation to me. We should get more expert in the various distribution sampling methods, and may want to pick different algorithms in the future. But that will take time, I think this PR is just good to have in the meantime. My result with plotting 100.000 samples of the binomial distribution (using simple spreadsheet): #[test]
fn test_binomial_distr() {
let mut results = [0; 41];
// let binomial = Binomial::new(20, 0.5);
// let binomial = Binomial::new(20, 0.7);
let binomial = Binomial::new(40, 0.5);
let mut rng = ::test::rng(123);
for _ in 0..100_000 {
let sample = binomial.sample(&mut rng);
if sample <= 40 { results[sample as usize] += 1 }
}
let sum = results.iter().sum::<u64>() as f64;
for sample in results.iter() {
println!("{}", *sample as f64 / sum);
}
panic!();
} And the same for Poisson: #[test]
fn test_poisson_distr() {
let mut results = [0; 21];
// let poisson = Poisson::new(1.0);
// let poisson = Poisson::new(4.0);
let poisson = Poisson::new(10.0);
let mut rng = ::test::rng(123);
for _ in 0..100_000 {
let sample = poisson.sample(&mut rng);
if sample <= 20 { results[sample as usize] += 1 }
}
let sum = results.iter().sum::<u64>() as f64;
for sample in results.iter() {
println!("{}", *sample as f64 / sum);
}
panic!();
} It is very primitive, but both look very plausible to me when compared to Wikipedia (Binomial, Poisson) 😄. What is left to finally get this PR over the finish line (after 2+ years)?
@dhardy What do you think of merging this PR, and I make a PR with those tiny fix-ups? |
The distributions are slow at the moment though:
|
Ok. I created a tracker: #310 |
Actually I have a branch ready. |
🎉 |
Add binomial and Poisson distributions
This pull request adds the binomial and Poisson distributions using "step-by-step" generation for low expected values and the rejection method for higher ones.
The code is heavily based on the algorithms presented in "Numerical Recipes in C" - it's obviously not just copy-pasted and the algorithms are rather widely known, but I'm still not sure if that doesn't cause any licensing problems.