Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast log gammas #2860

Closed
droazen opened this issue Jun 5, 2017 · 1 comment
Closed

Fast log gammas #2860

droazen opened this issue Jun 5, 2017 · 1 comment

Comments

@droazen
Copy link
Contributor

droazen commented Jun 5, 2017

@davidbenjamin commented on Tue Jun 07 2016

Calculating log gammas is an expensive part of the allele fraction model. We could speed this with negligible loss of accuracy by caching a few tens of thousands of values from 0 to 100 or 1000 and using linear interpolation.

This issue can be closed by implementing such caching or by showing that it doesn't significantly improve performance.


@davidbenjamin commented on Tue Jun 07 2016

The JVM running on my laptop does 10 million log gammas per second, which is about three times as expensive as logarithms. The allele fraction model needs to calculate 4 log gammas per het, so if you have 25,000 hets all the log gammas in the model likelihood take 1/100 of a second.

To get MLEs for each parameter (minor allele fractions, outlier probability etc) might require 100 evaluations each, so we're probably dealing with 10 seconds of log gammas per iteration to find the posterior mode. Getting a few hundred MCMC samples is probably more expensive but roughly comparable.

These numbers are manageable but get expensive when we relearn the model at every iteration of segment merging. In my opinion it makes sense to come back to this issue after we have a new segmentation strategy. We'll see how pressing it is then.


@samuelklee commented on Wed Jun 08 2016

To clarify, I think this is primarily an issue for WGS, where we have ~1.5 million hets. From the logs in /dsde/working/lichtens/wgs/out_case_chip_wgs/acnv/*out it looks like finding the MLE takes ~10 minutes (which is roughly consistent with your estimate), but 200 MCMC iterations takes ~1 hr.

Naive profiling of the AlleleFractionModeller tests suggests that around ~60% CPU is going toward log gammas, so if we can improve on this I think it might be an easy win. But we should perhaps profile more carefully.

However, I agree that changing our segmentation is more pressing! Note that oversegmentation (typically 1000+ segments) hurts us by both by increasing the number of MAF parameters and by increasing the number of similar-segment merge iterations required to smooth things (looks like the WGS samples hit the limit of 25 merge iterations = ~25 hrs). Turning off refitting between iterations helps, perhaps at the cost of smoothness of the final result, but you're still looking at 2+ hours for the initial and final fit.

Just to note, other possibilities for cutting down the runtime include trimming down the number of hets for WGS, changing similar-segment merging so that we can locally refit only the MAF for the newly created segment, ditching MCMC, etc.

@LeeTL1220 can you make the plots for your WGS runs so we can see what these hets look like?


@davidbenjamin commented on Thu Jun 09 2016

Ahhhhh, I get it!

@samuelklee
Copy link
Contributor

Subsampling seems to be the way to go, see #2858. For the record, I did try to implement caching, but this results in excessive cache checking. In general, I think a better solution is to structure code so that expensive global quantities are not unnecessarily recomputed locally.

At some point, this sort of undesirable recomputation snuck in during a refactoring of the allele-fraction likelihood code, probably when we tried to make the method for computing site likelihoods pull double duty based on the presence or absence of an allelic PoN. With an allelic PoN, we need to compute a log gamma at each site based on the site-specific bias hyperparameters; without a PoN, we only need to do this once for all sites, since the bias hyperparameters are now global, but the code naively recomputes it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants