Fast log gammas #2860

droazen · 2017-06-05T17:49:50Z

@davidbenjamin commented on Tue Jun 07 2016

Calculating log gammas is an expensive part of the allele fraction model. We could speed this with negligible loss of accuracy by caching a few tens of thousands of values from 0 to 100 or 1000 and using linear interpolation.

This issue can be closed by implementing such caching or by showing that it doesn't significantly improve performance.

@davidbenjamin commented on Tue Jun 07 2016

The JVM running on my laptop does 10 million log gammas per second, which is about three times as expensive as logarithms. The allele fraction model needs to calculate 4 log gammas per het, so if you have 25,000 hets all the log gammas in the model likelihood take 1/100 of a second.

To get MLEs for each parameter (minor allele fractions, outlier probability etc) might require 100 evaluations each, so we're probably dealing with 10 seconds of log gammas per iteration to find the posterior mode. Getting a few hundred MCMC samples is probably more expensive but roughly comparable.

These numbers are manageable but get expensive when we relearn the model at every iteration of segment merging. In my opinion it makes sense to come back to this issue after we have a new segmentation strategy. We'll see how pressing it is then.

@samuelklee commented on Wed Jun 08 2016

To clarify, I think this is primarily an issue for WGS, where we have ~1.5 million hets. From the logs in /dsde/working/lichtens/wgs/out_case_chip_wgs/acnv/*out it looks like finding the MLE takes ~10 minutes (which is roughly consistent with your estimate), but 200 MCMC iterations takes ~1 hr.

Naive profiling of the AlleleFractionModeller tests suggests that around ~60% CPU is going toward log gammas, so if we can improve on this I think it might be an easy win. But we should perhaps profile more carefully.

However, I agree that changing our segmentation is more pressing! Note that oversegmentation (typically 1000+ segments) hurts us by both by increasing the number of MAF parameters and by increasing the number of similar-segment merge iterations required to smooth things (looks like the WGS samples hit the limit of 25 merge iterations = ~25 hrs). Turning off refitting between iterations helps, perhaps at the cost of smoothness of the final result, but you're still looking at 2+ hours for the initial and final fit.

Just to note, other possibilities for cutting down the runtime include trimming down the number of hets for WGS, changing similar-segment merging so that we can locally refit only the MAF for the newly created segment, ditching MCMC, etc.

@LeeTL1220 can you make the plots for your WGS runs so we can see what these hets look like?

@davidbenjamin commented on Thu Jun 09 2016

Ahhhhh, I get it!

samuelklee · 2017-10-10T22:02:39Z

Subsampling seems to be the way to go, see #2858. For the record, I did try to implement caching, but this results in excessive cache checking. In general, I think a better solution is to structure code so that expensive global quantities are not unnecessarily recomputed locally.

At some point, this sort of undesirable recomputation snuck in during a refactoring of the allele-fraction likelihood code, probably when we tried to make the method for computing site likelihoods pull double duty based on the presence or absence of an allelic PoN. With an allelic PoN, we need to compute a log gamma at each site based on the site-specific bias hyperparameters; without a PoN, we only need to do this once for all sites, since the bias hyperparameters are now global, but the code naively recomputes it.

droazen mentioned this issue Jun 5, 2017

Fast log gammas broadinstitute/gatk-protected#542

Closed

droazen added Copy Number tools performance labels Jun 5, 2017

droazen assigned davidbenjamin Jun 5, 2017

samuelklee closed this as completed Oct 10, 2017

samuelklee mentioned this issue Dec 5, 2017

Added code and WDL to complete ModelSegments CNV pipeline. #3913

Merged

samuelklee mentioned this issue Jan 14, 2019

Added MinibatchSliceSampler and replaced naive subsampling in ModelSegments. #5575

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast log gammas #2860

Fast log gammas #2860

droazen commented Jun 5, 2017

samuelklee commented Oct 10, 2017

Fast log gammas #2860

Fast log gammas #2860

Comments

droazen commented Jun 5, 2017

samuelklee commented Oct 10, 2017