-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numerical/convergence problems with some models #840
Comments
Many thanks for the report and the effort made for the benchmarking. We are aware of the low effective sample size issue reported some time ago (based on a simple Gaussian model). We recently did some tests on breaking down the NUTS components to see what happens. We found that the naive HMC and HMC with dual averaging are very similar to Stan's (based on ESS and adaptaion results) while the NUTS does not have comparable ESS but similar adpataion results. We ended up guessing that it's probably because that the NUTS we use are in fact different from NUTS's and DynamicHMC's: we use slice sampling to obtain samples from the doubling tree while Stan and DynamicHMC uses multionomial sampling. This leads to our WIP PR to change ours to multonomial sampling: TuringLang/AdvancedHMC.jl#79 I will check these models with our multonomial sampling version (when it's avaiable) and see how it works soon. |
Just a side note. Turing now drops the samples in the adaptation phase by default so there is no need to do |
Opps. I think it still throws the burn-in samples (which means it throws burn-in samples again). As this is epxected to be the default behaviour of Turing in the future, I will create a PR over MCMCBenchmarks.jl |
@itsdfish Regarding 1, I'm running into different results on MCMCBenchmarks.jl and one of the script you (or Rob) previously shared with me (diag_test.jl.zip). To be more specific
Any idea why this happens? PS: I'm running things on recent release of Turing.jl and this branch of AdvancedHMC.jl (https://github.com/TuringLang/AdvancedHMC.jl/tree/kx/multinomial). |
Hi Kai- I think I figured out the problem. In the first set of analyzes, summary_sigma_ess_ps.pdf is effective sample size per second for the sigma parameter, which you understandably mistook as raw effective sample size. Currently, we output distributions of effective sample size, but it is probably reasonable to add means also (or use it instead). In the meantime, you can use this to generate the plot you are looking for:
On my system, I obtain something like this: summary_mu_ess.pdf I hope that helps and please bear with us until we add documentation. |
P.S. I should note those figures were generated with the previous version of Turing and the default parameters. |
Re: #840 (comment) I see. Thanks for the clarificaiton! This makes sense to me now. Re: #840 (comment) Thanks! |
Hi- In case you are unaware, the Poisson regression still fails to converge with the new multinomial sampling. I am using the master branches of Turing and AdvancedHMC because they have bug fixes described here and here:
|
Thanks for pointing this again. Can you give me some adivce of interpreting the results, i.e. how do you tell the convergence for the Poisson regression model? |
No problem. Generally, speaking I look for effective sample size that is at least 30% of the number of saved samples (300 in this case) and and rhat < 1.05. That is just a rule of thumb, but the metrics for ess and rhat above are pretty bad. If it would be helpful, I can provide those metrics for both Stan and Turing, which would provide a better idea of how Turing is performing. |
Thanks a lot @itsdfish! I investigated a bit and found it might be due to a pretty dumb reason. As you know AHMC was slow so we set the default maximum depth of NUTS in Turing to 5 (althought it's 10 by default in AHMC, it's override in Turing). I changed it to 10 and re-run the code you provided, which gives me:
Is this more comparable to Stan? If not some numbers from Stan would be helpful for me to furthur diagnose. Thanks! |
No problem. Thanks for looking into this. What you have is a big improvement. However, for Stan, rhat is still lower, and more importantly, ESS is about 50-100% larger in many cases. I used 1000 samples, 1000 warmup samples and four chains, as in the original post:
|
Thanks! Can you also post me other stats (i.e. |
@itsdfish You can get everything shown really quickly by calling describe(chain, showall=true) |
Thanks @cpfiffer. Here are the stats:
Let me know if you need anything else. |
Seems that the results vary a lot based on the synthestic data. I can also get something like below which looks pretty nice.
|
I wondered the same thing. I ran a benchmark with 20 reps of 1000 samples, 1000 warmup samples, sample size [10,20], and one chain. My plots are not working for some reason, but here are tables of the mean ESS for key parameters:
I also saw a fairly high number of numerical errors (10-15 per model application). |
I ran a slightly different benchmark to avoid missings in plots. Here are the parameters:
Key results:
I'm not sure why ESS is lower for Turing compared to CmdStan. I suspect the numerical errors might provide a clue. The speed difference between Turing and DynamicHMC might be due to the higher number of allocations in Turing. |
The ESS issue for the Poisson model should be resolved when the generalised U-turn PR is merged (TuringLang/AdvancedHMC.jl#91). |
The numbers look like below in my local. 6×7 DataFrame
│ Row │ sampler │ Nd │ a0_ess_mean │ a1_ess_mean │ a0_sig_ess_mean │ tree_depth_mean │ epsilon_mean │
│ │ String │ Int64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────────┼───────┼─────────────┼─────────────┼─────────────────┼─────────────────┼──────────────┤
│ 1 │ CmdStanNUTS │ 1 │ 208.297 │ 208.776 │ 272.023 │ 7.24086 │ 0.0161274 │
│ 2 │ AHMCNUTS │ 1 │ 203.079 │ 204.569 │ 274.711 │ 7.23374 │ 0.0157469 │
│ 3 │ CmdStanNUTS │ 2 │ 201.638 │ 202.592 │ 263.016 │ 7.46326 │ 0.0119092 │
│ 4 │ AHMCNUTS │ 2 │ 213.495 │ 215.527 │ 252.686 │ 7.43646 │ 0.0121216 │
│ 5 │ CmdStanNUTS │ 5 │ 174.845 │ 174.25 │ 223.66 │ 7.71896 │ 0.00812359 │
│ 6 │ AHMCNUTS │ 5 │ 200.267 │ 200.766 │ 230.273 │ 7.75796 │ 0.00801717 │ |
Thanks for the update, Kai. Effective sample size looks much better! I look forward to trying out the changes when they are merged. |
It's merged and just waiting for a new release (JuliaRegistries/General#2797). |
Is this fixed now with the new AHMC release? |
It's quite different from the benchmark I did a few weeks ago using MCMCBenchmarks (see http://xuk.ai/assets/StanCon-AHMC.pdf). Are you using the master branch or something? |
Yeah. That is very odd. I obtained the same results with MCMCBencharks master and 0.5.1. Looking through the history, the only changes that were made to Hierachical_Poisson_Models.jl since your last commit were for DynamicHMC. So I don't think a bug was introduced into the model. I looked over changes to the source code of MCMCBenchmarks, but nothing seemed problematic at first glance. The changes I made did not seem to affect the Gaussian model because CmdStan and Turing have comparable ESS. I'll try to look into this more over the weekend.
Julia 1.1.1 |
I couldn't find a problem with MCMCBenchmarks. I obtained similar results with AdvancedHMC 0.2.3. Here is a minimum working example:
|
Thanks for the code. The results you share doesn't include ESS. My result of running the MWE you provided gives
Is it the same as yours? If not, I suspect somehow your Turing is not using the update version of AHMC? |
Sorry. I should have included the summary (note that the non-stationary trace plot suggests low ESS). Here is my summary:
I believe I have the latest package versions:
This is perplexing. |
I was using master. Looking over the differences between master and recent release (v0.6.23...master), seems like the recent release is still using 5 as the default tree depth. @yebai @cpfiffer I think we really need to make a new release. What stops us from doing so? Is it the update of the documentation to reflect the new interface? |
@mohamed82008 can you give #908 another try? |
Sure |
@itsdfish This should be fixed by release 0.7. Can you try re-running the benchmarks? |
@yebai, you can find the benchmarks within the results subfolders for each model here. Here is a summary of what I found.
|
Many thanks, @itsdfish! These are helpful to planing Turing's next release.
This is more likely caused by the model itself and its parameterisation.
This is consistent with @xukai92's results and good to be reproduced independently.
We will revisit the issue of switching default AD engine to reverse mode soon, since Distributions are more compatible with
This is likely from the overhead incurred by Turing's DSL. @mohamed82008 made a lot of progress in making sure model functions are type-stable, which leads to a leap in Turing's performance recently. But there is further space for improvements. We'll look into how to further reduce Turing's overhead in |
My pleasure. I'm glad it was helpful. I tried Tracker and Zygote in the past without much success. Models that required seconds with ForwardDiff required more than an hour with Zygote and Tracker. I'm not sure what that problem was. I can try to find those benchmarks if that information would be helpful. In meantime, I amenable to reparameterizing the SDT and Linear Regression models. Please let me know if you or other Turing members have any recommendations. |
We started to benchmark Turing and AHMC on different AD backedns. So far what we know is that if there is controll flows, Tracker (high dim) and ForwardDiff (low dim) still out performs Zygote.
Do you have any progress here for this. I've been seeing there are a lot of changes in MCMCBenchmarks.jl. I wonder what's the current status of it. BTW, I saw you were discussing a numeric issue in tpapp/DynamicHMC.jl#95, does AHMC also has this issue? |
Hi Kai- Thanks for the update. Most of the recent changes to MCMCBenchmarks have been for maintenance and reorganization. As of right now, there is one optimization I can add. I'll try to add that within the next week and update the benchmarks to reflect changes in Turing and AdvancedHMC. Please let me know if there are more optimizations you would like me to incorporate. Yeah. I have been experiencing several issues with DynamicHMC. I tried re-working the log likelihood function of the LBA to be more numerically stable, but it didn't really solve the problem. I think part of the issue might be related to implementational differences in HMC and how the model is initialized. Unfortunately, I am at an impasse with that particular issue. |
Thanks. Maybe you could watch https://github.com/TuringLang/TuringExamples/tree/master/benchmarks which contains the updated implementation using whatever new features or optimizations we have. In fact, maybe we should figure out a way to keep models in a single place and allow them to be easily reused in different places. Note that our simple benchmark scripts only tests the plain computatational perfomrance (using static HMC) but MCMCBenchmarks.jl does much more.
IIRC the Turing and DynamicHMC version shares the same LBA related computation, so I guess there should be no issues there. It would be useful to check if initialization matters by using the same starting point to exclude the first guess. |
Closing now since AHMC has many numerical stability improvements over the past year. Please reopen if more help is needed. |
Hi-
While benchmarking various MCMC samplers in Julia, @goedman and I have encountered a few cases in which Turing produces many numerical errors and/or fails to converge. Here are some examples for your consideration. I included model code below for ease of reading.
The first case in the Linear Ballistic Accumulator. Although the the chains converge in some cases, including the example below, there are usually between 5 to 10 numerical errors at the beginning. What we found in general is that increasing the sample size from 10 to 100 or more leads to lower effective sample (e.g. <200 out of 1,000 for k,tau v[1] and v[2]) and yields poorer convergence in many cases (1.02 to 1.08). It is not clear to me why this is happening or whether it is related to the numerical errors that happen sporadically. In either case, you might considering adding this to your validation routine because the inter-correlations between parameters make this model a challenging test case.
The second case is a hierarchical Poisson model based on Rob's Statistical Rethinking package. The code below illustrates a common problem in which there are numerical errors and poor convergence. The results do not appear to depend on the delta/target acceptance rate parameter.
Results:
Model Code:
The text was updated successfully, but these errors were encountered: