Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical/convergence problems with some models #840

Closed
itsdfish opened this issue Jul 6, 2019 · 41 comments
Closed

Numerical/convergence problems with some models #840

itsdfish opened this issue Jul 6, 2019 · 41 comments

Comments

@itsdfish
Copy link
Contributor

itsdfish commented Jul 6, 2019

Hi-

While benchmarking various MCMC samplers in Julia, @goedman and I have encountered a few cases in which Turing produces many numerical errors and/or fails to converge. Here are some examples for your consideration. I included model code below for ease of reading.

  1. The first case in the Linear Ballistic Accumulator. Although the the chains converge in some cases, including the example below, there are usually between 5 to 10 numerical errors at the beginning. What we found in general is that increasing the sample size from 10 to 100 or more leads to lower effective sample (e.g. <200 out of 1,000 for k,tau v[1] and v[2]) and yields poorer convergence in many cases (1.02 to 1.08). It is not clear to me why this is happening or whether it is related to the numerical errors that happen sporadically. In either case, you might considering adding this to your validation routine because the inter-correlations between parameters make this model a challenging test case.

  2. The second case is a hierarchical Poisson model based on Rob's Statistical Rethinking package. The code below illustrates a common problem in which there are numerical errors and poor convergence. The results do not appear to depend on the delta/target acceptance rate parameter.

Results:


│ Row │ parameters │ mean       │ std       │ naive_se   │ mcse      │ ess     │ r_hat   │
│     │ Symbol     │ Float64    │ Float64   │ Float64    │ Float64   │ Any     │ Any     │
├─────┼────────────┼────────────┼───────────┼────────────┼───────────┼─────────┼─────────┤
│ 1   │ a0         │ 0.711879   │ 0.827281  │ 0.0130805  │ 0.126825  │ 16.0643 │ 2.1534  │
│ 2   │ a0_sig     │ 0.324261   │ 0.0971365 │ 0.00153586 │ 0.0123472 │ 33.0278 │ 1.26012 │
│ 3   │ a0s[1]     │ -0.178984  │ 0.154675  │ 0.00244563 │ 0.0192009 │ 22.3411 │ 1.30793 │
│ 4   │ a0s[2]     │ -0.216737  │ 0.206755  │ 0.00326908 │ 0.0284051 │ 24.3501 │ 1.35725 │
│ 5   │ a0s[3]     │ 0.166555   │ 0.164518  │ 0.00260126 │ 0.0217475 │ 16.6985 │ 1.48057 │
│ 6   │ a0s[4]     │ 0.182247   │ 0.176997  │ 0.00279857 │ 0.0240585 │ 27.9606 │ 1.31519 │
│ 7   │ a0s[5]     │ -0.204573  │ 0.14484   │ 0.00229011 │ 0.0179342 │ 34.9558 │ 1.20347 │
│ 8   │ a0s[6]     │ 0.037053   │ 0.148093  │ 0.00234156 │ 0.0186322 │ 22.4169 │ 1.30494 │
│ 9   │ a0s[7]     │ 0.398469   │ 0.203508  │ 0.00321774 │ 0.0283307 │ 23.4493 │ 1.38911 │
│ 10  │ a0s[8]     │ -0.521049  │ 0.156409  │ 0.00247304 │ 0.0194112 │ 27.6936 │ 1.24186 │
│ 11  │ a0s[9]     │ -0.0622397 │ 0.23819   │ 0.00376612 │ 0.0329554 │ 16.0643 │ 1.70797 │
│ 12  │ a0s[10]    │ 0.0355108  │ 0.155785  │ 0.00246318 │ 0.0200219 │ 22.9582 │ 1.31811 │
│ 13  │ a1         │ 0.544897   │ 0.0981347 │ 0.00155165 │ 0.0149694 │ 16.0643 │ 2.02386 │

  1. The last case is error/warning reporting in DynamicNUTS. When an error occurs, the model crashes. This has made it difficult for us to benchmark DynamicNUTS because an occasional numerical error may occur during the process of applying the model hundreds of times.

Model Code:

  1. LBA:
using Turing,Random,Distributions,Parameters
import Distributions: pdf,logpdf,rand
export LBA,pdf,logpdf,rand
###################################################################################
#                              LBA Functions
###################################################################################
struct LBA{T1,T2,T3,T4} <: ContinuousUnivariateDistribution
    ν::T1
    A::T2
    k::T3
    τ::T4
    σ::Float64
end

Base.broadcastable(x::LBA)=Ref(x)

LBA(;τ,A,k,ν,σ=1.0) = LBA(ν,A,k,τ,σ)

function selectWinner(dt)
    if any(x->x >0,dt)
        mi,mv = 0,Inf
        for (i,t) in enumerate(dt)
            if (t > 0) && (t < mv)
                mi = i
                mv = t
            end
        end
    else
        return 1,-1.0
    end
    return mi,mv
end

function sampleDriftRates(ν,σ)
    noPositive=true
    v = similar(ν)
    while noPositive
        v = [rand(Normal(d,σ)) for d in ν]
        any(x->x>0,v) ? noPositive=false : nothing
    end
    return v
end

function rand(d::LBA)
    @unpack τ,A,k,ν,σ = d
    b=A+k
    N = length(ν)
    v = sampleDriftRates(ν,σ)
    a = rand(Uniform(0,A),N)
    dt = @. (b-a)/v
    choice,mn = selectWinner(dt)
    rt = τ .+ mn
    return choice,rt
end

function rand(d::LBA,N::Int)
    data = Array{Tuple{Int64,Float64},1}(undef,N)
    for i in 1:N
        data[i] = rand(d)
    end
    return data
end

logpdf(d::LBA,choice,rt) = log(pdf(d,choice,rt))

function logpdf(d::LBA,data::T) where {T<:NamedTuple}
    return sum(logpdf.(d,data...))
end

function logpdf(dist::LBA,data::Array{<:Tuple,1})
    LL = 0.0
    for d in data
        LL += logpdf(dist,d...)
    end
    return LL
end

function pdf(d::LBA,c,rt)
    @unpack τ,A,k,ν,σ = d
    b=A+k; den = 1.0
    rt < τ ? (return 1e-10) : nothing
    for (i,v) in enumerate(ν)
        if c == i
            den *= dens(d,v,rt)
        else
            den *= (1-cummulative(d,v,rt))
        end
    end
    pneg = pnegative(d)
    den = den/(1-pneg)
    den = max(den,1e-10)
    isnan(den) ? (return 0.0) : (return den)
end

logpdf(d::LBA,data::Tuple) = logpdf(d,data...)

function dens(d::LBA,v,rt)
    @unpack τ,A,k,ν,σ = d
    dt = rt-τ; b=A+k
    n1 = (b-A-dt*v)/(dt*σ)
    n2 = (b-dt*v)/(dt*σ)
    dens = (1/A)*(-v*cdf(Normal(0,1),n1) + σ*pdf(Normal(0,1),n1) +
        v*cdf(Normal(0,1),n2) - σ*pdf(Normal(0,1),n2))
    return dens
end

function cummulative(d::LBA,v,rt)
    @unpack τ,A,k,ν,σ = d
    dt = rt-τ; b=A+k
    n1 = (b-A-dt*v)/(dt*σ)
    n2 = (b-dt*v)/(dt*σ)
    cm = 1 + ((b-A-dt*v)/A)*cdf(Normal(0,1),n1) -
        ((b-dt*v)/A)*cdf(Normal(0,1),n2) + ((dt*σ)/A)*pdf(Normal(0,1),n1) -
        ((dt*σ)/A)*pdf(Normal(0,1),n2)
    return cm
end

function pnegative(d::LBA)
    @unpack ν,σ=d
    p=1.0
    for v in ν
        p*= cdf(Normal(0,1),-v/σ)
    end
    return p
end
###################################################################################
#                                Generate Data
###################################################################################
Random.seed!(34)
v=[1.0,1.5,2.0]
A=.8
k=.2
tau=.4
N = 10
Nc = 3
data = rand(LBA(ν=v,A=A,k=k,τ=tau),N)
###################################################################################
#                                Turing Model
###################################################################################
@model lbaModel(data,N,Nc) = begin
    mn=minimum(x->x[2],data)
    tau ~ TruncatedNormal(.4,.1,0.0,mn)
    A ~ TruncatedNormal(.8,.4,0.0,Inf)
    k ~ TruncatedNormal(.2,.3,0.0,Inf)
    v = Vector{Real}(undef,Nc)
    v ~ [Normal(0,3)]
    for i in 1:N
        data[i] ~ LBA(;ν=v,τ=tau,A=A,k=k)
    end
end
Nchains = 4
Nsamples = 2000
Nadapt = 1000
delta = .8
config = NUTS(Nsamples,Nadapt,delta)
chns = reduce(chainscat, map(x->sample(lbaModel(data,N,Nc),config),1:Nchains))
chain = chns[(Nadapt+1):end,:,:]
println(chain)
  1. Hierarchical Poisson
using Turing,Random,Distributions
###################################################################################
#                                Model Functions
###################################################################################
function simulatePoisson(;Nd=1,Ns=10,a0=1.0,a1=.5,a0_sig=.3,kwargs...)
    N = Nd*Ns
    y = fill(0,N)
    x = fill(0.0,N)
    idx = similar(y)
    i = 0
    for s in 1:Ns
        a0s = rand(Normal(0,a0_sig))
        logpop = rand(Normal(9,1.5))
        λ = exp(a0 + a0s + a1*logpop)
        for nd in 1:Nd
            i+=1
            x[i] = logpop
            idx[i] = s
            y[i] = rand(Poisson(λ))
        end
    end
    return (y=y,x=x,idx=idx,N=N,Ns=Ns)
 end

###################################################################################
#                                Generate Data
###################################################################################
Random.seed!(49)
data = simulatePoisson()
###################################################################################
#                                Turing Model
###################################################################################
@model poisson(y,x,idx,N,Ns) = begin
    a0 ~ Normal(0, 10)
    a1 ~ Normal(0, 1)
    a0_sig ~ Truncated(Cauchy(0, 1), 0, Inf)
    a0s = Vector{Real}(undef,Ns)
    a0s ~ [Normal(0, a0_sig)]
    for i ∈ 1:N
        λ = exp(a0 + a0s[idx[i]] + a1*x[i])
        y[i] ~ Poisson(λ)
    end
end

Nchains = 4
Nsamples = 2000
Nadapt = 1000
delta = .8
config = NUTS(Nsamples,Nadapt,delta)
chns = reduce(chainscat, map(x->sample(poisson(data...),config),1:Nchains))
chain = chns[(Nadapt+1):end,:,:]
println(chain)

@xukai92
Copy link
Member

xukai92 commented Jul 6, 2019

Many thanks for the report and the effort made for the benchmarking.

We are aware of the low effective sample size issue reported some time ago (based on a simple Gaussian model). We recently did some tests on breaking down the NUTS components to see what happens. We found that the naive HMC and HMC with dual averaging are very similar to Stan's (based on ESS and adaptaion results) while the NUTS does not have comparable ESS but similar adpataion results. We ended up guessing that it's probably because that the NUTS we use are in fact different from NUTS's and DynamicHMC's: we use slice sampling to obtain samples from the doubling tree while Stan and DynamicHMC uses multionomial sampling. This leads to our WIP PR to change ours to multonomial sampling: TuringLang/AdvancedHMC.jl#79

I will check these models with our multonomial sampling version (when it's avaiable) and see how it works soon.

@xukai92
Copy link
Member

xukai92 commented Jul 7, 2019

Just a side note. Turing now drops the samples in the adaptation phase by default so there is no need to do chain = chns[(Nadapt+1):end,:,:] in the end. But as checked it is correctly implemented in MCMCBenchmarks.jl, is it right?

@xukai92
Copy link
Member

xukai92 commented Jul 7, 2019

Opps. I think it still throws the burn-in samples (which means it throws burn-in samples again). As this is epxected to be the default behaviour of Turing in the future, I will create a PR over MCMCBenchmarks.jl

@xukai92
Copy link
Member

xukai92 commented Jul 8, 2019

@itsdfish Regarding 1, I'm running into different results on MCMCBenchmarks.jl and one of the script you (or Rob) previously shared with me (diag_test.jl.zip). To be more specific

  • For MCMCBenchmarks.jl, I did AHMCconfig = Turing.NUTS(2000,1000,.85; metricT=AdvancedHMC.DiagEuclideanMetric) to make sure Turing.jl is using the diagonal adaptation, and simply run the Gaussian model (I commented out some lines as it was reported in How do I suppose to run this repo locally? StatisticalRethinkingJulia/MCMCBenchmarks.jl#25).
    • I looked at the summary_sigma_ess_ps.pdf and summary_mu_ess_ps.pdf (
      summary.zip
      ) which show Turing's ESS decreases with the number of samples used increasing.
  • For diag_test.jl, I change the number of samples used from 5 to 30, to 300 and run the script 3 times.
    • I looked at the ESS plots and found that they do increase with the number of samples used increasing (n=5,30,300.zip).

Any idea why this happens?

PS: I'm running things on recent release of Turing.jl and this branch of AdvancedHMC.jl (https://github.com/TuringLang/AdvancedHMC.jl/tree/kx/multinomial).

@itsdfish
Copy link
Contributor Author

itsdfish commented Jul 8, 2019

Hi Kai-

I think I figured out the problem. In the first set of analyzes, summary_sigma_ess_ps.pdf is effective sample size per second for the sigma parameter, which you understandably mistook as raw effective sample size. Currently, we output distributions of effective sample size, but it is probably reasonable to add means also (or use it instead). In the meantime, you can use this to generate the plot you are looking for:

ess = plotsummary(results,:Nd,:ess,(:sampler,);save=true,dir=dir)

On my system, I obtain something like this:

summary_mu_ess.pdf
summary_sigma_ess.pdf

I hope that helps and please bear with us until we add documentation.

@itsdfish
Copy link
Contributor Author

itsdfish commented Jul 8, 2019

P.S. I should note those figures were generated with the previous version of Turing and the default parameters.

@xukai92
Copy link
Member

xukai92 commented Jul 9, 2019

Re: #840 (comment)

I see. Thanks for the clarificaiton! This makes sense to me now.

Re: #840 (comment)

Thanks!

@itsdfish
Copy link
Contributor Author

Hi-

In case you are unaware, the Poisson regression still fails to converge with the new multinomial sampling. I am using the master branches of Turing and AdvancedHMC because they have bug fixes described here and here:

Summary Statistics

│ Row │ parameters │ mean      │ std      │ naive_se   │ mcse      │ ess     │ r_hat   │
│     │ Symbol     │ Float64   │ Float64  │ Float64    │ Float64   │ Any     │ Any     │
├─────┼────────────┼───────────┼──────────┼────────────┼───────────┼─────────┼─────────┤
│ 1   │ a0         │ 0.2888    │ 0.931632 │ 0.0147304  │ 0.122927  │ 34.5533 │ 1.10511 │
│ 2   │ a0_sig     │ 0.354791  │ 0.132084 │ 0.00208843 │ 0.0150491 │ 43.128  │ 1.15244 │
│ 3   │ a0s[1]     │ -0.121081 │ 0.187378 │ 0.0029627  │ 0.0253339 │ 19.5141 │ 1.25898 │
│ 4   │ a0s[2]     │ -0.26234  │ 0.26919  │ 0.00425627 │ 0.0381923 │ 23.1839 │ 1.32593 │
│ 5   │ a0s[3]     │ 0.230442  │ 0.200536 │ 0.00317076 │ 0.0263961 │ 21.5578 │ 1.18891 │
│ 6   │ a0s[4]     │ 0.164603  │ 0.235531 │ 0.00372407 │ 0.0338144 │ 21.3094 │ 1.3637  │
│ 7   │ a0s[5]     │ -0.161985 │ 0.189921 │ 0.00300291 │ 0.0269448 │ 18.9195 │ 1.34745 │
│ 8   │ a0s[6]     │ 0.100176  │ 0.197166 │ 0.00311746 │ 0.0262556 │ 21.0273 │ 1.21602 │
│ 9   │ a0s[7]     │ 0.35988   │ 0.265921 │ 0.00420458 │ 0.0379563 │ 22.7123 │ 1.33589 │
│ 10  │ a0s[8]     │ -0.477994 │ 0.186985 │ 0.0029565  │ 0.0260595 │ 19.9471 │ 1.32371 │
│ 11  │ a0s[9]     │ 0.0519478 │ 0.254021 │ 0.00401642 │ 0.032194  │ 27.8074 │ 1.11512 │
│ 12  │ a0s[10]    │ 0.0844027 │ 0.186093 │ 0.00294239 │ 0.0255003 │ 20.0431 │ 1.23501 │
│ 13  │ a1         │ 0.59155   │ 0.112818 │ 0.00178381 │ 0.0150285 │ 33.7903 │ 1.13195 │

@xukai92
Copy link
Member

xukai92 commented Aug 1, 2019

Thanks for pointing this again. Can you give me some adivce of interpreting the results, i.e. how do you tell the convergence for the Poisson regression model?

@itsdfish
Copy link
Contributor Author

itsdfish commented Aug 1, 2019

No problem. Generally, speaking I look for effective sample size that is at least 30% of the number of saved samples (300 in this case) and and rhat < 1.05. That is just a rule of thumb, but the metrics for ess and rhat above are pretty bad.

If it would be helpful, I can provide those metrics for both Stan and Turing, which would provide a better idea of how Turing is performing.

@xukai92
Copy link
Member

xukai92 commented Aug 1, 2019

Thanks a lot @itsdfish! I investigated a bit and found it might be due to a pretty dumb reason. As you know AHMC was slow so we set the default maximum depth of NUTS in Turing to 5 (althought it's 10 by default in AHMC, it's override in Turing).

I changed it to 10 and re-run the code you provided, which gives me:

Summary Statistics

│ Row │ parameters │ mean      │ std       │ naive_se   │ mcse       │ ess     │ r_hat   │
│     │ Symbol     │ Float64   │ Float64   │ Float64    │ Float64    │ Any     │ Any     │
├─────┼────────────┼───────────┼───────────┼────────────┼────────────┼─────────┼─────────┤
│ 1   │ a0         │ 0.374652  │ 0.818352  │ 0.0129393  │ 0.0539238  │ 270.994 │ 1.01078 │
│ 2   │ a0_sig     │ 0.324592  │ 0.0997393 │ 0.00157702 │ 0.00347919 │ 628.047 │ 1.00069 │
│ 3   │ a0s[1]     │ -0.116291 │ 0.137415  │ 0.00217273 │ 0.0082362  │ 268.881 │ 1.02334 │
│ 4   │ a0s[2]     │ -0.237583 │ 0.189071  │ 0.00298948 │ 0.0111819  │ 305.1   │ 1.018   │
│ 5   │ a0s[3]     │ 0.230482  │ 0.153759  │ 0.00243114 │ 0.00991924 │ 257.542 │ 1.02266 │
│ 6   │ a0s[4]     │ 0.183472  │ 0.155971  │ 0.00246613 │ 0.00899458 │ 314.569 │ 1.02092 │
│ 7   │ a0s[5]     │ -0.161092 │ 0.123878  │ 0.00195868 │ 0.00736051 │ 238.432 │ 1.02905 │
│ 8   │ a0s[6]     │ 0.0958219 │ 0.144607  │ 0.00228645 │ 0.00883009 │ 273.993 │ 1.02407 │
│ 9   │ a0s[7]     │ 0.38144   │ 0.181887  │ 0.00287588 │ 0.0109348  │ 306.522 │ 1.02002 │
│ 10  │ a0s[8]     │ -0.473231 │ 0.130584  │ 0.00206472 │ 0.00796292 │ 248.984 │ 1.03119 │
│ 11  │ a0s[9]     │ 0.0374965 │ 0.217759  │ 0.00344308 │ 0.0135209  │ 277.704 │ 1.01794 │
│ 12  │ a0s[10]    │ 0.0889275 │ 0.138067  │ 0.00218303 │ 0.00894153 │ 225.447 │ 1.03354 │
│ 13  │ a1         │ 0.580512  │ 0.0967509 │ 0.00152977 │ 0.00626249 │ 280.587 │ 1.0102  │

Is this more comparable to Stan? If not some numbers from Stan would be helpful for me to furthur diagnose. Thanks!

@itsdfish
Copy link
Contributor Author

itsdfish commented Aug 2, 2019

No problem. Thanks for looking into this. What you have is a big improvement. However, for Stan, rhat is still lower, and more importantly, ESS is about 50-100% larger in many cases. I used 1000 samples, 1000 warmup samples and four chains, as in the original post:

Object of type Chains, with data of type 1000×20×4 Array{Float64,3}

Iterations        = 1001:2000
Thinning interval = 1
Chains            = 1, 2, 3, 4
Samples per chain = 1000
internals         = lp__, accept_stat__, stepsize__, treedepth__, n_leapfrog__, divergent__, energy__
parameters        = a0, a0s.1, a0s.2, a0s.3, a0s.4, a0s.5, a0s.6, a0s.7, a0s.8, a0s.9, a0s.10, a1, a0_sig

2-element Array{ChainDataFrame,1}

Summary Statistics

│ Row │ parameters │ mean        │ std      │ naive_se   │ mcse       │ ess     │ r_hat   │
│     │ Symbol     │ Float64     │ Float64  │ Float64    │ Float64    │ Any     │ Any     │
├─────┼────────────┼─────────────┼──────────┼────────────┼────────────┼─────────┼─────────┤
│ 1   │ a0         │ 2.91917     │ 1.17409  │ 0.018564   │ 0.0552791  │ 506.497 │ 1.0086  │
│ 2   │ a0_sig     │ 0.391864    │ 0.12055  │ 0.00190606 │ 0.0039154  │ 726.929 │ 1.00551 │
│ 3   │ a0s.1      │ 0.0110985   │ 0.226923 │ 0.00358797 │ 0.0116164  │ 422.175 │ 1.0054  │
│ 4   │ a0s.2      │ -0.00752534 │ 0.183497 │ 0.00290134 │ 0.00973214 │ 493.862 │ 1.00652 │
│ 5   │ a0s.3      │ 0.440896    │ 0.14912  │ 0.00235779 │ 0.00842415 │ 423.263 │ 1.00453 │
│ 6   │ a0s.4      │ 0.0697749   │ 0.158924 │ 0.00251281 │ 0.00883451 │ 398.797 │ 1.00345 │
│ 7   │ a0s.5      │ 0.0722853   │ 0.162671 │ 0.00257205 │ 0.00903263 │ 398.417 │ 1.00382 │
│ 8   │ a0s.6      │ -0.337904   │ 0.260617 │ 0.00412071 │ 0.0131449  │ 427.474 │ 1.00582 │
│ 9   │ a0s.7      │ -0.200556   │ 0.14555  │ 0.00230135 │ 0.00826782 │ 386.612 │ 1.00383 │
│ 10  │ a0s.8      │ 0.500156    │ 0.143458 │ 0.00226827 │ 0.00824504 │ 375.848 │ 1.00383 │
│ 11  │ a0s.9      │ -0.0801571  │ 0.283813 │ 0.00448748 │ 0.0141514  │ 515.564 │ 1.00849 │
│ 12  │ a0s.10     │ -0.505856   │ 0.196079 │ 0.00310029 │ 0.0102691  │ 502.543 │ 1.0072  │
│ 13  │ a1         │ 0.277116    │ 0.132474 │ 0.0020946  │ 0.0062143  │ 495.771 │ 1.00839 │

@xukai92
Copy link
Member

xukai92 commented Aug 2, 2019

Thanks! Can you also post me other stats (i.e. lp__, accept_stat__, stepsize__, treedepth__, n_leapfrog__, divergent__, energy__)? Thanks!

@cpfiffer
Copy link
Member

cpfiffer commented Aug 2, 2019

@itsdfish You can get everything shown really quickly by calling

describe(chain, showall=true)

@itsdfish
Copy link
Contributor Author

itsdfish commented Aug 2, 2019

Thanks @cpfiffer. Here are the stats:

Summary Statistics

│ Row │ parameters    │ mean       │ std         │ naive_se   │ mcse       │ ess     │ r_hat      │
│     │ Symbol        │ Float64    │ Float64     │ Float64    │ Float64    │ Any     │ Any        │
├─────┼───────────────┼────────────┼─────────────┼────────────┼────────────┼─────────┼────────────┤
│ 1   │ accept_stat__ │ 0.941958   │ 0.080377    │ 0.00127087 │ 0.00162432 │ 4000.0  │ 1.00481    │
│ 2   │ divergent__   │ 0.0        │ 0.0         │ 0.0        │ 0.0        │ NaN     │ NaN        │
│ 3   │ energy__      │ -1.03149e5 │ 3.74854     │ 0.0592696  │ 0.107191   │ 928.861 │ 1.00126    │
│ 4   │ lp__          │ 1.03156e5  │ 2.71593     │ 0.0429426  │ 0.0884147  │ 754.944 │ 1.00319    │
│ 5   │ n_leapfrog__  │ 457.064    │ 358.381     │ 5.6665     │ 8.96849    │ 2417.05 │ 1.00766    │
│ 6   │ stepsize__    │ 0.00463429 │ 0.000334517 │ 5.28918e-6 │ 5.35589e-5 │ 16.0643 │ 1.09619e13 │
│ 7   │ treedepth__   │ 7.91725    │ 1.31371     │ 0.0207716  │ 0.0425757  │ 747.621 │ 1.01743    │

Let me know if you need anything else.

@xukai92
Copy link
Member

xukai92 commented Aug 2, 2019

Seems that the results vary a lot based on the synthestic data. I can also get something like below which looks pretty nice.

Summary Statistics

│ Row │ parameters │ mean      │ std       │ naive_se   │ mcse       │ ess     │ r_hat    │
│     │ Symbol     │ Float64   │ Float64   │ Float64    │ Float64    │ Any     │ Any      │
├─────┼────────────┼───────────┼───────────┼────────────┼────────────┼─────────┼──────────┤
│ 1   │ a0         │ 0.432889  │ 0.809888  │ 0.0128055  │ 0.0437089  │ 428.974 │ 1.01011  │
│ 2   │ a0_sig     │ 0.326824  │ 0.0956446 │ 0.00151227 │ 0.00349493 │ 795.304 │ 1.00914  │
│ 3   │ a0s[1]     │ -0.11389  │ 0.138769  │ 0.00219413 │ 0.00539469 │ 528.726 │ 1.00073  │
│ 4   │ a0s[2]     │ -0.217714 │ 0.185011  │ 0.00292528 │ 0.00984951 │ 370.074 │ 1.02207  │
│ 5   │ a0s[3]     │ 0.23187   │ 0.152519  │ 0.00241154 │ 0.00667814 │ 542.281 │ 0.999885 │
│ 6   │ a0s[4]     │ 0.200843  │ 0.156146  │ 0.00246889 │ 0.00791065 │ 373.861 │ 1.02153  │
│ 7   │ a0s[5]     │ -0.156917 │ 0.126857  │ 0.00200579 │ 0.00504307 │ 445.973 │ 1.00776  │
│ 8   │ a0s[6]     │ 0.103235  │ 0.146701  │ 0.00231954 │ 0.0057391  │ 539.697 │ 0.999679 │
│ 9   │ a0s[7]     │ 0.402339  │ 0.181189  │ 0.00286486 │ 0.00955511 │ 373.021 │ 1.0222   │
│ 10  │ a0s[8]     │ -0.466462 │ 0.130819  │ 0.00206844 │ 0.0049619  │ 550.583 │ 1.00673  │
│ 11  │ a0s[9]     │ 0.028516  │ 0.215491  │ 0.00340721 │ 0.00934239 │ 515.235 │ 1.00134  │
│ 12  │ a0s[10]    │ 0.0914416 │ 0.143152  │ 0.00226343 │ 0.00607738 │ 494.816 │ 1.0008   │
│ 13  │ a1         │ 0.572657  │ 0.0956253 │ 0.00151197 │ 0.0052389  │ 417.607 │ 1.01326  │

@itsdfish
Copy link
Contributor Author

itsdfish commented Aug 2, 2019

I wondered the same thing. I ran a benchmark with 20 reps of 1000 samples, 1000 warmup samples, sample size [10,20], and one chain. My plots are not working for some reason, but here are tables of the mean ESS for key parameters:

 by(results,[:sampler,:Ns],:a0_sig_ess=>mean)
  6×3 DataFrame
│ Row │ sampler     │ Ns     │ a0_sig_ess_mean │
│     │ Symbol⍰     │ Int64⍰ │ Float64         │
├─────┼─────────────┼────────┼─────────────────┤
│ 1   │ CmdStanNUTS │ 10     │ 275.658         │
│ 2   │ AHMCNUTS    │ 10     │ 141.072         │
│ 3   │ DHMCNUTS    │ 10     │ 534.693         │
│ 4   │ CmdStanNUTS │ 20     │ 436.461         │
│ 5   │ AHMCNUTS    │ 20     │ 335.877         │
│ 6   │ DHMCNUTS    │ 20     │ 863.51          │

 by(results,[:sampler,:Ns],:a0_ess=>mean)
  6×3 DataFrame
│ Row │ sampler     │ Ns     │ a0_ess_mean │
│     │ Symbol⍰     │ Int64⍰ │ Float64     │
├─────┼─────────────┼────────┼─────────────┤
│ 1   │ CmdStanNUTS │ 10     │ 233.283     │
│ 2   │ AHMCNUTS    │ 10     │ 95.9255     │
│ 3   │ DHMCNUTS    │ 10     │ 804.726     │
│ 4   │ CmdStanNUTS │ 20     │ 168.452     │
│ 5   │ AHMCNUTS    │ 20     │ 135.125     │
│ 6   │ DHMCNUTS    │ 20     │ 921.038     │

 by(results,[:sampler,:Ns],:a1_ess=>mean)
  6×3 DataFrame
│ Row │ sampler     │ Ns     │ a1_ess_mean │
│     │ Symbol⍰     │ Int64⍰ │ Float64     │
├─────┼─────────────┼────────┼─────────────┤
│ 1   │ CmdStanNUTS │ 10     │ 229.268     │
│ 2   │ AHMCNUTS    │ 10     │ 94.719      │
│ 3   │ DHMCNUTS    │ 10     │ 811.847     │
│ 4   │ CmdStanNUTS │ 20     │ 166.287     │
│ 5   │ AHMCNUTS    │ 20     │ 132.555     │
│ 6   │ DHMCNUTS    │ 20     │ 915.751     │

(v1.1) pkg> st Turing
    Status `~/.julia/environments/v1.1/Project.toml`
  [0bf59076] AdvancedHMC v0.2.2 #master (https://github.com/TuringLang/AdvancedHMC.jl.git)
  [31c24e10] Distributions v0.21.1
  [f6369f11] ForwardDiff v0.10.3
  [c7f686f2] MCMCChains v0.3.11
  [4c63d2b9] StatsFuns v0.8.0
  [9f7883ad] Tracker v0.2.2
  [fce5fe82] Turing v0.6.23 #master (https://github.com/TuringLang/Turing.jl.git)
  [e88e6eb3] Zygote v0.3.2

I also saw a fairly high number of numerical errors (10-15 per model application).

@itsdfish
Copy link
Contributor Author

itsdfish commented Aug 3, 2019

I ran a slightly different benchmark to avoid missings in plots. Here are the parameters:

  • 1000 samples
  • 1000 warmup samples
  • 10 Ns (number of "units")
  • [1,2,5] Nd (data points per unit)
  • 25 repetitions per combination
  • 1 chain

Key results:

  • ESS: Turing < CmdStan < DynamicHMC (note that dynamicHMC uses a different NUTS configuration that may not be optimized across models)
  • Number of allocations for Turing about 1 order of magnitude higher than DynamicHMC
  • Speed: Turing < CmdStan < DynamicHMC, with Turing still about 1.5 orders of magnitude slower than Dynamic HMC
(v1.1) pkg> st Turing
    Status `~/.julia/environments/v1.1/Project.toml`
  [0bf59076] AdvancedHMC v0.2.2 #master (https://github.com/TuringLang/AdvancedHMC.jl.git)
  [31c24e10] Distributions v0.21.1
  [f6369f11] ForwardDiff v0.10.3
  [c7f686f2] MCMCChains v0.3.11
  [4c63d2b9] StatsFuns v0.8.0
  [9f7883ad] Tracker v0.2.2
  [fce5fe82] Turing v0.6.23 #master (https://github.com/TuringLang/Turing.jl.git)
  [e88e6eb3] Zygote v0.3.2

I'm not sure why ESS is lower for Turing compared to CmdStan. I suspect the numerical errors might provide a clue. The speed difference between Turing and DynamicHMC might be due to the higher number of allocations in Turing.

Results.zip

@xukai92
Copy link
Member

xukai92 commented Aug 19, 2019

The ESS issue for the Poisson model should be resolved when the generalised U-turn PR is merged (TuringLang/AdvancedHMC.jl#91).

@xukai92
Copy link
Member

xukai92 commented Aug 19, 2019

The numbers look like below in my local.

6×7 DataFrame
│ Row │ sampler     │ Nd    │ a0_ess_mean │ a1_ess_mean │ a0_sig_ess_mean │ tree_depth_mean │ epsilon_mean │
│     │ String      │ Int64 │ Float64     │ Float64     │ Float64         │ Float64         │ Float64      │
├─────┼─────────────┼───────┼─────────────┼─────────────┼─────────────────┼─────────────────┼──────────────┤
│ 1   │ CmdStanNUTS │ 1208.297208.776272.0237.240860.0161274    │
│ 2   │ AHMCNUTS    │ 1203.079204.569274.7117.233740.0157469    │
│ 3   │ CmdStanNUTS │ 2201.638202.592263.0167.463260.0119092    │
│ 4   │ AHMCNUTS    │ 2213.495215.527252.6867.436460.0121216    │
│ 5   │ CmdStanNUTS │ 5174.845174.25223.667.718960.00812359   │
│ 6   │ AHMCNUTS    │ 5200.267200.766230.2737.757960.00801717

@itsdfish
Copy link
Contributor Author

Thanks for the update, Kai. Effective sample size looks much better! I look forward to trying out the changes when they are merged.

@xukai92
Copy link
Member

xukai92 commented Aug 19, 2019

It's merged and just waiting for a new release (JuliaRegistries/General#2797).

@yebai
Copy link
Member

yebai commented Sep 26, 2019

Is this fixed now with the new AHMC release?

@itsdfish
Copy link
Contributor Author

I don't think so. A comparison between CmdStan and Turing with AdvancedHMC .2.5 shows problems with effective sample size continue to persist. Here are some examples from a Poisson regression model.

density_a0_sig_ess

density_a1_ess

.

density_a0_ess

@xukai92
Copy link
Member

xukai92 commented Sep 26, 2019

It's quite different from the benchmark I did a few weeks ago using MCMCBenchmarks (see http://xuk.ai/assets/StanCon-AHMC.pdf). Are you using the master branch or something?

@itsdfish
Copy link
Contributor Author

itsdfish commented Sep 27, 2019

Yeah. That is very odd. I obtained the same results with MCMCBencharks master and 0.5.1. Looking through the history, the only changes that were made to Hierachical_Poisson_Models.jl since your last commit were for DynamicHMC. So I don't think a bug was introduced into the model.

I looked over changes to the source code of MCMCBenchmarks, but nothing seemed problematic at first glance. The changes I made did not seem to affect the Gaussian model because CmdStan and Turing have comparable ESS.

I'll try to look into this more over the weekend.

(v1.1) pkg> st MCMCBenchmarks
    Status `~/.julia/environments/v1.1/Project.toml`
  [0bf59076] AdvancedHMC v0.2.5
  [6e4b80f9] BenchmarkTools v0.4.3
  [336ed68f] CSV v0.5.12
  [593b3428] CmdStan v5.2.0
  [a93c6f00] DataFrames v0.19.4
  [31c24e10] Distributions v0.21.1
  [bbc10e6e] DynamicHMC v2.1.0
  [f6369f11] ForwardDiff v0.10.3
  [6fdf6af0] LogDensityProblems v0.9.1
  [72ce6e51] MCMCBenchmarks v0.5.1
  [c7f686f2] MCMCChains v0.3.14
  [d96e819e] Parameters v0.12.0
  [189a3867] Reexport v0.2.0
  [295af30f] Revise v2.2.0
  [276daf66] SpecialFunctions v0.7.2
  [f3b207a7] StatsPlots v0.12.0
  [84d833dd] TransformVariables v0.3.6
  [fce5fe82] Turing v0.6.23

Julia 1.1.1

@itsdfish
Copy link
Contributor Author

I couldn't find a problem with MCMCBenchmarks. I obtained similar results with AdvancedHMC 0.2.3. Here is a minimum working example:

using Turing, StatsPlots, Distributions, SpecialFunctions, Random
import Distributions: logpdf

function simulatePoisson(; Nd=1, Ns=10, a0=1.0, a1=.5, a0_sig=.3, kwargs...)
  N = Nd * Ns
  y = fill(0, N)
  x = fill(0.0, N)
  idx = similar(y)
  i = 0
  for s in 1:Ns
    a0s = rand(Normal(0, a0_sig))
    logpop = rand(Normal(9, 1.5))
    λ = exp(a0 + a0s + a1 * logpop)
    for nd in 1:Nd
      i += 1
      x[i] = logpop
      idx[i] = s
      y[i] = rand(Poisson(λ))
    end
  end
  return (y=y, x=x, idx=idx, N=N, Ns=Ns)
 end
 
struct LogPoisson{T<:Real} <: DiscreteUnivariateDistribution
    logλ::T
end

function logpdf(lp::LogPoisson, k::Int)
    return k * lp.logλ - exp(lp.logλ) - lgamma(k + 1)
end

@model AHMCpoisson(y, x, idx, N, Ns) = begin
  a0 ~ Normal(0, 10)
  a1 ~ Normal(0, 1)
  a0_sig ~ Truncated(Cauchy(0, 1), 0, Inf)
  a0s ~ MvNormal(zeros(Ns), a0_sig)
  for i ∈ 1:N
    λ = a0 + a0s[idx[i]] + a1 * x[i]
    y[i] ~ LogPoisson(λ)
  end
end

Random.seed!(39)

data = simulatePoisson(;Ns = 10)

chain = sample(AHMCpoisson(data...),NUTS(2000,1000,.8))

println(chain)

plot(chain[:a0])

Poisson

@xukai92
Copy link
Member

xukai92 commented Sep 27, 2019

Thanks for the code. The results you share doesn't include ESS. My result of running the MWE you provided gives

Object of type Chains, with data of type 1000×22×1 Array{Union{Missing, Real},3}

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
internals         = acceptance_rate, hamiltonian_energy, is_accept, log_density, lp, n_steps, numerical_error, step_size, tree_depth
parameters        = a0, a0_sig, a0s[1], a0s[2], a0s[3], a0s[4], a0s[5], a0s[6], a0s[7], a0s[8], a0s[9], a0s[10], a1

2-element Array{ChainDataFrame,1}

Summary Statistics

│ Row │ parameters │ mean       │ std      │ naive_se   │ mcse       │ ess     │ r_hat    │
│     │ Symbol     │ Float64    │ Float64  │ Float64    │ Float64    │ Any     │ Any      │
├─────┼────────────┼────────────┼──────────┼────────────┼────────────┼─────────┼──────────┤
│ 1   │ a0         │ 0.895491   │ 1.12556  │ 0.0355933  │ 0.0642669  │ 215.637 │ 0.999022 │
│ 2   │ a0_sig     │ 0.357993   │ 0.110865 │ 0.00350587 │ 0.00676386 │ 288.628 │ 1.00146  │
│ 3   │ a0s[1]     │ -0.142888  │ 0.129212 │ 0.00408605 │ 0.0075781  │ 236.867 │ 1.00811  │
│ 4   │ a0s[2]     │ 0.467269   │ 0.120416 │ 0.0038079  │ 0.00804799 │ 216.413 │ 1.00967  │
│ 5   │ a0s[3]     │ -0.0813931 │ 0.15794  │ 0.00499452 │ 0.00803419 │ 217.796 │ 1.00461  │
│ 6   │ a0s[4]     │ -0.375486  │ 0.140475 │ 0.0044422  │ 0.00739035 │ 231.58  │ 1.0056   │
│ 7   │ a0s[5]     │ -0.328649  │ 0.152504 │ 0.00482261 │ 0.00799011 │ 218.841 │ 1.00629  │
│ 8   │ a0s[6]     │ 0.401141   │ 0.135664 │ 0.00429008 │ 0.00832431 │ 222.68  │ 1.00878  │
│ 9   │ a0s[7]     │ -0.0521814 │ 0.301167 │ 0.00952373 │ 0.0192478  │ 218.042 │ 1.0003   │
│ 10  │ a0s[8]     │ 0.0837627  │ 0.12595  │ 0.00398288 │ 0.00940983 │ 206.132 │ 1.00909  │
│ 11  │ a0s[9]     │ -0.186746  │ 0.164949 │ 0.00521613 │ 0.0120216  │ 202.941 │ 1.00255  │
│ 12  │ a0s[10]    │ 0.117228   │ 0.124092 │ 0.00392415 │ 0.00842289 │ 218.936 │ 1.01107  │
│ 13  │ a1         │ 0.498655   │ 0.120539 │ 0.00381179 │ 0.00655568 │ 213.745 │ 0.999026 │

Quantiles

│ Row │ parameters │ 2.5%      │ 25.0%      │ 50.0%      │ 75.0%      │ 97.5%     │
│     │ Symbol     │ Float64   │ Float64    │ Float64    │ Float64    │ Float64   │
├─────┼────────────┼───────────┼────────────┼────────────┼────────────┼───────────┤
│ 1   │ a0         │ -1.45633  │ 0.164452   │ 0.955111   │ 1.60549    │ 3.10455   │
│ 2   │ a0_sig     │ 0.212185  │ 0.284172   │ 0.335621   │ 0.405613   │ 0.638829  │
│ 3   │ a0s[1]     │ -0.412466 │ -0.221602  │ -0.142009  │ -0.0622003 │ 0.11225   │
│ 4   │ a0s[2]     │ 0.231944  │ 0.392252   │ 0.463197   │ 0.545739   │ 0.710186  │
│ 5   │ a0s[3]     │ -0.37997  │ -0.191058  │ -0.0732723 │ 0.0245348  │ 0.228448  │
│ 6   │ a0s[4]     │ -0.645226 │ -0.46758   │ -0.368753  │ -0.280846  │ -0.100005 │
│ 7   │ a0s[5]     │ -0.625378 │ -0.438088  │ -0.327296  │ -0.217246  │ -0.028741 │
│ 8   │ a0s[6]     │ 0.109995  │ 0.305765   │ 0.408784   │ 0.49118    │ 0.65865   │
│ 9   │ a0s[7]     │ -0.605775 │ -0.257336  │ -0.0606962 │ 0.13829    │ 0.59724   │
│ 10  │ a0s[8]     │ -0.178198 │ 0.00384925 │ 0.0885033  │ 0.166566   │ 0.330785  │
│ 11  │ a0s[9]     │ -0.49266  │ -0.297873  │ -0.1814    │ -0.0735794 │ 0.126346  │
│ 12  │ a0s[10]    │ -0.121548 │ 0.0320237  │ 0.119097   │ 0.198276   │ 0.357121  │
│ 13  │ a1         │ 0.263909  │ 0.420791   │ 0.493074   │ 0.577344   │ 0.74598   │

Is it the same as yours?

If not, I suspect somehow your Turing is not using the update version of AHMC?

@itsdfish
Copy link
Contributor Author

Sorry. I should have included the summary (note that the non-stationary trace plot suggests low ESS).

Here is my summary:

Object of type Chains, with data of type 1000×23×1 Array{Union{Missing, Float64},3}

Log evidence      = 0.0
Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
internals         = acceptance_rate, eval_num, hamiltonian_energy, is_accept, log_density, lp, n_steps, numerical_error, step_size, tree_depth
parameters        = a0, a0_sig, a0s[1], a0s[2], a0s[3], a0s[4], a0s[5], a0s[6], a0s[7], a0s[8], a0s[9], a0s[10], a1

2-element Array{ChainDataFrame,1}

Summary Statistics

│ Row │ parameters │ mean      │ std       │ naive_se   │ mcse      │ ess     │ r_hat   │
│     │ Symbol     │ Float64   │ Float64   │ Float64    │ Float64   │ Any     │ Any     │
├─────┼────────────┼───────────┼───────────┼────────────┼───────────┼─────────┼─────────┤
│ 1   │ a0         │ 0.25734   │ 0.904652  │ 0.0286076  │ 0.180165  │ 25.7848 │ 1.00111 │
│ 2   │ a0_sig     │ 0.385031  │ 0.0934959 │ 0.0029566  │ 0.0195804 │ 22.513  │ 1.05103 │
│ 3   │ a0s[1]     │ -0.118888 │ 0.154196  │ 0.00487609 │ 0.0421908 │ 7.26326 │ 1.01569 │
│ 4   │ a0s[2]     │ 0.487679  │ 0.149209  │ 0.0047184  │ 0.042932  │ 7.49517 │ 0.99988 │
│ 5   │ a0s[3]     │ -0.11255  │ 0.149933  │ 0.00474129 │ 0.0384453 │ 9.50847 │ 1.00325 │
│ 6   │ a0s[4]     │ -0.38551  │ 0.152533  │ 0.00482353 │ 0.0398215 │ 8.28392 │ 1.00414 │
│ 7   │ a0s[5]     │ -0.359556 │ 0.150769  │ 0.00476773 │ 0.0395436 │ 9.23062 │ 1.0063  │
│ 8   │ a0s[6]     │ 0.391066  │ 0.14374   │ 0.00454546 │ 0.0396994 │ 7.97456 │ 1.00353 │
│ 9   │ a0s[7]     │ 0.128352  │ 0.281772  │ 0.0089104  │ 0.0640121 │ 15.0364 │ 1.00179 │
│ 10  │ a0s[8]     │ 0.11543   │ 0.152279  │ 0.00481549 │ 0.0433674 │ 7.43558 │ 1.00167 │
│ 11  │ a0s[9]     │ -0.116702 │ 0.182076  │ 0.00575775 │ 0.045569  │ 9.90864 │ 1.02162 │
│ 12  │ a0s[10]    │ 0.131333  │ 0.151793  │ 0.00480012 │ 0.0441841 │ 7.69638 │ 1.00594 │
│ 13  │ a1         │ 0.564955  │ 0.0932821 │ 0.00294984 │ 0.017341  │ 28.2881 │ 1.00017 │

Quantiles

│ Row │ parameters │ 2.5%      │ 25.0%      │ 50.0%     │ 75.0%       │ 97.5%      │
│     │ Symbol     │ Float64   │ Float64    │ Float64   │ Float64     │ Float64    │
├─────┼────────────┼───────────┼────────────┼───────────┼─────────────┼────────────┤
│ 1   │ a0         │ -1.55721  │ -0.350089  │ 0.280959  │ 0.915745    │ 1.85898    │
│ 2   │ a0_sig     │ 0.22418   │ 0.320691   │ 0.379533  │ 0.443339    │ 0.587711   │
│ 3   │ a0s[1]     │ -0.454146 │ -0.222612  │ -0.103852 │ -0.00940392 │ 0.168154   │
│ 4   │ a0s[2]     │ 0.203723  │ 0.385027   │ 0.491964  │ 0.580074    │ 0.790241   │
│ 5   │ a0s[3]     │ -0.390252 │ -0.212872  │ -0.10848  │ -0.00921342 │ 0.182705   │
│ 6   │ a0s[4]     │ -0.692409 │ -0.495817  │ -0.364626 │ -0.274441   │ -0.121393  │
│ 7   │ a0s[5]     │ -0.627087 │ -0.474513  │ -0.355538 │ -0.248264   │ -0.0676558 │
│ 8   │ a0s[6]     │ 0.123875  │ 0.285813   │ 0.392087  │ 0.491312    │ 0.673782   │
│ 9   │ a0s[7]     │ -0.424979 │ -0.0495103 │ 0.11791   │ 0.320994    │ 0.670225   │
│ 10  │ a0s[8]     │ -0.189626 │ 0.0250555  │ 0.118156  │ 0.216578    │ 0.406702   │
│ 11  │ a0s[9]     │ -0.503479 │ -0.238112  │ -0.113226 │ 0.0132313   │ 0.228342   │
│ 12  │ a0s[10]    │ -0.161768 │ 0.0182823  │ 0.145505  │ 0.236064    │ 0.444482   │
│ 13  │ a1         │ 0.395641  │ 0.501628   │ 0.560841  │ 0.629406    │ 0.754552   │

I believe I have the latest package versions:

(v1.1) pkg> st Turing
    Status `~/.julia/environments/v1.1/Project.toml`
  [0bf59076] AdvancedHMC v0.2.5
  [31c24e10] Distributions v0.21.1
  [f6369f11] ForwardDiff v0.10.3
  [c7f686f2] MCMCChains v0.3.14
  [189a3867] Reexport v0.2.0
  [276daf66] SpecialFunctions v0.7.2
  [4c63d2b9] StatsFuns v0.8.0
  [9f7883ad] Tracker v0.2.3
  [fce5fe82] Turing v0.6.23
  [e88e6eb3] Zygote v0.3.4

This is perplexing.

@xukai92
Copy link
Member

xukai92 commented Sep 27, 2019

I was using master.
I downgraded myself to the rencet release v.6.23.
I confirm this issue does exist for v.6.23.

Looking over the differences between master and recent release (v0.6.23...master), seems like the recent release is still using 5 as the default tree depth.

@yebai @cpfiffer I think we really need to make a new release. What stops us from doing so? Is it the update of the documentation to reflect the new interface?

@yebai
Copy link
Member

yebai commented Sep 27, 2019

I think we can make a new release once #917, #908 and #915 are merged.

@yebai
Copy link
Member

yebai commented Sep 27, 2019

@mohamed82008 can you give #908 another try?

@mohamed82008
Copy link
Member

Sure

@yebai
Copy link
Member

yebai commented Oct 3, 2019

@itsdfish This should be fixed by release 0.7. Can you try re-running the benchmarks?

@itsdfish
Copy link
Contributor Author

itsdfish commented Oct 4, 2019

@yebai, you can find the benchmarks within the results subfolders for each model here.

Here is a summary of what I found.

  1. Turing consistently produced numerical errors for linear regression (see below). There were alot more numerical errors for SDT. Numerical errors were not a problem for the other models.
┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfiniteθ = true
│   isfiniter = false
│   isfiniteℓπ = false
│   isfiniteℓκ = false
  1. ESS was comparable to Stan for all of the models, except linear regression. This difference could be due to the use of MvNormal in Turing and DynamicHMC (but not Stan).

  2. Turing was consistently slower than Stan and was more sensitive to increased data points and/or parameters. I suspect this is partially a result of reverse vs forward autodiff.

  3. Turing has a somewhat large startup cost compared to DynamicHMC.

  4. Turing is nearly 2 times slower than DynamicHMC (regardless of sample size) for linear regression. I've found this to be the case for other models not included in MCMCBenchmarks.

@yebai
Copy link
Member

yebai commented Oct 4, 2019

Many thanks, @itsdfish! These are helpful to planing Turing's next release.

Turing consistently produced numerical errors for linear regression (see below). There were alot more numerical errors for SDT. Numerical errors were not a problem for the other models.

This is more likely caused by the model itself and its parameterisation.

ESS was comparable to Stan for all of the models, except linear regression. This difference could be due to the use of MvNormal in Turing and DynamicHMC (but not Stan).

This is consistent with @xukai92's results and good to be reproduced independently.

Turing was consistently slower than Stan and was more sensitive to increased data points and/or parameters. I suspect this is partially a result of reverse vs forward autodiff.

We will revisit the issue of switching default AD engine to reverse mode soon, since Distributions are more compatible with Tracker with DistributionsAD.jl.

Turing has a somewhat large startup cost compared to DynamicHMC.
Turing is nearly 2 times slower than DynamicHMC (regardless of sample size) for linear regression. I've found this to be the case for other models not included in MCMCBenchmarks.

This is likely from the overhead incurred by Turing's DSL. @mohamed82008 made a lot of progress in making sure model functions are type-stable, which leads to a leap in Turing's performance recently. But there is further space for improvements. We'll look into how to further reduce Turing's overhead in v0.8.

@itsdfish
Copy link
Contributor Author

itsdfish commented Oct 4, 2019

My pleasure. I'm glad it was helpful.

I tried Tracker and Zygote in the past without much success. Models that required seconds with ForwardDiff required more than an hour with Zygote and Tracker. I'm not sure what that problem was. I can try to find those benchmarks if that information would be helpful.

In meantime, I amenable to reparameterizing the SDT and Linear Regression models. Please let me know if you or other Turing members have any recommendations.

@xukai92
Copy link
Member

xukai92 commented Feb 16, 2020

I tried Tracker and Zygote in the past without much success. Models that required seconds with ForwardDiff required more than an hour with Zygote and Tracker. I'm not sure what that problem was. I can try to find those benchmarks if that information would be helpful.

We started to benchmark Turing and AHMC on different AD backedns. So far what we know is that if there is controll flows, Tracker (high dim) and ForwardDiff (low dim) still out performs Zygote.

In meantime, I amenable to reparameterizing the SDT and Linear Regression models. Please let me know if you or other Turing members have any recommendations.

Do you have any progress here for this. I've been seeing there are a lot of changes in MCMCBenchmarks.jl. I wonder what's the current status of it.

BTW, I saw you were discussing a numeric issue in tpapp/DynamicHMC.jl#95, does AHMC also has this issue?

@itsdfish
Copy link
Contributor Author

itsdfish commented Feb 16, 2020

Hi Kai-

Thanks for the update. Most of the recent changes to MCMCBenchmarks have been for maintenance and reorganization. As of right now, there is one optimization I can add. I'll try to add that within the next week and update the benchmarks to reflect changes in Turing and AdvancedHMC. Please let me know if there are more optimizations you would like me to incorporate.

Yeah. I have been experiencing several issues with DynamicHMC. I tried re-working the log likelihood function of the LBA to be more numerically stable, but it didn't really solve the problem. I think part of the issue might be related to implementational differences in HMC and how the model is initialized. Unfortunately, I am at an impasse with that particular issue.

@xukai92
Copy link
Member

xukai92 commented Feb 16, 2020

I'll try to add that within the next week and update the benchmarks to reflect changes in Turing and AdvancedHMC. Please let me know if there are more optimizations you would like me to incorporate.

Thanks. Maybe you could watch https://github.com/TuringLang/TuringExamples/tree/master/benchmarks which contains the updated implementation using whatever new features or optimizations we have. In fact, maybe we should figure out a way to keep models in a single place and allow them to be easily reused in different places.

Note that our simple benchmark scripts only tests the plain computatational perfomrance (using static HMC) but MCMCBenchmarks.jl does much more.

I tried re-working the log likelihood function of the LBA to be more numerically stable, but it didn't really solve the problem. I think part of the issue might be related to implementational differences in HMC and how the model is initialized. Unfortunately, I am at an impasse with that particular issue.

IIRC the Turing and DynamicHMC version shares the same LBA related computation, so I guess there should be no issues there. It would be useful to check if initialization matters by using the same starting point to exclude the first guess.

@yebai
Copy link
Member

yebai commented Jan 28, 2021

Closing now since AHMC has many numerical stability improvements over the past year. Please reopen if more help is needed.

@yebai yebai closed this as completed Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants