Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelizing the benchmkarks #11

Closed
itsdfish opened this issue May 21, 2019 · 30 comments
Closed

Parallelizing the benchmkarks #11

itsdfish opened this issue May 21, 2019 · 30 comments

Comments

@itsdfish
Copy link
Collaborator

Hi Rob-

I put together two versions for running the simulations in parallel. I will recommend the version on branch parallel_2, as it is simpler and more efficient. You can find the example in the Examples folder. I defined a parallel version of benchmark as follows:

 function pbenchmark(samplers,simulate,Nd,Nreps=100)
     results = DataFrame()
     pfun(nd) = benchmark!(samplers,results,simulate,Nreps;Nd=nd,Nsamples=2000,Nadapt=1000,delta=.8)
     presults = pmap(nd->pfun(nd),Nd)
     return vcat(presults...)
 end

Each level of Nd runs on a separate processor. So it should speed up the simulation without interfering with the time measurements. One of the problems with parallel_1 was that it was only as fast as the slowest sampler, which is Turing. This is faster and simpler in terms of setting a random seed. The important thing to do is compile stan so that it compile time is not included in the benchmarks, but this is always the case.

I made two other minor changes. I removed the print setting from the runSampler function and put this in the top-level script: @everywhere Turing.turnprogress(false). I also put a call to allowmissing!(results) in the sampler loop because the previous set up was order dependent (e.g. it expected epsilon to already be in the results dataframe, which may not always be true). Unfortunately allowmissing! does not work with an empty dataframe so it has to go in the loop. Let me know if you would like me to merge with master.

@goedman
Copy link
Member

goedman commented May 21, 2019

Great! Go ahead and merge parallel_2 into master.

@itsdfish
Copy link
Collaborator Author

Sounds good. I merged to master.

@goedman
Copy link
Member

goedman commented May 21, 2019

Testing benchmark() on OSX required something like ProjDir=@__DIR__; cd(ProjDir) in the first @Everywhere block (otherwise it can't load the models file).

A bit further down, just under #pyplot there is a line cd(pwd), that doesn't work (I think, on OSX at least) and inside stancode read_samples() fails as it can't find the correct the tmp dir.

If I update that statement to cd(ProjDir) it starts to run although I get the very first time:

ERROR: LoadError: On worker 2:
ArgumentError: cannot parse "0.046591 seconds (Sampling)" as Float64
_parse_failure at ./parse.jl:368 (repeats 2 times)
#tryparse_internal#353 at ./parse.jl:364 [inlined]

which I haven't completely figured out yet. Could be a race condition in a .csv file, e.g. if multiple calls to stancode write to the same .csv file.

I'll let it run a bit and see what the results df will look like.

@itsdfish
Copy link
Collaborator Author

Interesting. I cannot reproduce this on Ubuntu yet. Perhaps one way to test this hypothesis is to set Nd = [10] so that only one version of CmdStan runs.

@itsdfish
Copy link
Collaborator Author

I performed a different test on my system. I set Nd = [1,1,1] to increase the chance of a race condition. It produced this error:

On worker 3:
UndefVarError: idx not defined
read_samples at /home/dfish/.julia/packages/CmdStan/LQpQf/src/utilities/read_samples.jl:90
read_samples at /home/dfish/.julia/packages/CmdStan/LQpQf/src/utilities/read_samples.jl:20 [inlined]

@itsdfish
Copy link
Collaborator Author

After running the code several times, I was able to produce a similar error. I wonder if it is possible to deactivate the writing or, if not, is it possible write to separate files?

On worker 3:
ArgumentError: cannot parse "1.32748-1.46964" as Float64
_parse_failure at ./parse.jl:370 (repeats 2 times)
#tryparse_internal#349 at ./parse.jl:366 [inlined]
tryparse_internal at ./parse.jl:364 [inlined]
#parse#350 at ./parse.jl:376 [inlined]
parse at ./parse.jl:376 [inlined]
_broadcast_getindex_evalf at ./broadcast.jl:578 [inlined]
_broadcast_getindex at ./broadcast.jl:561 [inlined]
getindex at ./broadcast.jl:511 [inlined]
macro expansion at ./broadcast.jl:843 [inlined]
macro expansion at ./simdloop.jl:73 [inlined]
copyto! at ./broadcast.jl:842 [inlined]
copyto! at ./broadcast.jl:797 [inlined]
copy at ./broadcast.jl:773 [inlined]
materialize at ./broadcast.jl:753 [inlined]
read_samples at /home/dfish/.julia/packages/CmdStan/LQpQf/src/utilities/read_samples.jl:82
read_samples at /home/dfish/.julia/packages/CmdStan/LQpQf/src/utilities/read_samples.jl:20 [inlined]
#stan#4 at /home/dfish/.julia/packages/CmdStan/LQpQf/src/main/stancode.jl:230
#stan at ./tuple.jl:0
#runSampler#14 at /home/dfish/.julia/dev/MCMCBenchmarks/src/MCMCBenchmarks.jl:104
#runSampler at ./none:0 [inlined]
macro expansion at ./util.jl:213 [inlined]
#benchmark!#10 at /home/dfish/.julia/dev/MCMCBenchmarks/src/MCMCBenchmarks.jl:77
#benchmark! at ./none:0 [inlined]
pfun at /home/dfish/.julia/dev/MCMCBenchmarks/Models/Gaussian/Gaussian_Models.jl:113 [inlined]
#62 at /home/dfish/.julia/dev/MCMCBenchmarks/Models/Gaussian/Gaussian_Models.jl:114
#112 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/process_messages.jl:269
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/process_messages.jl:56
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/process_messages.jl:269 [inlined]
#111 at ./task.jl:259
(::getfield(Base, Symbol("##696#698")))(::Task) at asyncmap.jl:178
foreach(::getfield(Base, Symbol("##696#698")), ::Array{Any,1}) at abstractarray.jl:1866
maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Array{Int64,1}) at asyncmap.jl:178
#async_usemap#681 at asyncmap.jl:154 [inlined]
#async_usemap at none:0 [inlined]
#asyncmap#680 at asyncmap.jl:81 [inlined]
#asyncmap at none:0 [inlined]
#pmap#213(::Bool, ::Int64, ::Nothing, ::Array{Any,1}, ::Nothing, ::Function, ::Function, ::WorkerPool, ::Array{Int64,1}) at pmap.jl:126
pmap(::Function, ::WorkerPool, ::Array{Int64,1}) at pmap.jl:101
#pmap#223(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Array{Int64,1}) at pmap.jl:156
pmap at pmap.jl:156 [inlined]
pbenchmark(::Tuple{CmdStanNUTS{Stanmodel},AHMCNUTS{typeof(AHMCGaussian),Turing.Inference.NUTS{Turing.Core.ForwardDiffAD{40},Union{}}},DHMCNUTS{typeof(sampleDHMC),Int64}}, ::Function, ::Array{Int64,1}, ::Int64) at Gaussian_Models.jl:114
top-level scope at none:0

@goedman
Copy link
Member

goedman commented May 21, 2019

Setting Nd=[10] solves it. So I think the hypothesis is likely correct. The idx error looks also possible if the number of names on that first non-commented line is mangled.

A solution could be to separate the tmp dirs or a more elaborate naming scheme for the .csv files.

What do you mean 'deactivating the writing`?

@itsdfish
Copy link
Collaborator Author

I was wondering if there was a way to prevent Stan from writing the output files since they are not necessary for our purposes. That is one way to prevent the race conditions. Another way would be to have a try/catch, but that might interfere with the timing to some degree.

If all else fails, I had an idea that was similar to yours: write to separate subdirectories within tmp. That would involve creating the subdirectories and modifying the tmpdir field of the Stan configuration object. That leaves us with (1) prevent stan from writing the files if possible, (2) a try/catch statement if possible, (3) or creating separate directories, or (4) some other naming scheme. Does any of these seem better from your perspective?

@itsdfish
Copy link
Collaborator Author

itsdfish commented May 22, 2019

If I understand correctly, rstan provides sample_file (see also diagnostic_file), which will not store a copy of the samples if a directory is not provided (or set to NULL). Would it be possible to add this feature to CmdStan.jl?

Also, it appears that samples are saved to the csv file periodically during sampling. I think that is another reason to have an option not to save. It may interfere with time measurement.

@itsdfish itsdfish reopened this May 22, 2019
@goedman
Copy link
Member

goedman commented May 22, 2019

Where would we get ess and r_hat values from in that case?

@goedman
Copy link
Member

goedman commented May 22, 2019

Given that dHMC and CmdStan are orders of magnitude faster than Turing (today), we could consider just run those for timing purposes. But let me think a bit more what other options there are.

@itsdfish
Copy link
Collaborator Author

Good question. Based on your comment, it seems like ess and r_hat calculations were taken from the sample output files. Is that true? I assumed that they were accumulated in an internal array in CmdStan.jl. If it is calculated from the sample output files, I suppose that would mean we would need some other solution or some way of accumulating the samples inside CmdStan.

Which subset of samplers were you suggesting to include? I think including Turing in particular is a good idea because it performs so poorly, at least right now. CmdStan is good to include because its more or less the gold standard. Ideally, it would be nice to benchmark all of the major NUTS samplers to keep track of improvements/regressions and to help other people choose packages. In the worse case scenario, I think we can do that on a single core.

@goedman
Copy link
Member

goedman commented May 22, 2019

Maybe they are accumulated in cmdstan (the C++ binary), but I doubt it. The cmdstan binary creates the .csv files and after the fact you can call a secondary binary program , stansummary, to generate the summary. Stansummary reads the .csv files in to create that summary. CmdStan.jl kind of orchestrates setting up the input data, stan language program, compilation phases and subsequent execution of cmdstan. RStan and PyStan use the C++ API directly and store the chains in appropriate R and Python structures while the simulation takes place. CmdStan.jl creates the Julia Chains after cmdstan has completed.

What I meant is not to drop Turing, but we could maybe accept the time it takes to perform an extra run of CmdStan and dHMC just to obtain the timing info (and not reading the output files back in).

@itsdfish
Copy link
Collaborator Author

Gotcha. Sounds good.

@goedman
Copy link
Member

goedman commented May 22, 2019

One question slowly bubbling to the surface is if running in parallel is a high priority for MCMCBenchmarks right now. I think using the current sequential setup takes somewhat longer, but it does produce interesting snapshots for different mcmc options.

At some point I would like to go back and look into using BenchmarkTools because over time we could benefit from their improvements, including parallelism.

@itsdfish
Copy link
Collaborator Author

Good question. I don't think parallelism is a high priority or even necessary. It is more of a bonus feature. Considering that you don't have an elegant solution, I might eventually try a solution that involves separate subdirectories for each instance of CmdStan. I have a pretty good idea of how that might work.

The biggest roadblock I encountered with BenchmarkTools is that it does not collect output from the function, which we need for ess and the like. I also noticed that it produces similar results to @time for longer running functions. Do you foresee some solution to that problem?

@goedman
Copy link
Member

goedman commented May 22, 2019

What I had in mind with running the extra simulation is:

# chris_test_5a.jl

using MCMCBenchmarks

Random.seed!(38445)

ProjDir = @__DIR__
cd(ProjDir)

# Define user likelihood distribution

import Distributions: logpdf, pdf
mutable struct mydist{T1,T2} <: ContinuousUnivariateDistribution
    μ::T1
    σ::T2
end

function pdf(dist::mydist, x::Float64)
  @unpack μ,σ=dist
  pdf(Normal(μ,σ), x)
end

struct ChrisProblem5a{TY <: AbstractVector}
    "Observations."
    y::TY
end;

# Very constraint prior on μ. Flat σ.

function (problem::ChrisProblem5a)(θ)
    @unpack y = problem   # extract the data
    @unpack μ, σ = θ
    loglikelihood(Normal(μ, σ), y) + logpdf(Normal(0, 1), μ) + 
    logpdf(Truncated(Cauchy(0,1),0,Inf), σ)
end;

# Define problem with data and inits.
function dhmc_bm(data::Dict, Nsamples=2000)
  
  N = data["N"]
  obs = data["y"]
  p = ChrisProblem5a(obs);
  p((μ = 0.0, σ = 2.0))

  # Write a function to return properly dimensioned transformation.

  problem_transformation(p::ChrisProblem5a) =
      as((μ  = as(Real, -25, 25), σ = asℝ₊), )

  # Use Flux for the gradient.

  P = TransformedLogDensity(problem_transformation(p), p)
  #∇P = LogDensityRejectErrors(ADgradient(:ReverseDiff, P));
  ∇P = LogDensityRejectErrors(ADgradient(:ForwardDiff, P));

  # FSample from the posterior.

  chain, NUTS_tuned = NUTS_init_tune_mcmc(∇P, Nsamples, 
    report=ReportSilent());

  # Undo the transformation to obtain the posterior from the chain.

  posterior = TransformVariables.transform.(Ref(problem_transformation(p)), get_position.(chain));
  
  # Set varable names, this will be automated using θ

  parameter_names = ["μ", "σ"]

  # Create a3d

  a3d = Array{Float64, 3}(undef, Nsamples, 2, 1);
  for i in 1:Nsamples
    a3d[i, 1, 1] = values(posterior[i][1])
    a3d[i, 2, 1] = values(posterior[i][2])
  end

  chns = MCMCChains.Chains(a3d,
    parameter_names,
    Dict(
      :parameters => parameter_names,
    )
  );
  
  chns
end

BenchmarkTools.DEFAULT_PARAMETERS.samples = 50

Ns = [100, 500, 1000]

chns = Vector{MCMCChains.Chains}(undef, length(Ns))
t = Vector{BenchmarkTools.Trial}(undef, length(Ns))

for (i, N) in enumerate(Ns)
  data = Dict("y" => rand(Normal(0,1),N), "N" => N)
  t[i] = @benchmark dhmc_bm($data, $N)
  chns[i] = dhmc_bm(data, N)
end

t[1] |> display
println()
t[end] |> display

describe(chns[end])

What's interesting is that benchmark chooses the number of simulation runs based on the actual time variation observed, e.g. more samples, longer times, less runs needed:

julia> include("/Users/rob/.julia/dev/MCMCBenchmarks/benchmarks/initial/dHMC/dHMC_bm_01.jl")
BenchmarkTools.Trial: 
  memory estimate:  9.25 MiB
  allocs estimate:  159467
  --------------
  minimum time:     21.915 ms (0.00% GC)
  median time:      25.526 ms (0.00% GC)
  mean time:        26.690 ms (8.63% GC)
  maximum time:     33.933 ms (0.00% GC)
  --------------
  samples:          50
  evals/sample:     1

BenchmarkTools.Trial: 
  memory estimate:  14.93 MiB
  allocs estimate:  261547
  --------------
  minimum time:     175.151 ms (0.00% GC)
  median time:      206.505 ms (2.51% GC)
  mean time:        204.424 ms (2.09% GC)
  maximum time:     248.813 ms (2.06% GC)
  --------------
  samples:          25
  evals/sample:     1

@itsdfish
Copy link
Collaborator Author

Interesting. Are you suggesting this as a replacement or a supplement to generate quick benchmarks? As a replacement you would lose rich information about the variation and correlation structure of the metrics.

Are you sure that it is based on time variation? My understanding is that there is a default maximum time of 5 seconds. See:

dump(BenchmarkTools.DEFAULT_PARAMETERS)

.204*25 ≈ 5 seconds

@goedman
Copy link
Member

goedman commented May 22, 2019

Well, you're definitely right that it simply stopped at the 5 second limit!

I wasn't aiming at making them faster, just making sure compilation times etc. are dealt with properly for the timing info and still have all info we need to update the results DataFrame when calling updateResults!()..

@itsdfish
Copy link
Collaborator Author

Ok. That makes sense.

In my experience, @time gives similar times as @benchmark for long running functions. This might be worth testing before making a large investment.

@itsdfish
Copy link
Collaborator Author

itsdfish commented May 22, 2019

What do you think about using @timed? It returns the computed value, run time, garbage collection time, number of allocations, and memory usage. It seems like this would give us most of the metrics with little effort and modification.

@itsdfish
Copy link
Collaborator Author

itsdfish commented May 23, 2019

Hi Rob-

After reading through several sources on benchmarking, including this blog, my understanding is that we need to do two things to get valid benchmarks:

  1. run the relevant code inside a function or use interpolation (e.g. $) with @benchmark .
  2. run the relevant code repeatedly to average out noise

Our benchmarks are good on that front. After giving it careful thought, it seems like BenchmarkTools is not right tool for our purposes and, as far as I can tell, it doesn't add much. However, I think your more general idea of including other performance metrics is really useful. I added branch called performance, which includes garbage collection time, megabytes allocated, number of allocations etc. One of the interesting findings is that Turing is spending more time in garbage collection for some reason. Here is one of the plots
density_gctime.pdf

Here is gc percentage (CmdStan only has gc for the julia overhead of course):

density_gcpercent.pdf

I'm not sure if gc should be a constant or a percentage of total time when code is optimized.

@itsdfish
Copy link
Collaborator Author

P.S. I ran out of time before work, but my preliminary inspection suggested Turing had a really high number of memory allocations. This could explain its performance problem. I wonder if something in Turing could be cached and computed in place.

@goedman
Copy link
Member

goedman commented May 23, 2019

Thanks Chris, yes, you are right. It's just that I like to explore several different lines of thoughts before committing. I need to complete the LinearRegression example.

@itsdfish
Copy link
Collaborator Author

Exploring multiple options is prudent from a design perspective. Unless you have any concerns, I can merge the new metrics when I return home. I can also add some graphs for the new metrics.

@goedman
Copy link
Member

goedman commented May 23, 2019

Yes, by all means. I suggest we can remove the old benchmark, performance and diagnostics folders.

@itsdfish
Copy link
Collaborator Author

itsdfish commented May 23, 2019

It just occurred to me that the solution to the original problem is very simple. Each file is named after the model name. So I should be able to modify the model name based on the processor id, e.g. in modifyConfig!

s.model.name = string(s.modelname,myid())

I can't believe I missed that! I think there might be one small complication initializing the stan file in tmp, but that should be easy to solve. I'll add those changes later today. I think this should work.

@itsdfish
Copy link
Collaborator Author

I fixed the race condition problem and it passed a stress test in which Nd = [1,1,1,1].

@goedman
Copy link
Member

goedman commented May 23, 2019

Very nice Chris! And adding the gc and allocation data is very useful.

It looks like OSX handles some aspects slightly different (addprocs, include of the model file, the directory settings) so I’m trying to come up with a way that works for both of us. Right now I need several edits each time I update.

@itsdfish
Copy link
Collaborator Author

Great. Thanks for fixing that.

I just pushed some improvements to the way jobs are distributed to the processors. The number of reps is split as evenly as possible to each processor (e.g. reps = [13,13,12,12] for 50 total reps distributed across four processors). Previously one job would run much longer than the others, making it inefficient.

itsdfish added a commit that referenced this issue Mar 25, 2020
…-23-23-07-41-509-2019252689

CompatHelper: add new compat entry for "Requires" at version "1.0"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants