-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelizing the benchmkarks #11
Comments
Great! Go ahead and merge parallel_2 into master. |
Sounds good. I merged to master. |
Testing benchmark() on OSX required something like A bit further down, just under If I update that statement to
which I haven't completely figured out yet. Could be a race condition in a .csv file, e.g. if multiple calls to stancode write to the same .csv file. I'll let it run a bit and see what the results df will look like. |
Interesting. I cannot reproduce this on Ubuntu yet. Perhaps one way to test this hypothesis is to set |
I performed a different test on my system. I set Nd = [1,1,1] to increase the chance of a race condition. It produced this error:
|
After running the code several times, I was able to produce a similar error. I wonder if it is possible to deactivate the writing or, if not, is it possible write to separate files?
|
Setting Nd=[10] solves it. So I think the hypothesis is likely correct. The idx error looks also possible if the number of names on that first non-commented line is mangled. A solution could be to separate the tmp dirs or a more elaborate naming scheme for the .csv files. What do you mean 'deactivating the writing`? |
I was wondering if there was a way to prevent Stan from writing the output files since they are not necessary for our purposes. That is one way to prevent the race conditions. Another way would be to have a try/catch, but that might interfere with the timing to some degree. If all else fails, I had an idea that was similar to yours: write to separate subdirectories within tmp. That would involve creating the subdirectories and modifying the tmpdir field of the Stan configuration object. That leaves us with (1) prevent stan from writing the files if possible, (2) a try/catch statement if possible, (3) or creating separate directories, or (4) some other naming scheme. Does any of these seem better from your perspective? |
If I understand correctly, rstan provides Also, it appears that samples are saved to the csv file periodically during sampling. I think that is another reason to have an option not to save. It may interfere with time measurement. |
Where would we get ess and r_hat values from in that case? |
Given that dHMC and CmdStan are orders of magnitude faster than Turing (today), we could consider just run those for timing purposes. But let me think a bit more what other options there are. |
Good question. Based on your comment, it seems like ess and r_hat calculations were taken from the sample output files. Is that true? I assumed that they were accumulated in an internal array in CmdStan.jl. If it is calculated from the sample output files, I suppose that would mean we would need some other solution or some way of accumulating the samples inside CmdStan. Which subset of samplers were you suggesting to include? I think including Turing in particular is a good idea because it performs so poorly, at least right now. CmdStan is good to include because its more or less the gold standard. Ideally, it would be nice to benchmark all of the major NUTS samplers to keep track of improvements/regressions and to help other people choose packages. In the worse case scenario, I think we can do that on a single core. |
Maybe they are accumulated in cmdstan (the C++ binary), but I doubt it. The cmdstan binary creates the .csv files and after the fact you can call a secondary binary program , stansummary, to generate the summary. Stansummary reads the .csv files in to create that summary. CmdStan.jl kind of orchestrates setting up the input data, stan language program, compilation phases and subsequent execution of cmdstan. RStan and PyStan use the C++ API directly and store the chains in appropriate R and Python structures while the simulation takes place. CmdStan.jl creates the Julia Chains after cmdstan has completed. What I meant is not to drop Turing, but we could maybe accept the time it takes to perform an extra run of CmdStan and dHMC just to obtain the timing info (and not reading the output files back in). |
Gotcha. Sounds good. |
One question slowly bubbling to the surface is if running in parallel is a high priority for MCMCBenchmarks right now. I think using the current sequential setup takes somewhat longer, but it does produce interesting At some point I would like to go back and look into using BenchmarkTools because over time we could benefit from their improvements, including parallelism. |
Good question. I don't think parallelism is a high priority or even necessary. It is more of a bonus feature. Considering that you don't have an elegant solution, I might eventually try a solution that involves separate subdirectories for each instance of CmdStan. I have a pretty good idea of how that might work. The biggest roadblock I encountered with BenchmarkTools is that it does not collect output from the function, which we need for ess and the like. I also noticed that it produces similar results to |
What I had in mind with running the extra simulation is:
What's interesting is that benchmark chooses the number of simulation runs based on the actual time variation observed, e.g. more samples, longer times, less runs needed:
|
Interesting. Are you suggesting this as a replacement or a supplement to generate quick benchmarks? As a replacement you would lose rich information about the variation and correlation structure of the metrics. Are you sure that it is based on time variation? My understanding is that there is a default maximum time of 5 seconds. See:
.204*25 ≈ 5 seconds |
Well, you're definitely right that it simply stopped at the 5 second limit! I wasn't aiming at making them faster, just making sure compilation times etc. are dealt with properly for the timing info and still have all info we need to update the results DataFrame when calling updateResults!().. |
Ok. That makes sense. In my experience, |
What do you think about using @timed? It returns the computed value, run time, garbage collection time, number of allocations, and memory usage. It seems like this would give us most of the metrics with little effort and modification. |
Hi Rob- After reading through several sources on benchmarking, including this blog, my understanding is that we need to do two things to get valid benchmarks:
Our benchmarks are good on that front. After giving it careful thought, it seems like BenchmarkTools is not right tool for our purposes and, as far as I can tell, it doesn't add much. However, I think your more general idea of including other performance metrics is really useful. I added branch called performance, which includes garbage collection time, megabytes allocated, number of allocations etc. One of the interesting findings is that Turing is spending more time in garbage collection for some reason. Here is one of the plots Here is gc percentage (CmdStan only has gc for the julia overhead of course): I'm not sure if gc should be a constant or a percentage of total time when code is optimized. |
P.S. I ran out of time before work, but my preliminary inspection suggested Turing had a really high number of memory allocations. This could explain its performance problem. I wonder if something in Turing could be cached and computed in place. |
Thanks Chris, yes, you are right. It's just that I like to explore several different lines of thoughts before committing. I need to complete the LinearRegression example. |
Exploring multiple options is prudent from a design perspective. Unless you have any concerns, I can merge the new metrics when I return home. I can also add some graphs for the new metrics. |
Yes, by all means. I suggest we can remove the old benchmark, performance and diagnostics folders. |
It just occurred to me that the solution to the original problem is very simple. Each file is named after the model name. So I should be able to modify the model name based on the processor id, e.g. in
I can't believe I missed that! I think there might be one small complication initializing the stan file in tmp, but that should be easy to solve. I'll add those changes later today. I think this should work. |
I fixed the race condition problem and it passed a stress test in which |
Very nice Chris! And adding the gc and allocation data is very useful. It looks like OSX handles some aspects slightly different (addprocs, include of the model file, the directory settings) so I’m trying to come up with a way that works for both of us. Right now I need several edits each time I update. |
Great. Thanks for fixing that. I just pushed some improvements to the way jobs are distributed to the processors. The number of reps is split as evenly as possible to each processor (e.g. reps = [13,13,12,12] for 50 total reps distributed across four processors). Previously one job would run much longer than the others, making it inefficient. |
…-23-23-07-41-509-2019252689 CompatHelper: add new compat entry for "Requires" at version "1.0"
Hi Rob-
I put together two versions for running the simulations in parallel. I will recommend the version on branch parallel_2, as it is simpler and more efficient. You can find the example in the Examples folder. I defined a parallel version of benchmark as follows:
Each level of Nd runs on a separate processor. So it should speed up the simulation without interfering with the time measurements. One of the problems with parallel_1 was that it was only as fast as the slowest sampler, which is Turing. This is faster and simpler in terms of setting a random seed. The important thing to do is compile stan so that it compile time is not included in the benchmarks, but this is always the case.
I made two other minor changes. I removed the print setting from the
runSampler
function and put this in the top-level script:@everywhere Turing.turnprogress(false).
I also put a call toallowmissing!(results)
in the sampler loop because the previous set up was order dependent (e.g. it expected epsilon to already be in the results dataframe, which may not always be true). Unfortunatelyallowmissing!
does not work with an empty dataframe so it has to go in the loop. Let me know if you would like me to merge with master.The text was updated successfully, but these errors were encountered: