Benchmark speed for autodiff in RigidBodyTree, MultibodyPlant and AcrobotPlant #8482

hongkai-dai · 2018-03-30T16:57:37Z

@mposa noticed that the autodiff computation in RigidBodyTree is a lot slower than that in AcrobotPlant (which writes the dynamics equation manually). I did a quick benchmark test on the three classes, and here is the result of computing the mass matrix for 1000 times

Time (ms)	double	AutoDiffXd	AutoDiffUpTo73d
AcrobotPlant	0.137	1.277	0.251
RigidBodyTree	2.199	253.413	116.395
MultibodyPlant	2.230	312.497	NA

From this table, we know that

For AcrobotPlant, autodiff works really well. Notice that AutoDiffUpTo73d takes only about 2x time than the double version. If we do numerical difference for AcrobotPlant to compute the gradient, the forward difference would take about 4x time than the double (4 variables to take gradient with). So auto-differentation is faster than a naive implementation of numerical gradient, and also more accurate.
We failed to observe the same speed-up using autodiff in RigidBodyTree and MultiBodyPlant. The AutoDiffXd version takes about 100x more time than the double version. Numerical gradient with forward difference would take about 4x time, and central difference would take about 8x time. So in this case autodiff is significantly slower than the numerical differentiation.
Using AutoDiffUpTo73d is about 2 ~ 5x faster than AutoDiffXd.

The benchmark code is in https://github.com/hongkai-dai/drake/blob/benchmark_autodiff/examples/acrobot/benchmark_autodiff.cc

@amcastro-tri @mposa @sherm1 @SeanCurtis-TRI @edrumwri

The text was updated successfully, but these errors were encountered:

sherm1 · 2018-03-30T22:08:54Z

The contrast between the good behavior on AcrobotPlant and the bad behavior with RB/MBPlant makes me think we are misusing AutoDiff there somehow.

amcastro-tri · 2018-04-02T20:53:07Z

Is there some optimization that the compiler is just not able to do or are we really performing more floating point operations?

rpoyner-tri · 2020-12-08T22:39:33Z

Some updates, now 2+ years later.

the basic complaint still stands -- MBP-based acrobot is still slower than bespoke AcrobotPlant.
the original program used MBP's CalcMassMatrixViaInverseDynamics(), which seems a bit unfair.
RBTree is gone, so I did not try to measure it.
AutoDiffUpTo73d is effectively gone (sea of template errors), so I gave up trying to revive it.
I've made a somewhat modernized google-benchmark version of the original program
- a big difference is that the new program invalidates state every iteration to avoid just vacuous cache returns

Here are some informal numbers (I haven't yet put my new benchmark under cassie-level controls yet):

rico@Puget-161804-10:~/checkout/drake$ bazel-bin/examples/acrobot/benchmark_autodiff
2020-12-08T16:47:50-05:00
Running bazel-bin/examples/acrobot/benchmark_autodiff
Run on (48 X 3500 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x24)
L1 Instruction 32 KiB (x24)
L2 Unified 256 KiB (x24)
L3 Unified 30720 KiB (x2)
Load Average: 0.03, 0.03, 0.02
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
AcrobotFixtureD/AcrobotDoubleMassMatrix        157 ns          157 ns      3781194
AcrobotFixtureADX/AcrobotAdxMassMatrix         367 ns          367 ns      1904789
MultibodyFixtureD/MbDMassMatrix               1289 ns         1289 ns       546203
MultibodyFixtureADX/MbAdxMassMatrix          37739 ns        37739 ns        18548
MultibodyFixtureD/MbDMassMatrixVia            2016 ns         2016 ns       340540
MultibodyFixtureADX/MbAdxMassMatrixVia       59170 ns        59169 ns        11743

Observations:

Acrobot/double is a bit slower; this could be attributable to state invalidation in my new benchmark program
Acrobot/autodiff is relatively quite a bit faster; 2.3x slower than double instead of 9.3x slower
Multibody/CalcMassMatrixViaInverseDynamics()/double is comparable to the old number (2016 vs. 2230)
the double=>autodiff penalty for MBP is now about 30x, instead of the old 140x.
- this is true for both the CalcMassMatrixViaInverseDynamics() calculation and the faster CalcMassMatrix() calculation
contemporary MBP/autodiff beats old RBT/autodiff, regardless of scalar type

Some early casual profiling suggests that heap thrashing (especially free()) is a problem for the MBP calculations. Since this problem is small, it is possible that some version of SBO would help. We have a draft; I'm not sure how much effort it would take to make that actually viable in master.

I think my plan for this is to sharpen up my new benchmark a bit, try to commit it, and then fold further work on this into ongoing AutoDiff work. I have my doubts that MBP autodiff will ever rival (theoretical) numerical integration, owing to the long sad history of Eigen's unsupported autodiff scalar. However, it is useful to have a small-problem benchmark to complement the existing cassie benchmark.

Relevant to: RobotLocomotion#8482 This patch rewrites Hongkai's original program (from an old branch) to use google benchmark, removes some obsolete measurements (RigidBodyTree, AutoDiffUpTo73d), and adds some new ones (MBP vanilla CalcMassMatrix()). This benchmark set is nice because is captures the small-problem (only four derivatives!) end of the autodiff problem space. A possible plan would be to wrap this program with controlled-experiment scripts, similar to those in examples/multibody/cassie_benchmark, and use it to help drive further autodiff optimization work.

amcastro-tri · 2020-12-08T23:35:00Z

Wow, those numbers actually look very good @rpoyner-tri, great work!
The old RBT comparisons with fixed size autodiff seem to indicate that it'd still be worth at least to measure its performance in a dev branch where we'd brute force replace AutoDiffXd with AutoDiffUpTo73 everywhere. but I believe you did this already?

Relevant to: RobotLocomotion#8482 This patch rewrites Hongkai's original program (from an old branch) to use google benchmark, removes some obsolete measurements (RigidBodyTree, AutoDiffUpTo73d), and adds some new ones (MBP vanilla CalcMassMatrix()). This benchmark set is nice because is captures the small-problem (only four derivatives!) end of the autodiff problem space. A possible plan would be to wrap this program with controlled-experiment scripts, similar to those in examples/multibody/cassie_benchmark, and use it to help drive further autodiff optimization work.

rpoyner-tri · 2020-12-08T23:46:16Z

@amcastro-tri I did something similar to the UpTo73 case in earlier work, perhaps this: #13902 (comment)

The fixed vs. heap tradeoff may be very different for these very small problems; hence my renewed interest in SBO similar to Nimmer's #12583 .

sherm1 · 2020-12-08T23:58:33Z

My thought is that it is not worth putting a lot of effort into optimizing for small problems -- typically they run fast enough for whatever toy or pedagogical purpose they serve. I would like to focus our efforts on the more-difficult problems encountered by our target users. OTOH if this little benchmark can teach us something about performance on big systems that could be useful.

jwnimmer-tri · 2020-12-09T02:10:21Z

FTR my original purpose for SBO was not for toy problems, but to use it for chunking (#2619) without increasing the number of scalar types we compile to. If we changed AutoDiffXd to use SBO, users who wanted to stripe their compute in chunks (maybe even with openmp) could do so without touching the heap, and without having more compile-time types. (It's convenient to assume that there's only ever one autodiff C++ type within Drake.)

amcastro-tri · 2020-12-09T12:32:34Z

Excellent point @jwnimmer-tri, chunking + SBO could still perform better.

rpoyner-tri · 2020-12-09T16:26:16Z

Good discussion; thanks! Rounding back to "what is this ticket about?"

It showed a limited benchmark and a set of numbers that are mostly obsolete.
It complained that autodiff is slow; this is not particularly novel among issues at this point.
There is not much more in the way of answerable/resolvable questions.

In the follow-on discussion a lot of work is proposed. I think that is beyond the scope/coherence of this issue as written. Here is what I think should happen:

merge or reject my rewrite of the original benchmark
link this ticket to the ongoing measurements ticket tracking measurement data from //examples/multibody/cassie_benchmark #13902
plan a more flexible benchmark that can express robots/problems of arbitrary size -- perhaps based on PlanarNLink or similar?
plan new work on autodiff (SBO? chunking? threads?) in Speed up AutoDiffXd substantially #10991 or new related tickets
close this ticket

Relevant to: RobotLocomotion#8482 This patch rewrites Hongkai's original program (from an old branch) to use google benchmark, removes some obsolete measurements (RigidBodyTree, AutoDiffUpTo73d), and adds some new ones (MBP vanilla CalcMassMatrix()). This benchmark set is nice because is captures the small-problem (only four derivatives!) end of the autodiff problem space. A possible plan would be to wrap this program with controlled-experiment scripts, similar to those in examples/multibody/cassie_benchmark, and use it to help drive further autodiff optimization work.

Relevant to: #8482 This patch rewrites Hongkai's original program (from an old branch) to use google benchmark, removes some obsolete measurements (RigidBodyTree, AutoDiffUpTo73d), and adds some new ones (MBP vanilla CalcMassMatrix()). This benchmark set is nice because is captures the small-problem (only four derivatives!) end of the autodiff problem space. A possible plan would be to wrap this program with controlled-experiment scripts, similar to those in examples/multibody/cassie_benchmark, and use it to help drive further autodiff optimization work.

rpoyner-tri · 2020-12-11T22:37:30Z

Merged my benchmark code, linked some tickets, and filed a new one: #14449. Closing this one.

jwnimmer-tri assigned hongkai-dai Mar 31, 2018

jwnimmer-tri added the unused team: dynamics label Sep 27, 2018

hongkai-dai mentioned this issue Dec 29, 2018

MultiBody dynamics, inertia, and autodiff benchmarking #10322

Closed

rpoyner-tri self-assigned this Oct 15, 2020

rpoyner-tri added this to the rico 2020q4 grab bag milestone Oct 15, 2020

rpoyner-tri mentioned this issue Dec 8, 2020

examples/acrobot: Add an autodiff benchmark #14432

Merged

rpoyner-tri mentioned this issue Dec 11, 2020

More flexible autodiff benchmark #14449

Closed

rpoyner-tri closed this as completed Dec 11, 2020

rpoyner-tri mentioned this issue Dec 16, 2020

reusable infrastructure for benchmark measurements #14464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark speed for autodiff in RigidBodyTree, MultibodyPlant and AcrobotPlant #8482

Benchmark speed for autodiff in RigidBodyTree, MultibodyPlant and AcrobotPlant #8482

hongkai-dai commented Mar 30, 2018

sherm1 commented Mar 30, 2018

amcastro-tri commented Apr 2, 2018

rpoyner-tri commented Dec 8, 2020

amcastro-tri commented Dec 8, 2020

rpoyner-tri commented Dec 8, 2020

sherm1 commented Dec 8, 2020

jwnimmer-tri commented Dec 9, 2020

amcastro-tri commented Dec 9, 2020

rpoyner-tri commented Dec 9, 2020

rpoyner-tri commented Dec 11, 2020

Benchmark speed for autodiff in RigidBodyTree, MultibodyPlant and AcrobotPlant #8482

Benchmark speed for autodiff in RigidBodyTree, MultibodyPlant and AcrobotPlant #8482

Comments

hongkai-dai commented Mar 30, 2018

sherm1 commented Mar 30, 2018

amcastro-tri commented Apr 2, 2018

rpoyner-tri commented Dec 8, 2020

amcastro-tri commented Dec 8, 2020

rpoyner-tri commented Dec 8, 2020

sherm1 commented Dec 8, 2020

jwnimmer-tri commented Dec 9, 2020

amcastro-tri commented Dec 9, 2020

rpoyner-tri commented Dec 9, 2020

rpoyner-tri commented Dec 11, 2020