Speed up AutoDiffXd substantially #10991

sherm1 · 2019-03-21T22:35:56Z

We believe the primary performance problem with AutoDiffXd is the absurd amount of heap malloc/free it requires during computation. Why is that expensive?

locking for thread safety
searching for a suitably-sized block on malloc(); combining blocks on free()

Ideas:

write a fixed-block-size memory manager (a.k.a. memory pool), with optional thread safety. Modify our version of AutoDiffScalar to use it instead of new/delete.
add missing move semantics to AutoDiffScalar operators to avoid unnecessary reallocations.

The bespoke memory manager would work because all the derivative blocks we need are the same size during a computation. That means:

no searching for the right size
no block combining on free
the first free block is always acceptable.

The latter point means good (hardware) cache behavior because the free list is essentially a stack -- the last one freed is the first one used and is likely still in the cache.

With careful implementation both allocation and freeing could be inline operations using only a few assembly instructions.

Also see discussion in issue #7039.

jwnimmer-tri · 2019-03-21T22:40:04Z

Alejandro asked me about this on slack earlier. I think you both underestimate how well current off-the-shelf allocators perform. How about a benchmark to show how bad the current one is? Or if glibc malloc really is bad, the first thing to do is not write your own allocator! It's to try jemalloc or tcmalloc or whatever other off-the-shelf one already exists.

sherm1 · 2019-03-21T23:59:34Z

The underlying assumption here is that AutoDiffXd(100) is much slower than AutoDiff100d. I have not measured that myself, so it has the status of rumor at the moment! But assuming it's true, heap allocation must be a lot slower than stack memory allocation.

Replacing heap allocation for all of Drake is a much more drastic step than just making AutoDiffXd work better by using a fixed-size pool. It could be worth a try though.

In any case this is a quantitative question and what we are really lacking at the moment is a repeatable benchmark problem showing that stack-allocated AutoDiff is much faster than heap-allocated. If we had that we could play with the heap allocation to see what it takes to get it close to stack-allocated performance.

amcastro-tri · 2019-03-22T14:00:25Z

I will add heap/stack-allocated AutoDiff tests to the benchmark I'm writing. I think @jwnimmer-tri has a good point on that we should just measure it.

jwnimmer-tri · 2019-03-22T14:04:27Z

Actually the first thing I would do (in terms of fixing the code) is teach AutoDiffXd about value categories. At the moment, every operation makes a copy (sometimes more than one!) of the derivatives vector. If it were move-enabled, we would be hitting the heap much less often, no matter the allocator.

(I still agree that having a benchmark is the first step.)

sherm1 · 2019-03-22T15:30:05Z

If it were move-enabled ...

I agree. That's actually one of my biggest complaints about Eigen. It uses horribly complicated expression templates to achieve what could mostly have been done with move semantics. I believe that is due to it predating &&. We are overdue for a modern-C++ matrix library.

jwnimmer-tri · 2019-03-22T15:34:40Z

It has move on the Matrix classes, just not on the ADS class.

sherm1 · 2020-05-29T16:17:18Z

TIL with some relief that thread local storage is (allegedly) as fast as local storage. That bodes well for a possible AutoDiffScalarX implementation that maintains its own pool of temporaries.

FYI @rpoyner-tri please see the above discussion.

EricCousineau-TRI · 2020-07-13T15:34:44Z

From #13675:

Related issues

MultiBody dynamics, inertia, and autodiff benchmarking #10322

Related discussions

Benchmarking in dairlib:
- https://gist.github.com/mposa/99e17f7b043ad737cf429c6801311da2
- https://drakedevelopers.slack.com/archives/C3L92BM2Q/p1590255574041300

Relevant to: RobotLocomotion#10991, RobotLocomotion#13902 I finally realized that LimitMalloc counting was contributing significant overhead to autodiff benchmark timings, owing to necessary synchronization primitives in that module. This patch separates the two measurements, to clarify the things we want to focus on. Notice that this change of measurement will require some revision of our timings for older versions, to keep comparisons sensible. Reviewers should look for updates to the tracking issue RobotLocomotion#13902.

…14146) * cassie_bench: Separate autodiff malloc counts from benchmark timing Relevant to: #10991, #13902 I finally realized that LimitMalloc counting was contributing significant overhead to autodiff benchmark timings, owing to necessary synchronization primitives in that module. This patch separates the two measurements, to clarify the things we want to focus on. Notice that this change of measurement will require some revision of our timings for older versions, to keep comparisons sensible. Reviewers should look for updates to the tracking issue #13902.

Relevant to: RobotLocomotion#10991, RobotLocomotion#13902 It turns out that relying on eigen's Matrix::operator*= too heavily results in slower code. Rewrite AutoDiffXd::operator*= for autodiff inputs so that it gets better optimization and inlining from Eigen. Supporting benchmark measurements will be provided in RobotLocomotion#13902.

Relevant to: #10991, #13902 It turns out that relying on eigen's Matrix::operator*= too heavily results in slower code. Rewrite AutoDiffXd::operator*= for autodiff inputs so that it gets better optimization and inlining from Eigen. Supporting benchmark measurements will be provided in #13902.

jwnimmer-tri · 2022-05-09T16:50:29Z

Would it be fair to say that this is not "priority: high" anymore?

sherm1 · 2022-05-09T17:50:21Z

Sigh. It's going to be a much longer slog than we hoped. Still highly desirable and we should keep at it. Lowering to Medium.

sherm1 added type: idea type: performance labels Mar 21, 2019

jwnimmer-tri assigned sherm1 Mar 26, 2019

jwnimmer-tri added the unused team: dynamics label Mar 26, 2019

sherm1 changed the title ~~An idea for speeding up AutoDiffXd~~ Speed up AutoDiffXd substantially Jun 14, 2020

sherm1 assigned rpoyner-tri Jun 14, 2020

sherm1 mentioned this issue Jun 14, 2020

Framework support for fixed-size AutoDiff #7039

Closed

sherm1 mentioned this issue Jun 23, 2020

Compute Hessian of MbP dynamics function #13586

Open

sherm1 added priority: high component: system framework System, Context, and supporting code labels Jun 23, 2020

jwnimmer-tri mentioned this issue Jul 13, 2020

"Epic" for AutoDiffXd performance #13675

Closed

rpoyner-tri added this to the autodiff speedup milestone Jul 17, 2020

rpoyner-tri mentioned this issue Sep 1, 2020

move-aware implementation of drake's Eigen::AutoDiffScalar<VectorXd> #13985

Closed

rpoyner-tri mentioned this issue Sep 30, 2020

cassie_bench: Separate autodiff malloc counts from benchmark timing #14146

Merged

rpoyner-tri mentioned this issue Oct 5, 2020

autodiffxd: Optimize operator*= a bit more #14171

Merged

This was referenced Dec 9, 2020

Benchmark speed for autodiff in RigidBodyTree, MultibodyPlant and AcrobotPlant #8482

Closed

More flexible autodiff benchmark #14449

Closed

jwnimmer-tri mentioned this issue May 20, 2021

Missing (linking) reference for RotationMatrix constructed with certain AutoDiff types #15062

Closed

rpoyner-tri removed their assignment Mar 18, 2022

jwnimmer-tri removed the unused team: dynamics label May 3, 2022

sherm1 added priority: medium and removed priority: high labels May 9, 2022

jwnimmer-tri mentioned this issue Jul 4, 2022

Reimplement AutoDiff from the ground up #17492

Draft

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up AutoDiffXd substantially #10991

Speed up AutoDiffXd substantially #10991

sherm1 commented Mar 21, 2019 •

edited

Loading

jwnimmer-tri commented Mar 21, 2019

sherm1 commented Mar 21, 2019

amcastro-tri commented Mar 22, 2019

jwnimmer-tri commented Mar 22, 2019 •

edited

Loading

sherm1 commented Mar 22, 2019

jwnimmer-tri commented Mar 22, 2019

sherm1 commented May 29, 2020

EricCousineau-TRI commented Jul 13, 2020 •

edited

Loading

jwnimmer-tri commented May 9, 2022

sherm1 commented May 9, 2022

Speed up AutoDiffXd substantially #10991

Speed up AutoDiffXd substantially #10991

Comments

sherm1 commented Mar 21, 2019 • edited Loading

jwnimmer-tri commented Mar 21, 2019

sherm1 commented Mar 21, 2019

amcastro-tri commented Mar 22, 2019

jwnimmer-tri commented Mar 22, 2019 • edited Loading

sherm1 commented Mar 22, 2019

jwnimmer-tri commented Mar 22, 2019

sherm1 commented May 29, 2020

EricCousineau-TRI commented Jul 13, 2020 • edited Loading

Related issues

Related discussions

jwnimmer-tri commented May 9, 2022

sherm1 commented May 9, 2022

sherm1 commented Mar 21, 2019 •

edited

Loading

jwnimmer-tri commented Mar 22, 2019 •

edited

Loading

EricCousineau-TRI commented Jul 13, 2020 •

edited

Loading