Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up AutoDiffXd substantially #10991

Open
sherm1 opened this issue Mar 21, 2019 · 10 comments
Open

Speed up AutoDiffXd substantially #10991

sherm1 opened this issue Mar 21, 2019 · 10 comments

Comments

@sherm1
Copy link
Member

sherm1 commented Mar 21, 2019

We believe the primary performance problem with AutoDiffXd is the absurd amount of heap malloc/free it requires during computation. Why is that expensive?

  • locking for thread safety
  • searching for a suitably-sized block on malloc(); combining blocks on free()

Ideas:

  • write a fixed-block-size memory manager (a.k.a. memory pool), with optional thread safety. Modify our version of AutoDiffScalar to use it instead of new/delete.
  • add missing move semantics to AutoDiffScalar operators to avoid unnecessary reallocations.

The bespoke memory manager would work because all the derivative blocks we need are the same size during a computation. That means:

  • no searching for the right size
  • no block combining on free
  • the first free block is always acceptable.

The latter point means good (hardware) cache behavior because the free list is essentially a stack -- the last one freed is the first one used and is likely still in the cache.

With careful implementation both allocation and freeing could be inline operations using only a few assembly instructions.

Also see discussion in issue #7039.

@jwnimmer-tri
Copy link
Collaborator

Alejandro asked me about this on slack earlier. I think you both underestimate how well current off-the-shelf allocators perform. How about a benchmark to show how bad the current one is? Or if glibc malloc really is bad, the first thing to do is not write your own allocator! It's to try jemalloc or tcmalloc or whatever other off-the-shelf one already exists.

@sherm1
Copy link
Member Author

sherm1 commented Mar 21, 2019

The underlying assumption here is that AutoDiffXd(100) is much slower than AutoDiff100d. I have not measured that myself, so it has the status of rumor at the moment! But assuming it's true, heap allocation must be a lot slower than stack memory allocation.

Replacing heap allocation for all of Drake is a much more drastic step than just making AutoDiffXd work better by using a fixed-size pool. It could be worth a try though.

In any case this is a quantitative question and what we are really lacking at the moment is a repeatable benchmark problem showing that stack-allocated AutoDiff is much faster than heap-allocated. If we had that we could play with the heap allocation to see what it takes to get it close to stack-allocated performance.

@amcastro-tri
Copy link
Contributor

I will add heap/stack-allocated AutoDiff tests to the benchmark I'm writing. I think @jwnimmer-tri has a good point on that we should just measure it.

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Mar 22, 2019

Actually the first thing I would do (in terms of fixing the code) is teach AutoDiffXd about value categories. At the moment, every operation makes a copy (sometimes more than one!) of the derivatives vector. If it were move-enabled, we would be hitting the heap much less often, no matter the allocator.

(I still agree that having a benchmark is the first step.)

@sherm1
Copy link
Member Author

sherm1 commented Mar 22, 2019

If it were move-enabled ...

I agree. That's actually one of my biggest complaints about Eigen. It uses horribly complicated expression templates to achieve what could mostly have been done with move semantics. I believe that is due to it predating &&. We are overdue for a modern-C++ matrix library.

@jwnimmer-tri
Copy link
Collaborator

It has move on the Matrix classes, just not on the ADS class.

@sherm1
Copy link
Member Author

sherm1 commented May 29, 2020

TIL with some relief that thread local storage is (allegedly) as fast as local storage. That bodes well for a possible AutoDiffScalarX implementation that maintains its own pool of temporaries.

FYI @rpoyner-tri please see the above discussion.

@sherm1 sherm1 changed the title An idea for speeding up AutoDiffXd Speed up AutoDiffXd substantially Jun 14, 2020
@sherm1 sherm1 added priority: high component: system framework System, Context, and supporting code labels Jun 23, 2020
@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Jul 13, 2020

@rpoyner-tri rpoyner-tri added this to the autodiff speedup milestone Jul 17, 2020
rpoyner-tri added a commit to rpoyner-tri/drake that referenced this issue Sep 30, 2020
Relevant to: RobotLocomotion#10991, RobotLocomotion#13902

I finally realized that LimitMalloc counting was contributing significant
overhead to autodiff benchmark timings, owing to necessary synchronization
primitives in that module. This patch separates the two measurements, to
clarify the things we want to focus on.

Notice that this change of measurement will require some revision of our
timings for older versions, to keep comparisons sensible. Reviewers should look
for updates to the tracking issue RobotLocomotion#13902.
rpoyner-tri added a commit that referenced this issue Oct 1, 2020
…14146)

* cassie_bench: Separate autodiff malloc counts from benchmark timing

Relevant to: #10991, #13902

I finally realized that LimitMalloc counting was contributing significant
overhead to autodiff benchmark timings, owing to necessary synchronization
primitives in that module. This patch separates the two measurements, to
clarify the things we want to focus on.

Notice that this change of measurement will require some revision of our
timings for older versions, to keep comparisons sensible. Reviewers should look
for updates to the tracking issue #13902.
rpoyner-tri added a commit to rpoyner-tri/drake that referenced this issue Oct 5, 2020
Relevant to: RobotLocomotion#10991, RobotLocomotion#13902

It turns out that relying on eigen's Matrix::operator*= too heavily results in
slower code. Rewrite AutoDiffXd::operator*= for autodiff inputs so that it gets
better optimization and inlining from Eigen.

Supporting benchmark measurements will be provided in RobotLocomotion#13902.
rpoyner-tri added a commit to rpoyner-tri/drake that referenced this issue Oct 6, 2020
Relevant to: RobotLocomotion#10991, RobotLocomotion#13902

It turns out that relying on eigen's Matrix::operator*= too heavily results in
slower code. Rewrite AutoDiffXd::operator*= for autodiff inputs so that it gets
better optimization and inlining from Eigen.

Supporting benchmark measurements will be provided in RobotLocomotion#13902.
sammy-tri pushed a commit that referenced this issue Oct 6, 2020
Relevant to: #10991, #13902

It turns out that relying on eigen's Matrix::operator*= too heavily results in
slower code. Rewrite AutoDiffXd::operator*= for autodiff inputs so that it gets
better optimization and inlining from Eigen.

Supporting benchmark measurements will be provided in #13902.
@jwnimmer-tri
Copy link
Collaborator

Would it be fair to say that this is not "priority: high" anymore?

@sherm1
Copy link
Member Author

sherm1 commented May 9, 2022

Sigh. It's going to be a much longer slog than we hoped. Still highly desirable and we should keep at it. Lowering to Medium.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants