-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up AutoDiffXd substantially #10991
Comments
Alejandro asked me about this on slack earlier. I think you both underestimate how well current off-the-shelf allocators perform. How about a benchmark to show how bad the current one is? Or if glibc malloc really is bad, the first thing to do is not write your own allocator! It's to try jemalloc or tcmalloc or whatever other off-the-shelf one already exists. |
The underlying assumption here is that Replacing heap allocation for all of Drake is a much more drastic step than just making AutoDiffXd work better by using a fixed-size pool. It could be worth a try though. In any case this is a quantitative question and what we are really lacking at the moment is a repeatable benchmark problem showing that stack-allocated AutoDiff is much faster than heap-allocated. If we had that we could play with the heap allocation to see what it takes to get it close to stack-allocated performance. |
I will add heap/stack-allocated AutoDiff tests to the benchmark I'm writing. I think @jwnimmer-tri has a good point on that we should just measure it. |
Actually the first thing I would do (in terms of fixing the code) is teach AutoDiffXd about value categories. At the moment, every operation makes a copy (sometimes more than one!) of the derivatives vector. If it were move-enabled, we would be hitting the heap much less often, no matter the allocator. (I still agree that having a benchmark is the first step.) |
I agree. That's actually one of my biggest complaints about Eigen. It uses horribly complicated expression templates to achieve what could mostly have been done with move semantics. I believe that is due to it predating |
It has move on the Matrix classes, just not on the ADS class. |
TIL with some relief that thread local storage is (allegedly) as fast as local storage. That bodes well for a possible AutoDiffScalarX implementation that maintains its own pool of temporaries. FYI @rpoyner-tri please see the above discussion. |
From #13675: Related issuesRelated discussions |
Relevant to: RobotLocomotion#10991, RobotLocomotion#13902 I finally realized that LimitMalloc counting was contributing significant overhead to autodiff benchmark timings, owing to necessary synchronization primitives in that module. This patch separates the two measurements, to clarify the things we want to focus on. Notice that this change of measurement will require some revision of our timings for older versions, to keep comparisons sensible. Reviewers should look for updates to the tracking issue RobotLocomotion#13902.
…14146) * cassie_bench: Separate autodiff malloc counts from benchmark timing Relevant to: #10991, #13902 I finally realized that LimitMalloc counting was contributing significant overhead to autodiff benchmark timings, owing to necessary synchronization primitives in that module. This patch separates the two measurements, to clarify the things we want to focus on. Notice that this change of measurement will require some revision of our timings for older versions, to keep comparisons sensible. Reviewers should look for updates to the tracking issue #13902.
Relevant to: RobotLocomotion#10991, RobotLocomotion#13902 It turns out that relying on eigen's Matrix::operator*= too heavily results in slower code. Rewrite AutoDiffXd::operator*= for autodiff inputs so that it gets better optimization and inlining from Eigen. Supporting benchmark measurements will be provided in RobotLocomotion#13902.
Relevant to: RobotLocomotion#10991, RobotLocomotion#13902 It turns out that relying on eigen's Matrix::operator*= too heavily results in slower code. Rewrite AutoDiffXd::operator*= for autodiff inputs so that it gets better optimization and inlining from Eigen. Supporting benchmark measurements will be provided in RobotLocomotion#13902.
Would it be fair to say that this is not "priority: high" anymore? |
Sigh. It's going to be a much longer slog than we hoped. Still highly desirable and we should keep at it. Lowering to Medium. |
We believe the primary performance problem with AutoDiffXd is the absurd amount of heap malloc/free it requires during computation. Why is that expensive?
Ideas:
The bespoke memory manager would work because all the derivative blocks we need are the same size during a computation. That means:
The latter point means good (hardware) cache behavior because the free list is essentially a stack -- the last one freed is the first one used and is likely still in the cache.
With careful implementation both allocation and freeing could be inline operations using only a few assembly instructions.
Also see discussion in issue #7039.
The text was updated successfully, but these errors were encountered: