AutoDiffScalar: move-aware rewrites of more math #14045

rpoyner-tri · 2020-09-10T22:15:36Z

Rewrite various global-scope math functions on AutoDiffScalars to
exploit pass-by-value optimizations; demonstrate resulting improvements
via lowered allocation counts in the heap test. The issues noted above
give more context.

This change is

rpoyner-tri

+@amcastro-tri for feature review; this is the bit you requested in some prior PR.
FYI @sherm1 @jwnimmer-tri

Reviewable status: 1 unresolved discussion, LGTM missing from assignee amcastro-tri, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @rpoyner-tri)

a discussion (no related file):
My expectation is that speedups to cassie_bench cases will be small, since only a few of the functions involved are used much. I'll do some experiments and post some numbers.

rpoyner-tri

Reviewable status: 1 unresolved discussion, LGTM missing from assignee amcastro-tri, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @rpoyner-tri)

a discussion (no related file):

Previously, rpoyner-tri (Rick Poyner (rico)) wrote…

My expectation is that speedups to cassie_bench cases will be small, since only a few of the functions involved are used much. I'll do some experiments and post some numbers.

Initial results are showing me that this PR does nothing to speed up the cassie_bench cases, and also has no effect on allocation counts in autodiff cassie_bench cases. They still may be worth landing, since they allow the optimizer to avoid allocations when these functions are invoked.

sherm1

👍

Reviewed 2 of 2 files at r1.
Reviewable status: 1 unresolved discussion, LGTM missing from assignee amcastro-tri, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @rpoyner-tri)

amcastro-tri

Excellent! just two non-blocking comments.

Reviewed 1 of 2 files at r1.
Reviewable status: 3 unresolved discussions, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @rpoyner-tri)

common/autodiffxd.h, line 440 at r1 (raw file):

DRAKE_EIGEN_AUTODIFFXD_DECLARE_GLOBAL_UNARY(
    atan, using std::atan;
    x.derivatives() *= Scalar(1) / (1 + x.value() * x.value());

interesting, do you know why above we use numext::abs2(x.value()) while here we use x.value()*x.value()?

common/autodiffxd.h, line 488 at r1 (raw file):

  a.derivatives() *= b * pow(a.value(), b - 1);
  a.value() = pow(a.value(), b);
  return a;

not sure how much a pow cost but I assume quite a few flops from this cool article.

We could write this as:

a.value() = pow(a.value(), b);
a.derivatives() *= b / a * a.value();

Since aᵇ⁻¹ = aᵇ/a.

Probably premature optimization but I thought you'd like to hear it. Also I love the benchmark in that article I found. I wonder if somehow useful for Drake in some shape.

amcastro-tri

Reviewed 1 of 2 files at r1.
Reviewable status: 3 unresolved discussions, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @rpoyner-tri)

a discussion (no related file):

Previously, rpoyner-tri (Rick Poyner (rico)) wrote…

Initial results are showing me that this PR does nothing to speed up the cassie_bench cases, and also has no effect on allocation counts in autodiff cassie_bench cases. They still may be worth landing, since they allow the optimizer to avoid allocations when these functions are invoked.

On the speed, I am not that surprised. On the number of allocations? that I am. Is that reasonable?

rpoyner-tri

Reviewable status: 3 unresolved discussions, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @rpoyner-tri)

a discussion (no related file):

Previously, amcastro-tri (Alejandro Castro) wrote…

On the speed, I am not that surprised. On the number of allocations? that I am. Is that reasonable?

I haven't been tracking allocations counts that closely since my last big PR. It's entirely possible that these functions can get called in a place where the inputs are not moveable (or only the right-hand one, for some of the binary ops). Also, we are dependent on the optimizer noticing and capitalizing on these opportunities. I'm no optimizer expert, but I'm reasonably certain that some optimizations may need more coaxing. Back in the bad old days I would see similar runs of code optimized better or worse based on the memory pressure during compilation. I hope that's no longer a thing, but I haven't studied it in detail here.

FWIW, the most recent gain in allocation counts (not matched by a lowering of caps in cassie_bench yet) was from @sherm1's #13962. That's a good example of a situation where fewer operations will generate fewer copies/temporaries and therefore fewer allocations. It could that the next set of substantial gains will come from changes more like that.

common/autodiffxd.h, line 440 at r1 (raw file):

Previously, amcastro-tri (Alejandro Castro) wrote…

interesting, do you know why above we use numext::abs2(x.value()) while here we use x.value()*x.value()?

I don't actually know -- I've tried to copy the existing math with as few changes as possible. I suppose git knows where these implementations came from, but I didn't look. My main goal was to keep the math correctness tests passing, while making allocation count gains. :)

common/autodiffxd.h, line 488 at r1 (raw file):

Previously, amcastro-tri (Alejandro Castro) wrote…

not sure how much a pow cost but I assume quite a few flops from this cool article.

We could write this as:
a.value() = pow(a.value(), b);
a.derivatives() *= b / a * a.value();
Since aᵇ⁻¹ = aᵇ/a.

Probably premature optimization but I thought you'd like to hear it. Also I love the benchmark in that article I found. I wonder if somehow useful for Drake in some shape.

Nice article. I have some questions about his benchmark technique. Also, I'm not sure how much pow is even invoked, outside of tests. At some point I'd like to go hunting for more macro/real-world benchmarks, since they give a clearer idea of the relative importance of different low-level operations.

That said, your rewrite could be a win. Well-known algorithms for double-double pow obey an O(1) bound by using logarithms. so we're probably then counting speeds of multiplies, divides, etc. Only a carefully crafted benchmark will know for sure.

sherm1

Reviewable status: 2 unresolved discussions, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @rpoyner-tri)

common/autodiffxd.h, line 440 at r1 (raw file):

Previously, rpoyner-tri (Rick Poyner (rico)) wrote…

I don't actually know -- I've tried to copy the existing math with as few changes as possible. I suppose git knows where these implementations came from, but I didn't look. My main goal was to keep the math correctness tests passing, while making allocation count gains. :)

I wondered that too. They wouldn't be identical for complex x. I would have expected consistency or a comment explaining why this is different. Maybe a bug? Probably worth at least a TODO to check this, or maybe an Eigen issue?

common/autodiffxd.h, line 488 at r1 (raw file):

Previously, rpoyner-tri (Rick Poyner (rico)) wrote…

Nice article. I have some questions about his benchmark technique. Also, I'm not sure how much pow is even invoked, outside of tests. At some point I'd like to go hunting for more macro/real-world benchmarks, since they give a clearer idea of the relative importance of different low-level operations.

That said, your rewrite could be a win. Well-known algorithms for double-double pow obey an O(1) bound by using logarithms. so we're probably then counting speeds of multiplies, divides, etc. Only a carefully crafted benchmark will know for sure.

pow() is definitely expensive in general -- better to avoid it when possible. But along the same lines multiply is much cheaper than divide. So I would do it like this:

const Scalar ab1 = pow(a.value(), b-1);
a.value() = a * ab1;        // a^b
a.derivatives() *= b * ab1; // b a^(b-1)

rpoyner-tri

Reviewable status: 2 unresolved discussions, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @sherm1)

common/autodiffxd.h, line 440 at r1 (raw file):

Previously, sherm1 (Michael Sherman) wrote…

I wondered that too. They wouldn't be identical for complex x. I would have expected consistency or a comment explaining why this is different. Maybe a bug? Probably worth at least a TODO to check this, or maybe an Eigen issue?

I'll file a ticket.

common/autodiffxd.h, line 488 at r1 (raw file):

Previously, sherm1 (Michael Sherman) wrote…

pow() is definitely expensive in general -- better to avoid it when possible. But along the same lines multiply is much cheaper than divide. So I would do it like this:
const Scalar ab1 = pow(a.value(), b-1);
a.value() = a * ab1;        // a^b
a.derivatives() *= b * ab1; // b a^(b-1)

I'm sure that's fine, but my guideline for PRs is to try to do one thing well. Do we have any data to indicate that it even matters? I'm reasonably certain cassie_bench doesn't use this at all -- do we have any examples anywhere that do (not counting unit tests)?

“In God we trust, all others must bring data.” – W. Edwards Deming

rpoyner-tri

Reviewable status: 2 unresolved discussions, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @sherm1)

common/autodiffxd.h, line 440 at r1 (raw file):

Previously, rpoyner-tri (Rick Poyner (rico)) wrote…

I'll file a ticket.

#14051

sherm1

Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @rpoyner-tri)

common/autodiffxd.h, line 488 at r1 (raw file):

Previously, rpoyner-tri (Rick Poyner (rico)) wrote…

I'm sure that's fine, but my guideline for PRs is to try to do one thing well. Do we have any data to indicate that it even matters? I'm reasonably certain cassie_bench doesn't use this at all -- do we have any examples anywhere that do (not counting unit tests)?

“In God we trust, all others must bring data.” – W. Edwards Deming

Likely not an issue in Drake. However we would not be able to tell if it were since tiny inlined things will typically escape cachegrind's watchful eye. The usual approach with low level arithmetic functions is to make each one as fast as possible. There is also a reasonable argument for not changing from the Eigen implementation, but it is at least worth a TODO here noting that this implementation is unnecessarily slow.

rpoyner-tri

Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @sherm1)

common/autodiffxd.h, line 488 at r1 (raw file):

Previously, sherm1 (Michael Sherman) wrote…

Likely not an issue in Drake. However we would not be able to tell if it were since tiny inlined things will typically escape cachegrind's watchful eye. The usual approach with low level arithmetic functions is to make each one as fast as possible. There is also a reasonable argument for not changing from the Eigen implementation, but it is at least worth a TODO here noting that this implementation is unnecessarily slow.

#14052

Relevant to: RobotLocomotion#13985, RobotLocomotion#14039 Rewrite various global-scope math functions on AutoDiffScalars to exploit pass-by-value optimizations; demonstrate resulting improvements via lowered allocation counts in the heap test. The issues noted above give more context.

rpoyner-tri

+@sherm1 for platform review, since you've come this far :)

Reviewable status: LGTM missing from assignee sherm1(platform) (waiting on @amcastro-tri and @sherm1)

sherm1

Reviewed 1 of 1 files at r2.
Reviewable status: LGTM missing from assignee sherm1(platform)

sherm1

Platform -- thanks, Rico!

Reviewable status: complete! all discussions resolved, LGTM from assignees amcastro-tri,sherm1(platform)

rpoyner-tri assigned amcastro-tri Sep 10, 2020

rpoyner-tri commented Sep 10, 2020

View reviewed changes

sherm1 reviewed Sep 11, 2020

View reviewed changes

amcastro-tri reviewed Sep 11, 2020

View reviewed changes

rpoyner-tri commented Sep 11, 2020

View reviewed changes

sherm1 reviewed Sep 11, 2020

View reviewed changes

rpoyner-tri mentioned this pull request Sep 11, 2020

fishy-looking math in AutoDiffScalar atan() #14051

Closed

rpoyner-tri commented Sep 11, 2020

View reviewed changes

sherm1 reviewed Sep 11, 2020

View reviewed changes

rpoyner-tri mentioned this pull request Sep 11, 2020

fishy math in AutoDiffScalar pow() #14052

Closed

rpoyner-tri commented Sep 11, 2020

View reviewed changes

rpoyner-tri force-pushed the unary-dev2 branch from 51c23bc to 603346a Compare September 11, 2020 21:06

rpoyner-tri assigned sherm1 Sep 11, 2020

rpoyner-tri commented Sep 11, 2020

View reviewed changes

sherm1 reviewed Sep 11, 2020

View reviewed changes

sherm1 merged commit 8275f8e into RobotLocomotion:master Sep 11, 2020

rpoyner-tri mentioned this pull request Oct 12, 2020

tracking measurement data from //examples/multibody/cassie_benchmark #13902

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoDiffScalar: move-aware rewrites of more math #14045

AutoDiffScalar: move-aware rewrites of more math #14045

rpoyner-tri commented Sep 10, 2020 •

edited by sherm1

Loading

rpoyner-tri left a comment

rpoyner-tri left a comment

sherm1 left a comment

amcastro-tri left a comment

amcastro-tri left a comment

rpoyner-tri left a comment

sherm1 left a comment

rpoyner-tri left a comment

rpoyner-tri left a comment

sherm1 left a comment

rpoyner-tri left a comment

rpoyner-tri left a comment

sherm1 left a comment

sherm1 left a comment

AutoDiffScalar: move-aware rewrites of more math #14045

AutoDiffScalar: move-aware rewrites of more math #14045

Conversation

rpoyner-tri commented Sep 10, 2020 • edited by sherm1 Loading

rpoyner-tri left a comment

Choose a reason for hiding this comment

rpoyner-tri left a comment

Choose a reason for hiding this comment

sherm1 left a comment

Choose a reason for hiding this comment

amcastro-tri left a comment

Choose a reason for hiding this comment

amcastro-tri left a comment

Choose a reason for hiding this comment

rpoyner-tri left a comment

Choose a reason for hiding this comment

sherm1 left a comment

Choose a reason for hiding this comment

rpoyner-tri left a comment

Choose a reason for hiding this comment

rpoyner-tri left a comment

Choose a reason for hiding this comment

sherm1 left a comment

Choose a reason for hiding this comment

rpoyner-tri left a comment

Choose a reason for hiding this comment

rpoyner-tri left a comment

Choose a reason for hiding this comment

sherm1 left a comment

Choose a reason for hiding this comment

sherm1 left a comment

Choose a reason for hiding this comment

rpoyner-tri commented Sep 10, 2020 •

edited by sherm1

Loading