-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoDiffScalar: move-aware rewrites of more math #14045
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+@amcastro-tri for feature review; this is the bit you requested in some prior PR.
FYI @sherm1 @jwnimmer-tri
Reviewable status: 1 unresolved discussion, LGTM missing from assignee amcastro-tri, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @rpoyner-tri)
a discussion (no related file):
My expectation is that speedups to cassie_bench cases will be small, since only a few of the functions involved are used much. I'll do some experiments and post some numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, LGTM missing from assignee amcastro-tri, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @rpoyner-tri)
a discussion (no related file):
Previously, rpoyner-tri (Rick Poyner (rico)) wrote…
My expectation is that speedups to cassie_bench cases will be small, since only a few of the functions involved are used much. I'll do some experiments and post some numbers.
Initial results are showing me that this PR does nothing to speed up the cassie_bench cases, and also has no effect on allocation counts in autodiff cassie_bench cases. They still may be worth landing, since they allow the optimizer to avoid allocations when these functions are invoked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Reviewed 2 of 2 files at r1.
Reviewable status: 1 unresolved discussion, LGTM missing from assignee amcastro-tri, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @rpoyner-tri)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! just two non-blocking comments.
Reviewed 1 of 2 files at r1.
Reviewable status: 3 unresolved discussions, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @rpoyner-tri)
common/autodiffxd.h, line 440 at r1 (raw file):
DRAKE_EIGEN_AUTODIFFXD_DECLARE_GLOBAL_UNARY( atan, using std::atan; x.derivatives() *= Scalar(1) / (1 + x.value() * x.value());
interesting, do you know why above we use numext::abs2(x.value())
while here we use x.value()*x.value()?
common/autodiffxd.h, line 488 at r1 (raw file):
a.derivatives() *= b * pow(a.value(), b - 1); a.value() = pow(a.value(), b); return a;
not sure how much a pow
cost but I assume quite a few flops from this cool article.
We could write this as:
a.value() = pow(a.value(), b);
a.derivatives() *= b / a * a.value();
Since aᵇ⁻¹ = aᵇ/a
.
Probably premature optimization but I thought you'd like to hear it. Also I love the benchmark in that article I found. I wonder if somehow useful for Drake in some shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1.
Reviewable status: 3 unresolved discussions, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @rpoyner-tri)
a discussion (no related file):
Previously, rpoyner-tri (Rick Poyner (rico)) wrote…
Initial results are showing me that this PR does nothing to speed up the cassie_bench cases, and also has no effect on allocation counts in autodiff cassie_bench cases. They still may be worth landing, since they allow the optimizer to avoid allocations when these functions are invoked.
On the speed, I am not that surprised. On the number of allocations? that I am. Is that reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 unresolved discussions, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @rpoyner-tri)
a discussion (no related file):
Previously, amcastro-tri (Alejandro Castro) wrote…
On the speed, I am not that surprised. On the number of allocations? that I am. Is that reasonable?
I haven't been tracking allocations counts that closely since my last big PR. It's entirely possible that these functions can get called in a place where the inputs are not moveable (or only the right-hand one, for some of the binary ops). Also, we are dependent on the optimizer noticing and capitalizing on these opportunities. I'm no optimizer expert, but I'm reasonably certain that some optimizations may need more coaxing. Back in the bad old days I would see similar runs of code optimized better or worse based on the memory pressure during compilation. I hope that's no longer a thing, but I haven't studied it in detail here.
FWIW, the most recent gain in allocation counts (not matched by a lowering of caps in cassie_bench yet) was from @sherm1's #13962. That's a good example of a situation where fewer operations will generate fewer copies/temporaries and therefore fewer allocations. It could that the next set of substantial gains will come from changes more like that.
common/autodiffxd.h, line 440 at r1 (raw file):
Previously, amcastro-tri (Alejandro Castro) wrote…
interesting, do you know why above we use
numext::abs2(x.value())
while here we use x.value()*x.value()?
I don't actually know -- I've tried to copy the existing math with as few changes as possible. I suppose git
knows where these implementations came from, but I didn't look. My main goal was to keep the math correctness tests passing, while making allocation count gains. :)
common/autodiffxd.h, line 488 at r1 (raw file):
Previously, amcastro-tri (Alejandro Castro) wrote…
not sure how much a
pow
cost but I assume quite a few flops from this cool article.We could write this as:
a.value() = pow(a.value(), b); a.derivatives() *= b / a * a.value();Since
aᵇ⁻¹ = aᵇ/a
.Probably premature optimization but I thought you'd like to hear it. Also I love the benchmark in that article I found. I wonder if somehow useful for Drake in some shape.
Nice article. I have some questions about his benchmark technique. Also, I'm not sure how much pow
is even invoked, outside of tests. At some point I'd like to go hunting for more macro/real-world benchmarks, since they give a clearer idea of the relative importance of different low-level operations.
That said, your rewrite could be a win. Well-known algorithms for double-double pow
obey an O(1) bound by using logarithms. so we're probably then counting speeds of multiplies, divides, etc. Only a carefully crafted benchmark will know for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 unresolved discussions, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @rpoyner-tri)
common/autodiffxd.h, line 440 at r1 (raw file):
Previously, rpoyner-tri (Rick Poyner (rico)) wrote…
I don't actually know -- I've tried to copy the existing math with as few changes as possible. I suppose
git
knows where these implementations came from, but I didn't look. My main goal was to keep the math correctness tests passing, while making allocation count gains. :)
I wondered that too. They wouldn't be identical for complex x. I would have expected consistency or a comment explaining why this is different. Maybe a bug? Probably worth at least a TODO to check this, or maybe an Eigen issue?
common/autodiffxd.h, line 488 at r1 (raw file):
Previously, rpoyner-tri (Rick Poyner (rico)) wrote…
Nice article. I have some questions about his benchmark technique. Also, I'm not sure how much
pow
is even invoked, outside of tests. At some point I'd like to go hunting for more macro/real-world benchmarks, since they give a clearer idea of the relative importance of different low-level operations.That said, your rewrite could be a win. Well-known algorithms for double-double
pow
obey an O(1) bound by using logarithms. so we're probably then counting speeds of multiplies, divides, etc. Only a carefully crafted benchmark will know for sure.
pow() is definitely expensive in general -- better to avoid it when possible. But along the same lines multiply is much cheaper than divide. So I would do it like this:
const Scalar ab1 = pow(a.value(), b-1);
a.value() = a * ab1; // a^b
a.derivatives() *= b * ab1; // b a^(b-1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 unresolved discussions, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @sherm1)
common/autodiffxd.h, line 440 at r1 (raw file):
Previously, sherm1 (Michael Sherman) wrote…
I wondered that too. They wouldn't be identical for complex x. I would have expected consistency or a comment explaining why this is different. Maybe a bug? Probably worth at least a TODO to check this, or maybe an Eigen issue?
I'll file a ticket.
common/autodiffxd.h, line 488 at r1 (raw file):
Previously, sherm1 (Michael Sherman) wrote…
pow() is definitely expensive in general -- better to avoid it when possible. But along the same lines multiply is much cheaper than divide. So I would do it like this:
const Scalar ab1 = pow(a.value(), b-1); a.value() = a * ab1; // a^b a.derivatives() *= b * ab1; // b a^(b-1)
I'm sure that's fine, but my guideline for PRs is to try to do one thing well. Do we have any data to indicate that it even matters? I'm reasonably certain cassie_bench doesn't use this at all -- do we have any examples anywhere that do (not counting unit tests)?
“In God we trust, all others must bring data.” – W. Edwards Deming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 unresolved discussions, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @sherm1)
common/autodiffxd.h, line 440 at r1 (raw file):
Previously, rpoyner-tri (Rick Poyner (rico)) wrote…
I'll file a ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @rpoyner-tri)
common/autodiffxd.h, line 488 at r1 (raw file):
Previously, rpoyner-tri (Rick Poyner (rico)) wrote…
I'm sure that's fine, but my guideline for PRs is to try to do one thing well. Do we have any data to indicate that it even matters? I'm reasonably certain cassie_bench doesn't use this at all -- do we have any examples anywhere that do (not counting unit tests)?
“In God we trust, all others must bring data.” – W. Edwards Deming
Likely not an issue in Drake. However we would not be able to tell if it were since tiny inlined things will typically escape cachegrind's watchful eye. The usual approach with low level arithmetic functions is to make each one as fast as possible. There is also a reasonable argument for not changing from the Eigen implementation, but it is at least worth a TODO here noting that this implementation is unnecessarily slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers (waiting on @amcastro-tri and @sherm1)
common/autodiffxd.h, line 488 at r1 (raw file):
Previously, sherm1 (Michael Sherman) wrote…
Likely not an issue in Drake. However we would not be able to tell if it were since tiny inlined things will typically escape cachegrind's watchful eye. The usual approach with low level arithmetic functions is to make each one as fast as possible. There is also a reasonable argument for not changing from the Eigen implementation, but it is at least worth a TODO here noting that this implementation is unnecessarily slow.
Relevant to: RobotLocomotion#13985, RobotLocomotion#14039 Rewrite various global-scope math functions on AutoDiffScalars to exploit pass-by-value optimizations; demonstrate resulting improvements via lowered allocation counts in the heap test. The issues noted above give more context.
51c23bc
to
603346a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+@sherm1 for platform review, since you've come this far :)
Reviewable status: LGTM missing from assignee sherm1(platform) (waiting on @amcastro-tri and @sherm1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2.
Reviewable status: LGTM missing from assignee sherm1(platform)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! all discussions resolved, LGTM from assignees amcastro-tri,sherm1(platform)
Relevant to: #13985, #14039
Rewrite various global-scope math functions on AutoDiffScalars to
exploit pass-by-value optimizations; demonstrate resulting improvements
via lowered allocation counts in the heap test. The issues noted above
give more context.
This change is