-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
comba mul/sqr: remove W array #441
Conversation
718dbc8
to
ca8cb27
Compare
* Output dp must not alias the input dps. * We avoid the unnecessary copying in the end. * No huge W array is allocated on the stack. * If the digit representation is reworked such that the full bitwidth is used, comba can be used for more digits. In that case however allocating the W array on the stack is not feasible. * One additional advantage is that the code gets simpler * The disadvantage is that the the user must ensure that no aliasing happens if the fast multiplier is to be used. Do we want that? I would argue that using aliasing (which impedes optimization in this case) should be discouraged. However this is a deviation from how things have been until now. We should document the change, such that code which suddenly performs worse can be updated. ALTERNATIVE 1: The comba functions could allocate a temporary if the inputs alias the output. ALTERNATIVE 2: We could move the allocation of a temporary outside of both s_mp_(sqr|mul) and s_mp_(sqr|mul)_comba into mp_mul/mp_sqr. Then the s_mp functions wouldn't accept aliases anymore. But we would also speed up the plain old s_mp_mul/s_mp_sqr functions in the non-aliasing case (no allocations!). Now that I think about it, I think ALTERNATIVE 2 sounds like a good solution! We also have to think about mul_high_comba/reduce/montgomery_reduce etc, since the reduce functions mutate the input and therefore alias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put it all on the heap instead?
Yes, not a bad idea.
Alternative 2? I don't see much of a difference but alt. 2 seems to be cleaner.
We also have to think about mul_high_comba/reduce/montgomery_reduce
etc, since the reduce functions mutate the input and therefore alias.
Do you think about changing that behaviour?
FYI RSA in libtomcrypt is ~10% slower with this version of mul&sqr! |
A bit was expected but 10% are a bit much. |
@sjaeckel ok but this is not the final version. In the end we would change the usage pattern of the library a bit, discouraging aliasing. And then your ltc rsa version would have to be changed. What you are observing is that comba is just not used, that's giving the slowdown. It is all expected. But this is just an experiment. So nothing to worry. If we make such intrusive changes, we definitely have to benchmark. |
I close this for now, to not confuse you about things I am trying :) |
This is just an experiment to gather some feedback and preliminary work needed if we want full width digits. It is based on #434.
Output dp must not alias the input dps.
We avoid the unnecessary copying in the end.
No huge W array is allocated on the stack.
If the digit representation is reworked such that the full bitwidth is
used, comba can be used for more digits.
In that case however allocating the W array on the stack is not
feasible.
One additional advantage is that the code gets simpler
The disadvantage is that the the user must ensure that no aliasing
happens if the fast multiplier is to be used. Do we want that?
I would argue that using aliasing (which impedes optimization in
this case) should be discouraged.
However this is a deviation from how things have been until now.
We should document the change, such that code which suddenly
performs worse can be updated.
ALTERNATIVE 1:
The comba functions could allocate a temporary if
the inputs alias the output.
ALTERNATIVE 2:
We could move the allocation of a temporary outside
of both s_mp_(sqr|mul) and s_mp_(sqr|mul)_comba
into mp_mul/mp_sqr. Then the s_mp functions wouldn't
accept aliases anymore. But we would also speed up the
plain old s_mp_mul/s_mp_sqr functions in the non-aliasing
case (no allocations!).
Now that I think about it, I think ALTERNATIVE 2 sounds like a good
solution!
We also have to think about mul_high_comba/reduce/montgomery_reduce
etc, since the reduce functions mutate the input and therefore alias.
Ping @czurnieden