-
Notifications
You must be signed in to change notification settings - Fork 763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigInt speedup #3379
BigInt speedup #3379
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
So I've been reading through this and trying to get a sense of what might be the cause of the slowdown.
I noticed that internally BigInt
seems to use either u32
or u64
according to the platform pointer size.
It looks to me that on 64-bit systems using Vec<u32>
as an intermediary isn't beneficial, because this gets copied into a new u64
representation. So the Vec
gets dropped on the floor.
Given that, I'm a little surprised that we've got a speedup here. Have you got an intuition for which bits of the change yielded the performance boost? I see we now skip zero-initialization, I don't see any other clear wins...
im again puzzled, the patch should now have 100% coverage shouldnt it? |
Looking at the PR on codecov it's the error branches which have no coverage, which is often the case so not a huge surprise. @iliya-malecki are you ok if I push some additional changes to this PR? I'm tempted to do a little further refactoring and see if I can get the small bigint performance back on par with the original implementation. (And if you're interested in doing the upstream PR to |
I would absolutely love that, I did this pr mainly to learn Rust so any input is very welcome (and yes i have great plans to attempt something with num-bigint) |
Also, I'm looking at the codecov page but I'm not sure what I'm seeing, where does the value 15% (of uncovered lines) come from? From two lines of returning errors? Or is it a quirk that arises from those errors themselves not being covered anywhere? |
Ok, I've pushed a refactoring which cut the duplication between signed & unsigned parts out into common helpers. For abi3 we now just use the bytes slice and let num_bigint do the work. I had a quick look with miri on play.rust-lang.org using I also changed the negative integer case to do the "subtract 1" bit as part of our code rather than calling This is now as fast as I can get the code. I did some profiling and unsurprisingly the conversion from u32 -> u64 form comes out as a large chunk of the runtime on my machine. @iliya-malecki if you're interested in taking on a follow-up with |
)?; | ||
buffer.set_len(n_digits) | ||
}; | ||
buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I wonder considering that the majority of users are probably on 64-bit architectures and even if we fix num-bigint
, being able to rely on that fix will take a while, is whether using Vec<u8>
with _PyLong_AsByteArray
and BigUint::from_bytes_le
might actually be faster in the 64-bit case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And might significantly simplify the signed case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original solution was that, wasn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original solution was that, wasn't it?
Well, with additional zero-initialization and one additional copy in the limited API case, right? But yes, I would be interesting to see if there is any speed up left in the 64-bit case when Vec::with_capacity(n_digits)
is replaced by vec![0; n_digits]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main:
extract_bigint_huge_negative
time: [2.0350 µs 2.0353 µs 2.0356 µs]
extract_bigint_huge_positive
time: [1.5174 µs 1.5177 µs 1.5181 µs]
this branch + zero initialization
extract_bigint_huge_negative
time: [1.1691 µs 1.1694 µs 1.1697 µs]
extract_bigint_huge_positive
time: [1.1229 µs 1.1233 µs 1.1239 µs]
So it's not just skipping zero-initialisation giving the speedup. My intuition for what's going on (having peeked in the num_bigint
code) is that the implementation we've written here to convert from u8 to u32 and also to do two's complement is better optimised than the upstream versions in these cases too.
For example, BigInt::from_signed_bytes_le
takes a copy of the bytes to do two's complement on that, before then throwing that away to do a loop which uses a lot of shifting to merge u8 to u64 (where we effectively just transmute a Vec modulo endianness to achieve that conversion).
https://github.com/rust-num/num-bigint/blob/65f62a8b1484448bfb9789ef4123b50556254905/src/bigint/convert.rs#L408C44-L408C44
https://github.com/rust-num/num-bigint/blob/65f62a8b1484448bfb9789ef4123b50556254905/src/biguint/convert.rs#L43
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I'm concerned, this is as good as we can do here. There looks like there's still more to be won with improvements upstream, however I think that's out of scope for me to worry about for now.
My contributions to pyo3 are available under the terms of either the Apache 2.0 OR the MIT license :)
This PR is a continuation of #3365
I am very much unsure about some parts, feel free to comment on how it might be improved or ditched. The balance i was trying to strike is between speed and maintainability - however, whether i was successful isnt very clear-cut. I succeeded in not using pointer offsets and other ridiculous methods of reaching into pyobjects memory, though