Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigInt speedup #3379

Merged
merged 3 commits into from
Aug 19, 2023
Merged

BigInt speedup #3379

merged 3 commits into from
Aug 19, 2023

Conversation

iliya-malecki
Copy link
Contributor

My contributions to pyo3 are available under the terms of either the Apache 2.0 OR the MIT license :)
This PR is a continuation of #3365
I am very much unsure about some parts, feel free to comment on how it might be improved or ditched. The balance i was trying to strike is between speed and maintainability - however, whether i was successful isnt very clear-cut. I succeeded in not using pointer offsets and other ridiculous methods of reaching into pyobjects memory, though

@iliya-malecki
Copy link
Contributor Author

This is the speedup i achieved (note the slowdown for small numbers)
image

Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

So I've been reading through this and trying to get a sense of what might be the cause of the slowdown.

I noticed that internally BigInt seems to use either u32 or u64 according to the platform pointer size.

https://github.com/rust-num/num-bigint/blob/65f62a8b1484448bfb9789ef4123b50556254905/build.rs#L9-L12

https://github.com/rust-num/num-bigint/blob/65f62a8b1484448bfb9789ef4123b50556254905/src/biguint.rs#L527-L538

It looks to me that on 64-bit systems using Vec<u32> as an intermediary isn't beneficial, because this gets copied into a new u64 representation. So the Vec gets dropped on the floor.

Given that, I'm a little surprised that we've got a speedup here. Have you got an intuition for which bits of the change yielded the performance boost? I see we now skip zero-initialization, I don't see any other clear wins...

src/conversions/num_bigint.rs Outdated Show resolved Hide resolved
@iliya-malecki
Copy link
Contributor Author

im again puzzled, the patch should now have 100% coverage shouldnt it?

@davidhewitt
Copy link
Member

Looking at the PR on codecov it's the error branches which have no coverage, which is often the case so not a huge surprise.

@iliya-malecki are you ok if I push some additional changes to this PR? I'm tempted to do a little further refactoring and see if I can get the small bigint performance back on par with the original implementation.

(And if you're interested in doing the upstream PR to num-bigint so that we could directly build the integer from the "native" u32 or u64 representation and avoid another Vec allocation then that would be a win for the whole ecosystem!)

@iliya-malecki
Copy link
Contributor Author

iliya-malecki commented Aug 16, 2023

I would absolutely love that, I did this pr mainly to learn Rust so any input is very welcome (and yes i have great plans to attempt something with num-bigint)

@iliya-malecki
Copy link
Contributor Author

Also, I'm looking at the codecov page but I'm not sure what I'm seeing, where does the value 15% (of uncovered lines) come from? From two lines of returning errors? Or is it a quirk that arises from those errors themselves not being covered anywhere?

@davidhewitt
Copy link
Member

Ok, I've pushed a refactoring which cut the duplication between signed & unsigned parts out into common helpers. For abi3 we now just use the bytes slice and let num_bigint do the work.

I had a quick look with miri on play.rust-lang.org using std::ptr::write_bytes as a substitute for _PyLong_AsByteArray and it seems happy with the idea of just reserving capacity and then using set_len after calling a method which operates on the vec pointer.

I also changed the negative integer case to do the "subtract 1" bit as part of our code rather than calling - 1 on the resulting BigInt object. For good measure I changed our tests to check the first 2000 fibonacci numbers, and the Rust and Python sequences agree so I'm pretty confident in our implementation.

This is now as fast as I can get the code. I did some profiling and unsurprisingly the conversion from u32 -> u64 form comes out as a large chunk of the runtime on my machine. @iliya-malecki if you're interested in taking on a follow-up with num-bigint to make it possible for us to pass u64 "digits" directly then I'm sure that will yield another significant speedup (maybe 2x?)

)?;
buffer.set_len(n_digits)
};
buffer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I wonder considering that the majority of users are probably on 64-bit architectures and even if we fix num-bigint, being able to rely on that fix will take a while, is whether using Vec<u8> with _PyLong_AsByteArray and BigUint::from_bytes_le might actually be faster in the 64-bit case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And might significantly simplify the signed case.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original solution was that, wasn't it?

Copy link
Member

@adamreichold adamreichold Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original solution was that, wasn't it?

Well, with additional zero-initialization and one additional copy in the limited API case, right? But yes, I would be interesting to see if there is any speed up left in the 64-bit case when Vec::with_capacity(n_digits) is replaced by vec![0; n_digits].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main:

extract_bigint_huge_negative
                        time:   [2.0350 µs 2.0353 µs 2.0356 µs]
extract_bigint_huge_positive
                        time:   [1.5174 µs 1.5177 µs 1.5181 µs]

this branch + zero initialization

extract_bigint_huge_negative
                        time:   [1.1691 µs 1.1694 µs 1.1697 µs]
extract_bigint_huge_positive
                        time:   [1.1229 µs 1.1233 µs 1.1239 µs]

So it's not just skipping zero-initialisation giving the speedup. My intuition for what's going on (having peeked in the num_bigint code) is that the implementation we've written here to convert from u8 to u32 and also to do two's complement is better optimised than the upstream versions in these cases too.

For example, BigInt::from_signed_bytes_le takes a copy of the bytes to do two's complement on that, before then throwing that away to do a loop which uses a lot of shifting to merge u8 to u64 (where we effectively just transmute a Vec modulo endianness to achieve that conversion).

https://github.com/rust-num/num-bigint/blob/65f62a8b1484448bfb9789ef4123b50556254905/src/bigint/convert.rs#L408C44-L408C44
https://github.com/rust-num/num-bigint/blob/65f62a8b1484448bfb9789ef4123b50556254905/src/biguint/convert.rs#L43

Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm concerned, this is as good as we can do here. There looks like there's still more to be won with improvements upstream, however I think that's out of scope for me to worry about for now.

@adamreichold adamreichold added this pull request to the merge queue Aug 19, 2023
Merged via the queue into PyO3:main with commit 12183ad Aug 19, 2023
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants