Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unroll biginteger loops and reduce copies #205

Merged
merged 13 commits into from
Feb 6, 2021
Merged

Conversation

Pratyush
Copy link
Member

@Pratyush Pratyush commented Feb 5, 2021

Description

Replaces #199, motivation from #198


Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

  • Targeted PR against correct branch (master)
  • Linked to Github issue with discussion and accepted design OR have an explanation in the PR that describes this work.
  • Wrote unit tests
  • Updated relevant documentation in the code
  • Added a relevant changelog entry to the Pending section in CHANGELOG.md
  • Re-reviewed Files changed in the Github PR explorer

weikengchen
weikengchen previously approved these changes Feb 5, 2021
Copy link
Member

@weikengchen weikengchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed the changes. Most of them look reasonable. I will wait for CI to do the careful check.

@jon-chuang
Copy link
Contributor

I still don't understand the purpose of this PR. Why are the changes to the non assign ops made? I was trying to establish that these changes have no worth and are in fact detractive, however, these complaints have not been addressed.

One can already achieve everything one would like in the new versions of the non-assigning ops with the assigning versions. Previously, non-assigning ops acted as syntactic sugar. Now, I am unconvinced one should ever use them.

@jon-chuang
Copy link
Contributor

Why would you need to assign the result of sub to a new variable, when you have the old variable denoted by self that holds that very value? That makes absolutely no sense. And yet this got an approving review, whereas a PR that I made that did not have such absurd changes and had been thoroughly benched did not get any review.

@weikengchen
Copy link
Member

I apologize that I was more focusing on the changes that clean up nonnative, poly-commit, ... I will surely take a look at the min const generics PR soon.

It is also related to the number of changes---small PRs are easier to reconcile. Our PRs in poly-commit and marlin, which add the constraints for Marlin verifier, are still there---because we want to avoid conflicts with other PRs and projects.

@jon-chuang
Copy link
Contributor

jon-chuang commented Feb 5, 2021

Hi ok sorry, anw I wasn't referring to min-const-generics, I think that one will be binned, and it's low priority. I'm referring to #204 which is a very small PR.

It's just there seems no reason why some people have more authority than others to push to branches, approve reviews, assign reviewers/request reviews, put labels. Makes little sense.

For me, it adds salt to the wound that the entire arkworks migration was carried out without asking for much input from the community, and also that numerous weird changes were made without any input. To me I'm just wondering how much you'd like to involve others or just want to do everything in your own circle.

@weikengchen
Copy link
Member

It has one thing to do with the challenging balance between an open-sourced platform and an academic prototype---sometimes we need to wrap up something within a time limit---and there is probably why breaking changes are happening more often than a stable library should. It is a slightly challenging balance that we are maintaining.

I want to assure you that when I approve this one, I was just being requested for a review, and I just did a sanity check. You may also request a review or @ us.

We are sort of overwhelmed with many projects---and for me, beyond arkworks, I am also working on other non-zero-knowledge stuff, so pardon me for being only able to maintain a slight balance.

@weikengchen
Copy link
Member

I think there is something that we can improve---for example, as you point out, the ability to request a review should be open.

Again please pardon that it is also a preliminary step, and many things can be changed to improve it!

@jon-chuang
Copy link
Contributor

jon-chuang commented Feb 5, 2021

Hi yes, I definitely understand that, anw I think Ive not been too communicative, I will try to communicate more.

Actually I do appreciate the work from the arkworks team of course.

Anw I should chat more and figure out some solutions. Anw I think this is a case of just pratyush and I developing parallel work, which is confusing, I should chat more with pratyush in real time on telegram let's say since exchanging comments/PRs on issues can be a bit stilted. Since there are some fundamental differences in the scopes of the successive PRs.

@weikengchen
Copy link
Member

Yes. I feel it is related to the situation that Pratyush is unaware that you will be working on it.

On the one hand, we are trying to get more people involved via good first issue because we still have to do many things ourselves. On the other hand, we are very fortunate to have you.

It is fortunate that we are working on zero-knowledge proofs that have a larger community.


I do receive emails for all the PRs but I usually focus on those that I am more familiar with (or bug reports). So, I am not perfect.

@jon-chuang
Copy link
Contributor

jon-chuang commented Feb 5, 2021

Again please pardon that it is also a preliminary step, and many things can be changed to improve it!

Hmm ok, but if it hasn't been reviewed carefully, I don't think it should be approved, no?

@weikengchen
Copy link
Member

I was just doing a sanity check---I am not an expert in asm, and the review process is mainly to prevent errors, add comments on things that are unclear.

I agree that we should have let you review this and reconcile with #204.

Copy link
Member

@weikengchen weikengchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Reconcile with #204 as there seem to be some subtle improvements.

@weikengchen weikengchen dismissed their stale review February 5, 2021 11:17

See the ongoing discussion for more tuning.

@ValarDragon
Copy link
Member

ValarDragon commented Feb 5, 2021

Why would you need to assign the result of sub to a new variable, when you have the old variable denoted by self that holds that very value? That makes absolutely no sense. And yet this got an approving review, whereas a PR that I made that did not have such absurd changes and had been thoroughly benched did not get any review.
Hi ok sorry, anw I wasn't referring to min-const-generics, I think that one will be binned, and it's low priority. I'm referring to #204 which is a very small PR.

FWIW, both this PR and 204 were made while I was asleep lol. I don't think sub-24 review turnarounds are a reasonable expectation to hold as people are strewn across many timezones and have multiple projects in parallel... (Nor do other open source projects give any such guarantee. Projects I've worked on in the past such as https://github.com/tendermint/tendermint/ definitely don't)

@ValarDragon
Copy link
Member

Why would you need to assign the result of sub to a new variable, when you have the old variable denoted by self that holds that very value? That makes absolutely no sense.

Are you talking about

    fn sub(mut self, other: &'a Self) -> Self {
        self -= other;
        self
    }

This seems like a great optimization given that the base is already copied when this function is called?

@jon-chuang
Copy link
Contributor

Why would you need to assign the result of sub to a new variable, when you have the old variable denoted by self that holds that very value? That makes absolutely no sense. And yet this got an approving review, whereas a PR that I made that did not have such absurd changes and had been thoroughly benched did not get any review.
Hi ok sorry, anw I wasn't referring to min-const-generics, I think that one will be binned, and it's low priority. I'm referring to #204 which is a very small PR.

FWIW, both this PR and 204 were made while I was asleep lol. I don't think sub-24 review turnarounds are a reasonable expectation to hold as people are strewn across many timezones and have multiple projects in parallel... (Nor do other open source projects give any such guarantee. Projects I've worked on in the past such as https://github.com/tendermint/tendermint/ definitely don't)

Well that's definitely not what I was expected, what really irked me is that this PR had an immediate approving review, even though my overlapping PR was posted earlier, further, there were many unexplained choices like the ones I am disputing.

So it seemed to me like a bit of group-think and disregard for evaluating PRs on their merits, and also neglect for others' work in favour of those of the maintainers.

@jon-chuang
Copy link
Contributor

jon-chuang commented Feb 5, 2021

Why would you need to assign the result of sub to a new variable, when you have the old variable denoted by self that holds that very value? That makes absolutely no sense.

Are you talking about

    fn sub(mut self, other: &'a Self) -> Self {
        self -= other;
        self
    }

This seems like a great optimization given that the base is already copied when this function is called?

Well, in that case, one would simply do:

p.sub_assign(&q);
let y = p;

Alternatively, if if is to copy to a new variable as you said, one does let mut y = p; p.sub(&q). But then one is as happy doing let mut y = p; y.sub_assign(&q).
The syntactic sugar introduced by the pr would result in let y = p.sub(&q).
However, now we have modified p. So it's confusing.
No one expects p to be modified in the process.

The current way of writing, which has been fine until now, and no one has brought up any reason to change, is let y = p.sub(&q);

y is now a new variable containing that result, and p is a variable we have yet to modify. So there is a clear and meaningful separation of duties between the assign and non assigning ops.

None of the above affects performance as they are just syntactic sugar, the copies must happen anyway, so they are not by any means "optimisations" as you call them.

Further, pratyush already confirmed there is no performance reason to change the way the group addition formulas are written and in fact I show with the changes to the biginteger ops, that for groups over Fp, they are near the performance of gurvy, wherein the remaining discrepancies I believe lie in things like uneccessary data movement at function/block boundaries (whereas gurvy generates much more optimal assembly for these via scripts). Suggesting that there are few things apart from writing assembly to improve things further.

Hence, my conclusion is that there has been 0 argument for why these changes make any sense at all.

Rather, the motivation for these changes seem like an artifact from pratyush's earlier attempts to optimize the way the formulas were written.

@jon-chuang
Copy link
Contributor

jon-chuang commented Feb 5, 2021

If one truly wanted to extend the current syntactic sugar, the appropriate function to modify is to append self to every _assign function and have that function return type Self. Think about it and tell me I'm right.

@ValarDragon
Copy link
Member

ValarDragon commented Feb 5, 2021

Well that's definitely not what I was expected, what really irked me is that this PR had an immediate approving review, even though my overlapping PR was posted earlier, further, there were many unexplained choices like the ones I am disputing.

So it seemed to me like a bit of group-think and disregard for evaluating PRs on their merits, and also neglect for others' work in favour of those of the maintainers.

Can you assume some more positive intent on our part? We're talking about a time difference in review of 5 hours between two PRs, and where for one of the PRs the reviewer was explicitly requested for review. Otherwise, the flow is that for a PR /issue not made by a maintainer, a maintainer has to add tags & request relevant people for review.

Thank you for disputing the unexplained choice! Calling it unexplained is a bit untrue though, as there were two explanations in the prior PR #199 (comment), #199 (comment) . My understanding of your last comment is that you think there is an error in these explanations, I have not as of yet looked into the docs / read the latest edits of the last two comment to understand this.

@jon-chuang
Copy link
Contributor

jon-chuang commented Feb 5, 2021

Well that's definitely not what I was expected, what really irked me is that this PR had an immediate approving review, even though my overlapping PR was posted earlier, further, there were many unexplained choices like the ones I am disputing.
So it seemed to me like a bit of group-think and disregard for evaluating PRs on their merits, and also neglect for others' work in favour of those of the maintainers.

Can you assume some more positive intent on our part? We're talking about a time difference in review of 5 hours between two PRs, and where for one of the PRs the reviewer was explicitly requested for review. Otherwise, the flow is that for a PR /issue not made by a maintainer, a maintainer has to add tags & request relevant people for review.

Thank you for disputing the unexplained choice, but there were two explanations in the prior PR #199 (comment), #199 (comment) . My understanding of your last comment is that you think there is an error in these explanations, I have not as of yet looked into the docs / read the latest edits of the last two comment to understand this. But saying it was approved with unexplained changes is not true given there were two explanations that both seemed sound to me.

I see, I guess I am wrong about the semantics, and I missed Pratyush's point as there was no explaining comment, so much so that you had to ask for further clarification, and at the time a bigger overarching issue in discussion. The explanation in your reviewing comment is sound but easy to miss.

So I would say that efforts at exposition, both in this PR and #204 should be improved by summarising contentious points of the PR. I think it is bad to rely on voluminous earlier discussion and courteous to rather provide a succint description.

I also stand behind my complaint that the approving reviewer did not do due process for the PR by asking the same clarifying questions.

@jon-chuang
Copy link
Contributor

jon-chuang commented Feb 5, 2021

@Pratyush ,while these changes only help to reduce a simple impl by one line, firstly, it's potentially confusing (requiring myself and dev to ask clarifying questions), and second, it doesn't appear to actually make any performance difference, hence I am not in favour of them.

If the self term were not to be used further, I believe either rustc or llvm will optimize the copy away in the current impl anyway, so there's no improvement from move v.s. copy semantics.

@Pratyush
Copy link
Member Author

Pratyush commented Feb 5, 2021

It's just there seems no reason why some people have more authority than others to push to branches, approve reviews, assign reviewers/request reviews, put labels. Makes little sense.

For the authority to push to branches etc, I think that's a standard practice where maintainers have more permissions, no? I can look into the issue about reviews and labels though, I didn't know only certain users can add labels/request reviews.

it's potentially confusing (requiring myself and dev to ask clarifying questions), and second, it doesn't appear to actually make any performance difference, hence I am not in favour of them.

IMO the code is clearer and shorter, and also the compiler has less work to do in optimizing the code; move semantics are a core and inherent part of Rust, so I don't think it's unreasonable to expect future readers to know that. I can add some comments in line to clarify that.

As for this PR vs #204, I wanted to separate out the unrolling changes from the intrinsics changes, just to make the PRs and performance effects smaller. Maybe we can repurpose #204 to only have the intrinsics changes? That would involve rebasing #204 on this. Does that seem reasonable @jon-chuang?

@Pratyush
Copy link
Member Author

Pratyush commented Feb 5, 2021

Regarding the reduce copies commits, I'll keep them around for now, but it should be easy to drop those commits before merging the PR.

@@ -12,40 +12,71 @@ macro_rules! bigint_impl {
impl BigInteger for $name {
const NUM_LIMBS: usize = $num_limbs;

#[inline]
#[ark_ff_asm::unroll_for_loops]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a before/after benchmark for field arithmetic on a 256, 384 and 768 bit fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding one now.

Copy link
Member

@ValarDragon ValarDragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, holding off on an approval until all changes are benchmarked on the varying field sizes.

I don't share the reservations around sub(mut self, ... and the like. Per godbolt, it seems that its just what looks more like idiomatic rust, not a performance question. I don't see how the one line shorter variant isn't idiomatic, but I'm not particularly familiar with this.

@Pratyush
Copy link
Member Author

Pratyush commented Feb 5, 2021

Benchmarks:

BN254
 name                          bn254_fq_master ns/iter  bn254_fq ns/iter  diff ns/iter   diff %  speedup 
 bn254::fq::add_assign         5                        4                           -1  -20.00%   x 1.25 
 bn254::fq::deser              24                       30                           6   25.00%   x 0.80 
 bn254::fq::deser_unchecked    25                       31                           6   24.00%   x 0.81 
 bn254::fq::double             4                        3                           -1  -25.00%   x 1.33 
 bn254::fq::from_repr          24                       24                           0    0.00%   x 1.00 
 bn254::fq::into_repr          4                        4                            0    0.00%   x 1.00 
 bn254::fq::inverse            5,703                    4,945                     -758  -13.29%   x 1.15 
 bn254::fq::mul_assign         20                       21                           1    5.00%   x 0.95 
 bn254::fq::negate             6                        3                           -3  -50.00%   x 2.00 
 bn254::fq::repr_add_nocarry   4                        4                            0    0.00%   x 1.00 
 bn254::fq::repr_div2          3                        2                           -1  -33.33%   x 1.50 
 bn254::fq::repr_mul2          2                        2                            0    0.00%   x 1.00 
 bn254::fq::repr_num_bits      3                        2                           -1  -33.33%   x 1.50 
 bn254::fq::repr_sub_noborrow  5                        3                           -2  -40.00%   x 1.67 
 bn254::fq::ser                19                       18                          -1   -5.26%   x 1.06 
 bn254::fq::ser_unchecked      19                       19                           0    0.00%   x 1.00 
 bn254::fq::sqrt               6,527                    6,539                       12    0.18%   x 1.00 
 bn254::fq::square             20                       20                           0    0.00%   x 1.00 
 bn254::fq::sub_assign         6                        4                           -2  -33.33%   x 1.50 
BLS12-381
 name                                    bls12_381_fq_master ns/iter  bls12_381_fq ns/iter  diff ns/iter   diff %  speedup 
 bls12_381::fq::add_assign               8                            5                               -3  -37.50%   x 1.60 
 bls12_381::fq::deser                    54                           69                              15   27.78%   x 0.78 
 bls12_381::fq::deser_unchecked          54                           69                              15   27.78%   x 0.78 
 bls12_381::fq::double                   6                            5                               -1  -16.67%   x 1.20 
 bls12_381::fq::from_repr                40                           40                               0    0.00%   x 1.00 
 bls12_381::fq::into_repr                30                           30                               0    0.00%   x 1.00 
 bls12_381::fq::inverse                  11,642                       9,313                       -2,329  -20.01%   x 1.25 
 bls12_381::fq::mul_assign               38                           37                              -1   -2.63%   x 1.03 
 bls12_381::fq::negate                   9                            5                               -4  -44.44%   x 1.80 
 bls12_381::fq::repr_add_nocarry         6                            5                               -1  -16.67%   x 1.20 
 bls12_381::fq::repr_div2                2                            2                                0    0.00%   x 1.00 
 bls12_381::fq::repr_mul2                2                            2                                0    0.00%   x 1.00 
 bls12_381::fq::repr_num_bits            3                            2                               -1  -33.33%   x 1.50 
 bls12_381::fq::repr_sub_noborrow        7                            3                               -4  -57.14%   x 2.33 
 bls12_381::fq::ser                      43                           43                               0    0.00%   x 1.00 
 bls12_381::fq::ser_unchecked            44                           43                              -1   -2.27%   x 1.02 
 bls12_381::fq::sqrt                     19,697                       19,106                        -591   -3.00%   x 1.03 
 bls12_381::fq::square                   37                           37                               0    0.00%   x 1.00 
 bls12_381::fq::sub_assign               9                            6                               -3  -33.33%   x 1.50
BW6-761
 name                            bw6_761_fq_master ns/iter  bw6_761_fq ns/iter  diff ns/iter   diff %  speedup
 bw6_761::fq::add_assign         16                         10                            -6  -37.50%   x 1.60 
 bw6_761::fq::deser              213                        212                           -1   -0.47%   x 1.00 
 bw6_761::fq::deser_unchecked    213                        212                           -1   -0.47%   x 1.00 
 bw6_761::fq::double             12                         9                             -3  -25.00%   x 1.33 
 bw6_761::fq::from_repr          182                        184                            2    1.10%   x 0.99 
 bw6_761::fq::into_repr          100                        97                            -3   -3.00%   x 1.03 
 bw6_761::fq::inverse            35,976                     26,473                    -9,503  -26.41%   x 1.36 
 bw6_761::fq::mul_assign         179                        178                           -1   -0.56%   x 1.01 
 bw6_761::fq::negate             22                         10                           -12  -54.55%   x 2.20 
 bw6_761::fq::repr_add_nocarry   11                         8                             -3  -27.27%   x 1.38 
 bw6_761::fq::repr_div2          4                          2                             -2  -50.00%   x 2.00 
 bw6_761::fq::repr_mul2          2                          3                              1   50.00%   x 0.67 
 bw6_761::fq::repr_num_bits      3                          3                              0    0.00%   x 1.00 
 bw6_761::fq::repr_sub_noborrow  13                         3                            -10  -76.92%   x 4.33 
 bw6_761::fq::ser                127                        131                            4    3.15%   x 0.97 
 bw6_761::fq::ser_unchecked      127                        131                            4    3.15%   x 0.97 
 bw6_761::fq::sqrt               172,372                    175,702                    3,330    1.93%   x 0.98 
 bw6_761::fq::square             161                        160                           -1   -0.62%   x 1.01 
 bw6_761::fq::sub_assign         16                         13                            -3  -18.75%   x 1.23 

The only slow down is in deser, and I can't figure out why (I tried various tricks to speed it up). But it's not on the critical path for any code, so I'm okay with merging as is.

Copy link
Member

@ValarDragon ValarDragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Base field Inverse speeding up seems like a solid reason to assume this is not just noise.

@Pratyush
Copy link
Member Author

Pratyush commented Feb 5, 2021

ok @jon-chuang it seems the last thing that we need to account for is the speed of doubling; can we just implement it as self + self there (i.e., no need to change mul2)? (also, at least on our benchmark machine, the speed of addition and doubling is the same right now). Basically, the following:

fn double_in_place(&mut self) {
	#[cfg(all(target_arch = "x86_64", feature = "asm"))]
	{ *self += &*self; }
	#[cfg(not(feature = "asm"))]
	{
		//existing code
	}
}

@ValarDragon
Copy link
Member

@Pratyush
Copy link
Member Author

Pratyush commented Feb 5, 2021

I added the assert; the primary thing left is to decide whether to switch doubling to use addition as well.

@jon-chuang
Copy link
Contributor

ok @jon-chuang it seems the last thing that we need to account for is the speed of doubling; can we just implement it as self + self there (i.e., no need to change mul2)? (also, at least on our benchmark machine, the speed of addition and doubling is the same right now). Basically, the following:

fn double_in_place(&mut self) {
	#[cfg(all(target_arch = "x86_64", feature = "asm"))]
	{ *self += &*self; }
	#[cfg(not(feature = "asm"))]
	{
		//existing code
	}
}

Well, I know it's faster when using the intrinsics, potentially it's slower when not.

Btw does that code compile? I had to do some unsafe magic to try to get the mut and non-mut reference to work together?

Well I decided not to trust the Fq benchmarks on this, when I looked at the g1 benchmarks there was about a 3-4% improvement overall.

@@ -12,40 +12,71 @@ macro_rules! bigint_impl {
impl BigInteger for $name {
const NUM_LIMBS: usize = $num_limbs;

#[inline]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing inline? I think this should be given an inline hint no?

*a = sbb!(*a, *b, &mut borrow);
for i in 0..$num_limbs {
#[cfg(all(target_arch = "x86_64", feature = "asm"))]
#[cfg_attr(all(target_arch = "x86_64", feature = "asm"), allow(unsafe_code))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessarily verbose since it only compiles if the cfg is true. So previous version is fine.

@jon-chuang
Copy link
Contributor

I've pushed some changes, which help the PR reach the previous baseline perf. and address some of the review comments I made.

@jon-chuang
Copy link
Contributor

jon-chuang commented Feb 6, 2021

Here are the final results from before and after this PR:
Screenshot from 2021-02-06 10-50-09

We achieve a 17-32% speedup across the board.

This is certainly more than I expected and I'm quite happy with these results.

@Pratyush
Copy link
Member Author

Pratyush commented Feb 6, 2021

Ok I think this is ready to merge!

@jon-chuang jon-chuang merged commit 87e25cb into master Feb 6, 2021
@jon-chuang jon-chuang deleted the faster-arithmetic-2 branch February 6, 2021 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants