Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: multiexp_serial skips doubling when all bits are zero #202

Conversation

jonathanpwang
Copy link

@jonathanpwang jonathanpwang commented Sep 1, 2023

For each c bit "segment", we get the bits from each coefficient in coeffs ahead of time and determine what the largest segment is. We use this to determine the largest nonzero segment and skip higher segments.

Before this, we always looked at all 256 bits and did a doubling regardless. Now, if all the coefficients in the MSM are 64 bits, we will detect this first and only do pippenger up to 64 bits. This only saves doublings because the pippenger buckets for addition were already dynamically created.

I was previously unsure how much this saved because I benchmarked on zkevm-keccak and there was almost no difference because almost all scalars used are 254 bits due to the way words are packed (bits are packed in base 256). However we found that in https://github.com/shuklaayush/halo2-keccak256, where most numbers are 64 bits, there was a 40% speedup in proving time! So I figured that justified this PR.

Also not sure if current choice of c using natural logarithm is optimal? Why not log_2? Edit: empirically on my laptop ln is indeed faster.

* feat: `multiexp_serial` skips doubling when all bits are zero

For each `c` bit "segment", we get the bits from each coefficient in
`coeffs` ahead of time and determine what the largest segment is. We use
this to determine the largest nonzero segment and skip higher segments.

Before this, we always looked at all 256 bits and did a doubling
regardless. Now, if all the coefficients in the MSM are 64 bits, we will
detect this first and only do pippenger up to 64 bits. This only saves
doublings because the pippenger buckets for addition were already
dynamically created.

* feat: handle case where all bits are 0
@jonathanpwang jonathanpwang changed the title feat: multiexp_serial skips doubling when all bits are zero (#10) feat: multiexp_serial skips doubling when all bits are zero Sep 1, 2023
The summation by parts at the end of each pippenger bucket is doing an
addition with identity unnecessarily for unused bits
Copy link
Member

@CPerezz CPerezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds really cool!! I like it! Thanks for upstreaming it!

Is there any source/benchmark from where we can see not only the performance increase in 0-bit buckets but also how the performance is when we are in non highly 0bit populated circuits??

.into_iter()
.take(max_nonzero_segment.unwrap() + 1)
.rev()
{
for _ in 0..c {
*acc = acc.double();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be faster and easier to maintain to add a boolean guard if is_init here
that defaults to false and is set to true on first initialization of the accumulator.

It would avoid an extra pass over the data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, nice suggestion.

But now I am tempted to just do a full scan ahead of time to do this bit column optimization...

@@ -108,7 +139,7 @@ fn multiexp_serial<C: CurveAffine>(coeffs: &[C::Scalar], bases: &[C], acc: &mut
// (a) + b +
// ((a) + b) + c
let mut running_sum = C::Curve::identity();
for exp in buckets.into_iter().rev() {
for exp in buckets.into_iter().take(max_bits).rev() {
running_sum = exp.add(running_sum);
*acc = *acc + &running_sum;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need a local accumulator initialized with the first non-zero running sum because we have no info on the global accumulator here.

@mratsim
Copy link

mratsim commented Sep 6, 2023

Also not sure if current choice of c using natural logarithm is optimal? Why not log_2? Edit: empirically on my laptop ln is indeed faster.

I've looked into many implementations, the first one to do extensive tests was libff:

then arkworks:

Barretenberg uses empiric formulas:

Dalek has precise cost, but doesn't use it:

BLST uses empiric formula that takes into account memory bandwidth limitations:

Gnark uses precise cost formulas

I've tested the various outputs of libraries in https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=f7c1706993e534d19ebf444ee2e807ca

In Constantine I use a precise cost formula with penalties when hitting memory bandwidth bottlenecks and tuned them based on empirical benchmarks:

@Brechtpd
Copy link

Brechtpd commented Sep 6, 2023

Looks like https://github.com/shuklaayush/halo2-keccak256 uses a very similar approach to the "bit based" keccak implementation: https://github.com/privacy-scaling-explorations/zkevm-circuits/blob/86f2c5e16d1d29acd562b1119fca81cc940ec7e2/zkevm-circuits/src/keccak_circuit/keccak_bit.rs, where all operations are done on bits and so can be done using only custom gates.

If so, it may be possible to optimize even further because the MSM on binary values is just conditional additions, so I think probably even faster to not even use pippenger. I never tested it so don't know how big of a difference it would make, and you do have to know that the columns are just binary to be able to do it though, so less of a general solution than the one in this PR. But if you're trying to optimize as much as possible, it could be worth it.

The bit based implementation one was removed from the PSE branch because it uses a lot more cells because each bit requires its own cell, compared to the lookup approach where more data is packed in a single cell. The bit based approach on its own was ~2x faster though, but at the cost of higher memory requirements and slower recursion, so was not ideal because of the super circuit approach for the zkevm-cirucits. At the time it also looked like lookups were getting much faster, so that version was kept. It looks like that will eventually be true, though we're not there yet of course in practice.

Some more info comparing the approaches: Some more info comparing the two approaches can be found here: https://hackmd.io/NaTuIvmaQCybaOYgd-DG1Q,

@jonathanpwang
Copy link
Author

jonathanpwang commented Sep 7, 2023

Benchmarks (on my M2Max mac): in https://github.com/shuklaayush/halo2-keccak256, run

cargo t -r --no-default-features --features halo2-pse -- --nocapture bench_keccak

This benchmark does 85 keccak_f permutations in with k = ceil(log2(85 * 25)) = 12. (Hmm just realized 85*25 is just above 2^11 so this is quite wasteful.)
Proving time: 18.802s

Now I patch halo2_proofs_pse with this fork and re-run the command.
Proving time: 14.948s

Also FYI if I run it with --features halo2-axiom:
Proving time: 8.686s
This is because we use @Brechtpd 's FFT.

@Brechtpd Yup I saw your comments about the keccak_bits, and had wanted to run it directly but didn't have time to try to update the versioning with zkevm and ff 0.13 etc, so the above repo was an easier approximation.

Just to say, this PR is just trying to pick low hanging fruit. It is by no means an endgame for Pippenger's optimization.

It seems there is a lot more surface area if we do a scan of all columns to determine bit counts per column. (Or user needs to specify bit counts per column, but that's a bigger change.) Just for bit columns, indeed just adding is faster. I guess that could be special cased by scanning the column first?

More generally, @mratsim I wonder if you know the best parallel MSM for multiple bit columns (or byte columns). We MSM per column, but the bases are the same (it's ParamsKZG) for each column. When the scalars are 254 bits, I did not think that there is any optimization from parallel pippenger opposed to the current implementation, since the buckets are dynamically computed. However for example if you have multiple columns where scalars are all bits, surely there is a Pippenger's algorithm that is better than just adding each column the dumb way? On the other hand, having the backend auto handle these optimizations seems a little annoying... But maybe we really should just take the entire arithmetization, Montgomery reduce and get bit counts of all columns, and then do very specific Pippenger optimizations to reap all the benefits.

@jonathanpwang
Copy link
Author

Looks like https://github.com/shuklaayush/halo2-keccak256 uses a very similar approach to the "bit based" keccak implementation: https://github.com/privacy-scaling-explorations/zkevm-circuits/blob/86f2c5e16d1d29acd562b1119fca81cc940ec7e2/zkevm-circuits/src/keccak_circuit/keccak_bit.rs, where all operations are done on bits and so can be done using only custom gates.

If so, it may be possible to optimize even further because the MSM on binary values is just conditional additions, so I think probably even faster to not even use pippenger. I never tested it so don't know how big of a difference it would make, and you do have to know that the columns are just binary to be able to do it though, so less of a general solution than the one in this PR. But if you're trying to optimize as much as possible, it could be worth it.

The bit based implementation one was removed from the PSE branch because it uses a lot more cells because each bit requires its own cell, compared to the lookup approach where more data is packed in a single cell. The bit based approach on its own was ~2x faster though, but at the cost of higher memory requirements and slower recursion, so was not ideal because of the super circuit approach for the zkevm-cirucits. At the time it also looked like lookups were getting much faster, so that version was kept. It looks like that will eventually be true, though we're not there yet of course in practice.

Some more info comparing the approaches: Some more info comparing the two approaches can be found here: https://hackmd.io/NaTuIvmaQCybaOYgd-DG1Q,

Also yes, the above circuit uses ~2400-2500 columns, so the recursion cost is quite high.

@davidnevadoc
Copy link

After merging #282, is this still relevant? @CPerezz

@CPerezz
Copy link
Member

CPerezz commented Feb 23, 2024

No. I think this is already fixed within halo2curves.

Can you confirm @kilic ? If so, then we can close this.

@mratsim
Copy link

mratsim commented Mar 4, 2024

Do you mean in privacy-scaling-explorations/halo2curves#130 ?

I don't remember seeing the doubling skipping in that PR.

@ed255
Copy link
Member

ed255 commented Apr 16, 2024

@jonathanpwang In this PR #282 we removed the msm implementation from halo2 in favor of using the implementation from halo2curves (we just had the same code repeated in both places). Could you port this PR to halo2curves, and possibly extend it to the cycloneMSM that we recently merged?

@jonathanpwang
Copy link
Author

Sounds good, nice to hear cyclone was merged. I will need to study it a little.

I have been interested in implementations of MSM that reduce peak memory usage - do you know if cyclone is any better than the original for that? We were also using par_iter across columns for the MSM, which might have made the memory usage higher.

@ed255
Copy link
Member

ed255 commented Apr 19, 2024

I have been interested in implementations of MSM that reduce peak memory usage - do you know if cyclone is any better than the original for that? We were also using par_iter across columns for the MSM, which might have made the memory usage higher.

I don't know about the memory usage of cyclone VS the previous MSM implementation; the only benchmarks that Onur shared where showing time; but it would be great to include memory benchmarks in future MSM iterations.

BTW, for memory usage I think there's a low hanging fruit improvement which is avoiding this step https://github.com/privacy-scaling-explorations/halo2curves/blob/8af4f1ebab640405c799e65d9873847a4acf04f8/src/msm.rs#L290 which clones the coefficients, and instead directly iterate over the coeffs: &[C::Scalar].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants