feat: `multiexp_serial` skips doubling when all bits are zero #202

jonathanpwang · 2023-09-01T21:55:24Z

For each c bit "segment", we get the bits from each coefficient in coeffs ahead of time and determine what the largest segment is. We use this to determine the largest nonzero segment and skip higher segments.

Before this, we always looked at all 256 bits and did a doubling regardless. Now, if all the coefficients in the MSM are 64 bits, we will detect this first and only do pippenger up to 64 bits. This only saves doublings because the pippenger buckets for addition were already dynamically created.

I was previously unsure how much this saved because I benchmarked on zkevm-keccak and there was almost no difference because almost all scalars used are 254 bits due to the way words are packed (bits are packed in base 256). However we found that in https://github.com/shuklaayush/halo2-keccak256, where most numbers are 64 bits, there was a 40% speedup in proving time! So I figured that justified this PR.

Also not sure if current choice of c using natural logarithm is optimal? Why not log_2? Edit: empirically on my laptop ln is indeed faster.

* feat: `multiexp_serial` skips doubling when all bits are zero For each `c` bit "segment", we get the bits from each coefficient in `coeffs` ahead of time and determine what the largest segment is. We use this to determine the largest nonzero segment and skip higher segments. Before this, we always looked at all 256 bits and did a doubling regardless. Now, if all the coefficients in the MSM are 64 bits, we will detect this first and only do pippenger up to 64 bits. This only saves doublings because the pippenger buckets for addition were already dynamically created. * feat: handle case where all bits are 0

The summation by parts at the end of each pippenger bucket is doing an addition with identity unnecessarily for unused bits

CPerezz

This sounds really cool!! I like it! Thanks for upstreaming it!

Is there any source/benchmark from where we can see not only the performance increase in 0-bit buckets but also how the performance is when we are in non highly 0bit populated circuits??

mratsim · 2023-09-06T09:38:02Z

halo2_proofs/src/arithmetic.rs

+        .into_iter()
+        .take(max_nonzero_segment.unwrap() + 1)
+        .rev()
+    {
        for _ in 0..c {
            *acc = acc.double();


I think it would be faster and easier to maintain to add a boolean guard if is_init here
that defaults to false and is set to true on first initialization of the accumulator.

It would avoid an extra pass over the data.

Ah yes, nice suggestion.

But now I am tempted to just do a full scan ahead of time to do this bit column optimization...

mratsim · 2023-09-06T09:41:33Z

halo2_proofs/src/arithmetic.rs

@@ -108,7 +139,7 @@ fn multiexp_serial<C: CurveAffine>(coeffs: &[C::Scalar], bases: &[C], acc: &mut
        //                    (a) + b +
        //                    ((a) + b) + c
        let mut running_sum = C::Curve::identity();
-        for exp in buckets.into_iter().rev() {
+        for exp in buckets.into_iter().take(max_bits).rev() {
            running_sum = exp.add(running_sum);
            *acc = *acc + &running_sum;


We'll need a local accumulator initialized with the first non-zero running sum because we have no info on the global accumulator here.

mratsim · 2023-09-06T10:06:36Z

Also not sure if current choice of c using natural logarithm is optimal? Why not log_2? Edit: empirically on my laptop ln is indeed faster.

I've looked into many implementations, the first one to do extensive tests was libff:

https://github.com/scipr-lab/libff/blame/674e437/libff/algebra/scalar_multiplication/multiexp.tcc#L174-L176

  // empirically, this seems to be a decent estimate of the optimal value of c
  size_t log2_length = log2(length);
  size_t c = log2_length - (log2_length / 3 - 2);

then arkworks:

Barretenberg uses empiric formulas:

https://github.com/AztecProtocol/barretenberg/blob/4bb7559/cpp/src/barretenberg/ecc/scalar_multiplication/runtime_states.hpp#L8-L58

Dalek has precise cost, but doesn't use it:

https://github.com/dalek-cryptography/curve25519-dalek/blob/e94a5fe/curve25519-dalek/src/backend/serial/scalar_mul/pippenger.rs#L46-L87

BLST uses empiric formula that takes into account memory bandwidth limitations:

https://github.com/supranational/blst/blob/9c87d4a/src/multi_scalar.c#L264-L271

Gnark uses precise cost formulas

https://github.com/Consensys/gnark-crypto/blob/3cb0eb3/ecc/bn254/multiexp.go#L384-L404

I've tested the various outputs of libraries in https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=f7c1706993e534d19ebf444ee2e807ca

In Constantine I use a precise cost formula with penalties when hitting memory bandwidth bottlenecks and tuned them based on empirical benchmarks:

https://github.com/mratsim/constantine/blob/4e0ca43/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim#L164-L213

Brechtpd · 2023-09-06T16:33:01Z

Looks like https://github.com/shuklaayush/halo2-keccak256 uses a very similar approach to the "bit based" keccak implementation: https://github.com/privacy-scaling-explorations/zkevm-circuits/blob/86f2c5e16d1d29acd562b1119fca81cc940ec7e2/zkevm-circuits/src/keccak_circuit/keccak_bit.rs, where all operations are done on bits and so can be done using only custom gates.

If so, it may be possible to optimize even further because the MSM on binary values is just conditional additions, so I think probably even faster to not even use pippenger. I never tested it so don't know how big of a difference it would make, and you do have to know that the columns are just binary to be able to do it though, so less of a general solution than the one in this PR. But if you're trying to optimize as much as possible, it could be worth it.

The bit based implementation one was removed from the PSE branch because it uses a lot more cells because each bit requires its own cell, compared to the lookup approach where more data is packed in a single cell. The bit based approach on its own was ~2x faster though, but at the cost of higher memory requirements and slower recursion, so was not ideal because of the super circuit approach for the zkevm-cirucits. At the time it also looked like lookups were getting much faster, so that version was kept. It looks like that will eventually be true, though we're not there yet of course in practice.

Some more info comparing the approaches: Some more info comparing the two approaches can be found here: https://hackmd.io/NaTuIvmaQCybaOYgd-DG1Q,

jonathanpwang · 2023-09-07T22:41:55Z

Benchmarks (on my M2Max mac): in https://github.com/shuklaayush/halo2-keccak256, run

cargo t -r --no-default-features --features halo2-pse -- --nocapture bench_keccak

This benchmark does 85 keccak_f permutations in with k = ceil(log2(85 * 25)) = 12. (Hmm just realized 85*25 is just above 2^11 so this is quite wasteful.)
Proving time: 18.802s

Now I patch halo2_proofs_pse with this fork and re-run the command.
Proving time: 14.948s

Also FYI if I run it with --features halo2-axiom:
Proving time: 8.686s
This is because we use @Brechtpd 's FFT.

@Brechtpd Yup I saw your comments about the keccak_bits, and had wanted to run it directly but didn't have time to try to update the versioning with zkevm and ff 0.13 etc, so the above repo was an easier approximation.

Just to say, this PR is just trying to pick low hanging fruit. It is by no means an endgame for Pippenger's optimization.

It seems there is a lot more surface area if we do a scan of all columns to determine bit counts per column. (Or user needs to specify bit counts per column, but that's a bigger change.) Just for bit columns, indeed just adding is faster. I guess that could be special cased by scanning the column first?

More generally, @mratsim I wonder if you know the best parallel MSM for multiple bit columns (or byte columns). We MSM per column, but the bases are the same (it's ParamsKZG) for each column. When the scalars are 254 bits, I did not think that there is any optimization from parallel pippenger opposed to the current implementation, since the buckets are dynamically computed. However for example if you have multiple columns where scalars are all bits, surely there is a Pippenger's algorithm that is better than just adding each column the dumb way? On the other hand, having the backend auto handle these optimizations seems a little annoying... But maybe we really should just take the entire arithmetization, Montgomery reduce and get bit counts of all columns, and then do very specific Pippenger optimizations to reap all the benefits.

jonathanpwang · 2023-09-07T22:47:33Z

Looks like https://github.com/shuklaayush/halo2-keccak256 uses a very similar approach to the "bit based" keccak implementation: https://github.com/privacy-scaling-explorations/zkevm-circuits/blob/86f2c5e16d1d29acd562b1119fca81cc940ec7e2/zkevm-circuits/src/keccak_circuit/keccak_bit.rs, where all operations are done on bits and so can be done using only custom gates.

If so, it may be possible to optimize even further because the MSM on binary values is just conditional additions, so I think probably even faster to not even use pippenger. I never tested it so don't know how big of a difference it would make, and you do have to know that the columns are just binary to be able to do it though, so less of a general solution than the one in this PR. But if you're trying to optimize as much as possible, it could be worth it.

The bit based implementation one was removed from the PSE branch because it uses a lot more cells because each bit requires its own cell, compared to the lookup approach where more data is packed in a single cell. The bit based approach on its own was ~2x faster though, but at the cost of higher memory requirements and slower recursion, so was not ideal because of the super circuit approach for the zkevm-cirucits. At the time it also looked like lookups were getting much faster, so that version was kept. It looks like that will eventually be true, though we're not there yet of course in practice.

Some more info comparing the approaches: Some more info comparing the two approaches can be found here: https://hackmd.io/NaTuIvmaQCybaOYgd-DG1Q,

Also yes, the above circuit uses ~2400-2500 columns, so the recursion cost is quite high.

davidnevadoc · 2024-02-23T08:57:10Z

After merging #282, is this still relevant? @CPerezz

CPerezz · 2024-02-23T08:58:50Z

No. I think this is already fixed within halo2curves.

Can you confirm @kilic ? If so, then we can close this.

mratsim · 2024-03-04T16:26:48Z

Do you mean in privacy-scaling-explorations/halo2curves#130 ?

I don't remember seeing the doubling skipping in that PR.

ed255 · 2024-04-16T16:02:44Z

@jonathanpwang In this PR #282 we removed the msm implementation from halo2 in favor of using the implementation from halo2curves (we just had the same code repeated in both places). Could you port this PR to halo2curves, and possibly extend it to the cycloneMSM that we recently merged?

jonathanpwang · 2024-04-16T16:19:15Z

Sounds good, nice to hear cyclone was merged. I will need to study it a little.

I have been interested in implementations of MSM that reduce peak memory usage - do you know if cyclone is any better than the original for that? We were also using par_iter across columns for the MSM, which might have made the memory usage higher.

ed255 · 2024-04-19T09:32:06Z

I have been interested in implementations of MSM that reduce peak memory usage - do you know if cyclone is any better than the original for that? We were also using par_iter across columns for the MSM, which might have made the memory usage higher.

I don't know about the memory usage of cyclone VS the previous MSM implementation; the only benchmarks that Onur shared where showing time; but it would be great to include memory benchmarks in future MSM iterations.

BTW, for memory usage I think there's a low hanging fruit improvement which is avoiding this step https://github.com/privacy-scaling-explorations/halo2curves/blob/8af4f1ebab640405c799e65d9873847a4acf04f8/src/msm.rs#L290 which clones the coefficients, and instead directly iterate over the coeffs: &[C::Scalar].

jonathanpwang changed the title ~~feat: multiexp_serial skips doubling when all bits are zero (#10)~~ feat: multiexp_serial skips doubling when all bits are zero Sep 1, 2023

jonathanpwang added 2 commits September 1, 2023 15:39

feat(msm): start running sum from max_bits

c085798

The summation by parts at the end of each pippenger bucket is doing an addition with identity unnecessarily for unused bits

chore: add comment

d3e3ba3

CPerezz reviewed Sep 4, 2023

View reviewed changes

mratsim reviewed Sep 6, 2023

View reviewed changes

einar-taiko mentioned this pull request Sep 8, 2023

Insert MSM and FFT code and their benchmarks. privacy-scaling-explorations/halo2curves#86

Merged

mratsim mentioned this pull request Sep 28, 2023

[RFC] Blackboxing MSM and FFT - Hardware Accel API #216

Closed

mratsim mentioned this pull request Nov 29, 2023

Booth encoding privacy-scaling-explorations/halo2curves#106

Merged

jonathanpwang closed this Apr 16, 2024

davidnevadoc mentioned this pull request Apr 16, 2024

Continue multiexp_serial skips doubling when all bits are zero. privacy-scaling-explorations/halo2curves#150

Closed

ed255 mentioned this pull request Apr 19, 2024

Proposal: Specialized MSM and Column vectors for small values #315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: `multiexp_serial` skips doubling when all bits are zero #202

feat: `multiexp_serial` skips doubling when all bits are zero #202

jonathanpwang commented Sep 1, 2023 •

edited

Loading

CPerezz left a comment

mratsim Sep 6, 2023

jonathanpwang Sep 7, 2023

mratsim Sep 6, 2023

mratsim commented Sep 6, 2023 •

edited

Loading

Brechtpd commented Sep 6, 2023

jonathanpwang commented Sep 7, 2023 •

edited

Loading

jonathanpwang commented Sep 7, 2023

davidnevadoc commented Feb 23, 2024

CPerezz commented Feb 23, 2024

mratsim commented Mar 4, 2024

ed255 commented Apr 16, 2024

jonathanpwang commented Apr 16, 2024

ed255 commented Apr 19, 2024

feat: multiexp_serial skips doubling when all bits are zero #202

feat: multiexp_serial skips doubling when all bits are zero #202

Conversation

jonathanpwang commented Sep 1, 2023 • edited Loading

CPerezz left a comment

Choose a reason for hiding this comment

mratsim Sep 6, 2023

Choose a reason for hiding this comment

jonathanpwang Sep 7, 2023

Choose a reason for hiding this comment

mratsim Sep 6, 2023

Choose a reason for hiding this comment

mratsim commented Sep 6, 2023 • edited Loading

Brechtpd commented Sep 6, 2023

jonathanpwang commented Sep 7, 2023 • edited Loading

jonathanpwang commented Sep 7, 2023

davidnevadoc commented Feb 23, 2024

CPerezz commented Feb 23, 2024

mratsim commented Mar 4, 2024

ed255 commented Apr 16, 2024

jonathanpwang commented Apr 16, 2024

ed255 commented Apr 19, 2024

feat: `multiexp_serial` skips doubling when all bits are zero #202

feat: `multiexp_serial` skips doubling when all bits are zero #202

jonathanpwang commented Sep 1, 2023 •

edited

Loading

mratsim commented Sep 6, 2023 •

edited

Loading

jonathanpwang commented Sep 7, 2023 •

edited

Loading