Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instruction Maybe-Wishlist for Recursion and Consensus #259

Open
aszepieniec opened this issue Mar 26, 2024 · 13 comments
Open

Instruction Maybe-Wishlist for Recursion and Consensus #259

aszepieniec opened this issue Mar 26, 2024 · 13 comments

Comments

@aszepieniec
Copy link
Collaborator

aszepieniec commented Mar 26, 2024

This is a tracking issue. We add imagined instructions that could make recursion (or consensus programs) faster.

read_mem_forward

  • BEFORE: *ptr
  • AFTER: (*ptr+n) [element]
  • With this instruction, 7 instructions inside the inner loop of (a static variant of) HashVarlen becomes 3.
  • Deprecated:
    • hash_var_len is only performance critical when the number of to-be-hashed elements is known a-priori, making other approaches feasible
    • dot_step (see below) eliminates the second next important use for read_mem_forward

dot_step

  • BEFORE: _ acc2 acc1 acc0 *lhs *rhs
  • AFTER: _ acc2' acc1' acc0' (*lhs+3) (*rhs+3)
  • With this instruction, 8 instructions inside the inner loop of InnerProductOfThreeRowsWithWeights becomes 1. Stands to reduce `1M to ~65000 cycles.
  • Dot-Step #272

merkle_step

  • BEFORE: _ merkle_node_idx [Digest; 5]
  • AFTER: _ (merkle_node_idx // 2) [Digest'; 5]
  • Moves up one layer in a Merkle tree. A combination of divine_sibling and hash into one instruction.
  • Replaces instruction divine_sibling. Instruction hash remains available as-is.
  • Does not change the height of the op stack, in contrast to divine_sibling and hash, which change the height of the stack by 5 elements each.
  • Done: 01c04e5

get_colinear_y

  • BEFORE: _ ax [ay] bx [by] [cx] (possibly different order)
  • AFTER: _ [cy]
  • Where $c_y = \frac{a_y-b_y}{a_x-b_x} \cdot (c_x - a_x) + a_y$
  • This instruction can reduce the inner loop of compute_c_values from 74 instructions to 49. Total cycle count reduction: ~26000.

recurse_or_return

  • BEFORE: _ a b
  • AFTER: _ a b
  • Acts like recurse if $a \neq b$. Else, acts like return.
  • Replaces checks of the form dup <m> dup <n> eq skiz return <loop_body> recurse, reducing the op stack delta of loop maintenanance.
  • feat: Introduce instruction recurse_or_return #288

absorb_from_mem

  • BEFORE: _ mem_ptr [garbage; 3]
  • AFTER: _ (mem_ptr - 10) [garbage; 3]
  • Reads 10 elements from RAM, adds them to the Sponge state, and applies the Tip5 permutation to the Sponge state.
  • Can be seen as replacement for read_mem 5 read_mem 5 sponge_absorb, albeit not a drop-in replacement due to the [garbage; 3], which is needed for arithmetization reasons.
  • Reduces the op stack delta from 10 to 0 compared to the instruction sequence it is intended to replace.
  • Done: 6dd9b54
@Sword-Smith
Copy link
Collaborator

Sword-Smith commented Mar 28, 2024

It turns out the the opstack table height is the bottle-neck in the verifier, not the clock cycle count (which is recorded by the processor table). So counting number of clock cycles that a new instruction would shave off might not be the relevant metric.

I was considering a recurse instruction that acted as the current recurse if the top to stack elements are different and acts as return if the two top stack elements are the same. That would probably shave off a good amount of both opstack table rows since the start of almost all our loops looks like this:

loop_label:
    dup <m>
    dup <n>
    eq
    skiz return
    
    <loop body>
    recurse

This new recurse instruction would save 4 opstack table rows and 4 clock cycles for each loop iteration.

@Sword-Smith
Copy link
Collaborator

Sword-Smith commented May 2, 2024

I guess my view has changed a bit.

The recurse_or_return I think would work best is one that returns if ST[5] == 1. Otherwise it recurses. That could be used in combination with merkle_step to reduce the loop body to 2 instructions.

A loop that walks up a Merkle tree would then be:

auth_path_loop:
    merkle_step
    recurse_or_return

I understand that this creates a problem for Merkle trees of height 0, with only one node. But I think that's not a practical problem.

edit: but let's see where we stand after having added the dot_steps and the merkle_step.

@aszepieniec
Copy link
Collaborator Author

I am observing that the following pattern occurs quite often:

                push {DIGEST_LENGTH - 1}
                add
                read_mem {DIGEST_LENGTH}
                pop 1

Maybe this can be shrunk to one instruction, load_digest?

@aszepieniec
Copy link
Collaborator Author

Also the following:

                dup 8
                push 1 add

How about a dupai, short for "duplicate and add immediate argument". I realize that now there are two immediate arguments, but it is acceptable for both of them to be constrained to a limited range, for instance $[0:16)$.

@aszepieniec
Copy link
Collaborator Author

Another frequent pattern. This one at the end of a loop iteration to do index bookkeeping.

                swap 1 read_mem 1 push 2 add add swap 1
                // _ *witness *all_aocl_indices *removal_records[i]_si num_utxos i *utxos[i]_si *msmp[i+1] *aocl
                swap 2 read_mem 1 push 2 add add swap 2
                // _ *witness *all_aocl_indices *removal_records[i]_si num_utxos i *utxos[i+1]_si *msmp[i+1] *aocl
                swap 5 read_mem 1 push 2 add add swap 5
                // _ *witness *all_aocl_indices *removal_records[i+1]_si num_utxos i *utxos[i+1]_si *msmp[i+1] *aocl

The general pattern is swap X read_mem 1 push 2 add add swap X, where X indicates a stack position. It could be reduced to iter_next X.

@aszepieniec
Copy link
Collaborator Author

And of course:

                swap 3 push 1 add swap 3
                // _ *witness *all_aocl_indices *removal_records[i+1]_si num_utxos (i+1) *utxos[i+1]_si *msmp[i+1] *aocl

could be reduced with inc X.

@aszepieniec
Copy link
Collaborator Author

Also, at the start of a loop to test the termination condition:

dup 2 dup 2 eq

I realize this is addressed by return_or_recurse but that instruction constrains where the loop index and terminal value live on the stack and thus less convenient for the programmer. That said, maybe convenience should take a hike in favor of performance.

@Sword-Smith
Copy link
Collaborator

Also, at the start of a loop to test the termination condition:

dup 2 dup 2 eq

I realize this is addressed by return_or_recurse but that instruction constrains where the loop index and terminal value live on the stack and thus less convenient for the programmer. That said, maybe convenience should take a hike in favor of performance.

There is a cost in proving time for each new instruction that you add or make more general. The instruction recurse_or_return is only really relevant when the loop body is very short since this is where you save a big relative number of table rows by cutting down on the cost of checking the loop end-condition, which is very of four clock cycles dup n push m eq skiz return.

In general, I would say that if the loop body is more than 40 clock cycles, then using an efficient check for the loop end-condition, like recurse_or_return, is irrelevant.

@aszepieniec
Copy link
Collaborator Author

This suggestion anticipates the miner's task of updating the mutator set's SWBFI MMR accumulator. He must use the same authentication path twice, once to prove that the old leaf was correct; and once to prove that the new root is correct. It would be a shame to read this authentication path from memory twice, especially since the authenticity of the paths provided in the transaction can be established efficiently by the transaction initiator.

To this end I propose dual_merkle_step, which like merkle_step divines in the digest and modifies the index, but which additionally keeps track of a second running digest on the stack.

(I'm sure one of you proposed this instruction before I did. I'm just writing it here for the record.)

@jan-ferdinand
Copy link
Member

I did some analysis on which instruction patterns appear the most often when executing the recursive verifier. Unsurprisingly, the sequences xb_dot_step …, xx_dot_step …, merkle_step recurse_or_return …, and sponge_absorb_mem … are highest on the list. I don't immediately see how we can shrink the number of executed instructions for these patterns. Given that all the instructions in these patterns were already introduced in order to speed up the recursive verifier, I doubt there are even medium-high fruits hanging around here.

The patterns occurring the next most often are:1

num sequence len sequence
8082 2 read_mem_3 pop_1
5459 2 push_0 push_0
4015 2 swap_6 pop_1
3916 2 push_ 1
3206 2 swap_8 pop_1
3187 3 read_mem_3 pop_1 xx_mul
2724 2 swap_2 swap_4
2612 2 pop_1 xx_add
2566 2 xx_mul xx_add
2173 2 dup_6 dup_6
1772 3 swap_1 swap_2 swap_3
1694 4 read_mem_3 pop_1 xx_mul xx_add
1674 3 push_0 push_0 push_0
1605 4 xx_mul swap_1 swap_2 swap_3
1361 3 swap_2 swap_4 pop_1
1360 8 swap_2 swap_4 swap_6 pop_1 swap_2 swap_4 pop_1 assert_vector
1120 17 push_1 dup_6 read_mem_1 swap_8 pop_1 dup_11 and dup_10 xor push_0 push_0 dup_11 read_mem_3 swap_15 pop_1 call 12130 merkle_step

The total number of instructions executed is around 285 000. Therefore, even the most commen sequence of 2 instructions hardly seems worth optimizing for.

To be honest, this data does not show me a clear path as to which instruction patterns to prioritize. I also can't confirm any of the patterns @aszepieniec identified last week. (This does not include dual_merkle_step, which addresses a different problem.)

Footnotes

  1. In order to simplify tokenizing, I break with the convention of writing, e.g., push 3, and write push_3 instead.

@Sword-Smith
Copy link
Collaborator

I did some analysis ...

This confirms my sense that we've more or less hit a wall when it comes to verifier-optimizations. It would probably be possible to get the verifier for exansion factor 4 below the next power-of-two threshold, but it would require a great deal of work, including the trick that halves the number of memory table rows at the cost of more columns. Cf., the latest verifier benchmark for alpha-6:

{
    "name": "tasmlib_verifier_stark_verify_inner_padded_height_524288_fri_exp_4",
    "benchmark_result": {
      "clock_cycle_count": 285183,
      "hash_table_height": 234025,
      "u32_table_height": 211197,
      "op_stack_table_height": 235124,
      "ram_table_height": 281599
    },
    "case": "CommonCase"
  }

Getting clock_cycle_count and ram_table_height below $2^{18} = 262.144$ is probably possible with the introduction of a few new instructions from @jan-ferdinand comment above and a trick to halve the ram_table_height. But I think a better way forward is to focus neptune-core's consensus logic and see which new bottlenecks show up there. That process has already led to one new required instruction.

@Sword-Smith
Copy link
Collaborator

Sword-Smith commented Jul 25, 2024

I did some analysis ...

Although this is a great analysis, this analysis will not find all candidates for new instructions since the current code is optimized for the current instruction set. An alternative approach would be to analyze the mathematical primitives of the verifier and add new instructions tailor-made for that. I guess get_colinear_y is an example of this process.

@aszepieniec
Copy link
Collaborator Author

This proposal targets a more efficient inner-product-and-row-hashing step, which takes place in the recursive verifier

  1. once for every revealed main table row, the elements of which are BFieldElements;
  2. once for every revealed auxiliary table row, the elements of which are XFieldElements;
  3. once for every revealed quotient table row, the width of which is much smaller elements of which are XFieldElements
  4. once for the out-of-domain row, which includes main, auxiliary, and quotient portions, all as XFieldElements but without hashing.

Currently, the hashing part of these steps accounts for 17% of the RAM table and 8% of the RAM Table; and the inner product part accounts for 56% of the RAM table and 16% of the Processor Table. While the Processor Table dominates at (according to @Sword-Smith's most recent profile for version alpha6 with expansion factor 4) 285185 rows, the RAM table is not far behind with 281599.

This proposal is chiefly intended to shrink the RAM Table row count. Without matching drops in the Processor Table row count, this proposal is of little use. So in other words, it should be considered in combination with another proposal to shrink the Processor Table row count.

The proposal involves modifying the STARK architecture and introducing some new instructions.

STARK architecture.

Use univariate batching instead of linear batching of execution trace columns. Basically this means using powers of the same single weight instead of independent weights for every column. In fact, this modification alone already generates a speedup for the prover and, with Horner counterpart to dotstep, cuts into the RAM Table row count significantly (because you don't have to read the same weights from RAM over and over again). It does this while affecting soundness only marginally, in accordance with the Schwartz-Zippel lemma.

Instruction shift_divine i.

Sends stack

_ a b c D E F q r s t u v w x y z

to

_ a b c D E F * * * * * q r s t u

If the top 10 elements, the i topmost are deleted; the remainder are shifted upwards with nondeterministic elements inserted below. The bottom-most 6 elements remain intact.

This instruction's only effect is to modify the stack as described. As this map is linear it generates constraints of degree 1. (And if I'm not mistaken, only one constraint.)

Instruction sponge_absorb_inplace.

Does the same as sponge_absorb but leaves the stack as-is. We could even consider modifying the behavior of the existing instruction sponge_absorb to conform to this description.

Instruction xxxhorner.

Sends stack

_ a b c d e f z y x w v u t s r q

to

_ a b c D E F z y x w v u t s r q

where, stretching notation, [F E D] = ((([f e d] * [c b a]) + [q r s]) * [c b a] + [t u v]) * [c b a] + [w x y] as XFieldElements.

Instruction bbbhorneri.

Note that i denotes one of [0, 3, 6]. So really this covers 3 instructions.

Instruction bbbherner0 sends stack

_ a b c d e f z y x w v u t s r q

to

_ a b c D E F z y x w v u t s r q

where, stretching notation, [F E D] = ((([f e d] * [c b a]) + [0 0 q]) * [c b a] + [0 0 r]) * [c b a] + [0 0 s] as XFieldElements. When i=3 (or i=6), the next (or next again) triple of coefficients are used.

Analysis.

This proposal drops all RAM accesses involved in the inner-product-and-row-hashing steps, and promises to reduce the RAM Table row count by 67% to some 93000.

Furthermore, it stands to reduce the Processor Table's row count as well, but less dramatically. Currently:

  • For every 10 XFieldElements, use sponge_absorb_mem 3 times for hashing, and use xxdotstep 10 times for the inner product, making for a total of 13 cycles.
  • For every 10 BFieldElements, use sponge_absorb_mem 1 times for hashing, and use xbdotstep 10 times for the inner product, making for a total of 11 cycles.

With the proposed changes:

  • For every 100 XFieldElements, use the following tuple 3 times, making for a total of 37 cycles.
    • xxxhorner --> * # # # # # # # # #
    • sponge_absorb_inplace --> & $ $ $ $ $ $ $ $ $
    • shift_divine 9 --> * * * * * * * * * &
    • xxxhorner --> * # # # # # # # # @
    • shift_divine 1 --> * * # # # # # # # #
    • sponge_absorb_inplace --> & & $ $ $ $ $ $ $ $
    • shift_divine 8 --> * * * * * * * * & &
    • xxxhorner --> * # # # # # # # @ @
    • shift_divine 2 --> * * * # # # # # # #
    • sponge_absorb_inplace --> & & & $ $ $ $ $ $ $
    • shift_divine 7 --> * * * * * * * & & &
    • xxxhorner --> * # # # # # # @ @ @
    • shift_divine 3 --> * * * * # # # # # #
    • sponge_absorb_inplace --> & & & & $ $ $ $ $ $
    • shift_divine 6 --> * * * * * * & & & &
    • xxxhorner --> * # # # # # @ @ @ @
    • shift_divine 4 --> * * * * * # # # # #
    • sponge_absorb_inplace --> & & & & & $ $ $ $ $
    • shift_divine 5 --> * * * * * & & & & &
    • xxxhorner --> * # # # # @ @ @ @ @
    • shift_divine 5 --> * * * * * * # # # #
    • sponge_absorb_inplace --> & & & & & & $ $ $ $
    • shift_divine 4 --> * * * * & & & & & &
    • xxxhorner --> * # # # @ @ @ @ @ @
    • shift_divine 6 --> * * * * * * * # # #`
    • sponge_absorb_inplace --> & & & & & & & $ $ $
    • shift_divine 3 --> * * * & & & & & & &
    • xxxhorner --> * # # @ @ @ @ @ @ @
    • shift_divine 7 --> * * * * * * * * # #
    • sponge_absorb_inplace --> & & & & & & & & $ $
    • shift_divine 2--> * * & & & & & & & &
    • xxxhorner --> * # @ @ @ @ @ @ @ @
    • shift_divine 8 --> * * * * * * * * * #
    • sponge_absorb_inplace --> & & & & & & & & & $
    • shift_divine 1 --> * & & & & & & & & &
    • xxxhorner ---> * @ @ @ @ @ @ @ @ @
    • shift_divine 9 --> * * * * * * * * * *
  • For every 100 BFieldElements, use the above tuple but replace every occurrence with xxxhorner with bbbhorner0 + bbbhorner3 + bbbhorner 6, making for a total of 57 cycles.

We have roughly 350 base field columns and 78 extension field columns, so rounding upwards to compute a performance factor one gets:

  • 35 * 11 + 8 * 13 = 385 + 104 = 489 (current cost for one row)
  • 4 * 57 + 1 * 37 = 228 + 37 = 265 (projected cost for one row with the current proposal)
  • improvement: - 46%

Applying that improvement to the current total row count across both hashing and inner product steps for both main and auxiliary tables, one gets 0.54 * ( 20000 (hashing) + 47000 (inner product) ) = 36180. This difference would put the Processor Table row count (currently at 285185) below the next power of two.

A lot of complexity comes from sponge_absorb(_inplace) absorbing 10 elements whereas xxxhorner (and bbbhorner) accumulate 9 (and 3). It is worth considering the option to hash only 9 elements per permutation as we hash rows. This decreases the throughput for this step, but simplifies the tasm code significantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants