Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize Filling of Degree-Lowering Table #284

Merged
merged 12 commits into from
May 22, 2024

Conversation

aszepieniec
Copy link
Collaborator

This PR modifies the auto-generated code for filling the degree-lowering table, essentially upgrading it from sequential to parallel iteration. To achieve this, the master base and extension tables are split into left and right parts at the column index separating original columns from degree-lowering columns. Then parallel iteration over the rows allows filling in the degree-lowering rows left to right. No value in any degree-lowering row depends on values to its right.

A complication arises when filling the degree-lowering columns for transition constraints as the degree-lowering values depend on two rows, current and next. The solution is already implied by the AIR constraints, which ensures that all degree-lowering values live in the current row. Therefore, by parallel iteration over single rows of the degree-lowering part, and overlapping row pairs of the original part, one can fill the former left to right. Note that the overlapping row pairs do not interfere with parallel execution because these are immutable references. Rust disallows multiple mutable references to the same data, but in this case only the right half of the table is mutable, and there we select one row per iteration.

Much of the kudos goes to @jan-ferdinand who came up with the blueprint for the solution.

On my desktop machine, benchmarking prove_halt --

sequential:
image

parallel:
image

Copy link
Member

@jan-ferdinand jan-ferdinand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! I've left a few nits & questions inline.

constraint-evaluation-generator/src/substitution.rs Outdated Show resolved Hide resolved
constraint-evaluation-generator/src/substitution.rs Outdated Show resolved Hide resolved
constraint-evaluation-generator/src/substitution.rs Outdated Show resolved Hide resolved
constraint-evaluation-generator/src/substitution.rs Outdated Show resolved Hide resolved
constraint-evaluation-generator/src/substitution.rs Outdated Show resolved Hide resolved
@jan-ferdinand
Copy link
Member

I'd like to point out that in the screenshots, the number of iterations is different: 292 in the sequential case, 347 in the parallel. This gives the following mean times & improvements:

mean [ms] old new
pad 1.62 0.29 -82%
fill 1.49 0.16 -89%

Excellent work.

@Sword-Smith
Copy link
Collaborator

I benchmarked this on our reference machine, and it gives an additional 2kHz on top of #283 :)

After the #283 branch was rebased against this branch:

### Prove Fibonacci 00000000000000102400       70.35s    #Reps   Share  Category
├─Fiat-Shamir: claim                            4.26µs       1   0.00%  (hash –  0.00%)
├─derive additional parameters                 11.96µs       1   0.00%  
├─base tables                                  30.95s        1  43.99%  
│ ├─create                                      2.36s        1   3.36%  (gen  – 28.76%)
│ ├─pad                                       351.50ms       1   0.50%  (gen  –  4.28%)
│ │ ├─pad original tables                     178.15ms       1   0.25%  
│ │ └─fill degree-lowering table              173.28ms       1   0.25%  
│ ├─randomize trace                            83.18ms       1   0.12%  (gen  –  1.01%)
│ ├─LDE                                        17.56s        1  24.96%  (LDE  – 49.88%)
│ │ ├─polynomial zero-initialization            3.45µs       1   0.00%  
│ │ ├─interpolation                             2.38s        1   3.38%  
│ │ ├─resize                                  772.00ns       1   0.00%  
│ │ ├─memory warmup                           866.28ms       1   1.23%  
│ │ ├─evaluation                               14.32s        1  20.35%  
│ │ └─memoize                                 772.00ns       1   0.00%  
│ ├─Merkle tree                                 5.23s        1   7.44%  (hash – 48.16%)
│ │ ├─leafs                                     4.80s        1   6.83%  
│ │ └─Merkle tree                             406.91ms       1   0.58%  
│ ├─Fiat-Shamir                                24.68µs       1   0.00%  (hash –  0.00%)
│ └─extend                                      5.36s        1   7.62%  (gen  – 65.25%)
├─ext tables                                   15.92s        1  22.64%  
│ ├─randomize trace                            57.53ms       1   0.08%  (gen  –  0.70%)
│ ├─LDE                                        10.98s        1  15.60%  (LDE  – 31.18%)
│ │ ├─polynomial zero-initialization            1.83µs       1   0.00%  
│ │ ├─interpolation                             1.93s        1   2.75%  
│ │ ├─resize                                    1.16µs       1   0.00%  
│ │ ├─memory warmup                           613.01ms       1   0.87%  
│ │ ├─evaluation                                8.36s        1  11.88%  
│ │ └─memoize                                   1.53µs       1   0.00%  
│ ├─Merkle tree                                 4.89s        1   6.95%  (hash – 45.04%)
│ │ ├─leafs                                     4.01s        1   5.71%  
│ │ └─Merkle tree                             756.02ms       1   1.07%  
│ └─Fiat-Shamir                               127.55µs       1   0.00%  (hash –  0.00%)
├─quotient calculation (cached)                 4.74s        1   6.73%  (CC   – 62.59%)
│ ├─zerofier inverse                          864.25ms       1   1.23%  
│ └─evaluate AIR, compute quotient codeword     3.73s        1   5.31%  
├─quotient LDE                                  6.67s        1   9.48%  (LDE  – 18.94%)
├─hash rows of quotient segments              342.65ms       1   0.49%  (hash –  3.15%)
├─Merkle tree                                 396.10ms       1   0.56%  (hash –  3.65%)
├─out-of-domain rows                          605.21ms       1   0.86%  
├─Fiat-Shamir                                  85.84µs       1   0.00%  (hash –  0.00%)
├─linear combination                            2.98s        1   4.23%  
│ ├─base                                      515.97ms       1   0.73%  (CC   –  6.82%)
│ ├─ext                                        75.43ms       1   0.11%  (CC   –  1.00%)
│ └─quotient                                    1.13s        1   1.61%  (CC   – 14.93%)
├─DEEP                                          1.16s        1   1.66%  
│ ├─base&ext curr row                         365.85ms       1   0.52%  
│ ├─base&ext next row                         393.85ms       1   0.56%  
│ └─segmented quotient                        404.62ms       1   0.58%  
├─combined DEEP polynomial                      1.19s        1   1.69%  
│ └─sum                                         1.11s        1   1.58%  (CC   – 14.66%)
├─FRI                                           1.46s        1   2.07%  
└─open trace leafs                              1.56ms       1   0.00%  

### Categories
LDE     35.20s  50.04%
hash    10.86s  15.44%
gen      8.22s  11.68%
CC       7.57s  10.75%

Clock frequency is 14555 Hz (1024009 clock cycles / 70353 ms / 1 iterations)
Optimal clock frequency is 14904 Hz (1048576 padded height / 70353 ms / 1 iterations)
FRI domain length is 2^23

@aszepieniec aszepieniec merged commit 9c02c64 into master May 22, 2024
5 checks passed
@jan-ferdinand jan-ferdinand deleted the asz/fast-fill-degree-lowering-table branch May 22, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants