Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(tee-prover): mitigate panic on redeployments #2764

Merged
merged 5 commits into from
Sep 2, 2024

Conversation

pbeza
Copy link
Collaborator

@pbeza pbeza commented Aug 28, 2024

What ❔

We experienced tee-prover panic, likely due to the automatic redeployment of the proof-data-handler in the staging environment. We've been getting 503 Service Unavailable errors for an extended period when trying to reach http://server-v2-proof-data-handler-internal.stage.matterlabs.corp/tee/proof_input, which resulted in a panic after reaching the retry limit.

Relevant code causing the panic:

if !err.is_retriable() || retries > self.config.max_retries {
return Err(err.into());
}

Relevant logs.

Why ❔

To mitigate panics on proof-data-handler redeployments.

Checklist

  • PR title corresponds to the body of PR (we generate changelog entries from PRs).
  • Tests for the changes have been added / updated.
  • Documentation comments have been added / updated.
  • Code has been formatted via zk fmt and zk lint.

@pbeza
Copy link
Collaborator Author

pbeza commented Aug 28, 2024

With this code change, tee-prover will retry connecting to http://server-v2-proof-data-handler-internal.stage.matterlabs.corp/tee/proof_input for a maximum of:

$$\begin{aligned} \sum_{i=0}^{9} \min(2^i, 128) &=\sum_{i=0}^{7} 2^i + \min(2^8, 128) + \min(2^9, 128) \\ &= 2^{8} - 1 + 128 + 128 \\ &= 255 + 256 \\ &= 511 \end{aligned} $$

seconds, which is 8 minutes and 31 seconds if my math is correct. :P

Not sure if just increasing the number of retries is good enough, though. I welcome any other suggestions. Perhaps we can allow panics during redeployments?

@pbeza pbeza marked this pull request as ready for review August 28, 2024 16:16
@pbeza pbeza force-pushed the tee/fix/mitigate-panic-on-tee-prover-redeployment branch from 6cb1a66 to 37af465 Compare August 29, 2024 12:11
@pbeza
Copy link
Collaborator Author

pbeza commented Aug 30, 2024

@popzxc, PTAL.

I’ve addressed all your comments. I think the failing tests are just being flaky.

@pbeza pbeza requested a review from popzxc August 30, 2024 14:29
@haraldh haraldh added this pull request to the merge queue Sep 2, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 2, 2024
@haraldh haraldh added this pull request to the merge queue Sep 2, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 2, 2024
@haraldh haraldh added this pull request to the merge queue Sep 2, 2024
Merged via the queue into main with commit 178b386 Sep 2, 2024
54 checks passed
@haraldh haraldh deleted the tee/fix/mitigate-panic-on-tee-prover-redeployment branch September 2, 2024 16:34
haraldh added a commit that referenced this pull request Sep 3, 2024
With (teepot PR 196)[matter-labs/teepot#196] merged,
update the `flake.lock` for `teepot`
to use the `--env-prefix` argument for `tee-key-preexec`.

This aligns the environment variable names, which were changed in
#2764

Signed-off-by: Harald Hoyer <[email protected]>
slowli added a commit that referenced this pull request Sep 3, 2024
github-merge-queue bot pushed a commit that referenced this pull request Sep 3, 2024
…2789)

## What ❔

With matter-labs/teepot#196 merged, update the
`flake.lock` for `teepot` to use the `--env-prefix` argument for
`tee-key-preexec`.

## Why ❔

This aligns the environment variable names, which were changed in
#2764

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [ ] Tests for the changes have been added / updated.
- [ ] Documentation comments have been added / updated.
- [ ] Code has been formatted via `zk fmt` and `zk lint`.

Signed-off-by: Harald Hoyer <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Sep 5, 2024
🤖 I have created a release *beep* *boop*
---


##
[24.24.0](core-v24.23.0...core-v24.24.0)
(2024-09-05)


### Features

* conditional cbt l1 updates
([#2748](#2748))
([6d18061](6d18061))
* **eth-watch:** do not query events from earliest block
([#2810](#2810))
([1da3f7e](1da3f7e))
* **genesis:** Validate genesis config against L1
([#2786](#2786))
([b2dd9a5](b2dd9a5))
* Integrate tracers and implement circuits tracer in vm2
([#2653](#2653))
([87b02e3](87b02e3))
* Move prover data to
/home/popzxc/workspace/current/zksync-era/prover/data
([#2778](#2778))
([62e4d46](62e4d46))
* Remove prover db from house keeper
([#2795](#2795))
([85b7346](85b7346))
* **vm-runner:** Implement batch data prefetching
([#2724](#2724))
([d01840d](d01840d))
* **vm:** Extract batch executor to separate crate
([#2702](#2702))
([b82dfa4](b82dfa4))
* **vm:** Simplify VM interface
([#2760](#2760))
([c3bde47](c3bde47))
* **zk_toolbox:** add multi-chain CI integration test
([#2594](#2594))
([05c940e](05c940e))


### Bug Fixes

* **config:** Do not panic for observability config
([#2639](#2639))
([1e768d4](1e768d4))
* **core:** Batched event processing support for Reth
([#2623](#2623))
([958dfdc](958dfdc))
* return correct witness inputs
([#2770](#2770))
([2516e2e](2516e2e))
* **tee-prover:** increase retries to reduce spurious alerts
([#2776](#2776))
([4fdc806](4fdc806))
* **tee-prover:** mitigate panic on redeployments
([#2764](#2764))
([178b386](178b386))
* **tee:** lowercase enum TEE types
([#2798](#2798))
([0f2f9bd](0f2f9bd))
* **vm-runner:** Fix statement timeouts in VM playground
([#2772](#2772))
([d3cd553](d3cd553))


### Performance Improvements

* **vm:** Fix VM performance regression on CI loadtest
([#2782](#2782))
([bc0d7d5](bc0d7d5))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: zksync-era-bot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants