feat(proof-data-handler): exclude batches without object file in GCS #2980

pbeza · 2024-09-27T17:36:43Z

What ❔

/tee/proof_inputs endpoint no longer returns batches that have no corresponding object file in Google Cloud Storage for an extended period.

Why ❔

TEE's proof-data-handler on mainnet was flooded with warnings.

Since the recent mainnet's 24.25.0 redeployment, we've been flooded with warnings for the proof-data-handler on mainnet (the warnings are actually not fatal in this context):

Failed request with a fatal error

(...)

Blobs for batch numbers 490520 to 490555 not found in the object store. Marked as unpicked.

The issue is caused by the code behind the /tee/proof_inputs endpoint (which is equivalent to the /proof_generation_data endpoint) – it finds the next batch to send to the requesting tee-prover by looking for the first batch that has a corresponding object in the Google object store. As it skips over batches that don’t have the objects, it logs Failed request with a fatal error for each one (unless the skipped batch was successfully proven, in which case it doesn’t log the error). This happens with every request the tee-prover sends, which is why we're getting so much noise in the logs.

One possible solution is to flag the problematic batches as permanently_ignored, like Thomas did before on mainnet.

Checklist

PR title corresponds to the body of PR (we generate changelog entries from PRs).
Tests for the changes have been added / updated.
Documentation comments have been added / updated.
Code has been formatted via zk fmt and zk lint.

pbeza · 2024-09-30T17:54:47Z

@popzxc, I remember you mentioned not to ask for code reviews this wave, but you're probably the most familiar with this code (along with @slowli). So, if you could make an exception this time, I’d really appreciate it. If you're busy, no worries – feel free to ignore, and I’ll ask @RomanBrodetski to find someone else. Thanks!

pbeza · 2024-10-01T19:53:01Z

Kindly ping @slowli @RomanBrodetski. I need a reviewer.

core/lib/object_store/src/retries.rs

core/lib/types/src/tee_types.rs

core/node/proof_data_handler/src/tee_request_processor.rs

RomanBrodetski

@pbeza to be honest I don't fully follow this solution. I understand what we are trying to do (mark older unresolved jobs as skipped), but I'm not sure I understand the Why here. We can discuss over a huddle or async

core/lib/dal/migrations/20240930110000_tee_add_permanently_ignored_state.down.sql

core/lib/types/src/tee_types.rs

pbeza · 2024-10-08T13:07:31Z

JFYI: This PR is on hold because the code it is based on was recently radically redesigned/refactored here: #3017. This PR may be cherry-picked/revisited once #3017 is merged into main.

pbeza · 2024-11-01T01:07:16Z

I rebased this PR on the latest origin/main, added a few more fixes, and did a careful manual retest. Trust me – it works. Please take another look, @slowli!

BTW, sorry for the force-push instead of merging origin/main into this branch.

/tee/proof_inputs endpoint no longer returns batches that have no corresponding object file in Google Cloud Storage for an extended period. Since the recent `mainnet`'s `24.25.0` redeployment, we've been [flooded with warnings][warnings] for the `proof-data-handler` on `mainnet` (the warnings are actually _not_ fatal in this context): ``` Failed request with a fatal error (...) Blobs for batch numbers 490520 to 490555 not found in the object store. Marked as unpicked. ``` The issue was caused [by the code][code] behind the `/tee/proof_inputs` [endpoint][endpoint_proof_inputs] (which is equivalent to the `/proof_generation_data` [endpoint][endpoint_proof_generation_data]) – it finds the next batch to send to the [requesting][requesting] `tee-prover` by looking for the first batch that has a corresponding object in the Google object store. As it skips over batches that don’t have the objects, [it logs][logging] `Failed request with a fatal error` for each one (unless the skipped batch was successfully proven, in which case it doesn’t log the error). This happens with every [request][request] the `tee-prover` sends, which is why we were getting so much noise in the logs. One possible solution was to manually flag the problematic batches as `permanently_ignored`, like Thomas [did before][Thomas] on `mainnet`. It was a quick and dirty workaround, but now we have a more automated solution. [warnings]: https://grafana.matterlabs.dev/goto/TjlaXQgHg?orgId=1 [code]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/node/proof_data_handler/src/tee_request_processor.rs#L35-L79 [endpoint_proof_inputs]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/node/proof_data_handler/src/lib.rs#L96 [endpoint_proof_generation_data]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/node/proof_data_handler/src/lib.rs#L67 [requesting]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/bin/zksync_tee_prover/src/tee_prover.rs#L93 [logging]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/lib/object_store/src/retries.rs#L56 [Thomas]: https://matter-labs-workspace.slack.com/archives/C05ANUCGCKV/p1725284962312929

pbeza · 2024-11-08T03:03:17Z

@slowli, kindly ping for a code review.

haraldh · 2024-11-08T03:58:58Z

core/node/proof_data_handler/src/tee_request_processor.rs

@@ -47,49 +51,52 @@ impl TeeRequestProcessor {
    ) -> Result<Option<Json<TeeProofGenerationDataResponse>>, RequestProcessorError> {
        tracing::info!("Received request for proof generation data: {:?}", request);

-        let mut min_batch_number = self.config.tee_config.first_tee_processed_batch;
-        let mut missing_range: Option<(L1BatchNumber, L1BatchNumber)> = None;
+        let batch_ignored_timeout = ChronoDuration::days(10);


Hardcode or config parameter?

haraldh · 2024-11-08T04:10:17Z

core/node/proof_data_handler/src/tee_request_processor.rs

                .lock_batch_for_proving(request.tee_type, min_batch_number)
                .await?
            else {
-                // No job available
-                return Ok(None);
+                return Ok(None); // no job available


nit: this can now be break, too... Either change all break to return or the other way round

pbeza · 2024-11-08T04:41:56Z

core/lib/object_store/src/retries.rs

+        f: F,
+    ) -> Result<T, ObjectStoreError>
+    where
+        Fut: Future<Output = Result<T, ObjectStoreError>>,
+        F: FnMut() -> Fut,
+    {
+        self.retry_internal(max_retries, f).await
+    }
+
+    async fn retry_internal<T, Fut, F>(
+        &self,
+        max_retries: u16,


JFYI: this is an artifact that I'm gonna revert.

slowli · 2024-11-08T10:39:33Z

core/node/proof_data_handler/Cargo.toml

 zksync_contracts.workspace = true
+zksync_basic_types.workspace = true


Nit: You don't usually need zksync_basic_types as a direct dep if you depend on zksync_types; the latter re-exports a substantial part of basic types.

slowli · 2024-11-08T10:48:34Z

core/node/proof_data_handler/src/tee_request_processor.rs

                    };
-                    self.unlock_batch(l1_batch_number, request.tee_type).await?;
-                    min_batch_number = l1_batch_number + 1;
+                    self.unlock_batch(batch_number, request.tee_type, status)


I feel we've already had this conversation: This looks like backend driven by frontend anti-pattern; batches are only unlocked in response to client requests. I'd imagine that batches should be unlocked on a timeout (currently hard-coded as 10 days) with PermanentlyIgnored status, right? Or is there no harm if a batch is unlocked later?

There is also a unlock part in the SQL query for new jobs ...

OR ( tee.status = `picked_by_prover` AND tee.prover_taken_at < NOW() - processing_timeout::INTERVAL )

slowli · 2024-11-08T10:51:36Z

core/node/proof_data_handler/src/tee_request_processor.rs

                }
                Err(err) => {
-                    self.unlock_batch(l1_batch_number, request.tee_type).await?;
+                    self.unlock_batch(


Dumb question: Why does this unlocking work like that? Suppose this server loses connection to Postgres for a moment, resulting in a RequestProcessorError::Dal error. IIUC, the batch will be marked as unlocked here, but there's seemingly no reason to do so.

pbeza force-pushed the tee/flag-old-batches-as-permanently-ignored-automatically branch 3 times, most recently from f1b8ad3 to 65cc26e Compare September 30, 2024 11:22

pbeza marked this pull request as ready for review September 30, 2024 12:02

pbeza requested review from slowli, haraldh, RomanBrodetski and popzxc September 30, 2024 12:04

slowli reviewed Oct 2, 2024

View reviewed changes

RomanBrodetski reviewed Oct 2, 2024

View reviewed changes

core/lib/dal/migrations/20240930110000_tee_add_permanently_ignored_state.down.sql Outdated Show resolved Hide resolved

core/lib/types/src/tee_types.rs Outdated Show resolved Hide resolved

pbeza force-pushed the tee/flag-old-batches-as-permanently-ignored-automatically branch 17 times, most recently from 4ee505b to bfeddc9 Compare October 31, 2024 18:29

pbeza force-pushed the tee/flag-old-batches-as-permanently-ignored-automatically branch 2 times, most recently from cf2cf1d to 6c7d879 Compare November 1, 2024 00:58

pbeza requested review from slowli and RomanBrodetski November 1, 2024 00:59

pbeza force-pushed the tee/flag-old-batches-as-permanently-ignored-automatically branch from 6c7d879 to 9188f8a Compare November 8, 2024 03:01

haraldh reviewed Nov 8, 2024

View reviewed changes

pbeza commented Nov 8, 2024

View reviewed changes

slowli reviewed Nov 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(proof-data-handler): exclude batches without object file in GCS #2980

feat(proof-data-handler): exclude batches without object file in GCS #2980

pbeza commented Sep 27, 2024 •

edited

Loading

pbeza commented Sep 30, 2024

pbeza commented Oct 1, 2024

RomanBrodetski left a comment

pbeza commented Oct 8, 2024 •

edited

Loading

pbeza commented Nov 1, 2024

pbeza commented Nov 8, 2024

haraldh Nov 8, 2024

haraldh Nov 8, 2024

pbeza Nov 8, 2024

slowli Nov 8, 2024

slowli Nov 8, 2024

haraldh Nov 11, 2024

slowli Nov 8, 2024

		zksync_contracts.workspace = true
		zksync_basic_types.workspace = true

feat(proof-data-handler): exclude batches without object file in GCS #2980

Are you sure you want to change the base?

feat(proof-data-handler): exclude batches without object file in GCS #2980

Conversation

pbeza commented Sep 27, 2024 • edited Loading

What ❔

Why ❔

Checklist

pbeza commented Sep 30, 2024

pbeza commented Oct 1, 2024

RomanBrodetski left a comment

Choose a reason for hiding this comment

pbeza commented Oct 8, 2024 • edited Loading

pbeza commented Nov 1, 2024

pbeza commented Nov 8, 2024

haraldh Nov 8, 2024

Choose a reason for hiding this comment

haraldh Nov 8, 2024

Choose a reason for hiding this comment

pbeza Nov 8, 2024

Choose a reason for hiding this comment

slowli Nov 8, 2024

Choose a reason for hiding this comment

slowli Nov 8, 2024

Choose a reason for hiding this comment

haraldh Nov 11, 2024

Choose a reason for hiding this comment

slowli Nov 8, 2024

Choose a reason for hiding this comment

pbeza commented Sep 27, 2024 •

edited

Loading

pbeza commented Oct 8, 2024 •

edited

Loading