Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notify lookup sync of gossip processing results #5722

Merged
merged 6 commits into from
May 13, 2024

Conversation

dapplion
Copy link
Collaborator

@dapplion dapplion commented May 6, 2024

General problem

Lookup sync and gossip interact in very racy ways. They interact at all because we want to prevent downloading things via ReqResp that we are already processing. Note that this optimization carries complexity, which we may decide if it's worth it at all.

Screenshot 2024-05-06 at 23 11 48

Consider the image above, we can receive an attestation referencing a block not yet imported into fork-choice at these moments in time:

  • A before processing: Easy, download and process the block
  • B during processing: Block is already processing, should skip download. However, we must consider:
    • What to do if processing fails?
    • Should we create a lookup at all? If not, what to do if we receive a second attestation referencing the child of this block?
  • C while waiting blobs: Block is fully processed but not yet in fork-choice. In this case we don't have to worry about processing failures, but the child lookup issue remains.
  • D after import: This trigger should never happen. If it does we will download all block components and get a BlockAreadyKnown error.

Issues with current unstable da9d386

^ introduces the optimization to prevent downloads for triggers B and C. However, it fails to explicitly handle both pitfalls listed above. If processing fails, a lookup may get stuck.

Also, it creates a spammy situation where a lookup get dropped immediately if the block is still processing, once per attestation

^ Fixed the spammy loop, but now it make the case of processing failure worse. Completed chains remain in a hashset for 60 seconds, so a failing lookup will be blocked from being retried for that period.

Proposed Changes

Sync lookup must be aware of gossip processing results. I think there's no way around that.

The least invasive way to achieve that IMO is:

  • If block is the processing cache, mark lookup request as processing
  • When gossip processing completes, send sync message BlockComponentProcessed

With the above we tackle:

  1. Immediately retry lookups for blocks that fail processing, don't wait 60 seconds
  2. Immediately continue or drop child lookup of blocks from gossip
  3. No spammy loop, lookups are not marked completed until imported into fork-choice

Pitfalls

This PR relies on these assumptions:

  • If availability_checker.has_block(block) returns true, the block is permanently valid and is missing blobs
  • If reqresp_pre_import_cache.contains_key(block) returns true, a BlockComponentProcessed event MUST be emitted some time in the future

We don't have e2e to assert those conditions at the moment. I think code comments should be good for now but it would be great to codify them in the future somehow

Todo

  • Update lookup tests, behaviour has slightly changed

Closes #5693

@dapplion dapplion mentioned this pull request May 6, 2024
@realbigsean realbigsean added ready-for-review The code is ready for review v5.2.0 Q2 2024 labels May 6, 2024
@dapplion dapplion force-pushed the notify-lookup-sync-process-result branch from 90c30b3 to e6c8955 Compare May 7, 2024 02:40
michaelsproul added a commit that referenced this pull request May 7, 2024
Squashed commit of the following:

commit e6c8955
Author: dapplion <[email protected]>
Date:   Mon May 6 22:52:48 2024 +0900

    Notify lookup sync of gossip processing results
@michaelsproul michaelsproul mentioned this pull request May 7, 2024
@AgeManning
Copy link
Member

I was thinking about this more.

In order to avoid sync having to know anything about block processing results (explicitly) can't we do the following:

  • If a block request comes in and we are processing, then ignore the request
  • If a request for a child comes in and the parent is processing, ignore (or as an optimization, cache).

Eventually the block will either fail, or be processed. We will eventually get more requests for the chain of blocks and eventually it will succeed because the block will either be failed by the beacon processor, or it will be successful.

The issue is around just restarting the chains. To prevent loops, like we wait a slot or something.

Just thinking this would be away to avoid explicit communication from the beacon processor.

@dapplion
Copy link
Collaborator Author

dapplion commented May 7, 2024

If a block request comes in and we are processing, then ignore the request
If a request for a child comes in and the parent is processing, ignore (or as an optimization, cache).

For child lookups it a bit tricky. Assume you are bit behind, such that given this chain of blocks:

Block slot 10 <- peers attesting here
Block slot 9
Block slot 8
Block slot 7
Block slot 6 <- you are here

You have to create a lookup for slot 10, fetch block, attempt process, get unknown parent, fetch 9, etc until 6. If 6 is processing lookup 7 needs some event to make progress. So the beacon processor must tell sync that block 6 has been processed. Otherwise dropping all lookups child of 6 while 6 is processing sounds silly. You must leave child lookups of 6 paused somehow awaiting for their ancestor to process.

Sync lookup technically doesn't need to know about the processing result, it would be enough to have a signal to continue childs of 6. But then we have to add a new SyncMessage and have a dedicated handler in block lookups. While it reduces how much information the processor leaks, I don't see how it simplifies things. We already have a BlockComponentProcessed message and its handler, why not re-use it?

Note: The moment that sync lookup checks the chain's processing cache or da_checker we break the separation of concerns between sync lookup and gossip processing. If we are crossing that line, let's implement a proper informational complete fix.

Notice that this ignores will happen thousands of times, one per each attestation. Unless we have something like #5706

@@ -170,21 +172,20 @@ impl<T: BeaconChainTypes> NetworkBeaconProcessor<T> {
if reprocess_tx.try_send(reprocess_msg).is_err() {
error!(self.log, "Failed to inform block import"; "source" => "rpc", "block_root" => %hash)
};
if matches!(process_type, BlockProcessType::SingleBlock { .. }) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this match?

It it redundant because we now only process single blocks here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, blobs are processed in process_rpc_blobs

@realbigsean
Copy link
Member

@mergify queue

Copy link

mergify bot commented May 13, 2024

queue

🛑 The pull request has been removed from the queue default

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@dapplion
Copy link
Collaborator Author

@mergify queue

Copy link

mergify bot commented May 13, 2024

queue

🛑 The pull request has been removed from the queue default

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@realbigsean
Copy link
Member

@mergify refresh

Copy link

mergify bot commented May 13, 2024

refresh

✅ Pull request refreshed

@realbigsean
Copy link
Member

@mergify queue

Copy link

mergify bot commented May 13, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 93e0649

@mergify mergify bot merged commit 93e0649 into sigp:unstable May 13, 2024
27 checks passed
@dapplion dapplion deleted the notify-lookup-sync-process-result branch May 13, 2024 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-review The code is ready for review v5.2.0 Q2 2024
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants