Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimistic Relay #380

Merged
merged 13 commits into from
May 12, 2023
Merged

Optimistic Relay #380

merged 13 commits into from
May 12, 2023

Conversation

michaelneuder
Copy link
Collaborator

NOTE: Description copied from: #285
ultra sound relay has been running this version of the relay in prod for ~1 month with no major issues

📝 Summary

tl;dr; This PR implements an "optimistic" version of Flashbots' mev-boost-relay. The core idea is that we can improve the throughput and latency of block submissions by using asynchronous processing. Instead of validating each block that comes in immediately, we return early and mark the block as eligible to win the auction before checking its validity. This allows builders to submit more blocks and critically, submit blocks later in the time period of the slot. Later blocks will have more MEV and thus a higher probability of winning the block auction. All blocks sent to the relay are still validated, and only a single slot at a time is being processed optimistically. In practice we expect that the first few milliseconds of Slot n will handle the remaining Slot n-1 optimistic blocks, which may have been received very close to the slot boundary. Proposers still use the Proposer API to select the highest bid, but there is no longer a guarantee that this block is valid (because we may not have had time to asynchronously validate it). To submit blocks in the optimistic mode, a builder must put up a collateral that is greater than the value of the blocks they are proposing. If a proposer ends up signing an invalid block, collateral from the builder of that block will be used to refund the proposer for the missed slot.

📚 References

This joint work from Justin Drake, AlphaMonad, and I is a revision of michaelneuder#2. Many thanks to Chris Hager, Alex Stokes, Mateusz Morusiewicz, and Builder0x69 for feedback and support thus far!!

⛱ Motivation and Context

The changes can be described through 3 sequences:

  1. Submitting optimistic blocks.
  2. Validating optimistic blocks.
  3. Proposing optimistic blocks.

1. Submitting optimistic blocks.

Screen Shot 2023-02-19 at 12 36 23 PM

  1. Block builders submit blocks to the Builder API endpoint of the relay.
  2. Based on the collateral and status of the builder:
    a. if the builder collateral is greater than the value of the block and the builder is optimistic, run the block simulation optimistically in a different goroutine.
    b. the optimistic block processor adds 1 to the optBlock waitgroup, which is a used to synchronize all the optimistic block validations happening concurrently during the slot.
    c. otherwise if the builder is highPrio send the block to the highPrio queue of the prio-load-balancer.
    d. else send the block to the lowPrio queue of the prio-load-balancer.
  3. For non-optimistic blocks, wait for the validation result.
  4. Update the builder's current bid in redis.

Notice that for builders with sufficient collateral, we update the bid without validating the incoming block (though we queue it for async processing). This is where the improved performance can be achieved.


2. Validating optimistic blocks.

Screen Shot 2023-02-19 at 12 47 48 PM

  1. The optimistic block processor sends the block as low-prio to the prio-load-balancer for simulation.
  2. The block is simulated on the validating nodes, and the status is returned to the optimistic block processor.
  3. If the simulation failed, the builder is demoted and the details of the failure are written to the database.
  4. The optBlock waitgroup is decremented by one, indicating that this goroutine has completed its tasks.

This flow handles the simulation of all the blocks that were optimistically skipped. An invalid block here results in a demotion, but not necessarily a refund for the proposer because we don't yet know whether this block was the winning bid and thus signed + proposed.


3. Proposing optimistic blocks

Screen Shot 2023-02-19 at 12 55 00 PM

  1. mev-boost calls getHeader on the Proposer API of the relay. This is part of the MEV-Block Block Proposal as documented in https://docs.flashbots.net/flashbots-mev-boost/architecture-overview/block-proposal.
  2. mev-boost calls getPayload on the Proposer API of the relay. This triggers the publication of a SignedBeaconBlock.
  3. The optBlock waitgroup is waited on. This ensures that there are no more optimistic blocks to be simulated for that slot.
  4. The proposer API checks the database for a demotion that matches the header of the winning block. If it is present, then the simulation of the winning block must have failed, and thus a refund in necessary.
  5. If a demotion is found, the proposer API updates the demotion table with the refund justification, which is the SignedBeaconBlock and SignedValidatorRegistration.

This flow represents the process of checking if our optimistic block processing ever results in a validator submitting an invalid block. Since block builders will post collateral, this will be used to reimburse the validator in that case. Since refunds should be a relatively rare event, we plan on handling them manually.


Misc notes

We only allow optimistic processing of one slot at a time. We use a waitgroup to ensure that before we update the optimistic slot, all the previous slot blocks have been processed. This may bleed into the subsequent slot, but that is OK because we are actually shifting that processing time from the end of the previous slot which is where the timing is much more critical for winning bids.

At the beginning of each slot we also cache the status and collateral of each builder. We access this cache during the block submission flow to (1) avoid repeatedly reading the same status + collateral for a builder and (2) avoid race conditions where the status + collateral of a builder changes over course of a slot due to either a demotion or an internal API call.

Since builders may use many different public keys to submit blocks we allow all of those keys to be backed by a single collateral through the use of the a "Collateral ID". This is aimed at simplifying the process of posting collateral, but if any public key using that Collateral ID submits an invalid block, all of the public keys are demoted.

We introduce a database migration by adding is_optimistic, collateral, builder_id as columns to the block builder table and a number timing columns to the block builder submission table for profiling the performance of the relay. We also introduce one new table for the demotions. Additionally, we introduce an internal API on the path /internal/v1/builder/collateral/{pubkey} to provide a convenient way of updating the collateral of a builder in the DB.

We are currently running this version of the relay on Goerli testnet and have confirmed that the flows described above work e2e. Additionally, we added a number of unit tests to exercise the new logic. See https://github.com/michaelneuder/opt-relay-docs/blob/main/proposal.md#learnings-from-goerli for more details on what we learned running this relay on Goerli.


✅ I have run these commands

  • make lint
  • make test-race (this gave a few errors, but i think it is likely because we are doing a lot of asynchronous processing now, not quite sure how to deal with this)
  • go mod tidy
  • I have seen and agree to CONTRIBUTING.md

@codecov-commenter
Copy link

codecov-commenter commented May 1, 2023

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 50.19231% with 259 lines in your changes missing coverage. Please review.

Project coverage is 30.68%. Comparing base (75f6c16) to head (342f761).
Report is 136 commits behind head on main.

Files with missing lines Patch % Lines
services/api/service.go 61.05% 97 Missing and 14 partials ⚠️
database/mockdb.go 0.00% 60 Missing ⚠️
beaconclient/mock_multi_beacon_client.go 0.00% 42 Missing ⚠️
common/test_utils.go 30.76% 18 Missing ⚠️
database/database.go 85.22% 9 Missing and 4 partials ⚠️
services/api/blocksim_ratelimiter.go 0.00% 8 Missing ⚠️
database/typesconv.go 0.00% 3 Missing ⚠️
common/common.go 0.00% 2 Missing ⚠️
common/types.go 0.00% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##             main     #380       +/-   ##
===========================================
+ Coverage   20.35%   30.68%   +10.32%     
===========================================
  Files          21       24        +3     
  Lines        4082     4507      +425     
===========================================
+ Hits          831     1383      +552     
+ Misses       3149     2960      -189     
- Partials      102      164       +62     
Flag Coverage Δ
unittests 30.68% <50.19%> (+10.32%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
@metachris
Copy link
Collaborator

As discussed previously (i.e. in #285), we want to merge this PR next. Going through a final round of review and testing before.

database/database.go Outdated Show resolved Hide resolved
@JustinDrake
Copy link

Heads up: we're chasing some sort of issue where is_high_prio changes unexpectedly for the ultra sound relay. This is seemingly correlated with optimistic relaying and demotion events.

services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
ignoreError := simErr.Error() == ErrBlockAlreadyKnown || simErr.Error() == ErrBlockRequiresReorg || strings.Contains(simErr.Error(), ErrMissingTrieNode)
if !ignoreError {
// Mark builder as non-optimistic.
opts.builder.status.IsOptimistic = false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the effect of this? opts doesn't seem to be used further down 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opts.builder has the type *blockBuilderCacheEntry, so we are modifying the status of the entry so the builders blocks are immediately rejected as non-optimistic (on this instance of the builder api).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm i see, this is a bit non-obvious, opts is also passed in as value.

i'm not feeling great about this function having side-effects.

would it maybe be cleaner if the handling of the errors happens outside of simulateBlock?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, thats fine! we can easily move it to processOptimisticBlock. it is probably cleaner there anyways, because we only need to flip the bit when we are in optimistic mode.

services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
services/api/service.go Outdated Show resolved Hide resolved
@metachris metachris merged commit 15c14de into main May 12, 2023
@metachris metachris deleted the mikeneuder-may1-prod-opt-relay branch May 12, 2023 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants