Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Use lock properly in filter_order_by_nonces to do it concurrently #38

Merged
merged 6 commits into from
Jul 15, 2024

Conversation

ghost
Copy link

@ghost ghost commented Jul 7, 2024

📝 Summary

The current use of Mutex on nonce_cache in filter_order_by_nonces is not correct. It causes the function to run sequentially because every other task will wait for prev one ( both read and write ) complete in order to do its work.
I change to use RwLock instead. Before the change it takes 18m to fetch a block with 2k4 orders, and now it takes 18s with 400 concurrent request.

  • And lint the file

💡 Motivation and Context


✅ I have completed the following steps:

  • Run make lint
  • Run make test
  • Added tests (if applicable)

@ghost ghost requested review from dvush and ZanCorDX as code owners July 7, 2024 17:26
@ghost ghost changed the title Fix: change from Mutex to RwLock to filter_order_by_nonces concurrently Fix: Use lock properly in filter_order_by_nonces to do it concurrently Jul 7, 2024
Copy link

github-actions bot commented Jul 7, 2024

Benchmark results for 465eadb

Report: https://flashbots-rbuilder-ci-stats.s3.us-east-2.amazonaws.com/benchmark/465eadb-db634fc/report/index.html

Date (UTC) 2024-07-13T16:06:03+00:00
Commit 465eadbd19a1cb2d62f8fe829e819e7db6e1a902
Base SHA db634fcb90a6b2fc484e19f02a59a5b74ffeefa2

Significant changes

None

Copy link
Member

@bertmiller bertmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! Really appreciate it and good catch :)

I think there is a potential race condition in the nonce cache update logic. Currently, the code checks if a value exists, fetches it if not, then writes it back. This could lead to overwriting more recent values in concurrent scenarios: imagine if there were two transactions from account X in different threads.

Here's a scenario that illustrates the problem:

  • Thread A checks the cache for address X, doesn't find it
  • Thread B checks the cache for address X, doesn't find it
  • Thread A fetches nonce N1 from the network for address X
  • Thread B fetches nonce N2 from the network for address X (N2 > N1 because it's more recent)
  • Thread B writes N2 to the cache
  • Thread A writes N1 to the cache, overwriting the more recent N2

What do you think about using a single write lock operation that checks and updates atomically? Something like this:

let mut nonce_cache = nonce_cache.write().map_err(|e| eyre::eyre!("Failed to acquire write lock: {}", e))?;
if !nonce_cache.contains_key(&nonce.address) {
    let address = nonce.address;
    let onchain_nonce = self
        .eth_provider
        .get_transaction_count(address)
        .block_id(BlockId::Number(parent_block.into()))
        .await
        .wrap_err("Failed to fetch onchain tx count")?;
    nonce_cache.insert(nonce.address, onchain_nonce);
}
let res_onchain_nonce = *nonce_cache.get(&nonce.address).unwrap();

crates/rbuilder/src/backtest/fetch/mod.rs Outdated Show resolved Hide resolved
@ghost
Copy link
Author

ghost commented Jul 8, 2024

@bertmiller spot on the analysis that there is an overwrite on the nonce value for the same address X. However, the part of getting the nonce, is a nonce of an address from parent block which should be the same regardless of the order of onchain network request to get the data.

@ZanCorDX
Copy link
Contributor

Thanks!!!!!
I believe the change from Mutex to RWlock is unimportant, but the great catch is not keeping the lock while calling get_transaction_count.
Can you remove the is_none and the 2 unwraps in favor of an if let?

@ghost ghost force-pushed the fix_concurrent_filter_by_nonces branch from b5ff532 to b671c02 Compare July 11, 2024 17:08
@ghost ghost force-pushed the fix_concurrent_filter_by_nonces branch from b671c02 to 99fde4a Compare July 11, 2024 22:27
@ghost
Copy link
Author

ghost commented Jul 12, 2024

Thanks!!!!! I believe the change from Mutex to RWlock is unimportant, but the great catch is not keeping the lock while calling get_transaction_count. Can you remove the is_none and the 2 unwraps in favor of an if let?

Thanks for your feedback. At the is_none() checking, we don't intend to read the value, only write if it's none and for the 2 unwrap(), the value surely is available, so I don't think we need to use if-let here, I believe.

@ZanCorDX
Copy link
Contributor

Thanks!!!!! I believe the change from Mutex to RWlock is unimportant, but the great catch is not keeping the lock while calling get_transaction_count. Can you remove the is_none and the 2 unwraps in favor of an if let?

Thanks for your feedback. At the is_none() checking, we don't intend to read the value, only write if it's none and for the 2 unwrap(), the value surely is available, so I don't think we need to use if-let here, I believe.

It's just to make the code more robust, in most cases rust allows us to avoid wrap() which might fail if something changes in the future.
It would be like this:

                     let res_onchain_nonce = if let Some(res_onchain_nonce) = res_onchain_nonce {
                            res_onchain_nonce
                        } else {
                            let address = nonce.address;
                            let onchain_nonce = self
                                .eth_provider
                                .get_transaction_count(address)
                                .block_id(BlockId::Number(parent_block.into()))
                                .await
                                .wrap_err("Failed to fetch onchain tx count")?;

                            if let Ok(mut nonce_cache) = nonce_cache.write() {
                                nonce_cache.entry(address).or_insert(onchain_nonce);
                            }
                            onchain_nonce
                        };
                        if res_onchain_nonce > nonce.nonce && !nonce.optional {

This way the code enforces the correctness at compile time and not at runtime just by the logic of it.

@ghost
Copy link
Author

ghost commented Jul 13, 2024

it makes sense.

@ghost ghost force-pushed the fix_concurrent_filter_by_nonces branch from e6f0c9b to 465eadb Compare July 13, 2024 16:05
@ZanCorDX ZanCorDX merged commit 824cdab into flashbots:develop Jul 15, 2024
2 checks passed
@ghost ghost deleted the fix_concurrent_filter_by_nonces branch July 15, 2024 16:11
MoeMahhouk pushed a commit that referenced this pull request Jul 29, 2024
#38)

## 📝 Summary

The current use of Mutex on `nonce_cache` in `filter_order_by_nonces` is
not correct. It causes the function to run sequentially because every
other task will wait for prev one ( both read and write ) complete in
order to do its work.
I change to use RwLock instead. Before the change it takes 18m to fetch
a block with 2k4 orders, and now it takes 18s with 400 concurrent
request.

- And lint the file

## 💡 Motivation and Context

<!--- (Optional) Why is this change required? What problem does it
solve? Remove this section if not applicable. -->

---

## ✅ I have completed the following steps:

* [x] Run `make lint`
* [x] Run `make test`
* [ ] Added tests (if applicable)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants