Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backport] [object_buffer_pool] reduce mutex lock scope in WriteChunk (#43150) #1

Conversation

iamyangchen
Copy link

@iamyangchen iamyangchen commented Mar 25, 2024

…ct#43434)

Signed-off-by: Saurabh Vishwas Joshi <[email protected]>

## Why are these changes needed?
Object store network transfer performance is slow, and we observe a periodic burst followed by a gap in the network usage. 

A burst of inbound network traffic occurs at the beginning of each `ray.get(obj_refs)` call, then there is a wide-gap of unused network traffic and then a subsequent burst of network traffic in the next `ray.get(obj_refs) call`.   

This looks like a processing bottleneck when payloads are received over the network on the Pull side. 

We dug into the code in [object_manager](https://github.com/ray-project/ray/tree/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager) and found the following:

- Objects are transferred in chunks of size: 5 MiB
- When a PushRequest is received for a chunk, it is processed by [ObjectManager::HandlePush](https://github.com/ray-project/ray/blob/7ff3969159d3aeac00415ac26bf96a63f782db86/src/ray/object_manager/object_manager.cc#L562)
- Which internally calls the [ObjectManager::ReceiveObjectChunk](https://github.com/ray-project/ray/blob/7ff3969159d3aeac00415ac26bf96a63f782db86/src/ray/object_manager/object_manager.cc#L623). This results in a call to the function [ObjectBufferPool::WriteChunk](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.h#L128). 
- The WriteChunk function is [mutex guarded](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.cc#L122) throughout its execution.  
- This includes the [std::memcpy](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.cc#L139) call for a 5MiB payload
- This [pool_mutex_](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.h#L215) lock is shared by all object_ids being received over the network 

Which makes us believe that even if chunks for different ObjectId are received in parallel over the network they are written sequentially. Which would explain why we see a burst in the network usage followed by a hole in the network usage

### Changes

**Write Chunk**
- Transition the chunk from `REFERENCED` to `SEALED` before releasing the lock
- Increment / Decrement `num_inflight_copies` before / after the copy
- Perform an unguarded memcpy of the chunk into the buffer
- Reacquire the mutex lock and perform object_id level `Seal` and `Release` decisions

**AbortCreate**
- Wait to ensure `num_inflight_copies == 0` before allowing the `Release` & `Abort` calls for the `object_id`
- This check ensures that we do not release the underlying buffer while an unguarded copy is ongoing


### Tests

#### Before

- Sampled network at 1 second frequency

`speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0`
<img width="590" alt="Screenshot 2024-02-13 at 10 26 33 AM" src="https://github.com/ray-project/ray/assets/8691593/7a5497dd-b87d-4bec-a51c-62d629c06c58">

- Sampled network at 100 millisecond frequency

`speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0 `
<img width="656" alt="Screenshot 2024-02-13 at 10 26 37 AM" src="https://github.com/ray-project/ray/assets/8691593/5ac229e6-4777-4820-b9e7-ebbd63cafcc9">


#### After

- Sampled network at 1 second frequency
`speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0`
<img width="620" alt="Screenshot 2024-02-21 at 1 22 08 PM" src="https://github.com/ray-project/ray/assets/8691593/f15196d6-01f7-4d3e-b56c-ba09c58be455">


- Sampled network at 100ms frequency
`speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0` 
<img width="1674" alt="Screenshot 2024-02-21 at 1 19 10 PM" src="https://github.com/ray-project/ray/assets/8691593/584f64b9-a39f-4441-b034-86234aff4283">

```
Finished benchmark for total_size_MiB: 102400, block_size_MiB: 1024, parallel_block: None
	ray.wait(fetch_local=True) Gbps: 54.67178658153866
	ray.wait(fetch_local=True) total time s: 15.711823463439941
	ray.get() Gbps: 186301.24111362518
	ray.get() total time s: 0.004610776901245117
```

## Related issue number
@iamyangchen iamyangchen force-pushed the cyang/pinterest-main-2.9.3-backport-mutex-fix branch from da4113b to 7692d63 Compare March 25, 2024 21:47
@iamyangchen iamyangchen merged commit dd40e65 into pinterest/main-2.9.3 Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants