New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[backport] [object_buffer_pool] reduce mutex lock scope in WriteChunk (#43150) #1

Merged

iamyangchen merged 1 commit into pinterest/main-2.9.3 from cyang/pinterest-main-2.9.3-backport-mutex-fix

Apr 12, 2024

iamyangchen commented Mar 25, 2024 •

edited

Loading

Backport the mutex lock fix in main Ray repo. See original PR for details: ray-project#43434

whl: https://ray-ci-artifact-pr-public.s3.amazonaws.com/7692d63f2d32c4b0da4166cf759190ba7f056bee/tmp/artifacts/.whl/.whl/ray-2.9.3-cp38-cp38-manylinux2014_x86_64.whl


          [object_buffer_pool] reduce mutex lock scope in WriteChunk (ray-proje…

7692d63

…ct#43434)

Signed-off-by: Saurabh Vishwas Joshi <[email protected]>

## Why are these changes needed?
Object store network transfer performance is slow, and we observe a periodic burst followed by a gap in the network usage. 

A burst of inbound network traffic occurs at the beginning of each `ray.get(obj_refs)` call, then there is a wide-gap of unused network traffic and then a subsequent burst of network traffic in the next `ray.get(obj_refs) call`.   

This looks like a processing bottleneck when payloads are received over the network on the Pull side. 

We dug into the code in [object_manager](https://github.com/ray-project/ray/tree/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager) and found the following:

- Objects are transferred in chunks of size: 5 MiB
- When a PushRequest is received for a chunk, it is processed by [ObjectManager::HandlePush](https://github.com/ray-project/ray/blob/7ff3969159d3aeac00415ac26bf96a63f782db86/src/ray/object_manager/object_manager.cc#L562)
- Which internally calls the [ObjectManager::ReceiveObjectChunk](https://github.com/ray-project/ray/blob/7ff3969159d3aeac00415ac26bf96a63f782db86/src/ray/object_manager/object_manager.cc#L623). This results in a call to the function [ObjectBufferPool::WriteChunk](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.h#L128). 
- The WriteChunk function is [mutex guarded](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.cc#L122) throughout its execution.  
- This includes the [std::memcpy](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.cc#L139) call for a 5MiB payload
- This [pool_mutex_](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.h#L215) lock is shared by all object_ids being received over the network 

Which makes us believe that even if chunks for different ObjectId are received in parallel over the network they are written sequentially. Which would explain why we see a burst in the network usage followed by a hole in the network usage

### Changes

**Write Chunk**
- Transition the chunk from `REFERENCED` to `SEALED` before releasing the lock
- Increment / Decrement `num_inflight_copies` before / after the copy
- Perform an unguarded memcpy of the chunk into the buffer
- Reacquire the mutex lock and perform object_id level `Seal` and `Release` decisions

**AbortCreate**
- Wait to ensure `num_inflight_copies == 0` before allowing the `Release` & `Abort` calls for the `object_id`
- This check ensures that we do not release the underlying buffer while an unguarded copy is ongoing


### Tests

#### Before

- Sampled network at 1 second frequency

`speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0`
<img width="590" alt="Screenshot 2024-02-13 at 10 26 33 AM" src="https://github.com/ray-project/ray/assets/8691593/7a5497dd-b87d-4bec-a51c-62d629c06c58">

- Sampled network at 100 millisecond frequency

`speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0 `
<img width="656" alt="Screenshot 2024-02-13 at 10 26 37 AM" src="https://github.com/ray-project/ray/assets/8691593/5ac229e6-4777-4820-b9e7-ebbd63cafcc9">


#### After

- Sampled network at 1 second frequency
`speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0`
<img width="620" alt="Screenshot 2024-02-21 at 1 22 08 PM" src="https://github.com/ray-project/ray/assets/8691593/f15196d6-01f7-4d3e-b56c-ba09c58be455">


- Sampled network at 100ms frequency
`speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0` 
<img width="1674" alt="Screenshot 2024-02-21 at 1 19 10 PM" src="https://github.com/ray-project/ray/assets/8691593/584f64b9-a39f-4441-b034-86234aff4283">

```
Finished benchmark for total_size_MiB: 102400, block_size_MiB: 1024, parallel_block: None
	ray.wait(fetch_local=True) Gbps: 54.67178658153866
	ray.wait(fetch_local=True) total time s: 15.711823463439941
	ray.get() Gbps: 186301.24111362518
	ray.get() total time s: 0.004610776901245117
```

## Related issue number

iamyangchen force-pushed the cyang/pinterest-main-2.9.3-backport-mutex-fix branch from da4113b to 7692d63 Compare

March 25, 2024 21:47

sjoshi6 approved these changes

View reviewed changes

lee1258561 pushed a commit that referenced this pull request


          [RLlib] Cleanup examples folder #1. (ray-project#44067)

cb05540

iamyangchen merged commit dd40e65 into pinterest/main-2.9.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet