Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] fix a race condition issue in OpBufferQueue #43015

Merged

Conversation

raulchen
Copy link
Contributor

@raulchen raulchen commented Feb 6, 2024

Why are these changes needed?

Fix the following error.

            # TODO(hchen): Index the queue by output_split_idx to
            # avoid linear scan.
            for i in range(len(self._queue)):
>               ref = self._queue[i]
E               IndexError: deque index out of range

This is a race condition bug introduced by #42601

This PR fixes the bug and also adds indexing to avoid inefficient linear scans.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
@raulchen raulchen changed the title [data] fix race condition in OpBufferQueue [data] fix a race condition issue in OpBufferQueue Feb 6, 2024
ref = self._queue.popleft()
self._outputs_by_split[ref.output_split_idx].append(ref)
try:
ret = split_queue.popleft()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one question - why we do not need lock here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. deque is thread-safe. 2) there will be at most one thread accessing this queue.

Previously the issue is because the length of self._queue may have changed when accessing self._queue[i]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Can you add one comment for it?

@raulchen raulchen merged commit f47a816 into ray-project:master Feb 6, 2024
9 checks passed
@raulchen raulchen deleted the fix-op-buffer-queue-multi-threading branch February 6, 2024 23:43
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Apr 4, 2024
lee1258561 added a commit to pinterest/ray that referenced this pull request Apr 10, 2024
…ject#43015) (#2)

---------
Backporting ray-project#43015 to fix:

 IndexError: deque index out of range

Build PR against OSS: ray-project#44469

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Hao Chen <[email protected]>
Co-authored-by: Hao Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants