Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pull manager deadlock due to object reconstruction #24791

Merged
merged 11 commits into from
May 18, 2022

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented May 13, 2022

Why are these changes needed?

When an object is under reconstruction, pull manager keeps the bundle request active with no timeout, which may block the next bundle request that's needed for the object reconstruction. As a result, we have deadlock.

For example, task 1 takes object A as argument and returns object B, task 2 takes object B as argument. When we run task 2, pull manager will add B to the queue and then B is lost. In this case, task 1 is re-submitted and A is added the the pull manager queue after B (assuming both tasks are scheduled to the same node). Due to limited available object store memory, A cannot be activated until B is pulled but B cannot be pulled until A is pulled and B is reconstructed.

The solution is that if an active pull request has pending-creation objects, pull manager will deactivates it until creation is done. This way, we will free object store memory occupied by the current active pull request so that next requests can proceed and potentially unblock the object creation.

Related issue number

Closes #13689

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the high-level change! One thing I'm not sure about is if it will matter that we're no longer preserving order between requests as they go between active and inactive. To be safe, I think we should use map instead of set to store active/inactive requests. One scenario where it could matter, for example, is if we have a large request that has been active for a long time, and we end up deactivating that over another later request.

@stephanie-wang
Copy link
Contributor

^ Ignore that, I forgot std::set is ordered :)

@rkooo567 rkooo567 self-assigned this May 16, 2022
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 16, 2022
@jjyao jjyao changed the title [WIP] Fix pull manager deadlock due to object reconstruction Fix pull manager deadlock due to object reconstruction May 16, 2022
@jjyao jjyao removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 16, 2022
python/ray/tests/test_object_manager.py Show resolved Hide resolved
python/ray/tests/test_object_manager.py Show resolved Hide resolved
@@ -141,19 +139,16 @@ class PullManager {
void ResetRetryTimer(const ObjectID &object_id);

/// The number of ongoing object pulls.
int NumActiveRequests() const;
int NumObjectPullRequests() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this just a badly named method before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just a bad naming. We are counting object_pull_reqeusts_ not active_object_pull_requests_

src/ray/object_manager/pull_manager.h Show resolved Hide resolved
src/ray/object_manager/pull_manager.h Show resolved Hide resolved
src/ray/object_manager/pull_manager.h Show resolved Hide resolved
src/ray/object_manager/pull_manager.h Show resolved Hide resolved
src/ray/object_manager/pull_manager.cc Outdated Show resolved Hide resolved
src/ray/object_manager/pull_manager.h Outdated Show resolved Hide resolved
src/ray/object_manager/pull_manager.cc Outdated Show resolved Hide resolved
@jjyao jjyao requested a review from stephanie-wang May 16, 2022 23:18
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is great! Let's also check if it's working on the chaos test in parallel?

src/ray/object_manager/pull_manager.cc Outdated Show resolved Hide resolved
src/ray/object_manager/pull_manager.cc Outdated Show resolved Hide resolved
@stephanie-wang stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 17, 2022
@jjyao jjyao removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 17, 2022
@jjyao
Copy link
Collaborator Author

jjyao commented May 17, 2022

@jjyao jjyao added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label May 18, 2022
@jjyao jjyao merged commit 5128029 into ray-project:master May 18, 2022
@jjyao jjyao deleted the jjyao/deadlock branch May 18, 2022 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Lineage reconstruction fails due to deadlock in object pulling
3 participants