Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Lineage reconstruction fails due to deadlock in object pulling #13689

Closed
stephanie-wang opened this issue Jan 25, 2021 · 2 comments · Fixed by #24791
Closed

[core] Lineage reconstruction fails due to deadlock in object pulling #13689

stephanie-wang opened this issue Jan 25, 2021 · 2 comments · Fixed by #24791
Assignees
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical
Milestone

Comments

@stephanie-wang
Copy link
Contributor

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): 1.2dev

Object reconstruction can fail to a deadlock introduced in #13514. The problem is that failed objects may not have a size, and so tasks that depend on a failed object may block later tasks that need to execute in order to recreate the object.

Reproduction (REQUIRED)

Skipped tests in test_reconstruction.py.

@stephanie-wang stephanie-wang added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 25, 2021
@stephanie-wang stephanie-wang added this to the Core Bugs milestone Jan 25, 2021
@stephanie-wang stephanie-wang self-assigned this Jan 25, 2021
@stephanie-wang stephanie-wang modified the milestones: Core Bugs, IO Bugs Feb 14, 2021
@ericl ericl added P0 Issues that should be fixed in short order and removed P2 Important issue, but not time-critical labels Jul 20, 2021
@ericl
Copy link
Contributor

ericl commented Jul 20, 2021

Seems like this can effect non reconstruction workloads in general, if task arg bundles are queued up in the wrong order.

@ericl ericl assigned ericl and unassigned stephanie-wang Jul 20, 2021
@ericl ericl added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order labels Jul 20, 2021
@ericl ericl removed their assignment Jul 29, 2021
@stephanie-wang stephanie-wang added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Oct 20, 2021
@jjyao jjyao self-assigned this May 6, 2022
@jjyao
Copy link
Collaborator

jjyao commented May 16, 2022

When an object is under reconstruction, pull manager keeps the bundle request active with no timeout, which may block the next bundle request that's needed for the object reconstruction. As a result, we have deadlock.

For example, task 1 takes object A as argument and returns object B, task 2 takes object B as argument. When we run task 2, pull manager will add B to the queue and then B is lost. In this case, task 1 is re-submitted and A is added the the pull manager queue after B (assuming both tasks are scheduled to the same node). Due to limited available object store memory, A cannot be activated until B is pulled but B cannot be pulled until A is pulled and B is reconstructed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants