You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ray version and other system information (Python version, TensorFlow version, OS): 1.2dev
Object reconstruction can fail to a deadlock introduced in #13514. The problem is that failed objects may not have a size, and so tasks that depend on a failed object may block later tasks that need to execute in order to recreate the object.
Reproduction (REQUIRED)
Skipped tests in test_reconstruction.py.
The text was updated successfully, but these errors were encountered:
stephanie-wang
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
P2
Important issue, but not time-critical
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jan 25, 2021
When an object is under reconstruction, pull manager keeps the bundle request active with no timeout, which may block the next bundle request that's needed for the object reconstruction. As a result, we have deadlock.
For example, task 1 takes object A as argument and returns object B, task 2 takes object B as argument. When we run task 2, pull manager will add B to the queue and then B is lost. In this case, task 1 is re-submitted and A is added the the pull manager queue after B (assuming both tasks are scheduled to the same node). Due to limited available object store memory, A cannot be activated until B is pulled but B cannot be pulled until A is pulled and B is reconstructed.
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS): 1.2dev
Object reconstruction can fail to a deadlock introduced in #13514. The problem is that failed objects may not have a size, and so tasks that depend on a failed object may block later tasks that need to execute in order to recreate the object.
Reproduction (REQUIRED)
Skipped tests in
test_reconstruction.py
.The text was updated successfully, but these errors were encountered: