Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEADLOCK: fetch->forgotten->fetch #6480

Closed
Tracked by #252
crusaderky opened this issue May 31, 2022 · 5 comments · Fixed by #6481
Closed
Tracked by #252

DEADLOCK: fetch->forgotten->fetch #6480

crusaderky opened this issue May 31, 2022 · 5 comments · Fixed by #6481
Assignees
Labels
deadlock The cluster appears to not make any progress

Comments

@crusaderky
Copy link
Collaborator

  1. A task transitions to fetch and is added to data_needed
  2. _ensure_communicating runs, but the network is saturated so the task is not popped from data_needed
  3. Task is forgotten
  4. Task is recreated from scratch and transitioned to fetch again
  5. BUG: data_needed.push silently does nothing, because it still contains the forgotten task, which is a different TaskState instance
  6. _ensure_communicating runs. It pops the forgotten task and discards it.
  7. We now have a task stuck forever in fetch state. If validate=True, validate_state() fails.

Log

2022-05-30 16:35:50,293 - distributed.core - ERROR - Invalid TaskState encountered for <TaskState 'inc-60d23875ec5497d5404aad1ce8fcd252' fetch>.
Story:
[
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'ensure-task-exists', 'released', 'compute-task-1653924950.2773008', 1653924950.279128),
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'released', 'fetch', 'fetch', {}, 'compute-task-1653924950.2773008', 1653924950.2794898),
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'release-key', 'worker-connect-1653924950.127911', 1653924950.2856364), 
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'fetch', 'released', 'released', {'inc-60d23875ec5497d5404aad1ce8fcd252': 'forgotten'}, 'worker-connect-1653924950.127911', 1653924950.2856448),
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'released', 'forgotten', 'forgotten', {}, 'worker-connect-1653924950.127911', 1653924950.2856493),
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'ensure-task-exists', 'released', 'compute-task-1653924950.2900314', 1653924950.292284),
	('inc-60d23875ec5497d5404aad1ce8fcd252', 'released', 'fetch', 'fetch', {}, 'compute-task-1653924950.2900314', 1653924950.2924364),
]
Traceback (most recent call last):
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4214, in validate_task
    self.validate_task_fetch(ts)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4155, in validate_task_fetch
    assert ts in self.data_needed
AssertionError

Reproducer

@gen_cluster(client=True)
async def test_forget_data_needed(c, s, a, b):
    """
    1. A task transitions to fetch and is added to data_needed
    2. _ensure_communicating runs, but the network is saturated so the task is not
       popped from data_needed
    3. Task is forgotten
    4. Task is recreated from scratch and transitioned to fetch again
    5. BUG: data_needed.push silently does nothing, because it still contains the
       forgotten task, which is a different TaskState instance
    6. _ensure_communicating runs. It pops the forgotten task and discards it.
    7. We now have a task stuck in fetch state.
    """
    x = c.submit(inc, 1, key="x", workers=[a.address])
    with freeze_data_fetching(b):
        y = c.submit(inc, x, key="y", workers=[b.address])
        await wait_for_state("x", "fetch", b)
        x.release()
        y.release()
        while s.tasks or a.tasks or b.tasks:
            await asyncio.sleep(0.01)

    x = c.submit(inc, 2, key="x", workers=[a.address])
    y = c.submit(inc, x, key="y", workers=[b.address])
    assert await y == 4

The test deadlock on the last line.
freeze_data_fetching is from #6342.

Proposed design

  • Replace UniqueTaskHeap with a generic HeapSet, which is a MutableSet where pop() returns the first element according to an arbitrary key function. You can discard keys from the middle of the heap, like in any other set. Internally it's implemented using weakrefs.
  • proactively discard tasks from data_needed whenever they transition away from fetch.
@crusaderky crusaderky added the deadlock The cluster appears to not make any progress label May 31, 2022
@crusaderky crusaderky self-assigned this May 31, 2022
@crusaderky
Copy link
Collaborator Author

CC @fjetter @gjoseph92 @graingert

@fjetter
Copy link
Member

fjetter commented May 31, 2022

Replace UniqueTaskHeap with a generic HeapSet, which is a MutableSet where pop() returns the first element according to an arbitrary key function. You can discard keys from the middle of the heap, like in any other set. Internally it's implemented using weakrefs.

I wouldn't want to rely too much on weakref logic since we currently do not guarantee TaskState references to be actually properly released, see #6250

can you outline how this would look like?

Another solution would be to track even more state using some unique identifier (e.g. id or a hash). Below unelegant pseudo-code

class UniqueTaskHeap(Collection[TaskState]):
    __slots__ = ("_known", "_heap")
    _known: dict[int, str]
    _heap: list[tuple[tuple[int, ...], int, str, TaskState]]
    
    def discard(self, ts):
        self._known.pop(id(ts))

    def pop(self) -> TaskState:
        while True:
            _, _id, key, ts = heapq.heappop(self._heap)
            try:
                self._known.pop(_id)
            except KeyError:
                continue
            return ts

@crusaderky
Copy link
Collaborator Author

No worries, the weakref is just for memory efficiency. See linked PR

@jrbourbeau
Copy link
Member

Thanks for reporting and working on a fix @crusaderky. Do you know when this was introduced? I'd like to know if it's been around for a bit, or is a regression in the latest release

@crusaderky
Copy link
Collaborator Author

It was introduced together with UniqueTaskHeap in 2022.2.0 (#5653)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deadlock The cluster appears to not make any progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants