Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

who_has not set for task in state fetch #5751

Closed
fjetter opened this issue Feb 3, 2022 · 4 comments
Closed

who_has not set for task in state fetch #5751

fjetter opened this issue Feb 3, 2022 · 4 comments
Assignees
Labels
flaky test Intermittent failures on CI.

Comments

@fjetter
Copy link
Member

fjetter commented Feb 3, 2022

Known tests to be affected

In production environment validation is disabled which will likely cause other errors and/or deadlocks.

Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3067, in gather_dep
    response = await get_data_from_worker(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4353, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4330, in _get_data
    comm = await rpc.connect(worker)
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 1067, in connect
    comm = await fut
  File "/home/runner/work/distributed/distributed/distributed/comm/core.py", line 329, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:36081 after 30 s
distributed.worker - ERROR - 
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3986, in validate_task
    self.validate_task_fetch(ts)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3928, in validate_task_fetch
    assert ts.who_has

Task story

    (
        "inc-aa226ae3a9f799819e1e685fba467442",
        "compute-task",
        "compute-task-1643835890.882272",
        1643835890.8875856,
    ),
    (
        "inc-aa226ae3a9f799819e1e685fba467442",
        "released",
        "waiting",
        "waiting",
        {"inc-aa226ae3a9f799819e1e685fba467442": "ready"},
        "compute-task-1643835890.882272",
        1643835890.887607,
    ),
    (
        "inc-aa226ae3a9f799819e1e685fba467442",
        "waiting",
        "ready",
        "ready",
        {},
        "compute-task-1643835890.882272",
        1643835890.8876154,
    ),
    (
        "inc-aa226ae3a9f799819e1e685fba467442",
        "ensure-task-exists",
        "ready",
        "compute-task-1643835890.8866765",
        1643835890.8877435,
    ),
    (
        "inc-aa226ae3a9f799819e1e685fba467442",
        "release-key",
        "compute-task-1643835890.8866765",
        1643835890.8879642,
    ),
    (
        "inc-aa226ae3a9f799819e1e685fba467442",
        "ready",
        "released",
        "released",
        {},
        "compute-task-1643835890.8866765",
        1643835890.8879726,
    ),
    (
        "inc-aa226ae3a9f799819e1e685fba467442",
        "released",
        "fetch",
        "fetch",
        {},
        "compute-task-1643835890.8866765",
        1643835890.8879821,
    ),
    (
        "inc-aa226ae3a9f799819e1e685fba467442",
        "ready",
        "fetch",
        "fetch",
        {},
        "compute-task-1643835890.8866765",
        1643835890.8879848,
    ),
Full story
AssertionError: Invalid TaskState encountered for <TaskState 'inc-aa226ae3a9f799819e1e685fba467442' fetch>.
Story:
[('inc-aa226ae3a9f799819e1e685fba467442', 'ensure-task-exists', 'released', 'compute-task-1643835890.875481', 1643835890.8762364), ('inc-aa226ae3a9f799819e1e685fba467442', 'released', 'fetch', 'fetch', {}, 'compute-task-1643835890.875481', 1643835890.876392), ('gather-dependencies', 'tcp://127.0.0.1:36081', {'inc-0c3f7739638b70e403ab21f80b09970b', 'inc-74fa2a3ce29de41facb1b01e75154df2', 'inc-bc330a1ea35cf0e28ef6270b770981cb', 'inc-aa226ae3a9f799819e1e685fba467442', 'inc-1bc6cda651ec56203f196053394876a1'}, 'ensure-communicating-1643835890.8765168', 1643835890.8765872), ('inc-aa226ae3a9f799819e1e685fba467442', 'fetch', 'flight', 'flight', {}, 'ensure-communicating-1643835890.8765168', 1643835890.876613), ('request-dep', 'tcp://127.0.0.1:36081', {'inc-0c3f7739638b70e403ab21f80b09970b', 'inc-74fa2a3ce29de41facb1b01e75154df2', 'inc-bc330a1ea35cf0e28ef6270b770981cb', 'inc-aa226ae3a9f799819e1e685fba467442', 'inc-1bc6cda651ec56203f196053394876a1'}, 'ensure-communicating-1643835890.8765168', 1643835890.876833), ('receive-dep-failed', 'tcp://127.0.0.1:36081', {'inc-0c3f7739638b70e403ab21f80b09970b', 'inc-74fa2a3ce29de41facb1b01e75154df2', 'inc-bc330a1ea35cf0e28ef6270b770981cb', 'inc-aa226ae3a9f799819e1e685fba467442', 'inc-1bc6cda651ec56203f196053394876a1'}, 'ensure-communicating-1643835890.8765168', 1643835890.8813467), ('inc-aa226ae3a9f799819e1e685fba467442', 'missing-dep', 'ensure-communicating-1643835890.8765168', 1643835890.8813949), ('inc-aa226ae3a9f799819e1e685fba467442', 'flight', 'missing', 'missing', {}, 'ensure-communicating-1643835890.8765168', 1643835890.881428), ('inc-aa226ae3a9f799819e1e685fba467442', 'release-key', 'processing-released-1643835890.880434', 1643835890.8822312), ('inc-aa226ae3a9f799819e1e685fba467442', 'missing', 'released', 'released', {'inc-aa226ae3a9f799819e1e685fba467442': 'forgotten'}, 'processing-released-1643835890.880434', 1643835890.8822412), ('inc-aa226ae3a9f799819e1e685fba467442', 'released', 'forgotten', 'forgotten', {}, 'processing-released-1643835890.880434', 1643835890.882249), ('inc-aa226ae3a9f799819e1e685fba467442', 'compute-task', 'compute-task-1643835890.880547', 1643835890.8824306), ('inc-aa226ae3a9f799819e1e685fba467442', 'released', 'waiting', 'waiting', {'inc-aa226ae3a9f799819e1e685fba467442': 'ready'}, 'compute-task-1643835890.880547', 1643835890.882451), ('inc-aa226ae3a9f799819e1e685fba467442', 'waiting', 'ready', 'ready', {}, 'compute-task-1643835890.880547', 1643835890.8824592), ('inc-aa226ae3a9f799819e1e685fba467442', 'ready', 'executing', 'executing', {}, 'ensure-computing-1643835890.8831396', 1643835890.8831706), ('inc-aa226ae3a9f799819e1e685fba467442', 'put-in-memory', 'ensure-computing-1643835890.8831396', 1643835890.8835356), ('inc-aa226ae3a9f799819e1e685fba467442', 'executing', 'memory', 'memory', {}, 'ensure-computing-1643835890.8831396', 1643835890.883553), ('free-keys', ('inc-aa226ae3a9f799819e1e685fba467442',), 'processing-released-1643835890.8822534', 1643835890.8875349), ('inc-aa226ae3a9f799819e1e685fba467442', 'release-key', 'processing-released-1643835890.8822534', 1643835890.8875427), ('inc-aa226ae3a9f799819e1e685fba467442', 'memory', 'released', 'released', {'inc-aa226ae3a9f799819e1e685fba467442': 'forgotten'}, 'processing-released-1643835890.8822534', 1643835890.8875651), ('inc-aa226ae3a9f799819e1e685fba467442', 'released', 'forgotten', 'forgotten', {}, 'processing-released-1643835890.8822534', 1643835890.8875735), ('inc-aa226ae3a9f799819e1e685fba467442', 'compute-task', 'compute-task-1643835890.882272', 1643835890.8875856), ('inc-aa226ae3a9f799819e1e685fba467442', 'released', 'waiting', 'waiting', {'inc-aa226ae3a9f799819e1e685fba467442': 'ready'}, 'compute-task-1643835890.882272', 1643835890.887607), ('inc-aa226ae3a9f799819e1e685fba467442', 'waiting', 'ready', 'ready', {}, 'compute-task-1643835890.882272', 1643835890.8876154), ('inc-aa226ae3a9f799819e1e685fba467442', 'ensure-task-exists', 'ready', 'compute-task-1643835890.8866765', 1643835890.8877435), ('inc-aa226ae3a9f799819e1e685fba467442', 'release-key', 'compute-task-1643835890.8866765', 1643835890.8879642), ('inc-aa226ae3a9f799819e1e685fba467442', 'ready', 'released', 'released', {}, 'compute-task-1643835890.8866765', 1643835890.8879726), ('inc-aa226ae3a9f799819e1e685fba467442', 'released', 'fetch', 'fetch', {}, 'compute-task-1643835890.8866765', 1643835890.8879821), ('inc-aa226ae3a9f799819e1e685fba467442', 'ready', 'fetch', 'fetch', {}, 'compute-task-1643835890.8866765', 1643835890.8879848)]
@fjetter fjetter added the flaky test Intermittent failures on CI. label Feb 3, 2022
@fjetter
Copy link
Member Author

fjetter commented Feb 3, 2022

Looks like this is a regression introduced by #5653

@fjetter
Copy link
Member Author

fjetter commented Feb 3, 2022

This is a weird one, again.

{
    "op": "handle_compute_task",
   "key": "A",
    # Dependencies are encoded in `who_has`
    # The scheduler claims the worker itself is holding this dependency but this is not true
    "who_has": {"B": [self]}
}

The assertion is raised during a handle_task_compute for a task A who has a dependency B. The scheduler claims that the worker already has the data for this key but this is not true. In fact, the worker merely queued it up and B is in state ready at the time.

I encountered such a condition already but couldn't make sense of it at the time. This led to us filtering out the "self.address" in

if self.address in workers and self.tasks[dep].state != "memory":
logger.debug(
"Scheduler claims worker %s holds data for task %s which is not true.",
self.name,
dep,
)
# Do not mutate the input dict. That's rude
workers = set(workers) - {self.address}

since otherwise gather_dep would run in an endless loop trying to fetch data from itself (yes, that's possible 🤦 ) #4784

Will likely need to dig a bit deeper in what the scheduler thinks its doing

@fjetter fjetter self-assigned this Feb 3, 2022
@fjetter fjetter changed the title test_submit_after_failed_worker_sync failing who_has not set for task in state fetch Feb 9, 2022
@fjetter
Copy link
Member Author

fjetter commented Feb 9, 2022

There are a few things happening.

While the worker B is trying to fetch data from worker A, worker A is killed such that the tasks are reassigned to worker B, i.e. this is a
fetch -> waiting (-> ready) transition which moves over the intermediate cancelled->resumed since the worker is waiting for the connection to fail.
However, the failed connection pings the scheduler with a missing-dep signal which will transition the (now processing on B) task to released. This will cause a bit of a vicious cycle of release->compute->release->compute which is not great but not the reason for this error.

The assertionerror/deadlock appears iff the compute->release cycles overlap with a task-finished signal of a previous compute due to network delay. This will cause the dependent to be scheduled on workerB already even though workerB is still working off the initial task. Trying to work on the dependent will trigger its dependency to be transitioned to fetch but since the worker itself is the one "supposed to have it", it will appear empty.

This stuff is difficult to write down, below a timeline of events with a few annotations

Annotated timeline of events

1644412377.3091562: ({'action': 'add-worker', 'worker': 'WorkerA'},)
1644412377.3091671: ({'action': 'add-worker', 'worker': 'WorkerA'},)
1644412377.487405: ({'action': 'add-worker', 'worker': 'WorkerB'},)
1644412377.4874198: ({'action': 'add-worker', 'worker': 'WorkerB'},)
1644412377.489585: ({'action': 'worker-status-change', 'prev-status': 'undefined', 'status': 'running', 'worker': 'WorkerA'},)
1644412377.4912362: ({'action': 'worker-status-change', 'prev-status': 'undefined', 'status': 'running', 'worker': 'WorkerB'},)
1644412377.534137: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'released', 'waiting', {'inc-1bc6cda651ec56203f196053394876a1': 'processing'})
1644412377.534146: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'waiting', 'processing', {})
1644412377.540587: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'processing', 'memory', {})
1644412377.543991: ('scheduler', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'released', 'waiting', {'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'processing'})
1644412377.544043: ('scheduler', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'waiting', 'processing', {})
# Worker B is instructed to compute `sum` and tries to fetch `inc`
1644412377.544758: ('WorkerB', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'compute-task', 'compute-task-1644412377.544033')
1644412377.5447848: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'ensure-task-exists', 'released', 'compute-task-1644412377.544033')
1644412377.544823: ('WorkerB', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'released', 'waiting', 'waiting', {'inc-0c3f7739638b70e403ab21f80b09970b': 'fetch', 'inc-1bc6cda651ec56203f196053394876a1': 'fetch', 'inc-74fa2a3ce29de41facb1b01e75154df2': 'fetch', 'inc-aa226ae3a9f799819e1e685fba467442': 'fetch', 'inc-bc330a1ea35cf0e28ef6270b770981cb': 'fetch'}, 'compute-task-1644412377.544033')
1644412377.544836: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'released', 'fetch', 'fetch', {}, 'compute-task-1644412377.544033')
1644412377.544929: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'fetch', 'flight', 'flight', {}, 'ensure-communicating-1644412377.544886')

# Worker A dies. Scheduler knows about it first. Reschedules all tasks
1644412377.546768: ({'action': 'remove-worker', 'worker': 'WorkerA'},)
1644412377.546768: ({'action': 'remove-worker', 'worker': 'WorkerA'},)
1644412377.5469272: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'memory', 'released', {'inc-1bc6cda651ec56203f196053394876a1': 'waiting', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'waiting'})
1644412377.546936: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'released', 'waiting', {'inc-1bc6cda651ec56203f196053394876a1': 'processing'})
1644412377.5469701: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'waiting', 'processing', {})
1644412377.546987: ('scheduler', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'processing', 'released', {'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'waiting'})
1644412377.546998: ('scheduler', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'released', 'waiting', {'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'waiting'})

# Worker B notices dead worker second and initiates a missing-dep request; transitions task to missing
1644412377.549644: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'missing-dep', 'ensure-communicating-1644412377.544886')
1644412377.5496802: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'flight', 'missing', 'missing', {}, 'ensure-communicating-1644412377.544886')

# Worker B now gets assigned the task to compute it. 
# XXX: it should've released `sum` before the new assignment comes through
# This could be explained if the batched send does not batch the two requests and the ordering of the message
# lists the compute-task before the release-task and it *does*! This is even possible in patch since we're runnnig with multiple processes!!!!!
1644412377.550276: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'compute-task', 'compute-task-1644412377.546967')
1644412377.5503109: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'release-key', 'compute-task-1644412377.546967')
1644412377.550329: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'missing', 'released', 'released', {}, 'compute-task-1644412377.546967')
1644412377.5503402: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'released', 'waiting', 'waiting', {'inc-1bc6cda651ec56203f196053394876a1': 'ready'}, 'compute-task-1644412377.546967')
1644412377.5503428: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'missing', 'waiting', 'waiting', {'inc-1bc6cda651ec56203f196053394876a1': 'ready'}, 'compute-task-1644412377.546967')
1644412377.5503511: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'waiting', 'ready', 'ready', {}, 'compute-task-1644412377.546967')

# The scheduler receives the missing-data request. It should ignore it because WorkerA is long gone. TODO: Verify this
1644412377.5503948: ({'action': 'missing-data', 'key': 'inc-1bc6cda651ec56203f196053394876a1', 'worker': 'WorkerA'},)

# It appears to reschedule something: XXX: This is due to our missing-data bug
1644412377.550459: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'processing', 'released', {'inc-1bc6cda651ec56203f196053394876a1': 'waiting'})
1644412377.5504901: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'released', 'waiting', {'inc-1bc6cda651ec56203f196053394876a1': 'processing'})

# Only here, `sum` is released. This should've happened sooner. However, it is
# forgotten so this is fine? This _appears_ to be the stimulus originating from
# remove-worker but no way to correlate right now. XXX:
1644412377.550514: ('WorkerB', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'release-key', 'processing-released-1644412377.546985')
1644412377.550543: ('WorkerB', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'waiting', 'released', 'released', {
    'inc-0c3f7739638b70e403ab21f80b09970b': 'released', 
    'inc-74fa2a3ce29de41facb1b01e75154df2': 'released', 
    'inc-aa226ae3a9f799819e1e685fba467442': 'released', 
    'inc-bc330a1ea35cf0e28ef6270b770981cb': 'released', 
    'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'forgotten'
    }, 'processing-released-1644412377.546985')
1644412377.550557: ('WorkerB', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'released', 'forgotten', 'forgotten', {}, 'processing-released-1644412377.546985')

# XXX: #2 Due to the missing-data request, it is again transitioned to processing which _should_ issue another compute
1644412377.5505621: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'waiting', 'processing', {})

# WorkerB computes the task ordinarily
1644412377.553717: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'ready', 'executing', 'executing', {}, 'ensure-computing-1644412377.553691')
1644412377.5540302: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'put-in-memory', 'ensure-computing-1644412377.553691')
# XXX: #3 send "task-finished" to scheduler
1644412377.55404: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'executing', 'memory', 'memory', {}, 'ensure-computing-1644412377.553691')

# XXX: #2
# I guess this is the result of us rescheduling the task due to `missing-data`
1644412377.5550568: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'release-key', 'processing-released-1644412377.550446')
1644412377.555079: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'memory', 'released', 'released', {'inc-1bc6cda651ec56203f196053394876a1': 'forgotten'}, 'processing-released-1644412377.550446')
1644412377.555084: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'released', 'forgotten', 'forgotten', {}, 'processing-released-1644412377.550446')

# XXX: #2 compute-task
1644412377.555091: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'compute-task', 'compute-task-1644412377.550556')
1644412377.555103: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'released', 'waiting', 'waiting', {'inc-1bc6cda651ec56203f196053394876a1': 'ready'}, 'compute-task-1644412377.550556')
1644412377.5551062: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'waiting', 'ready', 'ready', {}, 'compute-task-1644412377.550556')

# XXX: #3 Receive task-done but from a previous, stale compute-task. The worker lost the data again at this point
1644412377.5556521: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'processing', 'memory', {'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'processing'})

# The scheduler thinks that everything is in memory but this is false since the worker already forgot its data again
1644412377.555695: ('scheduler', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'waiting', 'processing', {})

# This is probably when the worker reponse "forgot the key" / `release-worker-data` comes in
# For the third? time now we're rescheduling the entire chain
1644412377.557708: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'memory', 'released', {'inc-1bc6cda651ec56203f196053394876a1': 'waiting', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'waiting'})
1644412377.5577338: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'released', 'waiting', {'inc-1bc6cda651ec56203f196053394876a1': 'processing'})
1644412377.5578141: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'waiting', 'processing', {})
1644412377.557857: ('scheduler', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'processing', 'released', {'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'waiting'})
1644412377.557884: ('scheduler', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'released', 'waiting', {'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'waiting'})

1644412377.561773: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'ready', 'executing', 'executing', {}, 'ensure-computing-1644412377.5617359')

1644412377.562877: ('WorkerB', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'compute-task', 'compute-task-1644412377.555685')
1644412377.562925: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'ensure-task-exists', 'executing', 'compute-task-1644412377.555685')
1644412377.563197: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'self-who-has-error')

# An unhandled exception in a handler will bubble up to handle_worker and will
# remove that worker. In fact, since remove_worker is a coroutine, it may
# entirely be possible that the remote already reconnected, reestablished the
# stream and will receive a close signal by the scheduler! wtf
# On worker side, this will trigger a reconnect. No funny business as far as I can tell
1644412377.566411: ({'action': 'remove-worker', 'worker': 'WorkerB'},)
1644412377.566411: ({'action': 'remove-worker', 'worker': 'WorkerB'},)
1644412377.566884: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'processing', 'released', {'inc-1bc6cda651ec56203f196053394876a1': 'waiting'})
1644412377.566889: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'released', 'no-worker', {})
1644412377.567368: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'put-in-memory', 'ensure-computing-1644412377.5617359')
1644412377.5673778: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'executing', 'memory', 'memory', {}, 'ensure-computing-1644412377.5617359')
1644412378.079115: ({'action': 'add-worker', 'worker': 'WorkerB'},)
1644412378.079124: ({'action': 'add-worker', 'worker': 'WorkerB'},)
1644412378.079268: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'no-worker', 'memory', {})
1644412378.079295: ('scheduler', 'inc-0c3f7739638b70e403ab21f80b09970b', 'no-worker', 'memory', {'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'processing'})
1644412378.079337: ('scheduler', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'waiting', 'processing', {})
1644412378.0804229: ('WorkerB', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'compute-task', 'compute-task-1644412378.0793269')
1644412378.080439: ('WorkerB', 'inc-1bc6cda651ec56203f196053394876a1', 'ensure-task-exists', 'memory', 'compute-task-1644412378.0793269')
1644412378.080469: ('WorkerB', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'released', 'waiting', 'waiting', {'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'ready'}, 'compute-task-1644412378.0793269')
1644412378.080478: ('WorkerB', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'waiting', 'ready', 'ready', {}, 'compute-task-1644412378.0793269')
1644412378.080606: ('WorkerB', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'ready', 'executing', 'executing', {}, 'ensure-computing-1644412378.080589')
1644412378.080926: ('WorkerB', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'put-in-memory', 'ensure-computing-1644412378.080589')
1644412378.080933: ('WorkerB', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'executing', 'memory', 'memory', {}, 'ensure-computing-1644412378.080589')

Apart from this I noticed two more things behaving weirdly.

  1. Given a scheduler state T1 (memory) -> T2 (processing). If T1 is released (e.g. lost worker) this will cause T1 to be rescheduled before T2 is released. This can cause weird situations on worker side, assuming we rely on ordering (so far I assumed ordering is reliable)

See below, this ordering depends on the insertion order to the recommendations dict

44412377.5469272: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'memory', 'released', {'inc-1bc6cda651ec56203f196053394876a1': 'waiting', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'waiting'})
1644412377.546936: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'released', 'waiting', {'inc-1bc6cda651ec56203f196053394876a1': 'processing'})
1644412377.5469701: ('scheduler', 'inc-1bc6cda651ec56203f196053394876a1', 'waiting', 'processing', {})
1644412377.546987: ('scheduler', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'processing', 'released', {'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'waiting'})
1644412377.546998: ('scheduler', 'sum-6e43962b5fde8e7d28875ebbdb8e7bc7', 'released', 'waiting', {'sum-6e43962b5fde8e7d28875ebbdb8e7bc7': 'waiting'})
  1. I just noticed again that any exception in a handler will close a connection between scheduler and worker, resulting in a worker reconnect which screws a bit with the state machine. We're resilient to this but it is very confusing when debugging these things (xref Is it intended that any error from a handler makes Server.handle_stream close the comm? #5483)

@crusaderky
Copy link
Collaborator

Closed by #5786

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
Development

No branches or pull requests

2 participants