Fix `Scheduler.restart` logic #6504

gjoseph92 · 2022-06-03T20:51:50Z

Scheduler.restart used to remove every worker without closing it. This was bad practice (#6390), as well as incorrect: it certainly seemed the intent was only to remove non-Nanny workers. See detailed explanation in #6455 (comment).

Closes #6455, closes #6452, closes #6494. I also will make a separate PR with #6494 (comment) (which won't be necessary to fix the restart issue, just a good cleanup task to do.)

cc @fjetter @hendrikmakait @jrbourbeau

Tests added / passed
Passes pre-commit run --all-files

`Scheduler.restart` used to remove every worker without closing it. This was bad practice (dask#6390), as well as incorrect: it certainly seemed the intent was only to remove non-Nanny workers. Then, Nanny workers are restarted via the `restart` RPC to the Nanny, not to the worker.

gjoseph92

BTW both added tests fail on main, so they would have caught the problem.

gjoseph92 · 2022-06-03T20:53:18Z

distributed/scheduler.py

+            *(
+                self.remove_worker(address=addr, stimulus_id=stimulus_id)
+                for addr in self.workers
+                if addr not in nanny_workers


Key change: before, nannies contained all workers, so we were removing all workers immediately. Now, we only remove non-nanny workers, and leave nanny workers around to be restarted via RPC to the Nanny a few lines below.

gjoseph92 · 2022-06-03T20:55:08Z

distributed/scheduler.py

+        )
+        for r in close_results:
+            if isinstance(r, Exception):
+                # TODO this is probably not, in fact, normal.


I can't see why this should happen. Any errors here are probably real (especially since remove_worker doesn't even do much communication, and it try/excepts places where you'd expect errors might happen). I'd like to remove it if others are okay, but it's not necessary and I'm not certain how safe it is, so I'm leaving it for now.

gjoseph92 · 2022-06-03T20:55:38Z

distributed/scheduler.py

            ]

-            resps = All(


Refactor for style, using asyncio.gather instead of All

gjoseph92 · 2022-06-03T20:56:50Z

distributed/tests/test_scheduler.py

+    async with Worker(s.address, nthreads=1) as w:
+        await c.wait_for_workers(3)
+
+        # Halfway through `Scheduler.restart`, only the non-Nanny workers should be removed.


For reference, the plugin is triggered here:

distributed/distributed/scheduler.py

Lines 5115 to 5119 in bd74d2b

for plugin in list(self.plugins.values()):

try:

plugin.restart(self)

except Exception as e:

logger.exception(e)

github-actions · 2022-06-03T21:50:25Z

Unit Test Results

      15 files ±  0       15 suites ±0 6h 30m 37s ⏱️ -53s
  2 838 tests +  2   2 756 ✔️ +  2   81 💤 ±0 0 ❌ - 1 1 🔥 +1
21 033 runs +17 20 085 ✔️ +14 947 💤 +3 0 ❌ - 1 1 🔥 +1

For more details on these errors, see this check.

Results for commit 735d982. ± Comparison against base commit 6d85a85.

gjoseph92 · 2022-06-03T22:04:56Z

Hm, the only failure is

test_local_cluster_redundant_kwarg[True] flaky #6506

And that's not a test that shows up as having failed before on https://dask.org/distributed/test_report.html.

But since this PR is only changing the logic in Scheduler.restart, which isn't used in that test, I'm not seeing how this change could be causing it?

gjoseph92 · 2022-06-03T22:11:55Z

I've confirmed that test fails for me locally on main: #6506 (comment). So I don't think it should hold up this PR.

graingert · 2022-06-07T11:27:03Z

distributed/scheduler.py

+            return_exceptions=True,
+        )
+        for r in close_results:
+            if isinstance(r, Exception):


I think like this?

Suggested change

if isinstance(r, Exception):

if isinstance(r, BaseException) and not isinstance(r, asyncio.CancelledError):

crusaderky

Mostly minor comments

distributed/scheduler.py

crusaderky · 2022-06-07T11:05:15Z

distributed/scheduler.py

+                # TODO this is probably not, in fact, normal.
+                logger.info("Exception while restarting.  This is normal.", exc_info=r)


Suggested change

# TODO this is probably not, in fact, normal.

logger.info("Exception while restarting. This is normal.", exc_info=r)

logger.error("Exception while restarting worker", exc_info=r)

After reading remove_worker I agree that I can't see anything that could go wrong. IMHO we should just let the exception be logged by @log_errors and be reraised on the client.
@fjetter do you have an opinion on this?

Yes, I suggest to remove this. Scheduler.remove_worker does not intentionally raise any exceptions nor are there any obvious transitive exceptions that should be raised and could be handled safely.
Therefore, in this local context I don't see the point about implementing any exception handling and would prefer to just reraise (ideally to the client).

I think it's even unnecessary to do the gather(..., raise_exception=False) etc. foo. If there is an unexpected exception at this point, the only sane thing to do is to restart the entire cluster (e.g. if there is a transition error) but this needs to be done externally (obviously, since we tried restarting already :)).

Okay, I didn't like it either. I just left it to keep the changes in this PR to a minimum.

If there is an unexpected exception at this point, the only sane thing to do is to restart the entire cluster

I think that would imply adding @fail_hard to Scheduler.restart, should I do that?

NVM, we're not using fail_hard anywhere on the scheduler right now, so I won't add that. I've removed the error suppression and raise_exception=False.

distributed/tests/test_scheduler.py

crusaderky · 2022-06-07T11:19:38Z

distributed/tests/test_scheduler.py

+        await s.restart()
+
+        if plugin.error:
+            raise plugin.error


I don't understand what's the purpose of this complication - can't you just put the assertions you wrote in the plugin here instead?

See the exception handling here:

distributed/distributed/scheduler.py

Lines 5107 to 5111 in ea2c80f

for plugin in list(self.plugins.values()):

try:

plugin.restart(self)

except Exception as e:

logger.exception(e)

crusaderky · 2022-06-07T11:20:40Z

distributed/tests/test_scheduler.py

+
+        assert len(s.workers) == 2
+        # Confirm they restarted
+        new_pids = set((await c.run(os.getpid)).values())


Suggested change

new_pids = set((await c.run(os.getpid)).values())

new_pids = {a.process.process.pid, b.process.process.pid}

assert all(new_pids)

crusaderky · 2022-06-07T11:22:14Z

distributed/tests/test_scheduler.py

+        assert len(s.workers) == 2
+        # Confirm they restarted
+        new_pids = set((await c.run(os.getpid)).values())
+        assert new_pids != original_pids


Suggested change

assert new_pids != original_pids

assert new_pids.isdisjoint(original_pids)

distributed/tests/test_scheduler.py

graingert · 2022-06-07T11:37:27Z

distributed/tests/test_scheduler.py

+
+        assert len(s.workers) == 2
+        # Confirm they restarted
+        new_pids = set((await c.run(os.getpid)).values())


is this possibly flakey as the pids could be reused?

also the psutil.Process.create_time() might also be more useful here

https://github.com/giampaolo/psutil/blob/5ca68709c44885f6902820e8dcb9fcff1cc1e33b/psutil/__init__.py#L376-L380

It should be safe to assume that the chance that the box running the tests is going to spawn and destroy 65536 processes during the duration of the test should be nil.

@graingert thanks, looks like just doing equality between the Process objects does this for us https://github.com/giampaolo/psutil/blob/5ca68709c44885f6902820e8dcb9fcff1cc1e33b/psutil/__init__.py#L408-L413

Co-authored-by: crusaderky <[email protected]>

gjoseph92 · 2022-06-08T15:32:16Z

Ready for final review I believe

github-actions · 2022-06-08T16:58:07Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±    0       15 suites ±0 6h 8m 49s ⏱️ - 2m 7s
  2 855 tests +  19   2 770 ✔️ +  17   81 💤 ±0 4 ❌ +2
21 153 runs +137 20 206 ✔️ +136 943 💤 - 1 4 ❌ +2

For more details on these failures, see this check.

Results for commit d037f37. ± Comparison against base commit 6d85a85.

gjoseph92 · 2022-06-08T21:16:06Z

flaky test_gather_dep_one_worker_always_busy #6533
Avoid deadlocks in tests that use popen #6483
Validation error in transition_flight_missing for test_chaos_rechunk #6535
test_quiet_client_close may be real (not introduced by this PR though, but by Remove worker reconnect #6361):

    def test_quiet_client_close(loop):
        with captured_logger(logging.getLogger("distributed")) as logger:
            with Client(
                loop=loop,
                processes=False,
                dashboard_address=":0",
                threads_per_worker=4,
            ) as c:
                futures = c.map(slowinc, range(1000), delay=0.01)
                sleep(0.200)  # stop part-way
            sleep(0.1)  # let things settle
    
            out = logger.getvalue()
            lines = out.strip().split("\n")
            assert len(lines) <= 2
            for line in lines:
>               assert (
                    not line
                    or "Reconnecting" in line
                    or "garbage" in line
                    or set(line) == {"-"}
                ), line
E               AssertionError: Received heartbeat from unregistered worker 'inproc://10.213.1.205/15971/24'.

TODO make this a separate PR and add tests. Just want to see if it helps CI.

This reverts commit 7d90e2a.

github-actions · 2022-06-09T00:16:36Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±    0       15 suites ±0 6h 39m 5s ⏱️ + 28m 9s
  2 855 tests +  19   2 774 ✔️ +  21   81 💤 ±0 0 ❌ - 2
21 153 runs +137 20 207 ✔️ +137 946 💤 +2 0 ❌ - 2

Results for commit 5cccba6. ± Comparison against base commit 6d85a85.

gjoseph92 added 3 commits June 3, 2022 12:50

Explicit test for the restart problem too

bd74d2b

improve PID sensitivity

735d982

gjoseph92 self-assigned this Jun 3, 2022

gjoseph92 mentioned this pull request Jun 3, 2022

Restart worker if it's unrecognized by scheduler #6505

Merged

2 tasks

gjoseph92 commented Jun 3, 2022

View reviewed changes

hayesgb requested a review from crusaderky June 6, 2022 14:53

graingert reviewed Jun 7, 2022

View reviewed changes

crusaderky reviewed Jun 7, 2022

View reviewed changes

graingert reviewed Jun 7, 2022

View reviewed changes

gjoseph92 and others added 3 commits June 8, 2022 09:23

Apply suggestions from code review

7fd7f0e

Co-authored-by: crusaderky <[email protected]>

Better process equality check

a033eed

Don't suppress errors from remove_worker

d037f37

gjoseph92 added 2 commits June 8, 2022 15:53

don't heartbeat when closing

7d90e2a

TODO make this a separate PR and add tests. Just want to see if it helps CI.

Revert "don't heartbeat when closing"

5cccba6

This reverts commit 7d90e2a.

This was referenced Jun 9, 2022

Flaky test_quiet_client_close #6540

Open

Flaky test_AllProgress #6550

Open

fjetter approved these changes Jun 10, 2022

View reviewed changes

fjetter merged commit a8f9b21 into dask:main Jun 10, 2022

This was referenced Jun 27, 2022

Ensure client.restart waits for workers to leave and come back #6637

Closed

⚠️ CI failed ⚠️ - test_deadlock fails intermittently coiled/benchmarks#166

Open

gjoseph92 deleted the fix-scheduler-restart branch July 19, 2022 21:43

gjoseph92 mentioned this pull request Aug 5, 2022

Flaky distributed/tests/test_nanny.py::test_repeated_restarts #6838

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `Scheduler.restart` logic #6504

Fix `Scheduler.restart` logic #6504

gjoseph92 commented Jun 3, 2022 •

edited

Loading

gjoseph92 left a comment

gjoseph92 Jun 3, 2022

gjoseph92 Jun 3, 2022

gjoseph92 Jun 3, 2022

gjoseph92 Jun 3, 2022

github-actions bot commented Jun 3, 2022

gjoseph92 commented Jun 3, 2022

gjoseph92 commented Jun 3, 2022

graingert Jun 7, 2022 •

edited

Loading

crusaderky left a comment

crusaderky Jun 7, 2022

crusaderky Jun 7, 2022

fjetter Jun 7, 2022

gjoseph92 Jun 7, 2022

gjoseph92 Jun 8, 2022

crusaderky Jun 7, 2022

gjoseph92 Jun 8, 2022

crusaderky Jun 7, 2022

crusaderky Jun 7, 2022

graingert Jun 7, 2022 •

edited

Loading

crusaderky Jun 7, 2022

gjoseph92 Jun 8, 2022

gjoseph92 commented Jun 8, 2022

github-actions bot commented Jun 8, 2022

gjoseph92 commented Jun 8, 2022

github-actions bot commented Jun 9, 2022

	for plugin in list(self.plugins.values()):
	try:
	plugin.restart(self)
	except Exception as e:
	logger.exception(e)

	if isinstance(r, Exception):
	if isinstance(r, BaseException) and not isinstance(r, asyncio.CancelledError):

		# TODO this is probably not, in fact, normal.
		logger.info("Exception while restarting. This is normal.", exc_info=r)

	# TODO this is probably not, in fact, normal.
	logger.info("Exception while restarting. This is normal.", exc_info=r)
	logger.error("Exception while restarting worker", exc_info=r)

	new_pids = set((await c.run(os.getpid)).values())
	new_pids = {a.process.process.pid, b.process.process.pid}
	assert all(new_pids)

	assert new_pids != original_pids
	assert new_pids.isdisjoint(original_pids)

Fix Scheduler.restart logic #6504

Fix Scheduler.restart logic #6504

Conversation

gjoseph92 commented Jun 3, 2022 • edited Loading

gjoseph92 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 3, 2022

Unit Test Results

gjoseph92 commented Jun 3, 2022

gjoseph92 commented Jun 3, 2022

graingert Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

crusaderky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graingert Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoseph92 commented Jun 8, 2022

github-actions bot commented Jun 8, 2022

Unit Test Results

gjoseph92 commented Jun 8, 2022

github-actions bot commented Jun 9, 2022

Unit Test Results

Fix `Scheduler.restart` logic #6504

Fix `Scheduler.restart` logic #6504

gjoseph92 commented Jun 3, 2022 •

edited

Loading

graingert Jun 7, 2022 •

edited

Loading

graingert Jun 7, 2022 •

edited

Loading