Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] Queue by default #7191

Closed
wants to merge 4 commits into from
Closed

Conversation

gjoseph92
Copy link
Collaborator

@gjoseph92 gjoseph92 commented Oct 25, 2022

Enables queuing by default (with worker-saturation: 1.1). See #7128 for some background.

Closes #7213.

  • Tests added / passed
  • Passes pre-commit run --all-files

@gjoseph92
Copy link
Collaborator Author

gjoseph92 commented Oct 25, 2022

@fjetter @crusaderky @hendrikmakait I realize that the implementation of the adaptive target is very simplistic for queued tasks:

# TODO consider any user-specified default task durations for queued tasks
queued_occupancy = len(self.queued) * self.UNKNOWN_TASK_DURATION

@pytest.mark.flaky(condition=LINUX, reruns=10, reruns_delay=5)
@pytest.mark.xfail(condition=MACOS or WINDOWS, reason="extremely flaky")
@gen_test()
async def test_target_duration():
with dask.config.set(
{
"distributed.scheduler.default-task-durations": {"slowinc": 1},
# adaptive target for queued tasks doesn't yet consider default or learned task durations
"distributed.scheduler.worker-saturation": float("inf"),
}
):
async with LocalCluster(
n_workers=0,
asynchronous=True,
processes=False,
silence_logs=False,
dashboard_address=":0",
) as cluster:
adapt = cluster.adapt(interval="20ms", minimum=2, target_duration="5s")
async with Client(cluster, asynchronous=True) as client:
await client.wait_for_workers(2)
futures = client.map(slowinc, range(100), delay=0.3)
await wait(futures)
assert adapt.log[0][1] == {"status": "up", "n": 2}
assert adapt.log[1][1] == {"status": "up", "n": 20}

People using adaptive clusters might see rather different behavior when the default changes, particularly if they're setting default task durations manually, or relying significantly on learned task durations.

@gjoseph92
Copy link
Collaborator Author

Note that there are a number of tests failing here that weren't failing before in the queue CI jobs. That's because here, worker-saturation is set to 1.1 instead of 1.0. That can lead to different selection of workers, which breaks assumptions of more tests.

Most of these have to do with the fact that small TaskGroups aren't considered 'root-ish', so they don't follow the queuing code path.

As it happens, I think most of those tests have already been fixed in #7221.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       16 files  +       1         16 suites  +1   6h 49m 37s ⏱️ + 20m 48s
  3 170 tests +       2    3 079 ✔️  -        4    85 💤 +  2    4 +2  2 🔥 +2 
22 196 runs   - 1 244  21 235 ✔️  - 1 301  947 💤 +47  11 +7  3 🔥 +3 

For more details on these failures and errors, see this check.

Results for commit 23135da. ± Comparison against base commit 02b9430.

@gjoseph92
Copy link
Collaborator Author

@fjetter
Copy link
Member

fjetter commented Nov 9, 2022

I would like us to fix #7197 w/out changing the root-ish definition for now

@gjoseph92
Copy link
Collaborator Author

gjoseph92 commented Nov 9, 2022

Sure. We could also do that with a small fix to bring the round-robin code path in line with how we select workers elsewhere: #7197 (comment). That way, regardless of the root-ish decision, we use equivalent criteria for picking workers.

Edit: see #7280.

@gjoseph92
Copy link
Collaborator Author

#7279 instead

@gjoseph92 gjoseph92 closed this Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Turn on queuing by default
2 participants