Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cythonic SchedulerState (WIP) #5176

Closed
wants to merge 14 commits into from

Conversation

martindurant
Copy link
Member

  • Closes #xxxx
  • Tests added / passed
  • Passes black distributed / flake8 distributed / isort distributed

Status: client.submit(lambda x: x + 1).result() ran successfully.

@martindurant
Copy link
Member Author

cc @jakirkham

@martindurant
Copy link
Member Author

For test_scheduler.py I get: 24 failed, 398 passed

@martindurant
Copy link
Member Author

It's worth pointing out here, that certain attributes moved to SchedulerState. Things like transition_* certainly feel like internal details. However, some tests and other low-level poking of the scheduler would need to change their call from, e.g., scheduler.get_task_duration to scheduler.state.get_task_duration, or we could have a bunch of attributes that just redirect, for the purposes of making them reachable in the previous location.

@mrocklin
Copy link
Member

@jakirkham if you have time my sense is that you're probably the main expert here

@jrbourbeau
Copy link
Member

Just a heads up that @jakirkham is out of the office this week, but should be back next week

@jakirkham
Copy link
Member

jakirkham commented Feb 1, 2022

Apologies for being slow here Martin. Would it be possible to resolve the conflicts here?

Edit: Should add I like this change generally and this was kind of in the back of my head while pulling out SchedulerState.

@martindurant
Copy link
Member Author

I haven't looked in detail, but I suppose with some effort, yes. Do you think the work is going in the right direction, though? There has been far less enthusiasm for cython this last year.

@jakirkham
Copy link
Member

Yeah I think so.

When adding SchedulerState, was hoping to pull a lot of the brains of the scheduler out separate from communication and other management the scheduler does. Ideally it would be a separate object (like you have here) that the Scheduler instantiates and operates on. Something that can make operating on the graph easier. Would hope this accomplishes a few things:

  1. Cleaner break between Cython code and Python code
  2. Separation of code for maintaining the graph from other routine scheduler operations
  3. Better potential optimizations on the Cython part
  4. Easier maintenance for people working on the Scheduler
  5. More opportunity for exploring graph optimizers in other languages (C++, Rust, etc.)

As to enthusiasm, there were two other things we found to be slow:

  1. Communication
  2. Serialization

On the communication front we wanted to move to an asyncio-based communication framework with the option to use uvloop. This has recently been completed. We may want to do similar things with other comms (UCX, MPI, etc.).

There have been some improvements on the serialization front ( #4541 ) ( #4531 ) ( #5208 ).
Also there have been a few attempts to further improve serialization ( #4699 ) ( #4897 ) ( #4923 ). More work is probably still needed here. For example ( #5581 ) and other things ( #5450 (review) ).

That said, the big thing that sticks out to me is more Dask operations work with HLGs now. The Scheduler is increasingly involved in the work to execute them. Also the bottlenecks that were there are getting cleared out. All of this will make the Scheduler more of a painpoint going forward. Additionally it sounds like we can benefit from cleaner separation of code that this PR would move us towards to improve general maintainability.

@fjetter
Copy link
Member

fjetter commented Feb 2, 2022

I think separating the scheduler state from the server code makes sense regardless of whether we move on with cythonization. I intend to do the same on worker side but rather from the POV of a cleaner architecture, easier testability, etc; see #5736

In a recent sync (scheduler performance / Nvidia & Coiled) we discussed dropping Cythonization support. During this sync everybody appeared to agree which led to #5685 but only few engaged on the discussion over there, so far.

I would suggest to move forward with the separation of state from server code but would suggest to wait with changing all the signatures until we have a solid decision in #5685

FWIW, I would be much more excited about the prospect of cythonization if the state is pure cython and this PR is a great step in this direction. IFF we want to move forward with cythonization, I think this separation is crucial

@martindurant
Copy link
Member Author

For those that do do benchmarking, is it possible to run a set, cythonized and not for each:

  • this branch
  • the state of the scheduler at the time this PR was written
  • the current scheduler

I would add "this branch after resolving conflicts", but doing that resolution is probably a decent amount of work in itself.

@jakirkham
Copy link
Member

In a recent sync (scheduler performance / Nvidia & Coiled) we discussed dropping Cythonization support. During this sync everybody appeared to agree which led to #5685 but only few engaged on the discussion over there, so far.

Not sure what this means, but it sounds like Ben missed this and I wasn't present. Don't know who else was present. So I'm not sure what everybody agreed means.

Anyways would suggest keeping that discussion in that issue.

@jakirkham
Copy link
Member

Following up on our discussion in the meeting earlier today, it seems like this PR is a good first step with the next step moving SchedulerState into a separate module. Am looking into what this will take and will try to draft a PR building off of Martin's work in that direction. Hopefully that gives us something we can look at concretely and discuss next steps 🙂

cc @eriknw @quasiben (for awareness)

@jakirkham
Copy link
Member

Have pushed conflict resolutions here. Though think I'm going to squash the history, rebase, and start a new PR for simplicity. Will reference once it is is open

@martindurant
Copy link
Member Author

Think I'm going to squash the history, rebase, and start a new PR

Fine with me. Nice to see things moving!

@jakirkham
Copy link
Member

Thanks for all the work and support Martin! 😄

Continuing in PR ( #5839 )

@github-actions
Copy link
Contributor

Unit Test Results

0 files   -        12  0 suites   - 12   0s ⏱️ - 7h 25m 58s
0 tests  -   2 607  0 ✔️  -   2 528  0 💤  -      79  0 ±0 
0 runs   - 15 566  0 ✔️  - 14 375  0 💤  - 1 191  0 ±0 

Results for commit e2e96d8. ± Comparison against base commit d2d76c0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants