-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use annotations machinery for workers/priority/... #4347
Conversation
Previously we had two systems to send per-task metadata like retries or workers or priorities to the scheduler. 1. Older system with explicit workers= keywords and expand_foo functions 2. Newer system with annotations The annotations system is nicer for a few reasons: 1. It's more generic 2. It's more consistent (there were some bugs in the expand foo functions, especially when dealing with collections) 3. We ship values up on a per-layer basis rather than a per-key basis This work-in-progress commit rips out the old system and uses the new system, but it still missing a lot: 1. It only implements this for the Client.compute method. We need to repeat this for persist, submit, and map 2. It doesn't handle the allow_other_workers -> loose_restrictions conversion anywhere yet. (this will need to be added to the scheduler)
Firstly, I think placing the existing taxonomy into an annotation dictionary for scheduler transmission is the right way forward as it's a cleaner mechanism than having separate arguments to update-graph-hlg. My initial reaction to global annotations (as described in #4306 (comment)): with dask.annotate(foo="baz"):
x.compute() is that they shouldn't be supported as I've always thought of annotations as a fine-grained mechanism for describing a task. I think that "global " annotations should be discouraged. Perhaps my view is too narrow? If really desired the following pattern would also work: with dask.annotate(foo="baz"):
x = construct_graph()
x.compute() However, the existing API should probably be supported for backwards compatibility (as is started in this PR) client.compute(x, retries=3) I may have time to look into this in more detail next week. |
This PR doesn't have an opinion on global vs local annotations in the compute call. I suggested this in a previous issue, but so far all we're doing here is replacing the internal machinery. |
@ian-r-rose if you have time to carry on this PR that would be welcome. |
Sounds good |
One use case I'd like to enable is to keep track of different jobs running on the cluster: with dask.annotate(job_name="weather-etl", job_id=uuid4().hex):
x.compute() I think that would require global annotations? |
Closing as this work was superseded by #4406 |
Previously we had two systems to send per-task metadata like retries or
workers or priorities to the scheduler.
The annotations system is nicer for a few reasons:
functions, especially when dealing with collections)
This work-in-progress commit rips out the old system and uses the new system,
but it still missing a lot:
We need to repeat this for persist, submit, and map
conversion anywhere yet. (this will need to be added to the
scheduler)
cc @sjperkins if you have time I'd love your thoughts on this approach. Also, if you agree with this approach and want to take over this PR that would also be welcome :)