Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yaml spec expressiveness #90

Merged
merged 186 commits into from
Jul 31, 2024
Merged

Yaml spec expressiveness #90

merged 186 commits into from
Jul 31, 2024

Conversation

cisaacstern
Copy link
Collaborator

@cisaacstern cisaacstern commented Jul 19, 2024

WIP closes #50 when complete

Opening as a draft just to get ideas out there

@cisaacstern cisaacstern changed the title Yaml spec verboseness Yaml spec expressiveness Jul 20, 2024
@cisaacstern cisaacstern mentioned this pull request Jul 31, 2024
@cisaacstern
Copy link
Collaborator Author

cisaacstern commented Jul 31, 2024

Thank you in advance to all reviewers for bearing with the scope here. To recap above comments, this PR:

Closes #50
Closes #105
Closes #106
Closes #107

The first of which was an early brainstorming issue for this work, and the latter three of which are essential for getting an end-to-end patrols workflow to run.

As suggested by #50, at least some of the syntax here should be familiar from GitHub Workflows (with for keyword arguments, ${{ }} for variable references, namespacing of variable contexts), but I haven't followed GitHub exactly, in the interest of simplification where appropriate, and also because certain features we need (i.e. "map") don't have 1:1 analogs in GitHub Workflows.

Note that while this PR does implement mode: "map" for tasks, it does not:

In terms of suggestions for reviewing, some approachable starting places might be:

  • README diff
  • examples/compilation-specs/* YAML spec diffs
  • Docstrings and field descriptions in compiler.py (especially for TaskInstance)
  • Compiler tests

Of lesser consequence (unless you're specifically interested) might be the fact that I've refactored some of the jinja logic into macros to make it more developer-friendly & readable, and also the details of the compiled workflows, which are all run end-to-end in test examples.

I'll note also that I don't feel especially dogmatic about much here, my intention is not necessarily to set this spec in stone as it is shown here, and I fully expect we want to iterate further on this. Rather, I am attempting here to just put in place the "minimal" requirements for defining a spec that can handle and end-to-end patrols workflow. (Minimal in quotes because it's kind of a lot, but that's because our earlier draft was pretty sparse!)

Please let me know of any questions, comments, or concerns!

# `observations`, and we want the value passed to this argument to be the return
# value of the task instance with id `obs`, which is the root task of this workflow.
# note that (a) variables which resolve based on the outputs of other tasks are wrapped
# in `${{ }}`; and (b) return values of other task instances are referenced by strings
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so (b) means a task has no previous dependencies?

Copy link
Collaborator Author

@cisaacstern cisaacstern Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I totally understand your question but I think the answer is no.

(b) reads in full:

# ... (b) return values of other task instances are referenced by strings
# having the structure `workflow.<id>.return`.

which is meant to indicate, for example, that we can reference the return value of a task instance defined like this:

- name: Get SubjectGroup Observations from EarthRanger
  id: obs
  task: get_subjectgroup_observations

using a with block like this:

with:
  observations: ${{ workflow.obs.return }}


The inline comments in this example explain what each line means:
```yaml
- name: # [required] a human readable name for the step
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we use name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this is used in a few ways:

  1. To provide descriptive error messages pointing the user to a particular task instance in their workflow, e.g.
    expected_error_text = re.escape(
    "All task instance `id`s must be unique in the workflow. Found duplicate ids: "
    "id='obs' is shared by ['Get Subjectgroup Observations', 'Process Relocations']"
    )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. To provide descriptive comments identifying which task instance a block of code or configuration is for, if that code or config is expected to be human-readable. So in the fillable yaml, e.g.:
    # Parameters for 'Get Patrol Observations from EarthRanger' using task `get_patrol_observations`.
    patrol_obs:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your question made me realize two things I needed to fix:

  • We should also be using these names for the cell headers in jupytext, fixed that just now in 4ce2ab5
  • In the task instances, these names are the human-readable identifiers, whereas the ids are the programmatic identifiers. Therefore I realized it was confusing that the top-level name of the compilation spec was (as used in this PR) in fact not supposed to be a human-readable name, but rather a programmatic identifier. I therefore renamed this field to id in fc97e10

with:
# FIXME: i had this typo and it was not caught by validator, i.e.
# we cannot be allow texts to reference their own ids as dependencies
# data: ${{ patrol_events_map_widget.return }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a validator of this spec?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed that just now in eb0e967

)
return self

# TODO: on __init__ (or in cached_property), sort tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the current order of execution?

Copy link
Collaborator Author

@cisaacstern cisaacstern Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current order of execution follows the order of the task instance list in the spec, so for a spec like

id: calculate_time_density
workflow:
  - name: Get Subjectgroup Observations
    id: obs
    task: get_subjectgroup_observations
  - name: Process Relocations
    id: relocs
    task: process_relocations
    with:
      observations: ${{ workflow.obs.return }}
  - name: Transform Relocations to Trajectories
    id: traj
    task: relocations_to_trajectory
    with:
      relocations: ${{ workflow.relocs.return }}

the execution order would be:

  1. Get Subjectgroup Observations
  2. Process Relocations
  3. Transform Relocations to Trajectories

This also applies if we are combining DAG branches (via either reduction or mapping) so the order of execution for a reduction like

id: create_dashboard
workflow:
  - name: Create Map Widget Single View
    id: map_widget
    task: draw_ecomap
  - name: Create Plot Widget Single View
    id: plot_widget
    task: draw_ecoplot
  - name: Gather Dashboard
    id: dashboard
    task: gather_dashboard
    with:
      widgets:
        - ${{ workflow.map_widget.return }}
        - ${{ workflow.plot_widget.return }}

would be:

  1. Create Map Widget Single View
  2. Create Plot Widget Single View
  3. Gather Dashboard

And for a mapping like

id: calculate_time_density
workflow:
  - name: Get Subjectgroup Observations A
    id: obs_a
    task: get_subjectgroup_observations
  - name: Get Subjectgroup Observations B
    id: obs_b
    task: get_subjectgroup_observations
  - name: Draw Ecomaps
    id: ecomaps
    task: draw_ecomap
    mode: "map"
    iter:
      geodataframe:
        - ${{ workflow.obs_a.return }}
        - ${{ workflow.obs_b.return }}

would be:

  1. Get Subjectgroup Observations A
  2. Get Subjectgroup Observations B
  3. Draw Ecomaps

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this will obviously create problems if all dependencies of a task are not already executed by the time it is executed (because then some of its specified arguments will be undefined).

While I was writing this response I realized that it's actually not that hard to raise a ValidationError if the list is not correctly sorted, so I added that check in a2cc7da + e30c013

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check doesn't actually sort the dag (which the comment you were asking about suggested we do) ... but the more I think about it, I realized I am not sure if we actually want to do that or not. I opened #118 for us to consider that.

@cisaacstern
Copy link
Collaborator Author

Thanks all for the reviews!

Particularly @Yun-Wu whose questions prompted a quick series of last-minute fixes + tweaks + enhancements which I would not have otherwise thought of.

Going to merge to keep moving forward.

@cisaacstern cisaacstern merged commit 2189405 into main Jul 31, 2024
4 checks passed
@cisaacstern cisaacstern deleted the more-verbose-yaml-spec branch July 31, 2024 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rework compilation spec schema to look more like GitHub Workflows
3 participants