RFC: The WDL Extended Library #488

jdidion · 2021-12-29T20:43:14Z

jdidion
Dec 29, 2021
Maintainer

Problem

There has always been (and probably will always be) a tension in the WDL community between those who emphasize simplicity and those who emphasize ease of use. This tension is most commonly observed in discussions around the standard library: team simplicity prefers to avoid adding functions that aren't absolutely necessary while team usability wants to add functions that implement common operations, even if those could otherwise be implemented as tasks. There have been many proposals that attempt to address this, such as various approaches to user-defined functions, but all of these have met with at least some opposition from among the governance team.

Arguments for simplicity

Reduces the cognitive load of learning and using WDL.
Minimizes the functionality the runtime author needs to implement.
Reduces the need to introduce new WDL versions just to add new functions to the standard library.

Arguments for usability

Many operations would be cumbersome to implement as tasks. For example, there is currently a proposal for a values function that would take a Map and return the values of the Map as an Array. To implement this as a task, one would need to:
1. Write the Map to JSON.
2. Use a programming language like python to read the JSON, extract the values, and write them out to a file.
3. Read the file in as an array.
Some operations may be challenging to implement correctly - it would be safer for them to be implemented once in the extended library (and perhaps optimized by the runtime).
These operations are typically computationally simple, meaning there would be a relatively large amount of overhead to launch a separate task that would complete very quickly.

Proposal

I believe that both sides can be satisfied, and the benefits of both approaches can largely be achieved, by providing an "official" library of WDL tasks independently of the WDL specification.

In practice this would be a GitHub repository (perhaps this one, but probably better if it's separate) that contains a set of WDL tasks. I propose that all tasks (and any related structs) be contained in a single WDL file, but there could also be an argument made for each task being self-contained in its own WDL file.

Tasks would be added to the repository by the community. There would be development practices and standards that would have to be followed.

Ideally, the tasks will be importable using a short, easy to remember URL. I propose that we alias https://openwdl.org/<version> to the repository, so a user could import "https://openwdl.org/1.1/lib.wdl" as lib.

Importantly, once a task is added to the official library, it cannot be renamed or removed (though it can be deprecated). This enables a runtime to optionally provide a built-in version of the task as an optimization. For example, when the runtime sees the import above, it can choose to replace any call to a task in the lib namespace with a call to a built-in function.

rhpvorderman · 2021-12-31T07:56:46Z

rhpvorderman
Dec 31, 2021
Collaborator

As a member of "team simplicity", I think the cognitive load of an extended official library is much higher than adding a few functions.

As for the functions. I see that I indeed mentioned values as a programming function that I opposed, but you make a pretty good case here. It is also quite easy to understand and to write. my_array = values(my_map). I shouldn't have posted such a knee-jerk reaction towards any programming feature. However these should still be considered carefully. A function like map for instance by definition applies a function and is therefore quite a lot more complex.

0 replies

jdidion · 2021-12-31T18:30:28Z

jdidion
Dec 31, 2021
Maintainer Author

I disagree on the cognitive load of an extended library. Since it is completely optional, a user can choose to ignore it and simply code their own tasks if they wish to do so.

2 replies

rhpvorderman Jan 3, 2022
Collaborator

That is not entirely true. You have no control over what code others write. And when you use their pipelines you have to read it to understand what is happening. Also when you read back your own pipelines 6 months later, you will have forgotten most of the details. Readability and cognitive load counts when you are not writing throwaway pipelines.
I think this is one thing that sets WDL apart from other workflow languages. It is easy to read.

jdidion Jan 3, 2022
Maintainer Author

Sorry, I don't understand how this relates to the topic at hand.

The sources of the tasks in the extended library are all public so anyone can read them the same way they'd read the task you'd write yourself. If anything they will be better implemented because they are subject to development standards and are code reviewed by the community.

If a runtime chooses to provide an optimized implementation of a task, the expectation is that it adheres to the same contract (inputs and outputs) as the task in the extended library. We can have the requirement that contributions to the extended library are accompanied by test cases so that runtimes can demonstrate their optimized implementations are conformant.

markjschreiber · 2022-01-03T15:45:33Z

markjschreiber
Jan 3, 2022
Collaborator

If the tasks are executed by the server this could have a negative impact on scalability and separation of concerns. Ideally the server/ head process should only be interested in interpreting a workflow and distributing work to task executors. Executing UDFs on the server or head process is going to place unpredictable requirements on the size and capabilities of that node which isn't going to be able to resize or provision resources on demand like a task can. This would also reduce portability.

9 replies

illusional Jan 3, 2022
Collaborator

Is a workaround to add a runtime hint for executing through the Cromwell parent process / head node? Or is there too much overhead for doing that, including localising a container, spinning up an environment etc. Unless there was ONE WDL engine function container, that you could spin up and exec functions against?

markjschreiber Jan 3, 2022
Collaborator

Could these extension functions not just be single task sub-workflows that are imported into a users workflow and then executed via the normal task execution process?

geoffjentry Jan 3, 2022
Maintainer

@markjschreiber I believe tat's exactly what @jdidion is proposing.

The only difference is the normal task execution process might be different for say MiniWDL vs Cromwell, and one of them might be choosing to apply an optimization in some way

Edit; A key difference would be there's not an implication of a single the normal task execution process under the hood. Other than the task would be processed at the appropriate time and produce the correct outputs.

vortexing Jan 4, 2022
Collaborator

Sometimes "normal task execution process" does not use docker. In that case, this would be unavailable to users, right?

geoffjentry Jan 4, 2022
Maintainer

I think people are viewing this as being a much more freewheeling prospect than is being proposed.

It's just something a bit more formal than a hint, combined with something like BioWDL that provides a set of tasks that are pretty ubiquitous.

geoffjentry · 2022-01-03T16:14:10Z

geoffjentry
Jan 3, 2022
Maintainer

I think this is a nice compromise on the tension of how to optimize common tasks.

Provided the standard library sticks to, well, common tasks - this would provide a set of WDL tasks that stay static enough that an implementation could choose to do something clever with them, but yet other implementations could treat as standard tasks. I picture this as an evolution on hints.

6 replies

geoffjentry Jan 3, 2022
Maintainer

The hard part will be agreeing to what's common enough to be included in stdlib (and this is where the cognitive load that @rhpvorderman brings up will come in). But that's no worse (and I'd argue much easier) a problem than currently exists where the options are to have embedded language features or nothing.

jdidion Jan 3, 2022
Maintainer Author

Yes, the "community process" part of the proposal is carrying a lot of the weight. My suggestion is to start out very conservative - e.g. require N approvals before a PR with a new task can be merged, where N is reasonably large, say in the 3-5 range. Hopefully we can draft in enough WDL users to crowdsource a good classifier for the common/not-common decision.

geoffjentry Jan 3, 2022
Maintainer

Yep. I'm sure it's clear from context, but to be explicit, I'm in favor of this.

vortexing Jan 3, 2022
Collaborator

Yeah, figuring out what is "common" may also then take task X, which user X wants included, and then users Y and Z will want a venn diagram of how that process occurs, and then result in four smaller tasks that each user ends up combining in slightly different ways which then you have to be really nitpicky as to differentiating between all of them. I think maybe this is where an example would come in handy so people have a more common view of what exactly the scale of these tasks you're envisioning being in the extended library really are.

At this point I'm assuming these are really really small things that aren't computationally intensive but are common processes that need to happen in WDL workflows... which then leads me back to despite being on Team Simplicity, wanting more of these super basic tasks to just be functions in WDL. If these two parts of my brain could be happy with one solution, I'm in.

Right now I don't see providing a useful WDL task resource library (which should happen), as a replacement for improving the WDL spec itself. In fact, I see creating an extended library of tasks in a WDL, publicizing it, and then when a release is coming up, we do a vote to see if any of the WDL tasks in the extended library have become so ubiquitous (and are computationally inexpensive) that they should be incorporated into the spec. Then folks would be REAL CLEAR on what exactly people are proposing to add to the spec and how it behaves.

jdidion Jan 3, 2022
Maintainer Author

I like that idea, and it's similar to how things work in other languages. E.g. in python many things that are in the standard library started out as separate packages and were added later after their popularity had been demonstrated.

illusional · 2022-01-03T20:27:15Z

illusional
Jan 3, 2022
Collaborator

I feel this is a slipper slope to what BioWDL is. I don't know the user-experience for referencing tasks from a repo is, but maybe documenting that explicitly, then making BioWDL an official "bioinformatics" channel of common tools, then this would be an explicit library of generic common operations, you could have one for other domains too.

2 replies

jdidion Jan 3, 2022
Maintainer Author

In my opinion, the dividing line is whether a task requires bioinformatics-specific dependencies or is written for a bioinformatics-specific purpose.

Maybe it makes sense to say that tasks in the extended library must be written in pure bash with no external dependencies outside of what is available in a specific base Docker image (eg debian slim).

vortexing Jan 3, 2022
Collaborator

I think this caveat makes sense from the perspective of WHO is providing said library items. If it's provided by the WDL comm then it should be stuff that doesn't require anything more complex than what might be installed in a base operating system type docker image or maaaaaybe python.

I'm thinking we're talking about things that might be like values where it's more of a parsing a file issue, or perhaps listing files in a directory to parse into a struct or something sort of wiring-related rather than anything bioinformatic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: The WDL Extended Library #488

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 19 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

RFC: The WDL Extended Library #488

jdidion Dec 29, 2021 Maintainer

Problem

Arguments for simplicity

Arguments for usability

Proposal

Replies: 5 comments · 19 replies

rhpvorderman Dec 31, 2021 Collaborator

jdidion Dec 31, 2021 Maintainer Author

rhpvorderman Jan 3, 2022 Collaborator

jdidion Jan 3, 2022 Maintainer Author

markjschreiber Jan 3, 2022 Collaborator

illusional Jan 3, 2022 Collaborator

markjschreiber Jan 3, 2022 Collaborator

geoffjentry Jan 3, 2022 Maintainer

vortexing Jan 4, 2022 Collaborator

geoffjentry Jan 4, 2022 Maintainer

geoffjentry Jan 3, 2022 Maintainer

geoffjentry Jan 3, 2022 Maintainer

jdidion Jan 3, 2022 Maintainer Author

geoffjentry Jan 3, 2022 Maintainer

vortexing Jan 3, 2022 Collaborator

jdidion Jan 3, 2022 Maintainer Author

illusional Jan 3, 2022 Collaborator

jdidion Jan 3, 2022 Maintainer Author

vortexing Jan 3, 2022 Collaborator

jdidion
Dec 29, 2021
Maintainer

Replies: 5 comments 19 replies

rhpvorderman
Dec 31, 2021
Collaborator

jdidion
Dec 31, 2021
Maintainer Author

rhpvorderman Jan 3, 2022
Collaborator

jdidion Jan 3, 2022
Maintainer Author

markjschreiber
Jan 3, 2022
Collaborator

illusional Jan 3, 2022
Collaborator

markjschreiber Jan 3, 2022
Collaborator

geoffjentry Jan 3, 2022
Maintainer

vortexing Jan 4, 2022
Collaborator

geoffjentry Jan 4, 2022
Maintainer

geoffjentry
Jan 3, 2022
Maintainer

geoffjentry Jan 3, 2022
Maintainer

jdidion Jan 3, 2022
Maintainer Author

geoffjentry Jan 3, 2022
Maintainer

vortexing Jan 3, 2022
Collaborator

jdidion Jan 3, 2022
Maintainer Author

illusional
Jan 3, 2022
Collaborator

jdidion Jan 3, 2022
Maintainer Author

vortexing Jan 3, 2022
Collaborator