feat: add alternative schema for array items in RunListResponse #207

uniqueg · 2023-04-02T18:48:26Z

Alternative to #172
Closes #165

This PR differs from #172 in the following ways:

A separate schema RunSummary is defined, to be used specifically for the runs array items in RunListResponse
RunSummary extends RunStatus (not modified!) with required additional fields
For backward compatibility, either RunSummary or RunStatus (anyOf) are acceptable responses for GET /runs
Next to start_time and end_time, also workflow name and tags are included in the extended response; workflow_url was not included, as it can point to a local object and its support would therefore require WES to serve objects (up for debate of course, following feat: Add extra fields to RunStatus to provide additional information #172 and Proposal: Add timestamps to RunStatus #165)

Taken together, this proposal avoids or partly addresses the following concerns raised against #172:

"Proposal has side effects:" By using a separate schema to accommodate the additional metadata fields to be returned in run list responses, the RunStatus schema remains untouched, thus retaining the behavior of GET /runs/{run_id}/status (which also uses RunStatus).
"Implementers may choose to ignore added fields": While implementers can still return RunStatus instead of RunSummary (or indeed any other response that has the run_id and state properties) in run list responses for backward compatibility (anyOf), this PR adds a couple of measures to more strongly encourage the use of the new extended schema. In particular,
- A DEPRECIATION WARNING is added to the description of RunListResponse, informing implementers that the use of RunStatus in RunListResponse will be discontinued in favor of RunSchema only.
- The new schema requires the additional fields in RunSummary, which - together with the depreciation warning - should tell every implementer clearly where the specs will be going in this respect; indeed, as soon as support for RunStatus in RunListResponse is removed (a simple change), any compliant implementation MUST provide these additional fields in their run list responses.

It does NOT address the concern that "clients may not agree with selection of fields".

patmagee

Thank you @uniqueg For putting this together! I think it has succinctly summarized the entirety of our discussion in the other thread in a very clean and concise way.

Overall I am in favour of this change! I addressed my original concerns and solves a few complexities with the breaking changes my PR introduced

patmagee · 2023-04-05T20:27:52Z

openapi/workflow_execution_service.openapi.yaml

+            end_time:
+              type: string
+              description: When the run stopped executing (completed, failed, or cancelled), in ISO 8601 format "%Y-%m-%dT%H:%M:%SZ"
+            tags:


I am in favour of this , but I would want to check with other implementors of WES (ie @wleepang) to see if this is a non-starter for them.

I'm in favor of this as well

patmagee · 2023-04-05T20:28:21Z

openapi/workflow_execution_service.openapi.yaml

+        - $ref: '#/components/schemas/RunStatus'
+        - type: object
+          properties:
+            name:


Similar to tags, I am largely in favour of this attribute, but I would probably want to ensure this is accessible by implementors before adding it

Where would this come from? The current RunRequest doesn't have an explicit attribute for name. IIRC, the way we've handled it is with a specific tag. That said, if name is important enough to be used by other parts of the API, it might be worth elevating it to a top level property of RunRequest.

It's a property of the Log model:

Log: title: Log type: object properties: name: type: string description: The task or workflow name

patmagee · 2023-04-05T20:29:13Z

openapi/workflow_execution_service.openapi.yaml

+            anyOf:
+              - $ref: '#/components/schemas/RunStatus'
+              - $ref: '#/components/schemas/RunSummary'


I like this approach, since it does not break the API, but strongly suggests to implementors to use the new approach

uniqueg · 2023-04-06T10:35:50Z

Thanks @patmagee :) I would still like to prepare a PR for the solution proposed here, before reaching out to implementers to vote. But I'm afraid it's gonna take me another two weeks or so.

patmagee · 2023-04-10T18:08:32Z

@uniqueg can you please rebase this on develop

patmagee · 2023-04-10T18:09:10Z

@wleepang @cjllanwarne I wonder what your thoughts are on this modified PR?

patmagee · 2023-04-12T12:13:11Z

openapi/workflow_execution_service.openapi.yaml

+          required:
+            - name
+            - start_time
+            - end_time
+            - tags


As much as I like the idea of these being all required, I realized that this information is not always going to be available.

name: this requires parsing of a workflow, which in systems that lazily evaluate submissions like Cromwell will not be immediately available. Also I am not sure what would be the name for a snake make workflow

start_time and end_time before the run has started, or has yet to finish these values may not be available

tags: so long as everyone is happy with an empty map, this is fine to be required."

I am still in favour of extending the RunStatus object like you are doing here, I think the RunSummary is a better approch overall. But, Thinking about this more we probably want most of these as optional, simply out of necessity

You are probably right.

But then we could set defaults for these values (e.g., empty strings). That way we would perhaps encourage implementers a little more to give their best at producing these values, wherever possible. And on the client side we could rely on these fields being available.

In general I prefer null to an empty value like "" (empty maps are fine). Empty strings end up requiring a bunch of edge case handling on the client side, ie in typed systems you now need to ensure that non null values can actually be deserialized as the expected type.

For example, I can represent a RunSummary in java like the following:

public record RunSummary(String id, String name, Instant start_time, Instant end_time, Map<String,String> tags){ }

Now, if I actually wanted to use this record when receiving responses, I would not be able to directly deserialize the value directly, but would now need to have a "holding" class because "" is not a valid Date. Trying to deserialize the following using a library like Jackson would result in a JsonProcessingException, OR a DateTimeParseException.

{ "id": "foo", "name": "bar", "start_time": "", "end_time": "" "tags": {} }

Whereas the following would work:

{ "id": "foo", "name": "bar", "start_time": null, "end_time": null "tags": {} }

Overall, I think this is sensible and is what I'd expect from JSON. That said, I'd have to see if our implementation can handle this. I know in the past we've had issues handling null values when the client library uses types that can't be Nullible.

I don't think name requires parsing of a workflow definition. As I mentioned on a separate comment, name is effectively a property of RunRequest (possibly defined as a tag) - that is it is the name of the run and not the name of the workflow. For that I think I'll backtrack a bit and say that tags is the better place to hold this information and that both a run name and workflow name are a good idea.

For example:

workflow names can be:

nf-core-sarek-3.1.2

gatk4-variant-discovery

run names can be:

sample-id-12345

lims-id-{{uuid}}

Having these as tags presents a way to filter the runs list response along the lines of:

"how many runs of a specific workflow are there?"

"how many workflows or runs are processing data for a specific sample?"

We would already have unique identifiers for runs (the run_id). Having an optional human-readable alternative to these is useful, I think, but would indeed go well in tags, as it's really just an alternative.

Workflow names, on the other hand, are extremely useful. I would venture to say that these would be among the features which users filter there reads by most often, if not the most commonly used of all. If I run a lot of analyses, next to inputs, I think that the workflow "name" is really what I would look for first to narrow things down. And with that, I think we could indeed add an optional workflow_name to RunRequest (I think TES has something similar) - and then encourage implementers to provide a reasonable default (e.g., repo name, TRS resource name, workflow URL suffix) if a workflow name is not provided by the client. For this, tags feels a little too optional for me personally.

However, wouldn't strongly object to having both in tags.

If name is a property of RunSummary I'd prefer the spect to be clear about what item is being named. If part of RunSummary, my expectation is that it is the name of the run and not the name of the workflow the run invoked.

With just workflow name, and assuming tags doesn't provide additional human readable information to help disambiguate, a list response would look like:

[ { "id": "e81da9ff-f5a8-4f20-9d5f-772efc4a93b4", "start_time": "YYYY-MM-DD HH:MM:SS Z", "end_time": "YYYY-MM-DD HH:MM:SS Z", "name": "nf-core-rnaseq", "tags": { ... } }, { "id": "1ae61546-7cfa-4959-bf28-247ce1390508", "start_time": "YYYY-MM-DD HH:MM:SS Z", "end_time": "YYYY-MM-DD HH:MM:SS Z", "name": "nf-core-rnaseq", "tags": { ... } }, { "id": "c9be874b-5b9c-48b3-879e-40986fe9dbf0", "start_time": "YYYY-MM-DD HH:MM:SS Z", "end_time": "YYYY-MM-DD HH:MM:SS Z", "name": "nf-core-rnaseq", "tags": { ... } }, { "id": "86aaf850-b2c4-4c9d-b898-16ed3ce1daba", "start_time": "YYYY-MM-DD HH:MM:SS Z", "end_time": "YYYY-MM-DD HH:MM:SS Z", "name": "nf-core-rnaseq", "tags": { ... } } ]

This would be challenging for an end-user to find the specific run that say processed sample "LIMSJVLO6YWS". Omitting name from the top level of RunSummary and having tags explicitly for workflow_name and run_name looks like:

[ { "id": "e81da9ff-f5a8-4f20-9d5f-772efc4a93b4", "start_time": "YYYY-MM-DD HH:MM:SS Z", "end_time": "YYYY-MM-DD HH:MM:SS Z", "tags": { "workflow_name": "nf-core-rnaseq", "run_name": "LIMS2ZR5ZS29" } }, { "id": "1ae61546-7cfa-4959-bf28-247ce1390508", "start_time": "YYYY-MM-DD HH:MM:SS Z", "end_time": "YYYY-MM-DD HH:MM:SS Z", "tags": { "workflow_name": "nf-core-rnaseq", "run_name": "LIMSXXOFXDMW" } }, { "id": "c9be874b-5b9c-48b3-879e-40986fe9dbf0", "start_time": "YYYY-MM-DD HH:MM:SS Z", "end_time": "YYYY-MM-DD HH:MM:SS Z", "tags": { "workflow_name": "nf-core-rnaseq", "run_name": "LIMSJVLO6YWS" } }, { "id": "86aaf850-b2c4-4c9d-b898-16ed3ce1daba", "start_time": "YYYY-MM-DD HH:MM:SS Z", "end_time": "YYYY-MM-DD HH:MM:SS Z", "tags": { "workflow_name": "nf-core-rnaseq", "run_name": "LIMS42IXEIFO" } } ]

IMO this provides a more useful response and maintains flexibility for clients to display or filter information how they need.

A compromise would be to maintain semantic consistency where the name property in RunSummary is the run_name and workflow_name is additional metadata provided in tags:

[ { "id": "e81da9ff-f5a8-4f20-9d5f-772efc4a93b4", "start_time": "YYYY-MM-DD HH:MM:SS Z", "end_time": "YYYY-MM-DD HH:MM:SS Z", "name": "LIMS2ZR5ZS29", "tags": { "workflow_name": "nf-core-rnaseq" } }, { "id": "1ae61546-7cfa-4959-bf28-247ce1390508", "start_time": "YYYY-MM-DD HH:MM:SS Z", "end_time": "YYYY-MM-DD HH:MM:SS Z", "name": "LIMSXXOFXDMW", "tags": { "workflow_name": "nf-core-rnaseq" } }, { "id": "c9be874b-5b9c-48b3-879e-40986fe9dbf0", "start_time": "YYYY-MM-DD HH:MM:SS Z", "end_time": "YYYY-MM-DD HH:MM:SS Z", "name": "LIMSJVLO6YWS", "tags": { "workflow_name": "nf-core-rnaseq" } }, { "id": "86aaf850-b2c4-4c9d-b898-16ed3ce1daba", "start_time": "YYYY-MM-DD HH:MM:SS Z", "end_time": "YYYY-MM-DD HH:MM:SS Z", "name": "LIMS42IXEIFO", "tags": { "workflow_name": "nf-core-rnaseq" } } ]

wleepang · 2023-04-13T16:11:27Z

I think tags is more important than name for this response. (see my comment above).

From a client perspective, runid, (start|end)_time, and status are certainly must haves. I'd say tags is required and it is strongly recommended that "run_name" and "workflow_name" tags are included.

uniqueg · 2023-04-13T23:11:11Z

I think tags is more important than name for this response. (see my comment above).

From a client perspective, runid, (start|end)_time, and status are certainly must haves. I'd say tags is required and it is strongly recommended that "run_name" and "workflow_name" tags are included.

I'm with you. It seems to me like we all agree and it's mostly a semantic argument. The main point is that we want to encourage implementers as much as possible to make this info available, without making their life unduly hard if, for whatever reason, they can't. So let it be in tags :)

patmagee · 2023-04-14T10:41:51Z

@uniqueg once you update this PR to remove name and make fields like start_time and end_time optional, I will go ahead and merge it!

uniqueg · 2023-04-17T14:20:58Z

@uniqueg once you update this PR to remove name and make fields like start_time and end_time optional, I will go ahead and merge it!

Done

patmagee approved these changes Apr 5, 2023

View reviewed changes

feat: add alternative schema for array items in RunListResponse

9e2f986

uniqueg force-pushed the run-list-response-alternative-schema branch from 96cc1b8 to 9e2f986 Compare April 12, 2023 01:42

patmagee reviewed Apr 12, 2023

View reviewed changes

remove 'name', make 'start_time'/'end_time' optional

4107db9

patmagee merged commit 88d0b8d into develop Apr 18, 2023

This was referenced Apr 18, 2023

feat: Add extra fields to RunStatus to provide additional information #172

Closed

RC-v1.2.0 #208

Closed

patmagee mentioned this pull request Jun 8, 2023

RC-v1.1.0 #210

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add alternative schema for array items in RunListResponse #207

feat: add alternative schema for array items in RunListResponse #207

uniqueg commented Apr 2, 2023

patmagee left a comment

patmagee Apr 5, 2023

wleepang Apr 13, 2023

patmagee Apr 5, 2023

wleepang Apr 13, 2023

uniqueg Apr 13, 2023

patmagee Apr 5, 2023

wleepang Apr 13, 2023

uniqueg commented Apr 6, 2023

patmagee commented Apr 10, 2023

patmagee commented Apr 10, 2023

patmagee Apr 12, 2023

uniqueg Apr 12, 2023

patmagee Apr 12, 2023 •

edited

Loading

wleepang Apr 13, 2023

wleepang Apr 13, 2023 •

edited

Loading

uniqueg Apr 13, 2023

wleepang Apr 13, 2023

wleepang commented Apr 13, 2023

uniqueg commented Apr 13, 2023

patmagee commented Apr 14, 2023

uniqueg commented Apr 17, 2023

feat: add alternative schema for array items in RunListResponse #207

feat: add alternative schema for array items in RunListResponse #207

Conversation

uniqueg commented Apr 2, 2023

patmagee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uniqueg commented Apr 6, 2023

patmagee commented Apr 10, 2023

patmagee commented Apr 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patmagee Apr 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wleepang Apr 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wleepang commented Apr 13, 2023

uniqueg commented Apr 13, 2023

patmagee commented Apr 14, 2023

uniqueg commented Apr 17, 2023

patmagee Apr 12, 2023 •

edited

Loading

wleepang Apr 13, 2023 •

edited

Loading