Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add alternative schema for array items in RunListResponse #207

Merged
merged 2 commits into from
Apr 18, 2023

Conversation

uniqueg
Copy link
Contributor

@uniqueg uniqueg commented Apr 2, 2023

Alternative to #172
Closes #165

This PR differs from #172 in the following ways:

  • A separate schema RunSummary is defined, to be used specifically for the runs array items in RunListResponse
  • RunSummary extends RunStatus (not modified!) with required additional fields
  • For backward compatibility, either RunSummary or RunStatus (anyOf) are acceptable responses for GET /runs
  • Next to start_time and end_time, also workflow name and tags are included in the extended response; workflow_url was not included, as it can point to a local object and its support would therefore require WES to serve objects (up for debate of course, following feat: Add extra fields to RunStatus to provide additional information #172 and Proposal: Add timestamps to RunStatus #165)

Taken together, this proposal avoids or partly addresses the following concerns raised against #172:

  • "Proposal has side effects:" By using a separate schema to accommodate the additional metadata fields to be returned in run list responses, the RunStatus schema remains untouched, thus retaining the behavior of GET /runs/{run_id}/status (which also uses RunStatus).
  • "Implementers may choose to ignore added fields": While implementers can still return RunStatus instead of RunSummary (or indeed any other response that has the run_id and state properties) in run list responses for backward compatibility (anyOf), this PR adds a couple of measures to more strongly encourage the use of the new extended schema. In particular,
    • A DEPRECIATION WARNING is added to the description of RunListResponse, informing implementers that the use of RunStatus in RunListResponse will be discontinued in favor of RunSchema only.
    • The new schema requires the additional fields in RunSummary, which - together with the depreciation warning - should tell every implementer clearly where the specs will be going in this respect; indeed, as soon as support for RunStatus in RunListResponse is removed (a simple change), any compliant implementation MUST provide these additional fields in their run list responses.

It does NOT address the concern that "clients may not agree with selection of fields".

Copy link
Contributor

@patmagee patmagee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @uniqueg For putting this together! I think it has succinctly summarized the entirety of our discussion in the other thread in a very clean and concise way.

Overall I am in favour of this change! I addressed my original concerns and solves a few complexities with the breaking changes my PR introduced

end_time:
type: string
description: When the run stopped executing (completed, failed, or cancelled), in ISO 8601 format "%Y-%m-%dT%H:%M:%SZ"
tags:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favour of this , but I would want to check with other implementors of WES (ie @wleepang) to see if this is a non-starter for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of this as well

- $ref: '#/components/schemas/RunStatus'
- type: object
properties:
name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to tags, I am largely in favour of this attribute, but I would probably want to ensure this is accessible by implementors before adding it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would this come from? The current RunRequest doesn't have an explicit attribute for name. IIRC, the way we've handled it is with a specific tag. That said, if name is important enough to be used by other parts of the API, it might be worth elevating it to a top level property of RunRequest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a property of the Log model:

    Log:
      title: Log
      type: object
      properties:
        name:
          type: string
          description: The task or workflow name

Comment on lines +847 to +849
anyOf:
- $ref: '#/components/schemas/RunStatus'
- $ref: '#/components/schemas/RunSummary'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach, since it does not break the API, but strongly suggests to implementors to use the new approach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@uniqueg
Copy link
Contributor Author

uniqueg commented Apr 6, 2023

Thanks @patmagee :) I would still like to prepare a PR for the solution proposed here, before reaching out to implementers to vote. But I'm afraid it's gonna take me another two weeks or so.

@patmagee
Copy link
Contributor

@uniqueg can you please rebase this on develop

@patmagee
Copy link
Contributor

@wleepang @cjllanwarne I wonder what your thoughts are on this modified PR?

@uniqueg uniqueg force-pushed the run-list-response-alternative-schema branch from 96cc1b8 to 9e2f986 Compare April 12, 2023 01:42
Comment on lines 668 to 672
required:
- name
- start_time
- end_time
- tags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As much as I like the idea of these being all required, I realized that this information is not always going to be available.

  • name: this requires parsing of a workflow, which in systems that lazily evaluate submissions like Cromwell will not be immediately available. Also I am not sure what would be the name for a snake make workflow
  • start_time and end_time before the run has started, or has yet to finish these values may not be available
  • tags: so long as everyone is happy with an empty map, this is fine to be required."

I am still in favour of extending the RunStatus object like you are doing here, I think the RunSummary is a better approch overall. But, Thinking about this more we probably want most of these as optional, simply out of necessity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably right.

But then we could set defaults for these values (e.g., empty strings). That way we would perhaps encourage implementers a little more to give their best at producing these values, wherever possible. And on the client side we could rely on these fields being available.

Copy link
Contributor

@patmagee patmagee Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I prefer null to an empty value like "" (empty maps are fine). Empty strings end up requiring a bunch of edge case handling on the client side, ie in typed systems you now need to ensure that non null values can actually be deserialized as the expected type.

For example, I can represent a RunSummary in java like the following:

public record RunSummary(String id, String name, Instant start_time, Instant end_time, Map<String,String> tags){
}

Now, if I actually wanted to use this record when receiving responses, I would not be able to directly deserialize the value directly, but would now need to have a "holding" class because "" is not a valid Date. Trying to deserialize the following using a library like Jackson would result in a JsonProcessingException, OR a DateTimeParseException.

{
  "id": "foo",
  "name": "bar",
  "start_time": "",
  "end_time": ""
  "tags": {}
} 

Whereas the following would work:

{
  "id": "foo",
  "name": "bar",
  "start_time": null,
  "end_time": null
  "tags": {}
} 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think this is sensible and is what I'd expect from JSON. That said, I'd have to see if our implementation can handle this. I know in the past we've had issues handling null values when the client library uses types that can't be Nullible.

Copy link
Contributor

@wleepang wleepang Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think name requires parsing of a workflow definition. As I mentioned on a separate comment, name is effectively a property of RunRequest (possibly defined as a tag) - that is it is the name of the run and not the name of the workflow. For that I think I'll backtrack a bit and say that tags is the better place to hold this information and that both a run name and workflow name are a good idea.

For example:

  • workflow names can be:
    • nf-core-sarek-3.1.2
    • gatk4-variant-discovery
  • run names can be:
    • sample-id-12345
    • lims-id-{{uuid}}

Having these as tags presents a way to filter the runs list response along the lines of:

  • "how many runs of a specific workflow are there?"
  • "how many workflows or runs are processing data for a specific sample?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would already have unique identifiers for runs (the run_id). Having an optional human-readable alternative to these is useful, I think, but would indeed go well in tags, as it's really just an alternative.

Workflow names, on the other hand, are extremely useful. I would venture to say that these would be among the features which users filter there reads by most often, if not the most commonly used of all. If I run a lot of analyses, next to inputs, I think that the workflow "name" is really what I would look for first to narrow things down. And with that, I think we could indeed add an optional workflow_name to RunRequest (I think TES has something similar) - and then encourage implementers to provide a reasonable default (e.g., repo name, TRS resource name, workflow URL suffix) if a workflow name is not provided by the client. For this, tags feels a little too optional for me personally.

However, wouldn't strongly object to having both in tags.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If name is a property of RunSummary I'd prefer the spect to be clear about what item is being named. If part of RunSummary, my expectation is that it is the name of the run and not the name of the workflow the run invoked.

With just workflow name, and assuming tags doesn't provide additional human readable information to help disambiguate, a list response would look like:

[
    {
        "id": "e81da9ff-f5a8-4f20-9d5f-772efc4a93b4",
        "start_time": "YYYY-MM-DD HH:MM:SS Z",
        "end_time": "YYYY-MM-DD HH:MM:SS Z",
        "name": "nf-core-rnaseq",
        "tags": { ... }
    },
    {
        "id": "1ae61546-7cfa-4959-bf28-247ce1390508",
        "start_time": "YYYY-MM-DD HH:MM:SS Z",
        "end_time": "YYYY-MM-DD HH:MM:SS Z",
        "name": "nf-core-rnaseq",
        "tags": { ... }
    },
    {
        "id": "c9be874b-5b9c-48b3-879e-40986fe9dbf0",
        "start_time": "YYYY-MM-DD HH:MM:SS Z",
        "end_time": "YYYY-MM-DD HH:MM:SS Z",
        "name": "nf-core-rnaseq",
        "tags": { ... }
    },
    {
        "id": "86aaf850-b2c4-4c9d-b898-16ed3ce1daba",
        "start_time": "YYYY-MM-DD HH:MM:SS Z",
        "end_time": "YYYY-MM-DD HH:MM:SS Z",
        "name": "nf-core-rnaseq",
        "tags": { ... }
    }
]

This would be challenging for an end-user to find the specific run that say processed sample "LIMSJVLO6YWS". Omitting name from the top level of RunSummary and having tags explicitly for workflow_name and run_name looks like:

[
    {
        "id": "e81da9ff-f5a8-4f20-9d5f-772efc4a93b4",
        "start_time": "YYYY-MM-DD HH:MM:SS Z",
        "end_time": "YYYY-MM-DD HH:MM:SS Z",
        "tags": {
            "workflow_name": "nf-core-rnaseq",
            "run_name": "LIMS2ZR5ZS29"
        }
    },
    {
        "id": "1ae61546-7cfa-4959-bf28-247ce1390508",
        "start_time": "YYYY-MM-DD HH:MM:SS Z",
        "end_time": "YYYY-MM-DD HH:MM:SS Z",
        "tags": {
            "workflow_name": "nf-core-rnaseq",
            "run_name": "LIMSXXOFXDMW"
        }
    },
    {
        "id": "c9be874b-5b9c-48b3-879e-40986fe9dbf0",
        "start_time": "YYYY-MM-DD HH:MM:SS Z",
        "end_time": "YYYY-MM-DD HH:MM:SS Z",
        "tags": {
            "workflow_name": "nf-core-rnaseq",
            "run_name": "LIMSJVLO6YWS"
        }
    },
    {
        "id": "86aaf850-b2c4-4c9d-b898-16ed3ce1daba",
        "start_time": "YYYY-MM-DD HH:MM:SS Z",
        "end_time": "YYYY-MM-DD HH:MM:SS Z",
        "tags": {
            "workflow_name": "nf-core-rnaseq",
            "run_name": "LIMS42IXEIFO"
        }
    }
]

IMO this provides a more useful response and maintains flexibility for clients to display or filter information how they need.

A compromise would be to maintain semantic consistency where the name property in RunSummary is the run_name and workflow_name is additional metadata provided in tags:

[
    {
        "id": "e81da9ff-f5a8-4f20-9d5f-772efc4a93b4",
        "start_time": "YYYY-MM-DD HH:MM:SS Z",
        "end_time": "YYYY-MM-DD HH:MM:SS Z",
        "name": "LIMS2ZR5ZS29",
        "tags": {
            "workflow_name": "nf-core-rnaseq"
        }
    },
    {
        "id": "1ae61546-7cfa-4959-bf28-247ce1390508",
        "start_time": "YYYY-MM-DD HH:MM:SS Z",
        "end_time": "YYYY-MM-DD HH:MM:SS Z",
        "name": "LIMSXXOFXDMW",
        "tags": {
            "workflow_name": "nf-core-rnaseq"
        }
    },
    {
        "id": "c9be874b-5b9c-48b3-879e-40986fe9dbf0",
        "start_time": "YYYY-MM-DD HH:MM:SS Z",
        "end_time": "YYYY-MM-DD HH:MM:SS Z",
        "name": "LIMSJVLO6YWS",
        "tags": {
            "workflow_name": "nf-core-rnaseq"
        }
    },
    {
        "id": "86aaf850-b2c4-4c9d-b898-16ed3ce1daba",
        "start_time": "YYYY-MM-DD HH:MM:SS Z",
        "end_time": "YYYY-MM-DD HH:MM:SS Z",
        "name": "LIMS42IXEIFO",
        "tags": {
            "workflow_name": "nf-core-rnaseq"
        }
    }
]

@wleepang
Copy link
Contributor

I think tags is more important than name for this response. (see my comment above).

From a client perspective, runid, (start|end)_time, and status are certainly must haves. I'd say tags is required and it is strongly recommended that "run_name" and "workflow_name" tags are included.

@uniqueg
Copy link
Contributor Author

uniqueg commented Apr 13, 2023

I think tags is more important than name for this response. (see my comment above).

From a client perspective, runid, (start|end)_time, and status are certainly must haves. I'd say tags is required and it is strongly recommended that "run_name" and "workflow_name" tags are included.

I'm with you. It seems to me like we all agree and it's mostly a semantic argument. The main point is that we want to encourage implementers as much as possible to make this info available, without making their life unduly hard if, for whatever reason, they can't. So let it be in tags :)

@patmagee
Copy link
Contributor

@uniqueg once you update this PR to remove name and make fields like start_time and end_time optional, I will go ahead and merge it!

@uniqueg
Copy link
Contributor Author

uniqueg commented Apr 17, 2023

@uniqueg once you update this PR to remove name and make fields like start_time and end_time optional, I will go ahead and merge it!

Done

@patmagee patmagee merged commit 88d0b8d into develop Apr 18, 2023
@patmagee patmagee mentioned this pull request Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: Add timestamps to RunStatus
3 participants