-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting results back from DNAStack WES #111
Comments
@ianfore the WES server simply mirrors the workflow execution engines output. The MD5 sum being empty is likely an issue with the signed urls. The task is completing successfully but my guess is the file is never actually being localized to the VM (similar to the GWAS workflow) so the With WDL, returns are typed (according to the WDL typing specification) and can be any valid JSON value (string, int, boolean, float, etc). File are represented by a string, however there is no information communicated that they are actually files, other then the fact that the WDL itself defines that they are files. WES/WDL is not actually concerned about how a user identiers or even accesses these files. For the DNAstack WES server you need access to the bucket in GCP in order to access the file based outputs. The lack of information on how outputs are structured (and inputs for that matter) are one of the things which I take issue of with the WES specification. In order for proper interopability between workflow platfaorms, this needs to be defined in the WES specification and not up to the individual language. AT the moment, theres no consistent way to represent this information
Unfortunately this is a bit more challening then it sounds. We did not previously do this but looks like we could probably figure it out |
The WDL approach to returning values seems reasonable behavior for WES too. The following is from a working MD5 example off your WES server and seems fit for purpose. No need to store files at the WES server or retrieve them. |
Its a difficult issue, because WES is yet one more abstraction on other very different abstractions (CWL,WDL,Nextflow etc), which is an abstraction on top of the underlying execution engine. There definitely is a line between being too heavy handed on specifying how inputs/outputs are defined and not being assertful enough In the WDL community, this was a hotly debated topic for the longest time. Originally We did not mandate a minimal implementation requirement for inputs or outputs, but left it up to the specific execution engine. Our thinking here was that to be too prescriptive would actually restrict adoption of the Specification by new engine implementations. Additionally in our minds it seemed to step outside of the bounds of the WDL specification, since WDL is not concerned about HOW it is run, only what specifically is run. Over time though, it became clear that this stance introduced ALOT of problems.
We recently voted a specific "Cromwell-style" inputs/outputs (openwdl/wdl#357) as the required format that all engines must minimally support. We do not restrict additional input/output formats. |
Picking up on how files from a WES run might be accessed.
For the most part direct bucket access isn't being considered and I can see why not. You had asked elsewhere if DRS would be an option. Certainly worth looking at. In fact we can test out aspects of this now. Seven Bridges DRS makes any file in a workspace available through DRS. One of my FASPScripts uses the SB API directly to submit a task rather than WES. I did it for a placeholder until an SB WES server is available. The resulting file is stored in my SB project. I added a script to check for task completion, get the file (DRS) ids and use DRS to download the resulting files. Again it has to use the SB API to query the task and get the ids, but it wouldn't need any WES changes to do this. It would work just as you outlined for WDL and return the DRS id as a string.
The case for stricter typing you've made above could well be applied though. This does lead to revisiting some assumptions about DRS, but I'll hold that for elsewhere. |
To me it sounds like we're talking about a GET /outputs for a given workflow ID? Where the response type is a key/value mapping of name/and output value (be it a file ref, string, int etc) |
Some progress made on this in this notebook albeit on a different WES implementation. A difference from how it was being discussed above is that you don't explicitly say "put the result in this DRS location" . It just happens. You can then pick up the results from the DRS service run by the same provider. You have to do a separate authentication for the DRS service though - the irony being that it's the same set of credentials as for the WES. The other part is that the script did have to do the download rather than having it pushed. Other than that it seems very convenient to use. Raises the question whether a WES server would also have a companion DRS for results. I suspect that for the most part if you have compute privileges on a system you will have storage privileges too. This might be a common pattern and therefore worth supporting e.g. that the WES authentication serves for the DRS retrieval too. |
How do you get back the results of a workflow passed to the DNAStack WES server. Can this be done through the WES API?
There are a couple of clues
The run log gives a working directory and the command line lists an output file name. So we might guess, for the MD5 example, that the output goes to
gs://workflow-bucket/md5Sum/run_id/output.md5
Another clue is in the log from the completed run
But the value there is unspecified.
The only thing I found in the WES spec on outputs is in the documentation of workflow_params which are:
The workflow run parameterizations (JSON encoded), including input and output file locations
However there are no details on how to provide the location, or how to give the workflow authorization to write to that location.
Maybe that last complication is necessary. From the above it seems that the output goes to workflow_root. In which case the authorization problem is mine. Can I access workflow-bucket?
stdout also goes to workflow_root.
gs://workflow-bucket/md5Sum/run_id/call-calculateMd5Sum/stdout
The WES documentation states that the stdout value is:
A URL to retrieve standard output logs of the workflow run or task. ... Should be available using the same credentials used to access the WES endpoint.
Can the access_token used for the DNAStack WES be used to access the stdout file via Google Storage API? I don't expect so, but trying the following URL
http://storage.googleapis.com/workflow-bucket/md5Sum/run_id/call-calculateMd5Sum/stdout
I passed the access token used for workflow as a Bearer token. It returned
Note also that the value for stdout (and stderr) returned by the implementation is a URI not a URL as the specification requires.
The text was updated successfully, but these errors were encountered: