-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading structured data from JSON and TSV #389
Comments
I think this is a really great proposal. This would really solve several problems with one stone. I have always had a bit of an issue with the unstructured nature of My one concern with doing something like:
is that it puts all typing on the LHS of the equation, meaning I wonder if a "better" more verbose approach would be a general formula of struct Sample {
String name
Int age
}
Array[Sample] samples = read_tsv(Sample, "samples.tsv") with some more complete examples:
|
I guess I should be more explicit in my proposal: the engine won't try to guess the right coercions for each column of the tsv. The mixed return type is about the fact that it can be one of two things:
When the return type is object, all the values are strings. It is up to the engine to then cast that object to the LHS struct type, or throw an error if there is any key incompatibility between the Object and the struct, or if any of the values of the object cannot be coerced to the expected types. |
I would not like to complicate read_tsv overly much by enabling coercion to nested types. If one really wants to do use data structures, one should use JSON. |
I do like the
This will be better for backward compatibility as well, vs introducing a new parameter to all the |
I really want to be able to have users look at a tsv and make their list of samples to run X workflow on, make sure their header matches the values defined in the Struct for inputs for the workflow, then provide the File tsv as an input. Reading over this now I have lost the thread of how this could be possible without breaking old stuff. It still is a huge need for newbie type folks and for folks who are running LOTS of smaller workflows (as in their workflow always scatters over a long list of jobs that fit nicely into rows of a TSV rather than a human trying to generate a json and then troubleshoot it's formatting). |
To read a table with a header, it is possible to use the following code:
as explained in #194. Though it would be nice to be able to do this without using Notice that then it also possible to check what columns are contained in the table using a little hack that obviates for the lack of a |
Duplicate of #194 |
read_json
function returns typemixed
(an object, array, or primitive)read_tsv
function returns typeArray[Array[String]]
Both of these functions are to support reading structured data.
The return type of
read_json
is structured - i.e. it may be coerced to aStruct
(or collection ofStruct
s). On the other hand, the return type ofread_tsv
is unstructured, i.e. you must know the content of theTSV file that you're reading in, and WDL will do nothing to validate that the assumed structure is correct.A better approach would be to have the return type of
read_tsv
be (optionally) an object that may be coerced to a specific structure, which is defined by aStruct
type.(Note that, even though I am proposing to support implicit coercion, I am also strongly in favor of explicit coercions (as describe in #373))
For
read_tsv
there would need to be a second parameter indicating the header. This could be either a Boolean (indicating whether the first line is a header) or anArray[String]
providing the actual header. I could see the spec forread_tsv
looking like this:If the second parameter is not specified, then the return type is
Array[Array[String]]
. This may be coerced toArray[Struct]
as follows:If the second parameter is of type
Boolean
and the value isfalse
, the return type is as above (Array[Array[String]]
). If the value istrue
, then the return type isArray[Object]
, with the keys being taken from the column headers in the first line. The values would all be of typeString
.If the second parameter is an
Array[String]
, then it is assumed the file does not have a header line and the given array is used as the header. The return type is as above (Array[Object]
).To coerce each
Object
toStruct
:Array[String]
.A related but separate proposal is to add an optional second parameter to
read_json
which is the string key of the element to select from the JSON object (because, even though it is valid to have an array at the top level of JSON, it is still much more common to see the structure shown in the example).The text was updated successfully, but these errors were encountered: