Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading structured data from JSON and TSV #389

Closed
jdidion opened this issue Jul 31, 2020 · 7 comments
Closed

Reading structured data from JSON and TSV #389

jdidion opened this issue Jul 31, 2020 · 7 comments

Comments

@jdidion
Copy link
Collaborator

jdidion commented Jul 31, 2020

  • The read_json function returns type mixed (an object, array, or primitive)
  • The read_tsv function returns type Array[Array[String]]

Both of these functions are to support reading structured data.

The return type of read_json is structured - i.e. it may be coerced to a Struct (or collection of Structs). On the other hand, the return type of read_tsv is unstructured, i.e. you must know the content of theTSV file that you're reading in, and WDL will do nothing to validate that the assumed structure is correct.

A better approach would be to have the return type of read_tsv be (optionally) an object that may be coerced to a specific structure, which is defined by a Struct type.

```
struct Sample {
  String name
  Int age
}

Array[Sample] samples = read_tsv("samples.tsv")
```

(Note that, even though I am proposing to support implicit coercion, I am also strongly in favor of explicit coercions (as describe in #373))

For read_tsv there would need to be a second parameter indicating the header. This could be either a Boolean (indicating whether the first line is a header) or an Array[String] providing the actual header. I could see the spec for read_tsv looking like this:

mixed read_tsv(String|File, Optional[Boolean|Array[String]])

If the second parameter is not specified, then the return type is Array[Array[String]]. This may be coerced to Array[Struct] as follows:

  • The array must be square (i.e. every row must be the same length).
  • Each column left-to-right corresponds to a member of the struct in the top-to-bottom order in which the members are declared.
  • For each row, each field is coerced to the type of the corresponding struct member - it is an error if any of these coercions fail. This requires explicitly allowing String-to-Int, String-to-Float, and String-to-Boolean coercions.
  • Nested structures are not supported - i.e. every element of the struct must be of a primitive type.

If the second parameter is of type Boolean and the value is false, the return type is as above (Array[Array[String]]). If the value is true, then the return type is Array[Object], with the keys being taken from the column headers in the first line. The values would all be of type String.

If the second parameter is an Array[String], then it is assumed the file does not have a header line and the given array is used as the header. The return type is as above (Array[Object]).

To coerce each Object to Struct:

  • The object keys and struct member ids must be identical.
  • Coercions are performed the same as when coercing from Array[String].
  • Nested structures are not supported.

A related but separate proposal is to add an optional second parameter to read_json which is the string key of the element to select from the JSON object (because, even though it is valid to have an array at the top level of JSON, it is still much more common to see the structure shown in the example).

# samples.json is of the form { "samples": [ { "name": "joe", "age": 42 }, ... ] }
Array[Sample] samples = read_json("samples.json", "samples")
@patmagee
Copy link
Member

patmagee commented Aug 4, 2020

I think this is a really great proposal. This would really solve several problems with one stone. I have always had a bit of an issue with the unstructured nature of read_tsv and the reliance on type casting from the engine. realistically there is no way for the engine to really know the proper type that the user intended without being explicitly told, so its unavoidable that we will end up with issues.

My one concern with doing something like:

Array[Sample] samples = read_tsv("samples.json")

is that it puts all typing on the LHS of the equation, meaning read_tsv already is returned as mixed and the engine's best guess of what the underlying types are. Its possible then that we can get into an incompatible coercion scenario simply becasue the engine cast to the wrong thing. Also, nested structures not being allowed would definitely cause a few people some problems.

I wonder if a "better" more verbose approach would be a general formula of read_X(Type, <name>), with a concrete example being:

struct Sample {
  String name
  Int age
}

Array[Sample] samples = read_tsv(Sample, "samples.tsv")

with some more complete examples:

Array[Array[String]] sample_names = read_tsv(Array[String],"names.tsv")
SomeStruct something = read_json(SomeStruct,"something.json")

@jdidion
Copy link
Collaborator Author

jdidion commented Aug 4, 2020

I guess I should be more explicit in my proposal: the engine won't try to guess the right coercions for each column of the tsv. The mixed return type is about the fact that it can be one of two things: Array[Array[String]] or Object. I guess it might be more clear to say:

Array[Array[String]] read_tsv(String|File)
Array[Array[String]] read_tsv(String|File, false)
Object read_tsv(String|File, true)
Object read_tsv(String|File, Array[String])

When the return type is object, all the values are strings. It is up to the engine to then cast that object to the LHS struct type, or throw an error if there is any key incompatibility between the Object and the struct, or if any of the values of the object cannot be coerced to the expected types.

@jdidion
Copy link
Collaborator Author

jdidion commented Aug 4, 2020

I would not like to complicate read_tsv overly much by enabling coercion to nested types. If one really wants to do use data structures, one should use JSON.

@jdidion
Copy link
Collaborator Author

jdidion commented Aug 4, 2020

I do like the read_json(SomeStruct,"something.json"), but I really think this should be done with a general-purpose cast function and an explicit set of rules for type casting. So instead:

cast(SomeStruct, read_json("something.json"))

This will be better for backward compatibility as well, vs introducing a new parameter to all the read* functions. At the very least, new parameters should be optional and added at the end of the signature.

@jdidion jdidion added this to the v2.0 milestone Aug 4, 2020
@vortexing
Copy link

I really want to be able to have users look at a tsv and make their list of samples to run X workflow on, make sure their header matches the values defined in the Struct for inputs for the workflow, then provide the File tsv as an input. Reading over this now I have lost the thread of how this could be possible without breaking old stuff.

It still is a huge need for newbie type folks and for folks who are running LOTS of smaller workflows (as in their workflow always scatters over a long list of jobs that fit nicely into rows of a TSV rather than a human trying to generate a json and then troubleshoot it's formatting).

@freeseek
Copy link
Contributor

freeseek commented Sep 6, 2022

To read a table with a header, it is possible to use the following code:

Array[Array[String]] tsv = read_tsv(file_tsv)
scatter (idx in range(length(tsv) - 1)) { Array[String] tsv_rows = tsv[(idx+1)] }
Map[String, Array[String]] tbl = as_map(zip(tsv[0], transpose(tsv_rows)))

as explained in #194. Though it would be nice to be able to do this without using scatter() with a function that could extract a slice from an array, as suggested in #120

Notice that then it also possible to check what columns are contained in the table using a little hack that obviates for the lack of a contains_key() function as explained in #305

@jdidion
Copy link
Collaborator Author

jdidion commented Feb 7, 2024

Duplicate of #194

@jdidion jdidion marked this as a duplicate of #194 Feb 7, 2024
@jdidion jdidion closed this as completed Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants