-
Notifications
You must be signed in to change notification settings - Fork 110
Rich type representation in Attributes #700
Conversation
91ff4f1
to
aa5156f
Compare
+1 I really like this idea. First, since one of the languages of the genetics community (python) is untyped anything that brings some degree of type integrity is welcome. Second, standardizing this method of creating space for secondary (and unforeseen) attributes that people need to store is a big plus as well. One shortcoming that was discussed was the storage of arbitrary object types, but this isn't supported any better in the current implementation. In either case, users would need to serialize their object to text and deserialize it later to work on it. I think this is not a primary use case for the |
Thanks @andrewjesaitis, that's right! You'll still be able to store arbitrary JSON strings in |
On the MTT side, we have almost exactly the same structure: // characteristics object uses lists of BioCharacteristic objects to describe // BioCharacteristic is a prototype wrapper object for single instances // Characteristics We're happy updating the names of the attributes if this is helpful. @mbaudis created a new branch, see https://github.com/ga4gh/schemas/blob/metadata-bio-characteristics/src/main/proto/ga4gh/bio_metadata.proto. It would be good to coordinate, use it across and merge the branch. |
To clarify, we propose to add an object of type |
d1f9e26
to
8ca5c45
Compare
How often are typed values needed? Is this overkill? |
My question gets at documenting the motivation for this change. The discussion is scattered in several PRs and issues. I will try to summarize: The motivation is to be able to precisely represent the data and to be able to recreate the original state of the data (roundtrip), e.g. if passing variants from a vcf file, then be able to recreate the vcf file from the data in the message. Therefore, there is a need to have lists of typed values that can be scalar or array values. Also, the data can be hierarchical. The drawback of this approach is that it is complicated and will require more complex processing just to get a simple value. Will this impede acceptance of GA4GH by consumers of the data? Is the goal of roundtrip even attainable? The formats are not even precisely defined (see #505). My impression of the info field was that it is for metadata, most often a string value. What if we keep info as well as adding attributes where it is necessary? |
ef15ba1
to
3b7910d
Compare
3b7910d
to
a4a8bcc
Compare
Thanks @ejacox, great questions! The basic problem is that folks want to be able to tag with Ontology Terms, when they're available, and define loosely structured hierachical metadata. The other messages made available as The attributes message as it as used in sequence annotations provides the highest fidelity of interchange, and this PR brings that feature to the rest of the API. It answers the question, "what do I do with my JSON" in as final of a manner as we can do at this time. In terms of keeping the info field, the attributes message is the same for the use case of the
You can compare the proto that backs the info field with the contents of this PR (https://github.com/google/protobuf/blob/master/src/google/protobuf/struct.proto#L93). Hopefully this will be largely code neutral and adds features that are needed to properly model loosely structured metadata. It also provides a path to better define the types represented in the metadata, but like the The feature that results in the reference server is being able to provide an interface for interchanging JSON metadata, or any other hierarchically structured message of a mixture of GA4GH and internal protobuf types. Inspirations for nesting keys was developed to interchange structured tags in tagstorm format http://hgwdev.cse.ucsc.edu/~kent/tagStorm/testTagStorm.html . A common use case request has been modeling loosely structured data, especially metadata, for example when interchanging phenotype data. The sequence annotations API's designers (@calbach @diekhans ) offered this message, this PR hopes to make the API consistent by supplying a single robust tagging method. I hope others will offer their perspective. At first I found these data structures (info field) extremely overwrought for interchanging simplistic data. I have experience working with tag-bags that map strings to strings, and not being able to do this in just a few characters of code was frustrating and added a steep learning curve. I was able to finally convince myself that type security and hierarchical mapping was important when working with variant set metadata in the variant annotations API. Often, analyses provide annotations in an unstructured manner, and these annotations were often nested arrays that needed to be split. It was my hope that no one would ever have to guess about whether a field is comma separated, contains booleans or strings, ontology terms or external identifiers, when using this API. Hopefully, clients will obscure the type reflection that one would like when working with these attributes. Thanks for your considerations @ejacox. @bwalsh had stated something similar about the info field in the past, which I understand. However, I believe that type security, hierarchy, and using our own type system will enable a lot of common interchange use cases. |
Following discussion with @ejacox, we will be modifying this PR to allow other metadata entries to be gathered together. #690 (comment) The new signature will be |
Hmmm. Don't think I really understand the purpose of this and think it may obscure the original intention of the info field. If I remember correctly, the purpose of the info field was to store data defined in source that didn't explicitly have a place in the schema. Moving it to be under a per record metadata field seems to change or at least obscure this purpose, since it suggests that this is data about the record rather than data "of" the record but that didn't have an explicit field for it. As an example, I don't really view the data in a VCF's INFO field as metadata, it is data about a variant. I wouldn't expect Sift or Polyphen scores to be metadata, but under this change it is where they would be stored. |
We propose to keep the timestamp info explicit (not in attributes/info). Rather, any info (metadata) that applies to data in general will be put into a metadata message. Explicit fields will still be explicit. Our motivation is to gather all purpose fields in one place. With timestamps, we were beginning to put them all over the place. When we wanted to change them, we needed to touch many messages. This way, any future changes (type change, description change, etc.) to timestamps will occur in one place. Does that address your concerns and still align with original intent @andrewjesaitis ? |
An alternative would be to add the Given the Variant Set Metadata message uses the word "metadata", I can't mount much of an argument against the suggestion to use it as an accessor for the attributes. Ideally only developers have to see this part of the model, and if you look at how applications have to reconcile the meaning of tags in the Variant protocol, it might make more sense to call it I would identify one problem as mixing the term "metadata" here, because the variant set metadata serves a number of purposes. It has ID resolution, descriptions, and type definitions (to some extent). So a variant set provides metadata about a variant's attributes, and a server provides metadata about a variant's record. As pointed out by @ejacox, we need to better specify the use cases around the timestamp. It's not clear if it is meant to be provided by the file. If it is left up to interpretation, it is a server specific message about a record, and maybe should go in a different named field: |
@ejacox I do not get this. Obviously, in GA4GH we use "metadata" roughly as "everything but the sequence", not just as describing housekeeping data associated with a message. And this is a view shared broadly outside GA4GH, when talking about data sharing concepts etc. So, please, try not to mix this into "named attributes and then accompanying metadata" or such. For time stamps and ids, I personally prefer the use of "housekeeping" attributes. The current So general agreement with @david4096, though IMO created/updated (please read up all the scattered discussions & history, including previous better naming |
I think I'm in agreement with @david4096 suggestion of keeping If we want to move timestamps to a separate metadata message that should be a separate PR. I'll throw my timestamp thoughts on the other issue to keep this PR focused. |
Attributes message is moved to metadata.proto info field was replaced for messages that had it with the field 'attributes' renamed the 'vals' field to 'attr' added to the available types in an attribute value message ('int32', 'int64', 'bool', 'double', 'null') Added attributes to the Variant Annotation Set message and RNA expression messages.
Remove unused package
Remove unused import
Merge assay_metadata in order to support as attributes.
Add tests that show basic attributes examples
Remove assay metadata from tests
a4a8bcc
to
167c5a0
Compare
In the DWG call today the question came up about how to perform message typing in a more dynamic manner. You can see one way of performing this task here by using the protobuf
oneof
field and reflection. This PR attempts to close #607 by adding richer type representation, and also close #445 by adding integers as one of the available types. Big thanks to @calbach and @diekhans for providing the basis for this in Sequence Annotations, this PR extends that work to the rest of the data model.The info field is used on many GA4GH messages and acts as a way to "tag" a message with data that are have not been represented by the named fields. The field items are lists of internal protobuf types, and their type information is retained with the message itself. This allows code to inspect the info field and perform tasks depending on the type.
With Sequence Annotations we saw the addition of a similar spirited effort that uses GA4GH types in a protobuf
oneof
field to allow typed information to be stored in aFeature
message. This improves data interchange by allowingExternalIdentifier
orOntologyTerm
messages to be used as values in the attributes map.This PR attempts to unify the approach by including the strengths of the
ListValue
method used by theinfo
field with the extensibility of theAttributes
approach from the Features protocol.To carry this out, the
Attributes
message is moved to metadata.proto so that it can be used amongst the other protocols. Theinfo
field was replaced for messages that had it with the fieldattributes
. I've renamed thevals
field toattr
, since I found the repetition of vals...values confusing in code.I've also expanded the available types in an attribute value message (
int32
,int64
,bool
,double
,null
). This list can be extended with other GA4GH messages as needed.Added
attributes
to the Variant Annotation Set message and RNA expression messages.Please view the iPython notebook example that shows how to work with the
attributes
field. More information on theoneof
field can be viewed here.I've included an example below of how JSON serialization of the
attributes
field as it appears for the variant in json_intro.rst. I hope this addresses any concerns that fidelity of the data is lost when using JSON serialization: