Rich type representation in Attributes #700

david4096 · 2016-08-25T00:25:21Z

In the DWG call today the question came up about how to perform message typing in a more dynamic manner. You can see one way of performing this task here by using the protobuf oneof field and reflection. This PR attempts to close #607 by adding richer type representation, and also close #445 by adding integers as one of the available types. Big thanks to @calbach and @diekhans for providing the basis for this in Sequence Annotations, this PR extends that work to the rest of the data model.

The info field is used on many GA4GH messages and acts as a way to "tag" a message with data that are have not been represented by the named fields. The field items are lists of internal protobuf types, and their type information is retained with the message itself. This allows code to inspect the info field and perform tasks depending on the type.

With Sequence Annotations we saw the addition of a similar spirited effort that uses GA4GH types in a protobuf oneof field to allow typed information to be stored in a Feature message. This improves data interchange by allowing ExternalIdentifier or OntologyTerm messages to be used as values in the attributes map.

This PR attempts to unify the approach by including the strengths of the ListValue method used by the info field with the extensibility of the Attributes approach from the Features protocol.

To carry this out, the Attributes message is moved to metadata.proto so that it can be used amongst the other protocols. The info field was replaced for messages that had it with the field attributes. I've renamed the vals field to attr, since I found the repetition of vals...values confusing in code.

I've also expanded the available types in an attribute value message (int32, int64, bool, double, null). This list can be extended with other GA4GH messages as needed.

Added attributes to the Variant Annotation Set message and RNA expression messages.

Please view the iPython notebook example that shows how to work with the attributes field. More information on the oneoffield can be viewed here.

I've included an example below of how JSON serialization of the attributes field as it appears for the variant in json_intro.rst. I hope this addresses any concerns that fidelity of the data is lost when using JSON serialization:

      "attributes": {
          "attr": {
              "numberOfPapers": {
                  "values": [
                      {"int32Value": 11}
                  ]
              },
              "variantFacts": {
                  "values": [
                      {"stringValue": "is_long"},
                      {"stringValue": "is_interesting"}
                  ]
              }
          }
      }

andrewjesaitis · 2016-09-24T16:35:31Z

+1 I really like this idea.

First, since one of the languages of the genetics community (python) is untyped anything that brings some degree of type integrity is welcome.

Second, standardizing this method of creating space for secondary (and unforeseen) attributes that people need to store is a big plus as well.

One shortcoming that was discussed was the storage of arbitrary object types, but this isn't supported any better in the current implementation. In either case, users would need to serialize their object to text and deserialize it later to work on it. I think this is not a primary use case for the info and the vast majority of time users will be storing primitive types.

david4096 · 2016-09-30T00:53:19Z

Thanks @andrewjesaitis, that's right! You'll still be able to store arbitrary JSON strings in string_value messages as is done now.

mcourtot · 2016-10-07T08:49:07Z

On the MTT side, we have almost exactly the same structure:

// characteristics object uses lists of BioCharacteristic objects to describe
// diseases and phenotypes associated with this Individual.
Characteristics characteristics = 5;

// BioCharacteristic is a prototype wrapper object for single instances
// of phenotypes, diseases ... which may be described through one or several
// ontology terms
message BioCharacteristic {
string description = 1;
repeated OntologyTerm ontologyTerms = 2;
}

// Characteristics
message Characteristics {
repeated BioCharacteristic diseases = 1;
repeated BioCharacteristic phenotypes = 2;
}

We're happy updating the names of the attributes if this is helpful.

@mbaudis created a new branch, see https://github.com/ga4gh/schemas/blob/metadata-bio-characteristics/src/main/proto/ga4gh/bio_metadata.proto. It would be good to coordinate, use it across and merge the branch.

mcourtot · 2016-10-10T12:01:05Z

To clarify, we propose to add an object of type biocharacteristic to the values allowed for attributes.

ejacox · 2017-01-08T08:05:51Z

How often are typed values needed? Is this overkill?

ejacox · 2017-01-08T10:00:42Z

My question gets at documenting the motivation for this change. The discussion is scattered in several PRs and issues. I will try to summarize: The motivation is to be able to precisely represent the data and to be able to recreate the original state of the data (roundtrip), e.g. if passing variants from a vcf file, then be able to recreate the vcf file from the data in the message. Therefore, there is a need to have lists of typed values that can be scalar or array values. Also, the data can be hierarchical.

The drawback of this approach is that it is complicated and will require more complex processing just to get a simple value. Will this impede acceptance of GA4GH by consumers of the data? Is the goal of roundtrip even attainable? The formats are not even precisely defined (see #505).

My impression of the info field was that it is for metadata, most often a string value. What if we keep info as well as adding attributes where it is necessary?

david4096 · 2017-01-10T04:59:37Z

Thanks @ejacox, great questions! The basic problem is that folks want to be able to tag with Ontology Terms, when they're available, and define loosely structured hierachical metadata. The other messages made available as AttributeValues (experiment, program, analysis) are aspirational and provided for completeness and don't interfere with the normal use cases.

The attributes message as it as used in sequence annotations provides the highest fidelity of interchange, and this PR brings that feature to the rest of the API. It answers the question, "what do I do with my JSON" in as final of a manner as we can do at this time.

In terms of keeping the info field, the attributes message is the same for the use case of the info field. I chose to rename it because I believe it's semantically more clear. The differences from the current info field are being able to nest keys and use some GA4GH messages as tags. a4a8bcc#diff-4fd17066615aa04304c49ac00877970bR104

dataset.info['version'].values.add().string_value = "version"
###
dataset.attributes.attr['version'].values.add().string_value = "version"
dataset.attributes.attr['species'].values.add().ontology_term.MergeFrom(myOntologyTerm)

You can compare the proto that backs the info field with the contents of this PR (https://github.com/google/protobuf/blob/master/src/google/protobuf/struct.proto#L93). Hopefully this will be largely code neutral and adds features that are needed to properly model loosely structured metadata. It also provides a path to better define the types represented in the metadata, but like the info field, will likely be used for string: [string] mappings much of the time.

The feature that results in the reference server is being able to provide an interface for interchanging JSON metadata, or any other hierarchically structured message of a mixture of GA4GH and internal protobuf types. Inspirations for nesting keys was developed to interchange structured tags in tagstorm format http://hgwdev.cse.ucsc.edu/~kent/tagStorm/testTagStorm.html .

A common use case request has been modeling loosely structured data, especially metadata, for example when interchanging phenotype data. The sequence annotations API's designers (@calbach @diekhans ) offered this message, this PR hopes to make the API consistent by supplying a single robust tagging method. I hope others will offer their perspective.

At first I found these data structures (info field) extremely overwrought for interchanging simplistic data. I have experience working with tag-bags that map strings to strings, and not being able to do this in just a few characters of code was frustrating and added a steep learning curve.

I was able to finally convince myself that type security and hierarchical mapping was important when working with variant set metadata in the variant annotations API. Often, analyses provide annotations in an unstructured manner, and these annotations were often nested arrays that needed to be split. It was my hope that no one would ever have to guess about whether a field is comma separated, contains booleans or strings, ontology terms or external identifiers, when using this API.

Hopefully, clients will obscure the type reflection that one would like when working with these attributes. Thanks for your considerations @ejacox. @bwalsh had stated something similar about the info field in the past, which I understand. However, I believe that type security, hierarchy, and using our own type system will enable a lot of common interchange use cases.

david4096 · 2017-01-25T20:04:58Z

Following discussion with @ejacox, we will be modifying this PR to allow other metadata entries to be gathered together. #690 (comment)

The new signature will be variant.metadata.attributes, which will allow us to store all shared metadata elements on a single message in the proto.

andrewjesaitis · 2017-01-25T21:01:00Z

Hmmm. Don't think I really understand the purpose of this and think it may obscure the original intention of the info field.

If I remember correctly, the purpose of the info field was to store data defined in source that didn't explicitly have a place in the schema. Moving it to be under a per record metadata field seems to change or at least obscure this purpose, since it suggests that this is data about the record rather than data "of" the record but that didn't have an explicit field for it.

As an example, I don't really view the data in a VCF's INFO field as metadata, it is data about a variant. I wouldn't expect Sift or Polyphen scores to be metadata, but under this change it is where they would be stored.

ejacox · 2017-01-26T17:15:11Z

We propose to keep the timestamp info explicit (not in attributes/info). Rather, any info (metadata) that applies to data in general will be put into a metadata message. Explicit fields will still be explicit.

Our motivation is to gather all purpose fields in one place. With timestamps, we were beginning to put them all over the place. When we wanted to change them, we needed to touch many messages. This way, any future changes (type change, description change, etc.) to timestamps will occur in one place.

Does that address your concerns and still align with original intent @andrewjesaitis ?

david4096 · 2017-01-26T18:17:00Z

An alternative would be to add the metadata message that has named metadata fields like created and updated and then maintain the attributes at the level they are.

Given the Variant Set Metadata message uses the word "metadata", I can't mount much of an argument against the suggestion to use it as an accessor for the attributes.

Ideally only developers have to see this part of the model, and if you look at how applications have to reconcile the meaning of tags in the Variant protocol, it might make more sense to call it metadata as opposed to info, I believe. It would be great to provide a similar self-description protocol in other parts of the API. We need to try to be VCF compliant, but not rely on it.

I would identify one problem as mixing the term "metadata" here, because the variant set metadata serves a number of purposes. It has ID resolution, descriptions, and type definitions (to some extent). So a variant set provides metadata about a variant's attributes, and a server provides metadata about a variant's record.

As pointed out by @ejacox, we need to better specify the use cases around the timestamp. It's not clear if it is meant to be provided by the file. If it is left up to interpretation, it is a server specific message about a record, and maybe should go in a different named field: record_metadata.

mbaudis · 2017-01-26T19:57:34Z

@ejacox I do not get this. Obviously, in GA4GH we use "metadata" roughly as "everything but the sequence", not just as describing housekeeping data associated with a message.

And this is a view shared broadly outside GA4GH, when talking about data sharing concepts etc.

So, please, try not to mix this into "named attributes and then accompanying metadata" or such.

For time stamps and ids, I personally prefer the use of "housekeeping" attributes. The current info field just is for unassigned data.

So general agreement with @david4096, though IMO created/updated (please read up all the scattered discussions & history, including previous better naming record_created ...) are part of the record (and technically different from server time stamps; e.g you may send a record to another host without changing the values).

andrewjesaitis · 2017-01-26T23:43:51Z

I think I'm in agreement with @david4096 suggestion of keeping attributes as a typed key-value store to hold data that doesn't explicitly have a field in the model. So I think this PR currently satisfies that.

If we want to move timestamps to a separate metadata message that should be a separate PR. I'll throw my timestamp thoughts on the other issue to keep this PR focused.

Attributes message is moved to metadata.proto info field was replaced for messages that had it with the field 'attributes' renamed the 'vals' field to 'attr' added to the available types in an attribute value message ('int32', 'int64', 'bool', 'double', 'null') Added attributes to the Variant Annotation Set message and RNA expression messages.

Remove unused package

Close ga4gh#537

Remove unused import

Merge assay_metadata in order to support as attributes.

Add tests that show basic attributes examples

Remove assay metadata from tests

david4096 added the in progress label Aug 25, 2016

david4096 force-pushed the 607_attributes branch from 91ff4f1 to aa5156f Compare August 25, 2016 00:37

david4096 mentioned this pull request Sep 9, 2016

Biosample sample characteristics #710

Closed

david4096 mentioned this pull request Oct 4, 2016

Biosample sample characteristics #711

Closed

david4096 added the ready label Oct 26, 2016

kozbo added this to the Schemas 1.0 milestone Nov 14, 2016

david4096 mentioned this pull request Dec 2, 2016

Variant annotation update (transcript effects endpoint) #706

Closed

david4096 mentioned this pull request Dec 14, 2016

Modeling Single Cell RNA Quantifications #758

Open

david4096 force-pushed the 607_attributes branch from d1f9e26 to 8ca5c45 Compare December 22, 2016 21:03

david4096 mentioned this pull request Jan 8, 2017

Basic peer service specification david4096/ga4gh-schemas#4

Closed

This was referenced Jan 9, 2017

Continuous Data #769

Closed

Implement improved attributes ga4gh/ga4gh-server#1520

Closed

david4096 force-pushed the 607_attributes branch from ef15ba1 to 3b7910d Compare January 10, 2017 03:50

david4096 mentioned this pull request Jan 10, 2017

Implement attributes field ga4gh/ga4gh-client#44

Closed

david4096 force-pushed the 607_attributes branch from 3b7910d to a4a8bcc Compare January 10, 2017 04:27

This was referenced Jan 12, 2017

Implement attributes ga4gh/compliance#240

Closed

Change info to attributes ga4gh/ga4gh-client#46

Merged

david4096 added 10 commits February 2, 2017 18:14

Remove references to info and update JSON example

a0e71ee

Remove unused package

Add attributes to reference/sets

8930a66

Close ga4gh#537

Bump reused tag number

97f139a

Remove unused import

Merge for genotypes list

26ae5f9

Add maps and lists as possible attributes

8512b22

Add analysis and program, reorder

d5accac

Move messages to common

b1f8479

Merge assay_metadata in order to support as attributes.

Clean up imports

b7b267a

Add tests that show basic attributes examples

Update compiled proto

167c5a0

Remove assay metadata from tests

david4096 force-pushed the 607_attributes branch from a4a8bcc to 167c5a0 Compare February 3, 2017 02:26

david4096 merged commit 396784a into ga4gh:master Feb 3, 2017

david4096 removed in progress ready labels Feb 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rich type representation in Attributes #700

Rich type representation in Attributes #700

david4096 commented Aug 25, 2016 •

edited

Loading

andrewjesaitis commented Sep 24, 2016

david4096 commented Sep 30, 2016 •

edited

Loading

mcourtot commented Oct 7, 2016

mcourtot commented Oct 10, 2016

ejacox commented Jan 8, 2017

ejacox commented Jan 8, 2017

david4096 commented Jan 10, 2017 •

edited

Loading

david4096 commented Jan 25, 2017

andrewjesaitis commented Jan 25, 2017

ejacox commented Jan 26, 2017 •

edited by david4096

Loading

david4096 commented Jan 26, 2017

mbaudis commented Jan 26, 2017

andrewjesaitis commented Jan 26, 2017

Rich type representation in Attributes #700

Rich type representation in Attributes #700

Conversation

david4096 commented Aug 25, 2016 • edited Loading

andrewjesaitis commented Sep 24, 2016

david4096 commented Sep 30, 2016 • edited Loading

mcourtot commented Oct 7, 2016

mcourtot commented Oct 10, 2016

ejacox commented Jan 8, 2017

ejacox commented Jan 8, 2017

david4096 commented Jan 10, 2017 • edited Loading

david4096 commented Jan 25, 2017

andrewjesaitis commented Jan 25, 2017

ejacox commented Jan 26, 2017 • edited by david4096 Loading

david4096 commented Jan 26, 2017

mbaudis commented Jan 26, 2017

andrewjesaitis commented Jan 26, 2017

david4096 commented Aug 25, 2016 •

edited

Loading

david4096 commented Sep 30, 2016 •

edited

Loading

david4096 commented Jan 10, 2017 •

edited

Loading

ejacox commented Jan 26, 2017 •

edited by david4096

Loading