Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Variant annotation update (transcript effects endpoint) #706

Closed
wants to merge 3 commits into from

Conversation

david4096
Copy link
Member

@david4096 david4096 commented Sep 1, 2016

The current variant annotation message structure faces an issue seen in other parts of the API (readgroupsets, variants). Messages returned from a search request may be modified during a query, causing the same identifier to point to two unique documents. This can cause problems in interchange. It also creates a complex nested document that can be difficult to work with in practice.

To solve this, another endpoint has been added for filtering transcript effects. This allows transcript effects to be filtered and operated on directly, as opposed to through the modification of Variant Annotation messages at query time. It also provides a more natural query interface by separating concerns from the SearchVariantAnnotationsRequest.

Another problem facing the Variant Annotation protocol was the difficulty in finding an annotation for a given variant. It isn't possible to directly get an annotation for a given variant, which has caused clients to have to search for similar ranges for a given variant in an annotation set. The addition of the variant_id to the SearchVariantAnnotationsRequest makes it possible to find annotations directly, when a variant_id is provided the range is ignored.

By request from the VATT, filtering annotations by allele frequency has been added. This was done by adding an allele_frequency field to the VariantAnnotation message and a frequency_threshold to the SearchVariantAnnotationsRequest. @reece @sarahhunt @Kusdhill

  • Search transcript effects directly (/transcripteffects/search)
  • Filter variant annotations by allele_frequency
  • Search variant annotations by variant ID

@sarahhunt
Copy link
Contributor

@david4096 - thanks for putting back the search by variant and feature options!

Directly querying TranscriptEffects makes sense for feature id and location, but I’m not sure about searching by variant_annotation_id. The only way to extract a variant_annotation_id is by looking up the VariantAnnotation record which contains the TranscriptEffects anyway. Searching by variant id would be more useful, though we don’t have variant id in TranscriptEffect.

Similarly, if you find all TranscriptEffects for a gene, you then need to look up all VariantAnnotation records to find the variant id, picking up all the TranscriptEffects again.

The desire to not modify the variant annotation record may be better met by not having the record to group results by variant and leaving the client to handle this. What do you think?

@david4096 david4096 force-pushed the transcript_effects branch 2 times, most recently from f37847e to bb2e69b Compare September 7, 2016 20:34
@david4096
Copy link
Member Author

Thanks @sarahhunt excellent points! I have removed the transcript_effects field from the VariantAnnotation message and added a variant_id to the TranscriptEffect message. This will allow one to easily move back and forth between the protocols.

@david4096
Copy link
Member Author

Helps with ga4gh/ga4gh-server#308


// Only return variants with an allele frequency annotation over this value.
// Return all variants by default.
double frequency_threshold = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for a one-sided test? What if someone wants variants below a threshold? minimum_frequency and maximum_frequency would be more generally useful.

Are we planning notion of population specific frequencies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell that is a difficult domain to model well, as different data preparers will annotate their subpopulations differently. If this is a common enough use case we should consider it, as there is work on modeling groups of individuals here. We can make a nice access pattern through that endpoint.

This frequency, as I understand it, would have been calculated for the specific variant set. Adding a minimum threshold makes sense.

// less than reference length. Requests spanning the join of circular
// genomes are represented as two requests one on each side of the join
// (position 0).
int64 start = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint64? (Have we had this discussion already? Someone must have picked up on this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a sensible change to me! It would imply a change to how position is specified throughout the API, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaning towards keeping the sign based on the response to #737

Add ability to filter by allele frequency
Provide ability to get annotations by variant id
Remove transcript effects from variant annotation message
@sarahhunt
Copy link
Contributor

@david4096 - I've been meaning to get back to this, sorry for the delay. Looking at the proposed changes, I am doubting the value of the variantAnnotation record and thinking it would be cleaner to just query for TranscriptEffects directly based on variant id, transcript id or location. What do you think?

@david4096
Copy link
Member Author

Thanks @sarahhunt !

The main reason for keeping the VariantAnnotation record are for "variant level" annotations, which for the time being is Allele Frequency. Perhaps we could move those features to the Variants API instead?

Can you tell me if I'm mistaken, but in practice if you have a Features API that allows you to get transcript effect IDs, can't you use that to construct a range request? I have been thinking that feature ID filtering of an endpoint might be fulfilled already by that access pattern. Like, if I want to find transcript effects for some ID that I have, I request that from the features API with a GetFeatureRequest, then use the returned location data to construct a range request against /transcripteffects.

Thanks for taking a look! I'm open to deprecating the Variant Annotation message. I think that with something like #700 it might be possible to think about how to construct requests against typed annotations. For now, I think the most important point thing is to remove the modification at query time of transcript effect messages.

@sarahhunt
Copy link
Contributor

Hi @david4096!

We are currently using the VariantAnnotation record for allele frequency and ClinVar clinical significance info, but the first is a property of the variant and the second fits better as a seperate search for phenotype associations. It would be interesting to know how anyone else is using it.

Searching by featureIds rather than just ranges is useful because many features overlap. If you are interested in a specific transcript, there could be a dozen in the same range, so only supporting range queries could mean more records than are required are returned and force the filtering onto the API user. There is also a nice consistency if you can throw a set of featureIds at the various endpoints and retrieve different types of data.

Tightening up the info structure as in #700 makes a lot of sense.

@david4096
Copy link
Member Author

Thanks for the input regarding feature search. I'll leave it as is then. If there is a way I can push this PR forward without having to make modifications to the variants model, I'll take it. I believe we are going to need to switch to a variant model with one alternate base entry per variant and my hope is that by separating out the messages now we will be better prepared.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants