Group of Individuals/Cohorts #724

david4096 · 2016-10-17T00:31:19Z

The data model should provide a way to group individuals. It currently groups Samples and Individuals, but being able to make statements about groups should be included as a first class part of the data model.

@mbaudis @mcourtot

david4096 · 2017-01-12T16:50:41Z

Should individuals be able to be assigned to multiple cohorts? Proposal, add a cohort message and a list of cohort_ids to the Individual message.

mbaudis · 2017-01-12T16:53:50Z

Yes; but shouldn't the cohort reference its members?

david4096 · 2017-01-12T16:55:12Z

Yes and in the data it will, but in terms of interchange we can probably satisfy membership through queries of individuals by cohort ID in a method similar to the other APIs.

mcourtot · 2017-01-16T16:03:33Z

Hi @david4096 - could you please expand on what you mean by

we can probably satisfy membership through queries of individuals by cohort ID in a method similar to the other APIs.

Thanks!

david4096 · 2017-01-17T20:00:17Z

Hi @mcourtot, of course, thank you for asking! The idea is that a cohort message doesn't say which individuals are a member, since that message might be very large. Like, if there are 65k individuals in a cohort that message would be very difficult to work with, if you had added each individual ID to the cohort message. Because of that, we can add a "cohort_ids" field to an individual message so an individual can be put in multiple cohorts in a dataset.

Then, when you want to know which individuals are in a given cohort, you request only those individuals via SearchIndividualsRequest that has that cohort ID set.

In this way, the access pattern would go, "search for cohorts matching some criteria", "search for individuals matching cohort ID". Then we can reconstruct the individuals that were part of a cohort.

mbaudis · 2017-01-17T20:13:40Z

@david4096 But you are solving a technical obstacle by introducing a conceptual xxx.

I have always made a point (well early MTT etc.) that groupings of records should in principle be treated dynamically; i.e. you go via a stored set, or via a query output.

A "cohort" therefore may be all callsets from breast cancer samples sequenced using some WES technique, based on biosamples from stage II tumors of pre-menopausal (proxy age ...) female smokers.

This could be a curated cohort, or a query with changing content (remember: GA4GH is about "federated" data).

Now, technically one could do it both ways: Even a query based cohort could first associate a record with a cohort identifier, and proceed from there.

Generally, IMO the preference for a schema solution based on "the message could get large" is flawed; work arounds can happen at the implementation stage.

But I may be wrong, of course ...

david4096 · 2017-01-20T02:30:29Z

Thanks @mbaudis ! I am happy to revoke my premature optimization of message size. The Cohort message itself I consider to be the metadata like "samples sequenced using this technique." Some combination of queries and fields can handle the referential integrity, but if we can move ahead with a simpler change I am all for it.

I think if we're ok with what might become a large Cohort message (if there are 10k samples it might take a moment to download), we can avoid any other API changes. The membership of a biosample or individual to a given cohort will be in the cohort message, which can then be used to construct directed queries against the other biosamples or individuals endpoints. We then just need to provide easy ways of generating valid cohort messages for a given dataset.

I would like to consider the aspirations of federating queries, but would also like to constrain the discussion and have placed a stub for federation here. You can read some of my thoughts about getting there in this issue about searching by external identifers.

The alternative is to follow the idiom of the references API and make the connection "loose" between cohort messages and biosample or individual messages. Implementing over this type of schema is challenging, as there is a lot of room for interpretation left to the implementor. I think it's simplicity of using a single document will make it easy to implement. If we face issues with very large documents we can provide an API for listing the underlying individual and biosample identifiers.

mcourtot added the MetadataTaskTeam label Oct 18, 2016

mcourtot self-assigned this Oct 18, 2016

david4096 mentioned this issue Oct 28, 2016

Variant annotation update (transcript effects endpoint) #706

Closed

kozbo added this to the Schemas 1.0 milestone Nov 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group of Individuals/Cohorts #724

Group of Individuals/Cohorts #724

david4096 commented Oct 17, 2016 •

edited

Loading

david4096 commented Jan 12, 2017

mbaudis commented Jan 12, 2017

david4096 commented Jan 12, 2017

mcourtot commented Jan 16, 2017 •

edited

Loading

david4096 commented Jan 17, 2017 •

edited

Loading

mbaudis commented Jan 17, 2017

david4096 commented Jan 20, 2017

Group of Individuals/Cohorts #724

Group of Individuals/Cohorts #724

Comments

david4096 commented Oct 17, 2016 • edited Loading

david4096 commented Jan 12, 2017

mbaudis commented Jan 12, 2017

david4096 commented Jan 12, 2017

mcourtot commented Jan 16, 2017 • edited Loading

david4096 commented Jan 17, 2017 • edited Loading

mbaudis commented Jan 17, 2017

david4096 commented Jan 20, 2017

david4096 commented Oct 17, 2016 •

edited

Loading

mcourtot commented Jan 16, 2017 •

edited

Loading

david4096 commented Jan 17, 2017 •

edited

Loading