Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Group of Individuals/Cohorts #724

Open
david4096 opened this issue Oct 17, 2016 · 7 comments
Open

Group of Individuals/Cohorts #724

david4096 opened this issue Oct 17, 2016 · 7 comments
Assignees
Milestone

Comments

@david4096
Copy link
Member

david4096 commented Oct 17, 2016

The data model should provide a way to group individuals. It currently groups Samples and Individuals, but being able to make statements about groups should be included as a first class part of the data model.

@mbaudis @mcourtot

@david4096
Copy link
Member Author

Should individuals be able to be assigned to multiple cohorts? Proposal, add a cohort message and a list of cohort_ids to the Individual message.

@mbaudis
Copy link
Member

mbaudis commented Jan 12, 2017

Yes; but shouldn't the cohort reference its members?

@david4096
Copy link
Member Author

Yes and in the data it will, but in terms of interchange we can probably satisfy membership through queries of individuals by cohort ID in a method similar to the other APIs.

@mcourtot
Copy link

mcourtot commented Jan 16, 2017

Hi @david4096 - could you please expand on what you mean by

we can probably satisfy membership through queries of individuals by cohort ID in a method similar to the other APIs.

Thanks!

@david4096
Copy link
Member Author

david4096 commented Jan 17, 2017

Hi @mcourtot, of course, thank you for asking! The idea is that a cohort message doesn't say which individuals are a member, since that message might be very large. Like, if there are 65k individuals in a cohort that message would be very difficult to work with, if you had added each individual ID to the cohort message. Because of that, we can add a "cohort_ids" field to an individual message so an individual can be put in multiple cohorts in a dataset.

Then, when you want to know which individuals are in a given cohort, you request only those individuals via SearchIndividualsRequest that has that cohort ID set.

In this way, the access pattern would go, "search for cohorts matching some criteria", "search for individuals matching cohort ID". Then we can reconstruct the individuals that were part of a cohort.

@mbaudis
Copy link
Member

mbaudis commented Jan 17, 2017

@david4096 But you are solving a technical obstacle by introducing a conceptual xxx.

I have always made a point (well early MTT etc.) that groupings of records should in principle be treated dynamically; i.e. you go via a stored set, or via a query output.

A "cohort" therefore may be all callsets from breast cancer samples sequenced using some WES technique, based on biosamples from stage II tumors of pre-menopausal (proxy age ...) female smokers.

This could be a curated cohort, or a query with changing content (remember: GA4GH is about "federated" data).

Now, technically one could do it both ways: Even a query based cohort could first associate a record with a cohort identifier, and proceed from there.

Generally, IMO the preference for a schema solution based on "the message could get large" is flawed; work arounds can happen at the implementation stage.

But I may be wrong, of course ...

@david4096
Copy link
Member Author

Thanks @mbaudis ! I am happy to revoke my premature optimization of message size. The Cohort message itself I consider to be the metadata like "samples sequenced using this technique." Some combination of queries and fields can handle the referential integrity, but if we can move ahead with a simpler change I am all for it.

I think if we're ok with what might become a large Cohort message (if there are 10k samples it might take a moment to download), we can avoid any other API changes. The membership of a biosample or individual to a given cohort will be in the cohort message, which can then be used to construct directed queries against the other biosamples or individuals endpoints. We then just need to provide easy ways of generating valid cohort messages for a given dataset.

I would like to consider the aspirations of federating queries, but would also like to constrain the discussion and have placed a stub for federation here. You can read some of my thoughts about getting there in this issue about searching by external identifers.

The alternative is to follow the idiom of the references API and make the connection "loose" between cohort messages and biosample or individual messages. Implementing over this type of schema is challenging, as there is a lot of room for interpretation left to the implementor. I think it's simplicity of using a single document will make it easy to implement. If we face issues with very large documents we can provide an API for listing the underlying individual and biosample identifiers.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants