-
Notifications
You must be signed in to change notification settings - Fork 110
Group of Individuals/Cohorts #724
Comments
Should individuals be able to be assigned to multiple cohorts? Proposal, add a cohort message and a list of |
Yes; but shouldn't the cohort reference its members? |
Yes and in the data it will, but in terms of interchange we can probably satisfy membership through queries of individuals by cohort ID in a method similar to the other APIs. |
Hi @david4096 - could you please expand on what you mean by
Thanks! |
Hi @mcourtot, of course, thank you for asking! The idea is that a cohort message doesn't say which individuals are a member, since that message might be very large. Like, if there are 65k individuals in a cohort that message would be very difficult to work with, if you had added each individual ID to the cohort message. Because of that, we can add a "cohort_ids" field to an individual message so an individual can be put in multiple cohorts in a dataset. Then, when you want to know which individuals are in a given cohort, you request only those individuals via SearchIndividualsRequest that has that cohort ID set. In this way, the access pattern would go, "search for cohorts matching some criteria", "search for individuals matching cohort ID". Then we can reconstruct the individuals that were part of a cohort. |
@david4096 But you are solving a technical obstacle by introducing a conceptual xxx. I have always made a point (well early MTT etc.) that groupings of records should in principle be treated dynamically; i.e. you go via a stored set, or via a query output. A "cohort" therefore may be all callsets from breast cancer samples sequenced using some WES technique, based on biosamples from stage II tumors of pre-menopausal (proxy age ...) female smokers. This could be a curated cohort, or a query with changing content (remember: GA4GH is about "federated" data). Now, technically one could do it both ways: Even a query based cohort could first associate a record with a cohort identifier, and proceed from there. Generally, IMO the preference for a schema solution based on "the message could get large" is flawed; work arounds can happen at the implementation stage. But I may be wrong, of course ... |
Thanks @mbaudis ! I am happy to revoke my premature optimization of message size. The I think if we're ok with what might become a large I would like to consider the aspirations of federating queries, but would also like to constrain the discussion and have placed a stub for federation here. You can read some of my thoughts about getting there in this issue about searching by external identifers. The alternative is to follow the idiom of the references API and make the connection "loose" between cohort messages and biosample or individual messages. Implementing over this type of schema is challenging, as there is a lot of room for interpretation left to the implementor. I think it's simplicity of using a single document will make it easy to implement. If we face issues with very large documents we can provide an API for listing the underlying individual and biosample identifiers. |
The data model should provide a way to group individuals. It currently groups Samples and Individuals, but being able to make statements about groups should be included as a first class part of the data model.
@mbaudis @mcourtot
The text was updated successfully, but these errors were encountered: