Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirements for UKB GWAS #67

Open
8 of 11 tasks
eric-czech opened this issue Jul 24, 2020 · 9 comments
Open
8 of 11 tasks

Requirements for UKB GWAS #67

eric-czech opened this issue Jul 24, 2020 · 9 comments
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.

Comments

@eric-czech
Copy link
Collaborator

eric-czech commented Jul 24, 2020

To run a basic GWAS on UKB data, here are some of the operations we'll need support for:

There may be a few more beyond that, but I think anything remaining should be reasonable with Xarray/Dask alone.

@tomwhite
Copy link
Collaborator

I'd be happy to work on variant/sample stats (#29) if no one else is working on them.

@hammer
Copy link
Contributor

hammer commented Jul 27, 2020

@eric-czech how are you thinking about LD estimation/pruning, population structure estimation/pruning, and relatedness estimation/pruning? Does REGENIE include in the implementation some means of estimating these things as covariates for the regression, or are you just thinking of those operations as optimizations that can be implemented later?

@hammer hammer added the core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. label Jul 27, 2020
@hammer
Copy link
Contributor

hammer commented Jul 27, 2020

To answer my own question, there are 3 stages to our work with UK Biobank

  • Stage 1: per-variant linear regression with provided population structure and kinship estimates. No LD pruning needed.
  • Stage 2: whole-genome regression with provided population structure and kinship estimates. LD pruning needed.
  • Stage 3: do our own population structure and kinship estimation.

@hammer
Copy link
Contributor

hammer commented Jul 27, 2020

A phenotype normalization pipeline.

I've been thinking about this one too. I think we're going to feel Dask's poor handling of nested data when working with phenotypes, and I'd prefer to keep Spark out of this project as a dependency, so I think we put that code into a separate repo if we find we do need Spark.

@hammer
Copy link
Contributor

hammer commented Aug 12, 2020

A variant annotation function like vep. There are plenty of other ways to get this but an internal function would be great.

File an issue to track?

@eric-czech
Copy link
Collaborator Author

File an issue to track?

https://github.com/pystatgen/sgkit/issues/112

@hammer
Copy link
Contributor

hammer commented Sep 4, 2020

A phenotype normalization pipeline.

@eric-czech I just had a nice chat with @zietzm and @ntatonetti who are at Columbia and are experts in handling complex phenotypes and running many GWAS against them.

They're interested in using sgkit and possibly contributing back, particularly on the phenotype side.

Would you be open to making https://github.com/related-sciences/ukb-gwas-pipeline-nealelab public soon and potentially working with @zietzm to factor the phenotype handling code into its own repo, maybe something like sgkit-pheno or phenokit?

@eric-czech
Copy link
Collaborator Author

Would you be open to making https://github.com/related-sciences/ukb-gwas-pipeline-nealelab public soon and potentially working with @zietzm to factor the phenotype handling code into its own repo, maybe something like sgkit-pheno or phenokit?

For sure! Looking forward to seeing how we can better integrate phenotypes.

@eric-czech
Copy link
Collaborator Author

FYI @zietzm / @ntatonetti (cc: @hammer) the phenotype prep code we're currently using (via PHESANT) is here: ukb-gwas-pipeline-nealelab#phenotype_prep.smk.

There is little to it yet other than running some messy, very inefficient R code to produce ~75 phenotypes that I wanted to attempt to validate against first. It would be great to hear your thoughts on how we might better define these as well as improve the mechanics of how we're creating them. I'm particularly interested in ICD code management since this pipeline doesn't address that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.
Projects
None yet
Development

No branches or pull requests

3 participants