Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GREGoR Processing #92

Open
6 of 16 tasks
bwalsh opened this issue Sep 11, 2024 · 3 comments
Open
6 of 16 tasks

GREGoR Processing #92

bwalsh opened this issue Sep 11, 2024 · 3 comments

Comments

@bwalsh
Copy link
Member

bwalsh commented Sep 11, 2024

GREGoR next steps

Use Case:

As a GREGoR analyst, in order to discover genotype to phenotype associations, I would like to compare CAF objects of cohorts from the GREGoR Consortium consortium with CAF objects from the gnomAD consortium

Test Driven Development:

Fixtures:

  • CAF objects from gnomad gks
  • VRS annotated VCFs generated from existing gregor VCFs. CAF objects generated from those annotated VCFs.
  • Authorization to read GREGoR data
  • Subset of GREGoR VCFs (chromosome TBD)
  • cat vrs objects

Methods:

  • Existing: Annotate GREGoR VCF - vrs-python, vrs_anvil_toolkit
  • Resolve GREGoR VCF specimen annotation to phenotype vcf2phenotype
  • Generate CAF objects based on: VRS annotated VCF using GREGoR specimen matrix vcf2caf
  • Search match CAF.GNOMAD and CAF.GREGoR objects caf-search

Method Construction

  • For each new method [vcf2phenotype, vcf2caf, caf-search]:
    • consensus algorithm
    • test cases
    • implementation
    • review

Acceptance:

  • Notebook(s)
  • journal article
@jsstevenson
Copy link
Contributor

jsstevenson commented Sep 12, 2024

👍

I suspect that searches run against GREGoR will require different software than searches against gnomAD, so we can probably break that up conceptually. The latter could even be more of a stretch goal if necessary -- if we're running of time and just need a demo, we could always just manually construct them and leave it as a proof of concept -- but could also be generalizable beyond this project (I am not sure how much additional work we'd need to do on top of the existing gnomad utils).

From Tuesday's discussion, I think a parquet file/flat file encompassing just the patient ID/VRS data and maybe some quality parameters for filtering would be the fixture against which a gregor search variation method would run. At least, this is what I've been working on since we spoke, so someone can speak up if I'm running off in the wrong direction.

@bwalsh
Copy link
Member Author

bwalsh commented Sep 17, 2024

Notes 9/17:
@jsstevenson - can you provide the gs:// path to the vcf(s) you are testing with?

https://github.com/ga4gh/va-spec/

https://github.com/broadinstitute/gnomad_methods - a search exists here
https://github.com/genomicmedlab/gregor - james' work (ignore for now - experimental)
@bwalsh TODO google storage api + tabix: skip to offset

Assumption:
schema clarifications for CAF and others tobe forthcoming from AlexW and GA4GH discussion

@bwalsh
Copy link
Member Author

bwalsh commented Sep 18, 2024

Re. remote indexing, the following works in the AnVIL env

# set this to the remote vcf you have access to
export MY_OBJECT=gs://xxxxxx.vcf.gz

# get the auth token, tabix reads from GCS_OAUTH_TOKEN
export GCS_OAUTH_TOKEN=`gcloud auth application-default print-access-token`

# read the remote object, validate we can list headers
tabix -H $MY_OBJECT | grep -q '#CHROM' && echo 'remote access worked' || echo 'remote access failed'
# >> remote access worked

# assuming MY_OBJECT points at chrY,  lets get alleles in the SRY gene 
 see https://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000184895;r=Y:2786855-2787682;t=ENST00000383070

tabix -p vcf   $MY_OBJECT chrY:2,786,855-2,787,682 | wc -l
# >> 2

# assuming  MY_OBJECT points at chr17,  lets get alleles in the BRCA1 gene see https://useast.ensembl.org/Homo_sapiens/Location/View?db=core;g=ENSG00000012048;r=17:43044295-43170245

tabix -p vcf   $MY_OBJECT chr17:43,044,295-43,170,245  | wc -l
# >> 2592

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants