Updating GWAS Catalog data ingestion #21

DSuveges · 2022-06-28T08:52:07Z

This is a prototype of the revisited ingestion of the GWAS Catalog data. At this stage it reads the GWAS Catalog association data, selects and parses relevant columns. Some of the logic from the original v2d pipeline is propagated but not all.

TODOs:

! Warning: as the PR is a prototype, all parameters are hardcoded. The script runs stand-alone on dataproc as a pyspark job as follows:

gcloud dataproc jobs submit pyspark   \
    --cluster=ds-single \
    --project=open-targets-eu-dev \
    --region=europe-west1 ingest_GWAS_Catalog.py

scripts/ingest_GWAS_Catalog.py

bruno-ariano · 2022-08-19T09:08:06Z

scripts/ingest_GWAS_Catalog.py

@@ -108,7 +233,7 @@ def parse_associations(association_df: DataFrame) -> DataFrame:
        .withColumn('association_id', f.monotonically_increasing_id())

        # Processing variant related columns:
-        #   - Sorting out current rsID field:
+        #   - Sorting out current rsID field: <- why do we need this? rs identifiers should always come from the GnomAD dataset.


I agree that once you have chromosome, position, ref and alt you can get the rs id from GnomAD. In that case you can remove the lines 247 and 248?

So this is complicated... if we join the tables by chromosome and position, there might be multiple variants with different alleles. In those cases it could help the rsID to disambiguate the mapping. This needs to be resolved.

bruno-ariano · 2022-08-19T12:12:15Z

scripts/ingest_GWAS_Catalog.py

+        .withColumn('sample_size', F.regexp_extract(F.regexp_replace(F.col('samples'), ',', ''), r'[0-9,]+', 0).cast(T.IntegerType()))
+
+        # Extracting number of cases:
+        .withColumn('n_cases', F.when(F.col('samples').contains('cases'), F.col('sample_size')).otherwise(F.lit(0)))


If you set this to 0 you might need to update also the lines attached below.

Basically if the n_cases is not null it compute the beta as log(beta) assuming that is odd ratio.

genetics-v2d-data/scripts/format_sumstat_toploci_assoc.py

Line 29 in 4b5e6f9

study['case_prop'] = study['n_cases'] / study['n_initial']

genetics-v2d-data/scripts/format_sumstat_toploci_assoc.py

Line 51 in 4b5e6f9

'oddsr_ci_upper']] = top_loci.apply(extract_effect_sizes, axis=1).apply(pd.Series)

genetics-v2d-data/scripts/format_sumstat_toploci_assoc.py

Line 95 in 4b5e6f9

is_beta = pd.isnull(row['case_prop'])

You are right. So you mean, when the cases are non-zero, but the controls are zero, I should resolve the cases to zero? (I think conceptually that would be correct) However the code, which processes these values should be ready to handle edge cases.

Yes I think when one of them is 0 the other should be too.

After this we should probably also change the "is_beta = pd.isnull(row['case_prop'])" concept ?!

I can think of a way to handle that

DSuveges added 6 commits May 16, 2022 22:19

slimming down snakefile 1.

0b00dad

small updates

c45ba89

Adding script to fetch and process data from GWAS Catalog

d5d7007

Functional version of the script.

b7d0a36

join association data with variant annotation.

21e5e0e

Structuring ingest script

9083ea7

DSuveges requested a review from bruno-ariano June 28, 2022 08:52

DSuveges added 3 commits June 28, 2022 13:44

Adding concordance test.

f0d3e15

fix small issue

cc3d96c

slight updates.

e0d4bd8

bruno-ariano reviewed Jul 8, 2022

View reviewed changes

scripts/ingest_GWAS_Catalog.py Outdated Show resolved Hide resolved

bruno-ariano reviewed Jul 13, 2022

View reviewed changes

scripts/ingest_GWAS_Catalog.py Outdated Show resolved Hide resolved

DSuveges added 2 commits August 10, 2022 09:33

Get new momentum on this repo.

c458c71

GWAS Catalog study table ingestion

9e3b264

bruno-ariano reviewed Aug 19, 2022

View reviewed changes

Adding code to parser GWAS Catalog studies.

8dd693d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating GWAS Catalog data ingestion #21

Updating GWAS Catalog data ingestion #21

DSuveges commented Jun 28, 2022 •

edited

Loading

bruno-ariano Aug 19, 2022

DSuveges Aug 19, 2022

bruno-ariano Aug 19, 2022

DSuveges Aug 19, 2022

bruno-ariano Aug 19, 2022

Updating GWAS Catalog data ingestion #21

Are you sure you want to change the base?

Updating GWAS Catalog data ingestion #21

Conversation

DSuveges commented Jun 28, 2022 • edited Loading

bruno-ariano Aug 19, 2022

Choose a reason for hiding this comment

DSuveges Aug 19, 2022

Choose a reason for hiding this comment

bruno-ariano Aug 19, 2022

Choose a reason for hiding this comment

DSuveges Aug 19, 2022

Choose a reason for hiding this comment

bruno-ariano Aug 19, 2022

Choose a reason for hiding this comment

DSuveges commented Jun 28, 2022 •

edited

Loading