Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating GWAS Catalog data ingestion #21

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from
Open

Updating GWAS Catalog data ingestion #21

wants to merge 12 commits into from

Conversation

DSuveges
Copy link
Contributor

@DSuveges DSuveges commented Jun 28, 2022

This is a prototype of the revisited ingestion of the GWAS Catalog data. At this stage it reads the GWAS Catalog association data, selects and parses relevant columns. Some of the logic from the original v2d pipeline is propagated but not all.

TODOs:

  • Parse GWAS Catalog associations.
  • Apply p-value and other filters
  • Parsing snp identifiers
  • Parsing p-values
  • Join with variant annotation
  • Check allele concordance
  • Deduplicate variant data
  • Clustering etc.

! Warning: as the PR is a prototype, all parameters are hardcoded. The script runs stand-alone on dataproc as a pyspark job as follows:

gcloud dataproc jobs submit pyspark   \
    --cluster=ds-single \
    --project=open-targets-eu-dev \
    --region=europe-west1 ingest_GWAS_Catalog.py 

@@ -108,7 +233,7 @@ def parse_associations(association_df: DataFrame) -> DataFrame:
.withColumn('association_id', f.monotonically_increasing_id())

# Processing variant related columns:
# - Sorting out current rsID field:
# - Sorting out current rsID field: <- why do we need this? rs identifiers should always come from the GnomAD dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that once you have chromosome, position, ref and alt you can get the rs id from GnomAD. In that case you can remove the lines 247 and 248?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is complicated... if we join the tables by chromosome and position, there might be multiple variants with different alleles. In those cases it could help the rsID to disambiguate the mapping. This needs to be resolved.

.withColumn('sample_size', F.regexp_extract(F.regexp_replace(F.col('samples'), ',', ''), r'[0-9,]+', 0).cast(T.IntegerType()))

# Extracting number of cases:
.withColumn('n_cases', F.when(F.col('samples').contains('cases'), F.col('sample_size')).otherwise(F.lit(0)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set this to 0 you might need to update also the lines attached below.

Basically if the n_cases is not null it compute the beta as log(beta) assuming that is odd ratio.

study['case_prop'] = study['n_cases'] / study['n_initial']

'oddsr_ci_upper']] = top_loci.apply(extract_effect_sizes, axis=1).apply(pd.Series)

is_beta = pd.isnull(row['case_prop'])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. So you mean, when the cases are non-zero, but the controls are zero, I should resolve the cases to zero? (I think conceptually that would be correct) However the code, which processes these values should be ready to handle edge cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think when one of them is 0 the other should be too.

After this we should probably also change the "is_beta = pd.isnull(row['case_prop'])" concept ?!

I can think of a way to handle that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants