-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating GWAS Catalog data ingestion #21
base: dev
Are you sure you want to change the base?
Conversation
@@ -108,7 +233,7 @@ def parse_associations(association_df: DataFrame) -> DataFrame: | |||
.withColumn('association_id', f.monotonically_increasing_id()) | |||
|
|||
# Processing variant related columns: | |||
# - Sorting out current rsID field: | |||
# - Sorting out current rsID field: <- why do we need this? rs identifiers should always come from the GnomAD dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that once you have chromosome, position, ref and alt you can get the rs id from GnomAD. In that case you can remove the lines 247 and 248?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is complicated... if we join the tables by chromosome and position, there might be multiple variants with different alleles. In those cases it could help the rsID to disambiguate the mapping. This needs to be resolved.
.withColumn('sample_size', F.regexp_extract(F.regexp_replace(F.col('samples'), ',', ''), r'[0-9,]+', 0).cast(T.IntegerType())) | ||
|
||
# Extracting number of cases: | ||
.withColumn('n_cases', F.when(F.col('samples').contains('cases'), F.col('sample_size')).otherwise(F.lit(0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set this to 0 you might need to update also the lines attached below.
Basically if the n_cases is not null it compute the beta as log(beta) assuming that is odd ratio.
study['case_prop'] = study['n_cases'] / study['n_initial'] |
'oddsr_ci_upper']] = top_loci.apply(extract_effect_sizes, axis=1).apply(pd.Series) |
is_beta = pd.isnull(row['case_prop']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. So you mean, when the cases are non-zero, but the controls are zero, I should resolve the cases to zero? (I think conceptually that would be correct) However the code, which processes these values should be ready to handle edge cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think when one of them is 0 the other should be too.
After this we should probably also change the "is_beta = pd.isnull(row['case_prop'])" concept ?!
I can think of a way to handle that
This is a prototype of the revisited ingestion of the GWAS Catalog data. At this stage it reads the GWAS Catalog association data, selects and parses relevant columns. Some of the logic from the original v2d pipeline is propagated but not all.
TODOs:
! Warning: as the PR is a prototype, all parameters are hardcoded. The script runs stand-alone on dataproc as a pyspark job as follows:
gcloud dataproc jobs submit pyspark \ --cluster=ds-single \ --project=open-targets-eu-dev \ --region=europe-west1 ingest_GWAS_Catalog.py