Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected vs Observed z-scores are not the same #236

Open
tamil-acog opened this issue Aug 12, 2024 · 3 comments
Open

Expected vs Observed z-scores are not the same #236

tamil-acog opened this issue Aug 12, 2024 · 3 comments

Comments

@tamil-acog
Copy link

Hi team,

I am trying to build a pipeline for fine-mapping with Susie. My results are sometimes off. I'll describe about my input data and what are the issues I am facing, please help me resolving that if possible.

I use, UKBB data for the fine mapping. Both my sumstats and LD matrix are from UKBB data.

After, going through some of the discussions in the issues, I found out that I have follow the following,

  1. ESTIMATE_RESIDUAL_VARIANCE = False
  2. Calculate the LD matrix with built-in R "cor()" function rather than plink.

After, adjusting my pipeline with the above changes, I face the following issues:

  1. Expected vs Observed scores plot, still doesn't exactly match even though I have an In-sample LD matrix.
  2. It takes a very large time to calculate the correlation matrix using built-in R. Is there a better way to do it?
  3. Sometimes I don't get any credible sets. So, what should be an ideal, "coverage" parameter?

Expected vs Observed plot:

Screenshot 2024-08-12 at 1 06 13 PM

Z-scores distribution:
Screenshot 2024-08-12 at 1 07 07 PM

@pcarbo
Copy link
Member

pcarbo commented Aug 12, 2024

@tamil-acog The first thing that jumps out at me is that your association results don't seem very strong. I presume you first ran a basic association analysis (in PLINK, for example)? What were the smallest p-values from this association analysis? If the association results are not strong enoug it may not make sense to perform fine-mapping in this region. (Typically we look for p-values smaller than approximately 1e-8, although this may be different in UK Biobank depending on how the association analysis is conducted.)

@tamil-acog
Copy link
Author

Hi Thank you very much for the timely response. I got your point and I checked the p-values and you were right. Thanks

But my concerns are mainly on "Expected vs Observed Z-scores":
I checked for other traits, I got some hits there in the credible sets. But, still the "expected vs observed" plot is same as above, though my LD matrix is in-sample.

Some info:

  • My reference panel is UKBB 450k data for LD matrix. My GWAS also comes from this data only. So, it is In-sample LD matrix
  • I use plink to calculate the LD matrix("plink --bfile mydata --extract variants.txt --keep-allele-order --r --matrix --our ld_matrix")
  • For the latest run, where I got some credible sets, my Lambda was 0.0442

My questions:

  • Why is still my "expected vs observed" plot off as compared to the straight line as shown in the susie examples? I also went through some github issues and I read that, if the LD matrix is In-sample, we are supposed to get a straight line. What am I missing here? And is it ok that my plot is off even though it is in-sample LD?
  • What other method is recommended to calculate the LD matrix other than plink? I tried built-in R "corr" function, but I am unable to completely parallelize the operation and it takes very long time?(Asking this because, in certain Github post, read that built-in R corr has better round-off errors compared to plink and I also suspect this is the reason for my exp vs obs plot being off.)
  • What is an acceptable range of Lambda?

@pcarbo
Copy link
Member

pcarbo commented Aug 16, 2024

Hi @tamil-acog, I'm not super familiar with PLINK, but this does look like the right approach. Did you also run your association analysis in PLINK?

I will note that others have encountered challenges in making the z-scores and LD consistent, so you are far from the only one. See for example Issue 207; I recommend searching the Issues on GitHub for other discussion.

It might also be helpful to reviews at the steps we took to generate the assocation statistics and LD matrices for our PLoS Genetics paper. The scripts can be found here.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants