You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running the lmm model to detect lineage effects, I sometimes get a statsmodels error because the endog variable contains nan values:
ValueError: endog must be in the unit interval.
It seems to happen when the Rtab observations include missing data for certain variants ("."). I can fix the error by changing the following line from:
I'm running pyseer v1.3.12 from conda. I'm using a subset of 15 genomes from the S. pneumoniaeGWAS tutorial. And I'm looking for both locus effects and lineage effects.
Read 15 phenotypes
Detected binary phenotype
Writing lineage effects to penicillin.lineage_effects.tsv
Setting up LMM
Similarity matrix has dimension (15, 15)
Analysing 15 samples found in both phenotype and similarity matrix
h^2 = 0.67
variant af filter-pvalue lrt-pvalue beta beta-std-err variant_h2 lineage notes
Traceback (most recent call last):
File "/home/username/miniconda3/envs/pyseer-1.3.12/bin/pyseer", line 10, in <module>
sys.exit(main())
File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/pyseer/__main__.py", line 567, in main
ret = fit_lmm(*data)
File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/pyseer/lmm.py", line 210, in fit_lmm
max_lineage = fit_lineage_effect(lineage_clusters,
File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/pyseer/model.py", line 181, in fit_lineage_effect
lineage_mod = smf.Logit(k, X)
File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/statsmodels/discrete/discrete_model.py", line 479, in __init__
raise ValueError("endog must be in the unit interval.")
ValueError: endog must be in the unit interval.
Once I add the missing='drop' parameter to the Logit model, it finishes successfully without errors:
Thanks @ktmeaton for the detailed explanation, reproducible example, and likely fix -- this is all really really helpful!
Strictly, p-values have different interpretations with different N, so dropping some values could be misleading. But this is commonly done in GWAS so I think your solution is sensible. Alternatives are setting all missing data to:
missing
present
major
ancestral
imputed
Major (i.e. 1 if AF > 0.5, 0 otherwise) is my usual preference due to simplicity and accuracy over the first two.
@mgalardini what do you think? We could also provide an option for this behaviour. But my feeling is we should just choose between dropping missing values as suggested, or imputing the major allele (which I think is what the fixed effects model does?)
When running the
lmm
model to detect lineage effects, I sometimes get astatsmodels
error because theendog
variable containsnan
values:It seems to happen when the Rtab observations include missing data for certain variants ("."). I can fix the error by changing the following line from:
pyseer/pyseer/model.py
Line 181 in 4b8d22f
to:
To Reproduce
I'm running
pyseer v1.3.12
from conda. I'm using a subset of 15 genomes from the S. pneumoniae GWAS tutorial. And I'm looking for both locus effects and lineage effects.And here is the output and traceback:
Once I add the
missing='drop'
parameter to theLogit
model, it finishes successfully without errors:Is this error reproducible for you, and does the suggested fix make sense?
Thanks,
Katherine
The text was updated successfully, but these errors were encountered: