Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: endog must be in the unit interval. #280

Open
ktmeaton opened this issue Oct 22, 2024 · 1 comment
Open

ValueError: endog must be in the unit interval. #280

ktmeaton opened this issue Oct 22, 2024 · 1 comment

Comments

@ktmeaton
Copy link

When running the lmm model to detect lineage effects, I sometimes get a statsmodels error because the endog variable contains nan values:

ValueError: endog must be in the unit interval.

It seems to happen when the Rtab observations include missing data for certain variants ("."). I can fix the error by changing the following line from:

lineage_mod = smf.Logit(k, X)

to:

 lineage_mod = smf.Logit(k, X, missing='drop') 

To Reproduce

I'm running pyseer v1.3.12 from conda. I'm using a subset of 15 genomes from the S. pneumoniae GWAS tutorial. And I'm looking for both locus effects and lineage effects.

pyseer \
  --lmm \
  --pres variants.txt \
  --phenotypes phenotypes.txt \
  --phenotype-column penicillin \
  --cpu 1 \
  --similarity similarities.txt \
  --output-patterns penicillin.locus_effects.patterns.txt \
  --lineage \
  --lineage-clusters lineages.txt \
  --lineage-file penicillin.lineage_effects.tsv \
  --distances distances.txt

And here is the output and traceback:

Read 15 phenotypes
Detected binary phenotype
Writing lineage effects to penicillin.lineage_effects.tsv
Setting up LMM
Similarity matrix has dimension (15, 15)
Analysing 15 samples found in both phenotype and similarity matrix
h^2 = 0.67
variant af      filter-pvalue   lrt-pvalue      beta    beta-std-err    variant_h2      lineage notes
Traceback (most recent call last):
  File "/home/username/miniconda3/envs/pyseer-1.3.12/bin/pyseer", line 10, in <module>
    sys.exit(main())
  File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/pyseer/__main__.py", line 567, in main
    ret = fit_lmm(*data)
  File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/pyseer/lmm.py", line 210, in fit_lmm
    max_lineage = fit_lineage_effect(lineage_clusters,
  File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/pyseer/model.py", line 181, in fit_lineage_effect
    lineage_mod = smf.Logit(k, X)
  File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/statsmodels/discrete/discrete_model.py", line 479, in __init__
    raise ValueError("endog must be in the unit interval.")
ValueError: endog must be in the unit interval.

Once I add the missing='drop' parameter to the Logit model, it finishes successfully without errors:

Read 15 phenotypes
Detected binary phenotype
Writing lineage effects to penicillin.lineage_effects.tsv
Setting up LMM
Similarity matrix has dimension (15, 15)
Analysing 15 samples found in both phenotype and similarity matrix
h^2 = 0.67
variant af      filter-pvalue   lrt-pvalue      beta    beta-std-err    variant_h2      lineage notes
variant_0       6.67E-02        2.05E-01        3.30E-01        3.50E-01        3.46E-01        2.70E-01        3      bad-chisq
variant_1       2.00E-01        1.77E-02        5.86E-02        5.01E-01        2.42E-01        4.98E-01        3      bad-chisq
variant_2       3.33E-01        2.53E-02        3.93E-01        3.46E-01        3.92E-01        2.38E-01        3      bad-chisq
variant_3       4.00E-01        5.16E-03        7.87E-02        6.20E-01        3.25E-01        4.68E-01        3      bad-chisq
variant_4       1.33E-01        2.15E-01        5.57E-01        -1.67E-01       2.77E-01        1.65E-01        3      bad-chisq
variant_5       1.33E-01        2.15E-01        8.33E-01        -5.80E-02       2.69E-01        5.96E-02        3      bad-chisq
variant_6       2.00E-01        1.14E-01        7.23E-01        -9.05E-02       2.50E-01        1.00E-01        3      bad-chisq
variant_8       8.67E-01        6.28E-02        3.37E-01        -5.06E-01       5.07E-01        2.67E-01        3      bad-chisq
10 loaded variants
2 pre-filtered variants
8 tested variants
8 printed variants

Is this error reproducible for you, and does the suggested fix make sense?

Thanks,
Katherine

@johnlees
Copy link
Collaborator

johnlees commented Oct 25, 2024

Thanks @ktmeaton for the detailed explanation, reproducible example, and likely fix -- this is all really really helpful!

Strictly, p-values have different interpretations with different N, so dropping some values could be misleading. But this is commonly done in GWAS so I think your solution is sensible. Alternatives are setting all missing data to:

  • missing
  • present
  • major
  • ancestral
  • imputed

Major (i.e. 1 if AF > 0.5, 0 otherwise) is my usual preference due to simplicity and accuracy over the first two.

@mgalardini what do you think? We could also provide an option for this behaviour. But my feeling is we should just choose between dropping missing values as suggested, or imputing the major allele (which I think is what the fixed effects model does?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants