ValueError: endog must be in the unit interval. #280

ktmeaton · 2024-10-22T21:52:58Z

When running the lmm model to detect lineage effects, I sometimes get a statsmodels error because the endog variable contains nan values:

ValueError: endog must be in the unit interval.

It seems to happen when the Rtab observations include missing data for certain variants ("."). I can fix the error by changing the following line from:

pyseer/pyseer/model.py

Line 181 in 4b8d22f

lineage_mod = smf.Logit(k, X)

to:

 lineage_mod = smf.Logit(k, X, missing='drop')

To Reproduce

I'm running pyseer v1.3.12 from conda. I'm using a subset of 15 genomes from the S. pneumoniae GWAS tutorial. And I'm looking for both locus effects and lineage effects.

Lineages: lineages.txt
Phenotypes: phenotypes.txt
Similarities: similarities.txt
Distances: distances.txt
Variants Presence/Absence Rtab: variants.txt
- Just the 10 variants, to be a minimal working example.

pyseer \
  --lmm \
  --pres variants.txt \
  --phenotypes phenotypes.txt \
  --phenotype-column penicillin \
  --cpu 1 \
  --similarity similarities.txt \
  --output-patterns penicillin.locus_effects.patterns.txt \
  --lineage \
  --lineage-clusters lineages.txt \
  --lineage-file penicillin.lineage_effects.tsv \
  --distances distances.txt

And here is the output and traceback:

Read 15 phenotypes
Detected binary phenotype
Writing lineage effects to penicillin.lineage_effects.tsv
Setting up LMM
Similarity matrix has dimension (15, 15)
Analysing 15 samples found in both phenotype and similarity matrix
h^2 = 0.67
variant af      filter-pvalue   lrt-pvalue      beta    beta-std-err    variant_h2      lineage notes
Traceback (most recent call last):
  File "/home/username/miniconda3/envs/pyseer-1.3.12/bin/pyseer", line 10, in <module>
    sys.exit(main())
  File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/pyseer/__main__.py", line 567, in main
    ret = fit_lmm(*data)
  File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/pyseer/lmm.py", line 210, in fit_lmm
    max_lineage = fit_lineage_effect(lineage_clusters,
  File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/pyseer/model.py", line 181, in fit_lineage_effect
    lineage_mod = smf.Logit(k, X)
  File "/home/username/miniconda3/envs/pyseer-1.3.12/lib/python3.10/site-packages/statsmodels/discrete/discrete_model.py", line 479, in __init__
    raise ValueError("endog must be in the unit interval.")
ValueError: endog must be in the unit interval.

Once I add the missing='drop' parameter to the Logit model, it finishes successfully without errors:

Read 15 phenotypes
Detected binary phenotype
Writing lineage effects to penicillin.lineage_effects.tsv
Setting up LMM
Similarity matrix has dimension (15, 15)
Analysing 15 samples found in both phenotype and similarity matrix
h^2 = 0.67
variant af      filter-pvalue   lrt-pvalue      beta    beta-std-err    variant_h2      lineage notes
variant_0       6.67E-02        2.05E-01        3.30E-01        3.50E-01        3.46E-01        2.70E-01        3      bad-chisq
variant_1       2.00E-01        1.77E-02        5.86E-02        5.01E-01        2.42E-01        4.98E-01        3      bad-chisq
variant_2       3.33E-01        2.53E-02        3.93E-01        3.46E-01        3.92E-01        2.38E-01        3      bad-chisq
variant_3       4.00E-01        5.16E-03        7.87E-02        6.20E-01        3.25E-01        4.68E-01        3      bad-chisq
variant_4       1.33E-01        2.15E-01        5.57E-01        -1.67E-01       2.77E-01        1.65E-01        3      bad-chisq
variant_5       1.33E-01        2.15E-01        8.33E-01        -5.80E-02       2.69E-01        5.96E-02        3      bad-chisq
variant_6       2.00E-01        1.14E-01        7.23E-01        -9.05E-02       2.50E-01        1.00E-01        3      bad-chisq
variant_8       8.67E-01        6.28E-02        3.37E-01        -5.06E-01       5.07E-01        2.67E-01        3      bad-chisq
10 loaded variants
2 pre-filtered variants
8 tested variants
8 printed variants

Is this error reproducible for you, and does the suggested fix make sense?

Thanks,
Katherine

The text was updated successfully, but these errors were encountered:

johnlees · 2024-10-25T07:29:18Z

Thanks @ktmeaton for the detailed explanation, reproducible example, and likely fix -- this is all really really helpful!

Strictly, p-values have different interpretations with different N, so dropping some values could be misleading. But this is commonly done in GWAS so I think your solution is sensible. Alternatives are setting all missing data to:

missing
present
major
ancestral
imputed

Major (i.e. 1 if AF > 0.5, 0 otherwise) is my usual preference due to simplicity and accuracy over the first two.

@mgalardini what do you think? We could also provide an option for this behaviour. But my feeling is we should just choose between dropping missing values as suggested, or imputing the major allele (which I think is what the fixed effects model does?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: endog must be in the unit interval. #280

ValueError: endog must be in the unit interval. #280

ktmeaton commented Oct 22, 2024

johnlees commented Oct 25, 2024 •

edited

Loading

ValueError: endog must be in the unit interval. #280

ValueError: endog must be in the unit interval. #280

Comments

ktmeaton commented Oct 22, 2024

To Reproduce

johnlees commented Oct 25, 2024 • edited Loading

johnlees commented Oct 25, 2024 •

edited

Loading