Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in nonparametric.py when calling IPCRidge #322

Open
fbarfi opened this issue Nov 8, 2022 · 4 comments
Open

Bug in nonparametric.py when calling IPCRidge #322

fbarfi opened this issue Nov 8, 2022 · 4 comments
Labels

Comments

@fbarfi
Copy link

fbarfi commented Nov 8, 2022

Describe the bug

Running IPCRidge hangs with the following message

assert (Ghat > 0).all()

and nothing after. I found that changing the option 'reverse = False' as shown down below in kaplan_meier_estimator in the function ipc_weights in the file nonparametric.py corrects the mistake.
Error message:

AssertionError                            Traceback (most recent call last)
Input In [74], in <cell line: 5>()
      2 set_config(display="text")  # displays text representation of estimators
      4 estimator = IPCRidge(alpha = 0.5,fit_intercept=True)
----> 5 estimator.fit(data_x,data_y)

File /opt/homebrew/Caskroom/miniforge/base/envs/teaching_env/lib/python3.10/site-packages/sksurv/linear_model/aft.py:90, in IPCRidge.fit(self, X, y)
     72 """Build an accelerated failure time model.
     73 
     74 Parameters
   (...)
     86 self
     87 """
     88 event, time = check_array_survival(X, y)
---> 90 weights = ipc_weights(event, time)
     91 super().fit(X, numpy.log(time), sample_weight=weights)
     93 return self

File /opt/homebrew/Caskroom/miniforge/base/envs/teaching_env/lib/python3.10/site-packages/sksurv/nonparametric.py:323, in ipc_weights(event, time)
    320 idx = numpy.searchsorted(unique_time, time[event])
    321 Ghat = p[idx]
--> 323 assert (Ghat > 0).all()
    325 weights = numpy.zeros(time.shape[0])
    326 weights[event] = 1.0 / Ghat

AssertionError: 

Code Sample to Reproduce the Bug

used code:

estimator = IPCRidge(alpha = 0.5,fit_intercept=True)
estimator.fit(data_x,data_y)

Here is what I changed in the nonparametric.py in the line
unique_time, p = kaplan_meier_estimator(event, time, reverse=False) -- changed True to False

def ipc_weights(event, time):
    """Compute inverse probability of censoring weights

    Parameters
    ----------
    event : array, shape = (n_samples,)
        Boolean event indicator.

    time : array, shape = (n_samples,)
        Time when a subject experienced an event or was censored.

    Returns
    -------
    weights : array, shape = (n_samples,)
        inverse probability of censoring weights

    See also
    --------
    CensoringDistributionEstimator
        An estimator interface for estimating inverse probability
        of censoring weights for unseen time points.
    """
    if event.all():
        return np.ones(time.shape[0])

    unique_time, p = kaplan_meier_estimator(event, time, reverse=False)

    idx = np.searchsorted(unique_time, time[event])
    Ghat = p[idx]

    assert (Ghat > 0).all()

    weights = np.zeros(time.shape[0])
    weights[event] = 1.0 / Ghat

    return weights

Machine and packages versions used:

Last updated: 2022-11-08T08:59:04.111247-05:00

Python implementation: CPython
Python version       : 3.10.5
IPython version      : 8.4.0

Compiler    : Clang 13.0.1 
OS          : Darwin
Release     : 21.6.0
Machine     : arm64
Processor   : arm
CPU cores   : 10
Architecture: 64bit

matplotlib: 3.5.2
numpy     : 1.22.4
pandas    : 1.4.4
json      : 2.0.9
@sebp
Copy link
Owner

sebp commented Nov 9, 2022

Could you please provide a self-contained minimal example, so I could re-produce the error?

@fbarfi
Copy link
Author

fbarfi commented Nov 9, 2022

Here is the snippet and I am attaching the data file too.

from sksurv.linear_model import IPCRidge
estimator_R = IPCRidge(alpha = 0.5,fit_intercept=True)
df_y=XY.iloc[:,:2].to_records(index=False)
df_x = XY.iloc[:,2:]
estimator_R.fit(data_x_numeric,data_y)

error message:

----------------------------------------------------------------
[XY.csv](https://github.com/sebp/scikit-survival/files/9975602/XY.csv)
-----------
AssertionError                            Traceback (most recent call last)
Input In [113], in <cell line: 5>()
      3 df_y=XY.iloc[:,:2].to_records(index=False)
      4 df_x = XY.iloc[:,2:]
----> 5 estimator_R.fit(data_x_numeric,data_y)

File /opt/homebrew/Caskroom/miniforge/base/envs/teaching_env/lib/python3.10/site-packages/sksurv/linear_model/aft.py:90, in IPCRidge.fit(self, X, y)
     72 """Build an accelerated failure time model.
     73 
     74 Parameters
   (...)
     86 self
     87 """
     88 event, time = check_array_survival(X, y)
---> 90 weights = ipc_weights(event, time)
     91 super().fit(X, np.log(time), sample_weight=weights)
     93 return self

File /opt/homebrew/Caskroom/miniforge/base/envs/teaching_env/lib/python3.10/site-packages/sksurv/nonparametric.py:325, in ipc_weights(event, time)
    322 idx = np.searchsorted(unique_time, time[event])
    323 Ghat = p[idx]
--> 325 assert (Ghat > 0).all()
    327 weights = np.zeros(time.shape[0])
    328 weights[event] = 1.0 / Ghat

AssertionError: 

@Finesim97
Copy link
Contributor

I experience a similar problem, but the fix is not to to flip the reverse parameter. Then the IPCW is no longer using the censoring distribution, but the survival distribution (see doi:10.1016/j.jbi.2016.03.009), therefore not working correctly. I have found that Ghat becomes zero for example, when there are ties at the last time point between event and censors.

Example with data:

from sksurv.nonparametric import kaplan_meier_estimator, ipc_weights
import numpy as np

time = np.array([1,1,1,2,2,2,3,3,3, 4,4])
event = np.array([1,1,1, 0,0,0, 1,1,0, 1,1]).astype('bool')
# works
ipc_weights(event, time)

event[-2:] = [0,0]
# works
ipc_weights(event, time)


event[-2:] = [0,1]
# crashes and burns
ipc_weights(event, time)

The problem is, that the estimated censoring estimator is 0 at this time. Technically this makes sense and the paper linked doesn't cover this case. My fix currently is to say that the events in this case are censored (which of course is wrong, especially if there is time between the event and event registration), but I am not sure, what the right solution would be and if IPCW handles this case at all.

@sebp
Copy link
Owner

sebp commented Nov 12, 2022

Thanks for the explanation!

Note that setting reverse = False is not the correct solution, because ties are handled different when estimating the censoring distribution.

I checked the R package prodlim, which returns the same estimates for Ghat:

1    1.0000000
2    0.6250000
3    0.4166667
4    0.0000000

From the theory, I would say that it makes sense that Ghat is zero, because if there is censoring at the last time point, the distribution drops to zero. This is equivalent to what happens when estimating the survival function and there is an event at the last time point.

That said, there should be a better error message pointing to this fact.

I haven't looked that the paper @Finesim97 linked to yet, they might offer a suggestion for this special case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants