Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make concordance_index_censored more robust to failure with datasets that have few uncensored samples #117

Closed
hermidalc opened this issue May 28, 2020 · 5 comments · Fixed by #125

Comments

@hermidalc
Copy link
Contributor

hermidalc commented May 28, 2020

Certain datasets will naturally have few uncensored samples. I will often have concordance_index_censored throwing an error ZeroDivisionError: float division by zero which stops model selection, even when I use a custom CV iterator that will stratify folds such that each fold will have at least one uncensored sample.

This makes things difficult on the user code side when using GridSearchCV for example, since it throws an error during CV scoring (not fitting) and this exception isn't caught by GridSearchCV, so everything stops. A possible solution would be simply to give back a c-index score of 0 and give a warning.

Here an example with y:

array([(False,  849.), (False,   28.), (False,   55.), (False,  727.),
       (False,  505.), (False, 1558.), (False, 1292.), (False, 1737.),
       (False,  944.), (False,  750.), (False, 2513.), (False,  472.),
       (False, 2417.), (False,  538.), (False,   49.), (False,  723.),
       ( True, 3563.), (False, 1090.), (False, 1167.), (False,  587.),
       (False, 1354.), (False,  910.), (False,  398.), (False,  854.),
       (False, 3534.), (False,  280.), (False,  183.), (False,  883.),
       (False,   32.), (False,  144.)]

and self.predict(X_test):

array([-0.01186444,  0.00423783,  0.02100755,  0.00138889,  0.01444719,
        0.01984668,  0.00424899,  0.00237929,  0.00340747, -0.00683663,
        0.01669027, -0.00386332,  0.01449622, -0.00329621, -0.00623105,
       -0.00568365,  0.00928448, -0.00652131,  0.00634896,  0.00737944,
        0.00181175, -0.00117537,  0.00321519,  0.00595648, -0.0044239 ,
       -0.00820228,  0.00416655,  0.03181141, -0.02795738,  0.00642553])
File "/home/hermidalc/soft/miniconda3/envs/sksurv-workflows/lib/python3.8/site-packages/sksurv/metrics.py", line 117, in _estimate_concordance_index
    cindex = numerator / denominator
ZeroDivisionError: float division by zero
@sebp
Copy link
Owner

sebp commented Jun 14, 2020

In this example, the only event has the maximum observed time, thus there are no comparable pairs and the concordance index is unspecified. I believe throwing an error is the right thing to do in this case, but ZeroDivisionError might not be the best one.

If you are doing CV, you can use stratified CV on the event indicator to ensure each split has enough uncensored event times.

@hermidalc
Copy link
Contributor Author

In this example, the only event has the maximum observed time, thus there are no comparable pairs and the concordance index is unspecified. I believe throwing an error is the right thing to do in this case, but ZeroDivisionError might not be the best one.

If you are doing CV, you can use stratified CV on the event indicator to ensure each split has enough uncensored event times.

Hi - I am doing stratified CV on the event indicator, it's still not enough in many real-world cases that I've worked with (i.e. TCGA). The issue with concordance_index_censored throwing an error is it completely stops CV and model selection in scikit-learn, when what it should be doing is only failing for the particular CV split with a warning and setting the score to zero.

There is no workaround possible from the user side, the only workaround it to change the scikit-learn BaseSearchCV _fit_and_score code to move scoring into the try/except. I would hazard a guess that the scikit-learn team will say your scorer should return a score of zero instead of raising an error?

@hermidalc
Copy link
Contributor Author

I've opened an issue with scikit-learn, we'll see what they say.

@sebp
Copy link
Owner

sebp commented Jun 14, 2020 via email

@hermidalc
Copy link
Contributor Author

Once I changed the error to something more reasonable than ZeroDivisionError, you can just catch it return zero. At least sklearn's GridSearchCV can do something similar when setting the error_score parameter.

I've submitted a feature request to scikit-learn to move the SearchCV scoring code into the try/except that the fit is in, so that it does the same thing as when a fit fails where it raises a warning and sets the score for the train/test split parameter combo to zero (or whatever error_score is set to) but continues with model selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants