Make concordance_index_censored more robust to failure with datasets that have few uncensored samples #117

hermidalc · 2020-05-28T13:58:18Z

Certain datasets will naturally have few uncensored samples. I will often have concordance_index_censored throwing an error ZeroDivisionError: float division by zero which stops model selection, even when I use a custom CV iterator that will stratify folds such that each fold will have at least one uncensored sample.

This makes things difficult on the user code side when using GridSearchCV for example, since it throws an error during CV scoring (not fitting) and this exception isn't caught by GridSearchCV, so everything stops. A possible solution would be simply to give back a c-index score of 0 and give a warning.

Here an example with y:

array([(False,  849.), (False,   28.), (False,   55.), (False,  727.),
       (False,  505.), (False, 1558.), (False, 1292.), (False, 1737.),
       (False,  944.), (False,  750.), (False, 2513.), (False,  472.),
       (False, 2417.), (False,  538.), (False,   49.), (False,  723.),
       ( True, 3563.), (False, 1090.), (False, 1167.), (False,  587.),
       (False, 1354.), (False,  910.), (False,  398.), (False,  854.),
       (False, 3534.), (False,  280.), (False,  183.), (False,  883.),
       (False,   32.), (False,  144.)]

and self.predict(X_test):

array([-0.01186444,  0.00423783,  0.02100755,  0.00138889,  0.01444719,
        0.01984668,  0.00424899,  0.00237929,  0.00340747, -0.00683663,
        0.01669027, -0.00386332,  0.01449622, -0.00329621, -0.00623105,
       -0.00568365,  0.00928448, -0.00652131,  0.00634896,  0.00737944,
        0.00181175, -0.00117537,  0.00321519,  0.00595648, -0.0044239 ,
       -0.00820228,  0.00416655,  0.03181141, -0.02795738,  0.00642553])

File "/home/hermidalc/soft/miniconda3/envs/sksurv-workflows/lib/python3.8/site-packages/sksurv/metrics.py", line 117, in _estimate_concordance_index
    cindex = numerator / denominator
ZeroDivisionError: float division by zero

The text was updated successfully, but these errors were encountered:

sebp · 2020-06-14T10:36:19Z

In this example, the only event has the maximum observed time, thus there are no comparable pairs and the concordance index is unspecified. I believe throwing an error is the right thing to do in this case, but ZeroDivisionError might not be the best one.

If you are doing CV, you can use stratified CV on the event indicator to ensure each split has enough uncensored event times.

hermidalc · 2020-06-14T13:10:18Z

In this example, the only event has the maximum observed time, thus there are no comparable pairs and the concordance index is unspecified. I believe throwing an error is the right thing to do in this case, but ZeroDivisionError might not be the best one.

If you are doing CV, you can use stratified CV on the event indicator to ensure each split has enough uncensored event times.

Hi - I am doing stratified CV on the event indicator, it's still not enough in many real-world cases that I've worked with (i.e. TCGA). The issue with concordance_index_censored throwing an error is it completely stops CV and model selection in scikit-learn, when what it should be doing is only failing for the particular CV split with a warning and setting the score to zero.

There is no workaround possible from the user side, the only workaround it to change the scikit-learn BaseSearchCV _fit_and_score code to move scoring into the try/except. I would hazard a guess that the scikit-learn team will say your scorer should return a score of zero instead of raising an error?

hermidalc · 2020-06-14T13:15:04Z

I've opened an issue with scikit-learn, we'll see what they say.

sebp · 2020-06-14T13:48:21Z

Once I changed the error to something more reasonable than ZeroDivisionError, you can just catch it return zero. At least sklearn's GridSearchCV can do something similar when setting the error_score parameter.

hermidalc · 2020-06-14T13:56:34Z

Once I changed the error to something more reasonable than ZeroDivisionError, you can just catch it return zero. At least sklearn's GridSearchCV can do something similar when setting the error_score parameter.

I've submitted a feature request to scikit-learn to move the SearchCV scoring code into the try/except that the fit is in, so that it does the same thing as when a fit fails where it raises a warning and sets the score for the train/test split parameter combo to zero (or whatever error_score is set to) but continues with model selection.

Fixes #117

hermidalc mentioned this issue Jun 14, 2020

BaseSearchCV _fit_and_score should move scoring into try/except as some scorers can raise an error for particular CV splits scikit-learn/scikit-learn#17589

Closed

sebp added a commit that referenced this issue Jun 27, 2020

Throw exception when trying to estimate c-index from uncomparable data

a852c0a

Fixes #117

sebp mentioned this issue Jun 27, 2020

Fix ZeroDivisionError when computing c-index #125

Merged

sebp closed this as completed in #125 Jun 27, 2020

sebp added a commit that referenced this issue Jun 27, 2020

Throw exception when trying to estimate c-index from uncomparable data

dafe20a

Fixes #117

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make concordance_index_censored more robust to failure with datasets that have few uncensored samples #117

Make concordance_index_censored more robust to failure with datasets that have few uncensored samples #117

hermidalc commented May 28, 2020 •

edited

Loading

sebp commented Jun 14, 2020

hermidalc commented Jun 14, 2020

hermidalc commented Jun 14, 2020

sebp commented Jun 14, 2020 via email

hermidalc commented Jun 14, 2020

Make concordance_index_censored more robust to failure with datasets that have few uncensored samples #117

Make concordance_index_censored more robust to failure with datasets that have few uncensored samples #117

Comments

hermidalc commented May 28, 2020 • edited Loading

sebp commented Jun 14, 2020

hermidalc commented Jun 14, 2020

hermidalc commented Jun 14, 2020

sebp commented Jun 14, 2020 via email

hermidalc commented Jun 14, 2020

hermidalc commented May 28, 2020 •

edited

Loading