Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrated_brier_score() - ValueError: expected estimate with ... columns, but got ... #317

Closed
ramm777 opened this issue Oct 28, 2022 · 5 comments · Fixed by #349
Closed
Assignees
Labels

Comments

@ramm777
Copy link

ramm777 commented Oct 28, 2022

data.csv

Describe the bug

integrated_brier_score() function cannot input floats and it looks like inputs must be integers only. This is not described in the documentation, I was wondering if you could add that, please?

If you input times as float, your module converts that to integers and the length of 'times' floats may not be equal to the length of 'times' of integers, because of rounding methods.

Code Sample to Reproduce the Bug

# I added here file called 'data.csv'


from sklearn.model_selection import train_test_split
from sksurv.ensemble import RandomSurvivalForest
from sksurv.metrics import integrated_brier_score

data = pd.read_csv('data.csv')

y = data[['1', '0']].copy()
y = y.to_records(index=False)

x = data.loc[:, '2':].copy()


x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=20)

rsf = RandomSurvivalForest()
rsf.fit(x_train, y_train)

rsf_surv_funcs = rsf.predict_survival_function(x_test)
times = np.percentile(rsf_surv_funcs[0].x, np.linspace(10, 80, 2*len(rsf_surv_funcs[0].x)))
rsf_surv_prob = np.row_stack([fn(times) for fn in rsf_surv_funcs])

# This will give the bug
integrated_brier_score(y_train, y_test, rsf_surv_prob, times)


# However, if I set times as unique of times - there will be no error. I guess the issue is the conversion of floats into integers in metrics.py. To check you can print your array in line 4 of the _check_estimate_2d() function.  

times = np.unique(np.round(times))
rsf_surv_prob = np.row_stack([fn(times) for fn in rsf_surv_funcs])
integrated_brier_score(y_train, y_test, rsf_surv_prob, times)

Expected Results
ibs

Actual Results

    time_points.shape[0], estimate.shape[1]))
ValueError: expected estimate with 142 columns, but got 144

Versions
Please execute the following snippet and paste the output below.

System:
    python: 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)]
executable: D:\concr_health_oucomes\projects\outcomes\venv\Scripts\python.exe
   machine: Windows-10-10.0.19041-SP0
Python dependencies:
          pip: 21.1.2
   setuptools: 57.0.0
      sklearn: 1.0.2
        numpy: 1.19.5
        scipy: 1.7.1
       Cython: None
       pandas: 1.3.2
   matplotlib: 3.4.3
       joblib: 1.0.1
threadpoolctl: 2.2.0
Built with OpenMP: True
sksurv: 0.17.2
@sebp
Copy link
Owner

sebp commented Nov 1, 2022

The error indicates that the number of time points does not match the number predictions (where needs to be one for each time point). I don't think it as anything to do with the data type.

@ramm777
Copy link
Author

ramm777 commented Nov 1, 2022

I checked the metrics.py, it looks like it is related to the type of data.

So, if your data points are not integers but float, printing time_points in the metrics.py will results in a shorter array so the dimension will change.

I fixed this by making the input integers, or it can be floats which if converted to integers will result in the same length.

@sebp
Copy link
Owner

sebp commented Nov 2, 2022

Could you please provide a minimal working example to reproduce the problem (e.g. using randomly generated predictions)?

@ramm777
Copy link
Author

ramm777 commented Nov 7, 2022

I have just updated the issue with the minimum working example and attached a data file. Thank you.

@sebp
Copy link
Owner

sebp commented Nov 12, 2022

You are correct. times gets converted to the same dtype as time in y, which is int, therefore creating duplicates.

times = check_array(np.atleast_1d(times), ensure_2d=False, dtype=test_time.dtype, input_name="times")

This is not intended.

@sebp sebp added the bug label Nov 12, 2022
@sebp sebp self-assigned this Nov 12, 2022
sebp added a commit that referenced this issue Apr 2, 2023
If `times` is a float array and survival times are ints,
a downcast of float to int can result in loss of information.
Keep the original dtype instead.

Closes #317
@sebp sebp closed this as completed in #349 Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants