reduce SurvivalTree.predict's memory use #369

cpoerschke · 2023-06-08T14:36:27Z

Checklist

pytest passes
tests are included
code is well formatted
documentation renders correctly

What does this implement/fix? Explain your changes

issue encountered: Passing relatively lots of technical data samples to SurvivalTree.predict results in "kernel died" in a notebook environment.

analysis

The pred = self.tree_.predict(X) call in SurvivalTree.predict_cumulative_hazard_function returns an array of (n_samples, self.event_times_, 2) shape which is then returned as a (n_samples, self.event_times_) shape and then SurvivalTree.predict does a .sum(1) on it, resulting in an overall (n_samples) shape return value. The n_samples X self.event_times_ shapes use a lot of memory for larger samples especially if the number of event times is a proportion of samples.

proposed solution

Do the self.tree_.predict call on a one-row-at-a-time basis, requiring one (1, self.event_times_, 2) shape for each iteration.
Have SurvivalTree.predict_cumulative_hazard_function do the sum-ing for each row, removing the need for a (n_samples, self.event_times_) shape array.

testing

peak memory usage stats inspected before and after the change using memray run ./pr6-demo.py profiling.

code snippet:

#!/usr/bin/env python

import numpy as np
import sksurv.tree
import sksurv.util

n = 12345

verbose = (n <= 10)

feature1 = np.arange(0, n, 1)
feature2 = n - feature1

times = np.arange(0,n) + 1
events = (times < times[len(times)//2])

X = np.vstack((feature1, feature2)).T
y = sksurv.util.Surv.from_arrays(time=times, event=events)

if verbose:
    print(f"X={X}\ny={y}")
else:
    print(f"X.shape={X.shape}\ny.shape={y.shape}")

st = sksurv.tree.SurvivalTree(max_leaf_nodes=100)
print(st.fit(X, y))

risk_scores = st.predict(X)

if verbose:
    print(f"risk_scores={risk_scores}")
else:
    print(f"risk_scores.shape={risk_scores.shape}")

sksurv/tree/tree.py

sebp · 2023-06-08T15:09:54Z

Thanks for your PR.

I agree that predict is quite heavy on memory usage. Part of it has to do with sklearn's Tree-related code, where predictions are returned via the split criterion's node_value method. It is currently implemented to return the full survival function and cumulative hazard function for each sample, disregarding whether predict, predict_survival_function, or predict_cumulative_hazard_function has been called. I could imagine adding a low-memory option that disables computing survival and cumulative hazard function (CHF) and just returns the event counts (sum over CHF, i.e. what predict returns).

cpoerschke · 2023-06-08T16:19:12Z

... Part of it has to do with ... I could imagine adding a low-memory option that disables computing survival and cumulative hazard function (CHF) and just returns the event counts (sum over CHF, i.e. what predict returns).

Thanks for the context and quick feedback!

I've added a low_memory=False option in the latest commit, though I guess its current use does not disable the computations as such. And of course for any new option there should be test coverage too.

codecov · 2023-06-08T16:41:52Z

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (bd2e240) 97.94% compared to head (9e1b344) 97.95%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #369   +/-   ##
=======================================
  Coverage   97.94%   97.95%           
=======================================
  Files          37       37           
  Lines        3361     3376   +15     
  Branches      509      511    +2     
=======================================
+ Hits         3292     3307   +15     
  Misses         33       33           
  Partials       36       36

Impacted Files	Coverage Δ
sksurv/ensemble/forest.py	`100.00% <100.00%> (ø)`
sksurv/tree/tree.py	`95.71% <100.00%> (+0.43%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

cpoerschke · 2023-06-08T17:06:27Z

... Part of it has to do with sklearn's Tree-related code, where predictions are returned via the split criterion's node_value method. It is currently implemented to return the full survival function and cumulative hazard function for each sample, disregarding whether predict, predict_survival_function, or predict_cumulative_hazard_function has been called. I could imagine adding a low-memory option that disables computing survival and cumulative hazard function (CHF) and just returns the event counts (sum over CHF, i.e. what predict returns).

So I haven't worked with .pxd and .pyx code before but from a little bit of code reading just now ... is the idea conceptually that in low-memory mode:

https://github.com/sebp/scikit-survival/blob/v0.20.0/sksurv/tree/tree.py#L227-L229 would request fewer outputs and classes from LogRankCriterion and Tree
https://github.com/sebp/scikit-survival/blob/v0.20.0/sksurv/tree/_criterion.pyx#L305-L327 would fill the reduced size dest with the event counts
predict_cumulative_hazard_function and predict_survival_function cannot be used in low-memory mode
some but not all of the above and/or something else?

cpoerschke

Added work-in-progress notes inline.

sksurv/tree/tree.py

sksurv/tree/_criterion.pyx

Resolved Conflicts: sksurv/tree/_criterion.pyx sksurv/tree/tree.py

sksurv/tree/tree.py

sksurv/tree/_criterion.pyx

sksurv/tree/tree.py

tests/test_tree.py

sebp · 2023-06-12T18:40:59Z

tests/test_tree.py

+    # Duplicates values in whas500 leads to assert errors because of
+    # tie resolution during tree fitting.
+    # Using a synthetic dataset resolves this issue.


Actually, ties should not cause any problems.

sksurv/tree/tree.py

cpoerschke

Thanks @sebp for the feedback! I think I've addressed all points and tests pass locally. Please let me know if there's any additional suggestions. Thank you.

sksurv/tree/tree.py

tests/test_tree.py

sksurv/tree/_criterion.pyx

Has been moved to test_forest.py

reduce SurvivalTree.predict's memory use

4c9505f

cpoerschke commented Jun 8, 2023

View reviewed changes

sksurv/tree/tree.py Outdated Show resolved Hide resolved

sksurv/tree/tree.py Outdated Show resolved Hide resolved

add low_memory=False option to SurvivalTree constructor

a227853

add test_predict_low_memory in test_tree.py

2fa63cb

cpoerschke marked this pull request as ready for review June 8, 2023 16:34

cpoerschke added 2 commits June 9, 2023 12:15

development increment: j_delta=2 (tests continue to pass)

a1a1213

development increment: j_delta=3 (tests fail for some reason)

8c62e09

cpoerschke marked this pull request as draft June 9, 2023 11:18

cpoerschke commented Jun 9, 2023

View reviewed changes

sksurv/tree/tree.py Outdated Show resolved Hide resolved

sksurv/tree/_criterion.pyx Outdated Show resolved Hide resolved

sebp reviewed Jun 10, 2023

View reviewed changes

sksurv/tree/_criterion.pyx Outdated Show resolved Hide resolved

cpoerschke added 4 commits June 12, 2023 11:54

low memory mode changes

67969a7

Merge remote-tracking branch 'origin/master' into pr-6

f1639ff

Resolved Conflicts: sksurv/tree/_criterion.pyx sksurv/tree/tree.py

annotate TODO w.r.t. summing only for event times

b85f924

address CI feedback

b5d1ac0

cpoerschke commented Jun 12, 2023

View reviewed changes

sksurv/tree/tree.py Show resolved Hide resolved

sebp requested changes Jun 12, 2023

View reviewed changes

cpoerschke added 4 commits June 13, 2023 15:22

action review feedback (part 1 of 2)

179d3ef

action review feedback (part 2 of 2)

9065fae

int[::1] --> const bint[::1] for LogrankCriterion's is_event_time

3af23c5

lint: line-too-long

fe5aed5

cpoerschke commented Jun 13, 2023

View reviewed changes

sksurv/tree/tree.py Outdated Show resolved Hide resolved

sksurv/tree/tree.py Show resolved Hide resolved

tests/test_tree.py Outdated Show resolved Hide resolved

cpoerschke added 2 commits June 13, 2023 18:45

address CI feedback (part 1 of 2)

4c21964

address CI feedback (part 2 of 2)

6ef7353

cpoerschke marked this pull request as ready for review June 13, 2023 18:15

cpoerschke requested a review from sebp June 13, 2023 18:16

action CI feedback

bdf7162

sebp reviewed Jun 17, 2023

View reviewed changes

sksurv/tree/_criterion.pyx Outdated Show resolved Hide resolved

sebp added 4 commits June 17, 2023 19:52

Assign self.is_event_time to local variable

706352b

Add low_memory option to forest classes

2b81bb3

Add test case for low-memory mode for forests

9acd236

Remove test_predict_low_memory

a41f022

Has been moved to test_forest.py

sebp self-requested a review June 17, 2023 21:05

sebp added 3 commits June 17, 2023 21:18

Use type cnp.npy_bool instead of bint

4440a76

Remove type conversion

35b0811

Fix code format

6cbb8fa

sebp approved these changes Jun 17, 2023

View reviewed changes

Fix API doc of RandomSurvivalForest

9e1b344

sebp merged commit 53d6261 into sebp:master Jun 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce SurvivalTree.predict's memory use #369

reduce SurvivalTree.predict's memory use #369

cpoerschke commented Jun 8, 2023 •

edited by sebp

Loading

sebp commented Jun 8, 2023

cpoerschke commented Jun 8, 2023

codecov bot commented Jun 8, 2023 •

edited

Loading

cpoerschke commented Jun 8, 2023

cpoerschke left a comment

sebp Jun 12, 2023

cpoerschke left a comment

reduce SurvivalTree.predict's memory use #369

reduce SurvivalTree.predict's memory use #369

Conversation

cpoerschke commented Jun 8, 2023 • edited by sebp Loading

analysis

proposed solution

testing

sebp commented Jun 8, 2023

cpoerschke commented Jun 8, 2023

codecov bot commented Jun 8, 2023 • edited Loading

Codecov Report

cpoerschke commented Jun 8, 2023

cpoerschke left a comment

Choose a reason for hiding this comment

sebp Jun 12, 2023

Choose a reason for hiding this comment

cpoerschke left a comment

Choose a reason for hiding this comment

cpoerschke commented Jun 8, 2023 •

edited by sebp

Loading

codecov bot commented Jun 8, 2023 •

edited

Loading