-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for H2O models #91
Comments
This package the reason I am a bit lagging in working on the {shapviz} ideas. The next thing I'd like to add is a function to show average observed, average predicted, and partial dependence over binned x (with BY). I am not sure about the API, so it will take a while. Thanks a lot for your idea and the beautiful example!
|
Thanks for sharing your thoughts! H2O can indeed be finicky and is not always easy to work with, so I understand your take for putting it on the back burner for now until you figure out the broader API setup. The current flexibility using Regarding 5, I just discovered that subtlety in your library(hstats)
set.seed(123)
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
noise <- counts / 3 + rnorm(length(counts), 0, 4)
df <- data.frame(treatment, outcome, counts, noise) # showing data
glm <- glm(counts ~ outcome + treatment + noise, family = poisson())
coef(glm)
#> (Intercept) outcome2 outcome3 treatment2 treatment3 noise
#> 2.71952786 -0.24815511 -0.37382803 -0.07882199 0.07174410 0.04519307
# Stratified PD
pd <- partial_dep(glm, v = "outcome", X = df, BY = "treatment")
# treatment 1 on top, even though coef for treatment 3 is larger
plot(pd) Created on 2023-10-28 with reprex v2.0.2 Regarding 6, indeed, thus far the implementations of H statistics I've explored are not fast enough in practice, whereas |
This is true. The reason for the current implementation is the focus on interactions: Like: "Oh, there seems to be a strong interaction between x and z, let's study stratified PDP". Maybe we can add an option like |
You have one argument already named |
And your method would switch the order of the first two steps? |
No, not really. I usually look at:
|
Ah, like a PDP for two features, but visualized not as heatmap but rather as a line plot? If it is the case, we simply need to add an option to Like:
|
Yeah, that sounds great and an elegant solution. |
I like this visualization much better than the heatmap style! And it has only a single predict() call. I am already thinking of using it as default geom... |
Awesome package! Thanks for all the work, @mayer79. Great to see such a solid implementation of H statistics in R. Well documented and compatibile with many modelling packages.
Would it be feasible to add out-of-the-box support for H2O models? Due to the way how you've set up the code, it's already possible now to use the
pred_fun
argument for H2O models in the various functions. See the code example down below, where I illustrate this for binomial, regression and multinomial H2O models.I believe overwriting the generics for
hstats()
,partial_dep()
,ice()
, andperm_importance()
would require two main things:pred_fun
argument to:as.data.frame()
Further suggestions / potential improvements:
pred_fun
itself? Then you'd only have to overwrite that one for ranger, Learner, explainer, H2OBinomialModel, H2ORegressionModel, and H2OMulticlassModel classes instead of all ofhstats()
,partial_dep()
,ice()
, andperm_importance()
.as.h2o()
and then callingh2o.predict()
. Doing this only once is much faster than callingpred_fun
multiple times. Hence, particularly for H2O models, further speed improvements are possible by first combining the data and then callingpred_fun
instead of the other way around. [Try running the below examples usingh2o.show_progress()
to see a progress bar wheneveras.h2o()
andh2o.predict()
get called]partial_dep(..., BY = ...)
could be faster by avoiding the for loop over theBY
argument.hstats()
could be faster by avoiding the for loop over the one-way, two-way and three-way effects. [This would be hard to refactor though]perm_importance()
could be faster by avoiding the for loop over thev
argument. [This might lead to memory issues though when stacking too many data frames whenv
,m_rep
and/orn_max
is large, so potentially having an optional argument for this would be better]v = NULL
inhstats()
andperm_importance()
could be set toobject@allparameters$x
. That way no unnecessary computations are performed for columns inX
not used as features.y
inperm_importance()
can be set toobject@allparameters$y
.Code example for H2O models:
Created on 2023-10-28 with reprex v2.0.2
The text was updated successfully, but these errors were encountered: