Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for H2O models #91

Closed
RoelVerbelen opened this issue Oct 28, 2023 · 9 comments
Closed

Support for H2O models #91

RoelVerbelen opened this issue Oct 28, 2023 · 9 comments

Comments

@RoelVerbelen
Copy link

Awesome package! Thanks for all the work, @mayer79. Great to see such a solid implementation of H statistics in R. Well documented and compatibile with many modelling packages.

Would it be feasible to add out-of-the-box support for H2O models? Due to the way how you've set up the code, it's already possible now to use the pred_fun argument for H2O models in the various functions. See the code example down below, where I illustrate this for binomial, regression and multinomial H2O models.

I believe overwriting the generics for hstats(), partial_dep(), ice(), and perm_importance() would require two main things:

  1. Changing the pred_funargument to:
# Class "H2OBinomialModel" 
pred_fun <- function(object, newdata) {
  as.data.frame(h2o::h2o.predict(object, h2o::as.h2o(newdata)))[[3]]
}

# Class "H2ORegressionModel"
pred_fun <- function(object, newdata) {
  as.data.frame(h2o::h2o.predict(object, h2o::as.h2o(newdata)))[[1]]
}

# Class "H2OMulticlassModel"
pred_fun <- function(object, newdata) {
  as.data.frame(h2o::h2o.predict(object, h2o::as.h2o(newdata)))[, -1]
}
  1. Ensuring the provided data argument (potentially an H2OFrame) gets converted to a data.frame using as.data.frame()

Further suggestions / potential improvements:

  • Would it be possible/better to set up a generic for the pred_fun itself? Then you'd only have to overwrite that one for ranger, Learner, explainer, H2OBinomialModel, H2ORegressionModel, and H2OMulticlassModel classes instead of all of hstats(), partial_dep(), ice(), and perm_importance().
  • The prediction function for H2O models requires converting the data to an H2OFrame using as.h2o() and then calling h2o.predict(). Doing this only once is much faster than calling pred_fun multiple times. Hence, particularly for H2O models, further speed improvements are possible by first combining the data and then calling pred_fun instead of the other way around. [Try running the below examples using h2o.show_progress() to see a progress bar whenever as.h2o() and h2o.predict() get called]
    • partial_dep(..., BY = ...) could be faster by avoiding the for loop over the BY argument.
    • hstats() could be faster by avoiding the for loop over the one-way, two-way and three-way effects. [This would be hard to refactor though]
    • perm_importance() could be faster by avoiding the for loop over the v argument. [This might lead to memory issues though when stacking too many data frames when v, m_rep and/or n_max is large, so potentially having an optional argument for this would be better]
  • For H2O models, the default argument when v = NULL in hstats() and perm_importance() could be set to object@allparameters$x. That way no unnecessary computations are performed for columns in X not used as features.
  • For H2O models, the default argument for y in perm_importance() can be set to object@allparameters$y.

Code example for H2O models:

library(hstats)
library(h2o)
h2o.init()
h2o.no_progress()

# H2O Binomial ----

# Import the prostate dataset into H2O:
prostate <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
prostate_df <- as.data.frame(prostate)

# Set the predictors and response; set the factors:
prostate$CAPSULE <- as.factor(prostate$CAPSULE)
prostate$RACE <- as.factor(prostate$RACE)
prostate$DPROS <- as.factor(prostate$DPROS)
prostate$DCAPS <- as.factor(prostate$DCAPS)
prostate$GLEASON <- as.factor(prostate$GLEASON)
predictors <- c("AGE", "RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON")
response <- "CAPSULE"

# Build and train the model:
pros_gbm <- h2o.gbm(x = predictors,
                    y = response,
                    nfolds = 5,
                    seed = 1111,
                    keep_cross_validation_predictions = TRUE,
                    training_frame = prostate)

# Class "H2OBinomialModel" 
pred_fun <- function(object, newdata) {
  as.data.frame(h2o::h2o.predict(object, h2o::as.h2o(newdata)))[[3]]
}

# H stats
s <- hstats(pros_gbm, X = prostate_df, pred_fun = pred_fun)
s
#> 'hstats' object. Use plot() or summary() for details.
#> 
#> H^2 (normalized)
#> [1] 0.2818805

# H statistics: overall level of interaction & pairwise
plot(s)

# Unnormalized pairwise statistics
plot(h2_pairwise(s, normalize = FALSE, squared = FALSE))

# Stratified PD
plot(partial_dep(pros_gbm, v = "PSA", X = prostate_df, BY = "GLEASON", pred_fun = pred_fun), show_points = FALSE)

# Two-dimensional PDP
pd <- partial_dep(pros_gbm, v = c("PSA", "GLEASON"), X = prostate_df, pred_fun = pred_fun, grid_size = 1000)
plot(pd)

# Centered ICE plot with colors
ic <- ice(pros_gbm, v = "PSA", X = prostate_df, BY = "GLEASON", pred_fun = pred_fun)
plot(ic, center = TRUE)

# Variable importance based on permutation
plot(perm_importance(pros_gbm, X = prostate_df, y = pros_gbm@allparameters$y, pred_fun = pred_fun, v = pros_gbm@allparameters$x))

# Variable importance based on PD function
plot(pd_importance(s))

# H2O Regression ----

# Run GLM of VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON
prostate_path = system.file("extdata", "prostate.csv", package = "h2o")
prostate = h2o.importFile(path = prostate_path)
predictors <- setdiff(colnames(prostate), c("ID", "DPROS", "DCAPS", "VOL"))
pros_glm <- h2o.glm(y = "VOL", x = predictors, training_frame = prostate, family = "tweedie",
                    nfolds = 0, alpha = 0.1, lambda_search = FALSE,
                    interactions = c("AGE", "GLEASON"))

# Class "H2ORegressionModel"
pred_fun <- function(object, newdata) {
  as.data.frame(h2o::h2o.predict(object, h2o::as.h2o(newdata)))[[1]]
}

# H stats
s <- hstats(pros_glm, X = as.data.frame(prostate), pred_fun = pred_fun)
s
#> 'hstats' object. Use plot() or summary() for details.
#> 
#> H^2 (normalized)
#> [1] 0.0007039798

# Pairwise statistics 
plot(h2_pairwise(s, normalize = FALSE, squared = FALSE))

# PD plot 
pd <- partial_dep(pros_glm, v = "AGE", X = as.data.frame(prostate), BY = "GLEASON", pred_fun = pred_fun, grid_size = 500)
plot(pd)

# H2O Multinomial ----

# import the cars dataset
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

# set the predictor names and the response column name
predictors <- c("displacement", "power", "weight", "acceleration", "year")
response <- "cylinders"
cars[, response] <- as.factor(cars[response])

# split into train and validation sets
cars_splits <- h2o.splitFrame(data =  cars, ratios = 0.8, seed = 1234)
train <- cars_splits[[1]]
valid <- cars_splits[[2]]

# build and train the model:
cars_gbm <- h2o.gbm(x = predictors,
                    y = response,
                    training_frame = train,
                    validation_frame = valid,
                    distribution = "multinomial",
                    seed = 1234)

# Class "H2OMulticlassModel"
pred_fun <- function(object, newdata) {
  as.data.frame(h2o::h2o.predict(object, h2o::as.h2o(newdata)))[, -1]
} 

# H stats
s <- hstats(cars_gbm, X = as.data.frame(cars), pred_fun = pred_fun)
s
#> 'hstats' object. Use plot() or summary() for details.
#> 
#> H^2 (normalized)
#>         p3         p4         p5         p6         p8 
#> 0.74177545 0.06181569 0.95886855 0.07111337 0.00996367

# Pairwise statistics 
plot(h2_pairwise(s, normalize = FALSE, squared = FALSE))

# PD plot 
pd <- partial_dep(cars_gbm, v = "displacement", X = as.data.frame(cars), pred_fun = pred_fun, grid_size = 500)
plot(pd)

Created on 2023-10-28 with reprex v2.0.2

@mayer79
Copy link
Collaborator

mayer79 commented Oct 28, 2023

This package the reason I am a bit lagging in working on the {shapviz} ideas. The next thing I'd like to add is a function to show average observed, average predicted, and partial dependence over binned x (with BY). I am not sure about the API, so it will take a while.

Thanks a lot for your idea and the beautiful example!

  1. When I wrote {kernelshap}, I tested the possibility to support H2O models in a fast way. Fast in the sense that the data would be passed and converted just once. Then I failed to find a good alternative to rowsum() (and similar) in H2O. There is a function for grouped averaging, but it is too slow to be useful.
  2. Now, regarding your idea to use H2O: I'd need to add it as dependency (unlike DALEX etc.), in minimum under "enhances" like in {shapviz}. The current policy of {hstats} is too keep the dependency footprint as small as possible. I am even considering moving {ggplot2} to "suggests".
  3. The answer is thus: "not at the moment". But I'd be very happy to put your code into an R script in the backlog folder with all h2o-specific methods (including deriving x, y, w and probabilistic classification). I think it will b a bit too long for the README.
  4. I am considering to write an "hstats_explainer" that does the wrapping of ranger, mlr3 etc. This idea is also in the backlog.
  5. partial_dep(..., BY = ) is written a bit funny, I know. But it is not trivial to move the BY into the crunching. (To avoid that background rows leak over BY group).
  6. I made a naive benchmark here https://lorentzen.ch/index.php/2023/10/16/interactions-where-are-you/ to show that even in the univariate case, {hstats} is much faster than any other XAI tool in R.

@RoelVerbelen
Copy link
Author

Thanks for sharing your thoughts!

H2O can indeed be finicky and is not always easy to work with, so I understand your take for putting it on the back burner for now until you figure out the broader API setup. The current flexibility using pred_fun means it's already possible today to use it, so thanks for that.

Regarding 5, I just discovered that subtlety in your partial_dep(..., BY = ) implementation of (stratified) PDs. That makes it harder indeed if the background grid is not fixed. I personally prefer looking at counterfactual PDs instead of stratified PDs for understanding a model. Those make it easier to reason about your model imo. The level difference due to background rows is something for which you consider one way observed or predicted averages. Here's an example of a stratified PD which I find confusing:

library(hstats)

set.seed(123)
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
noise <- counts / 3 + rnorm(length(counts), 0, 4)
df <- data.frame(treatment, outcome, counts, noise) # showing data
glm <- glm(counts ~ outcome + treatment + noise, family = poisson())
coef(glm)
#> (Intercept)    outcome2    outcome3  treatment2  treatment3       noise 
#>  2.71952786 -0.24815511 -0.37382803 -0.07882199  0.07174410  0.04519307

# Stratified PD
pd <- partial_dep(glm, v = "outcome", X = df, BY = "treatment")

# treatment 1 on top, even though coef for treatment 3 is larger 
plot(pd)

Created on 2023-10-28 with reprex v2.0.2

Regarding 6, indeed, thus far the implementations of H statistics I've explored are not fast enough in practice, whereas hstats addresses that!

@mayer79
Copy link
Collaborator

mayer79 commented Oct 28, 2023

This is true. The reason for the current implementation is the focus on interactions: Like: "Oh, there seems to be a strong interaction between x and z, let's study stratified PDP".

Maybe we can add an option like grid_per_BY = FALSE or common_grid_BY = TRUE to partial_dep() to cover your situation?

@RoelVerbelen
Copy link
Author

You have one argument already named by_size, so maybe a by_method with options stratified or counterfactual?

@mayer79
Copy link
Collaborator

mayer79 commented Oct 28, 2023

by_method sounds great. Just to get it right: The current implementation

  1. calculates a grid g for the x variable, then
  2. splits the data per BY group, then
  3. subsamples n_max rows per BY group,
  4. then calculates partial dependence within BY group, using the common grid g

And your method would switch the order of the first two steps?

@RoelVerbelen
Copy link
Author

No, not really. I usually look at:

  1. calculate a grid for the x variable
  2. calculate a grid for the BY variable
  3. expand.grid() on both (so this can include combinations not seen in the data)
  4. subsample n_max rows from the data (don't filter for BY group)
  5. full join the (x, BY) grid and the subsample
  6. predict
  7. compute means by (x, BY) combination

@mayer79
Copy link
Collaborator

mayer79 commented Oct 28, 2023

Ah, like a PDP for two features, but visualized not as heatmap but rather as a line plot? If it is the case, we simply need to add an option to plot.partial_dep()!

Like:

pd <- partial_dep(pros_gbm, v = c("PSA", "GLEASON"), X = prostate_df, pred_fun = pred_fun, grid_size = 1000)
plot(pd, d2_geom = "line")
plot(pd, d2_geom = "line", swap_dim = TRUE)

@RoelVerbelen
Copy link
Author

Yeah, that sounds great and an elegant solution.

@mayer79
Copy link
Collaborator

mayer79 commented Oct 29, 2023

I like this visualization much better than the heatmap style! And it has only a single predict() call. I am already thinking of using it as default geom...

@mayer79 mayer79 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants