-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ptype names and types to the metadata #116
Conversation
Thank you for this contribution @bjfletcher! This is a really interesting question. We had originally put both the The other consideration is that the ptype is more than column names and types; it includes other information needed for prediction like factor levels: library(vetiver)
data(penguins, package = "modeldata")
penguins_lm <- lm(bill_length_mm ~ species + bill_depth_mm, data = penguins)
penguin_ptype <- vetiver_ptype(penguins_lm)
attributes(penguin_ptype$species)
#> $levels
#> [1] "Adelie" "Chinstrap" "Gentoo"
#>
#> $class
#> [1] "factor" Created on 2022-07-01 by the reprex package (v2.0.1) It's not super straightforward to capture all input data prototype info you might need to make a prediction outside of a serialized binary object. I wonder if it would be better to use the OpenApi specification for this. This contains the JSON types, rather than the R types: library(vetiver)
library(plumber)
data(penguins, package = "modeldata")
penguins_lm <- lm(bill_length_mm ~ species + bill_depth_mm, data = penguins)
v <- vetiver_model(penguins_lm, "palmer_penguins")
penguins_api <- pr() %>%
vetiver_api(v)
api_spec <- penguins_api$getApiSpec()
super_nested_info <- api_spec$paths$`/predict`$post$requestBody$content$`application/json`$schema$items$properties
purrr::map_dfr(super_nested_info, "type")
#> # A tibble: 1 × 2
#> species bill_depth_mm
#> <chr> <chr>
#> 1 string number Created on 2022-07-01 by the reprex package (v2.0.1) You can get this info without loading the model into memory once the model is deployed. For example, if you look at this model: api_spec <- jsonlite::fromJSON("https://colorado.rstudio.com/rsc/seattle-housing/openapi.json")
super_nested_info <- api_spec$paths$`/predict`$post$requestBody$content$`application/json`$schema$items$properties
purrr::map_dfr(super_nested_info, "type")
#> # A tibble: 1 × 4
#> bedrooms bathrooms sqft_living yr_built
#> <chr> <chr> <chr> <chr>
#> 1 integer number integer integer Created on 2022-07-01 by the reprex package (v2.0.1) We could make a little function to handle that deeply nested mess. |
This would be better, actually: api_spec <- jsonlite::fromJSON("https://colorado.rstudio.com/rsc/seattle-housing/openapi.json")
super_nested_info <- api_spec$paths$`/predict`$post$requestBody$content$`application/json`$schema$items$properties
purrr::map_chr(super_nested_info, "type") |> tibble::enframe("feature", "json_type")
#> # A tibble: 4 × 2
#> feature json_type
#> <chr> <chr>
#> 1 bedrooms integer
#> 2 bathrooms number
#> 3 sqft_living integer
#> 4 yr_built integer Created on 2022-07-01 by the reprex package (v2.0.1) In your particular use case @bjfletcher are you interested in the R types or the JSON types, for making predictions? |
With the publication of the new cereal package and the changes in #220, this is now doable: url <- "https://colorado.posit.co/rsc/chicago-ridership/prototype"
r <- httr::GET(url)
prototype <- httr::content(r, as = "text", encoding = "UTF-8")
p <- cereal::cereal_from_json(prototype)
str(p)
#> tibble [0 × 49] (S3: tbl_df/tbl/data.frame)
#> $ Austin : num(0)
#> $ Quincy_Wells : num(0)
#> $ Belmont : num(0)
#> $ Archer_35th : num(0)
#> $ Oak_Park : num(0)
#> $ Western : num(0)
#> $ Clark_Lake : num(0)
#> $ Clinton : num(0)
#> $ Merchandise_Mart: num(0)
#> $ Irving_Park : num(0)
#> $ Washington_Wells: num(0)
#> $ Harlem : num(0)
#> $ Monroe : num(0)
#> $ Polk : num(0)
#> $ Ashland : num(0)
#> $ Kedzie : num(0)
#> $ Addison : num(0)
#> $ Jefferson_Park : num(0)
#> $ Montrose : num(0)
#> $ California : num(0)
#> $ temp_min : num(0)
#> $ temp : num(0)
#> $ temp_max : num(0)
#> $ temp_change : num(0)
#> $ dew : num(0)
#> $ humidity : num(0)
#> $ pressure : num(0)
#> $ pressure_change : num(0)
#> $ wind : num(0)
#> $ wind_max : num(0)
#> $ gust : num(0)
#> $ gust_max : num(0)
#> $ percip : num(0)
#> $ percip_max : num(0)
#> $ weather_rain : num(0)
#> $ weather_snow : num(0)
#> $ weather_cloud : num(0)
#> $ weather_storm : num(0)
#> $ Blackhawks_Away : num(0)
#> $ Blackhawks_Home : num(0)
#> $ Bulls_Away : num(0)
#> $ Bulls_Home : num(0)
#> $ Bears_Away : num(0)
#> $ Bears_Home : num(0)
#> $ WhiteSox_Away : num(0)
#> $ WhiteSox_Home : num(0)
#> $ Cubs_Away : num(0)
#> $ Cubs_Home : num(0)
#> $ date : 'Date' num(0) Created on 2023-07-03 with reprex v2.0.2 Thanks so much for your ideas here and your feedback! 🙌 |
With the model registry (board of pinned models), the number one most requested feature from my data science team is metadata about model input parameters.
Workaround 1
With current
vetiver
, it is possible by running the below code with every pinned model'srds
file in the board:but having to load the
.rds
file for every model (some very large due to large models) in order to get theptype
object is expensive and slow.Workaround 2
We can ask data scientists to replace:
with:
but the usability/inconsistency of it is a bit confusing - "isn't ptype already included? do we need to specify the ptype? why is description already in metadata but not ptype?" etc. The
lapply
code and use ofclass
is also only understood by those with a more advanced understanding of R typing. Would it be better to have it included in thevetiver
library?Proposal
This PR includes metadata about ptype when the ptype is included. This means that with the following code:
it will include ptype along with their metadata, by default.