Add ptype names and types to the metadata #116

bjfletcher · 2022-06-30T20:18:20Z

With the model registry (board of pinned models), the number one most requested feature from my data science team is metadata about model input parameters.

Workaround 1

With current vetiver, it is possible by running the below code with every pinned model's rds file in the board:

lapply(ptype, class)

but having to load the .rds file for every model (some very large due to large models) in order to get the ptype object is expensive and slow.

Workaround 2

We can ask data scientists to replace:

v <- vetiver_model(model)

with:

v <- vetiver_model(model)
v$metadata$user$ptype <- lapply(v$ptype, class)

but the usability/inconsistency of it is a bit confusing - "isn't ptype already included? do we need to specify the ptype? why is description already in metadata but not ptype?" etc. The lapply code and use of class is also only understood by those with a more advanced understanding of R typing. Would it be better to have it included in the vetiver library?

Proposal

This PR includes metadata about ptype when the ptype is included. This means that with the following code:

v <- vetiver_model(model)

it will include ptype along with their metadata, by default.

juliasilge · 2022-07-01T17:34:36Z

Thank you for this contribution @bjfletcher!

This is a really interesting question. We had originally put both the ptype and required packages into the serialized object in the board rather than the metadata because our current strategy doesn't involve using something like the ptype unless the model needs to be loaded to.

The other consideration is that the ptype is more than column names and types; it includes other information needed for prediction like factor levels:

library(vetiver)
data(penguins, package = "modeldata")

penguins_lm <- lm(bill_length_mm ~ species + bill_depth_mm, data = penguins)
penguin_ptype <- vetiver_ptype(penguins_lm)

attributes(penguin_ptype$species)
#> $levels
#> [1] "Adelie"    "Chinstrap" "Gentoo"   
#> 
#> $class
#> [1] "factor"

^{Created on 2022-07-01 by the reprex package (v2.0.1)}

It's not super straightforward to capture all input data prototype info you might need to make a prediction outside of a serialized binary object.

I wonder if it would be better to use the OpenApi specification for this. This contains the JSON types, rather than the R types:

library(vetiver)
library(plumber)
data(penguins, package = "modeldata")

penguins_lm <- lm(bill_length_mm ~ species + bill_depth_mm, data = penguins)
v <- vetiver_model(penguins_lm, "palmer_penguins")

penguins_api <- pr() %>%
    vetiver_api(v)

api_spec <- penguins_api$getApiSpec()
super_nested_info <- api_spec$paths$`/predict`$post$requestBody$content$`application/json`$schema$items$properties
purrr::map_dfr(super_nested_info, "type")
#> # A tibble: 1 × 2
#>   species bill_depth_mm
#>   <chr>   <chr>        
#> 1 string  number

^{Created on 2022-07-01 by the reprex package (v2.0.1)}

You can get this info without loading the model into memory once the model is deployed. For example, if you look at this model:
https://colorado.rstudio.com/rsc/seattle-housing/
It is at:
https://colorado.rstudio.com/rsc/seattle-housing/openapi.json

api_spec <- jsonlite::fromJSON("https://colorado.rstudio.com/rsc/seattle-housing/openapi.json")
super_nested_info <- api_spec$paths$`/predict`$post$requestBody$content$`application/json`$schema$items$properties
purrr::map_dfr(super_nested_info, "type")
#> # A tibble: 1 × 4
#>   bedrooms bathrooms sqft_living yr_built
#>   <chr>    <chr>     <chr>       <chr>   
#> 1 integer  number    integer     integer

^{Created on 2022-07-01 by the reprex package (v2.0.1)}

We could make a little function to handle that deeply nested mess.

juliasilge · 2022-07-01T17:51:26Z

This would be better, actually:

api_spec <- jsonlite::fromJSON("https://colorado.rstudio.com/rsc/seattle-housing/openapi.json")
super_nested_info <- api_spec$paths$`/predict`$post$requestBody$content$`application/json`$schema$items$properties
purrr::map_chr(super_nested_info, "type") |> tibble::enframe("feature", "json_type")
#> # A tibble: 4 × 2
#>   feature     json_type
#>   <chr>       <chr>    
#> 1 bedrooms    integer  
#> 2 bathrooms   number   
#> 3 sqft_living integer  
#> 4 yr_built    integer

^{Created on 2022-07-01 by the reprex package (v2.0.1)}

In your particular use case @bjfletcher are you interested in the R types or the JSON types, for making predictions?

juliasilge · 2023-07-03T16:15:11Z

With the publication of the new cereal package and the changes in #220, this is now doable:

url <- "https://colorado.posit.co/rsc/chicago-ridership/prototype"
r <- httr::GET(url)
prototype <- httr::content(r, as = "text", encoding = "UTF-8")
p <- cereal::cereal_from_json(prototype)
str(p)
#> tibble [0 × 49] (S3: tbl_df/tbl/data.frame)
#>  $ Austin          : num(0) 
#>  $ Quincy_Wells    : num(0) 
#>  $ Belmont         : num(0) 
#>  $ Archer_35th     : num(0) 
#>  $ Oak_Park        : num(0) 
#>  $ Western         : num(0) 
#>  $ Clark_Lake      : num(0) 
#>  $ Clinton         : num(0) 
#>  $ Merchandise_Mart: num(0) 
#>  $ Irving_Park     : num(0) 
#>  $ Washington_Wells: num(0) 
#>  $ Harlem          : num(0) 
#>  $ Monroe          : num(0) 
#>  $ Polk            : num(0) 
#>  $ Ashland         : num(0) 
#>  $ Kedzie          : num(0) 
#>  $ Addison         : num(0) 
#>  $ Jefferson_Park  : num(0) 
#>  $ Montrose        : num(0) 
#>  $ California      : num(0) 
#>  $ temp_min        : num(0) 
#>  $ temp            : num(0) 
#>  $ temp_max        : num(0) 
#>  $ temp_change     : num(0) 
#>  $ dew             : num(0) 
#>  $ humidity        : num(0) 
#>  $ pressure        : num(0) 
#>  $ pressure_change : num(0) 
#>  $ wind            : num(0) 
#>  $ wind_max        : num(0) 
#>  $ gust            : num(0) 
#>  $ gust_max        : num(0) 
#>  $ percip          : num(0) 
#>  $ percip_max      : num(0) 
#>  $ weather_rain    : num(0) 
#>  $ weather_snow    : num(0) 
#>  $ weather_cloud   : num(0) 
#>  $ weather_storm   : num(0) 
#>  $ Blackhawks_Away : num(0) 
#>  $ Blackhawks_Home : num(0) 
#>  $ Bulls_Away      : num(0) 
#>  $ Bulls_Home      : num(0) 
#>  $ Bears_Away      : num(0) 
#>  $ Bears_Home      : num(0) 
#>  $ WhiteSox_Away   : num(0) 
#>  $ WhiteSox_Home   : num(0) 
#>  $ Cubs_Away       : num(0) 
#>  $ Cubs_Home       : num(0) 
#>  $ date            : 'Date' num(0)

^{Created on 2023-07-03 with reprex v2.0.2}

Thanks so much for your ideas here and your feedback! 🙌

add ptype names and types to the metadata

54f7ca2

bjfletcher changed the title ~~add ptype names and types to the metadata~~ Add ptype names and types to the metadata Jun 30, 2022

isabelizimm mentioned this pull request Nov 30, 2022

Refactor metadata rstudio/vetiver-python#126

Merged

juliasilge mentioned this pull request Dec 8, 2022

Refactor metadata, specifically move required_pkgs #168

Closed

juliasilge mentioned this pull request Apr 18, 2023

Add new GET endpoint for /prototype #197

Closed

juliasilge mentioned this pull request Jun 9, 2023

Add new /prototype endpoint #220

Merged

juliasilge closed this Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ptype names and types to the metadata #116

Add ptype names and types to the metadata #116

bjfletcher commented Jun 30, 2022

juliasilge commented Jul 1, 2022

juliasilge commented Jul 1, 2022

juliasilge commented Jul 3, 2023

Add ptype names and types to the metadata #116

Add ptype names and types to the metadata #116

Conversation

bjfletcher commented Jun 30, 2022

Workaround 1

Workaround 2

Proposal

juliasilge commented Jul 1, 2022

juliasilge commented Jul 1, 2022

juliasilge commented Jul 3, 2023