Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ptype names and types to the metadata #116

Closed
wants to merge 1 commit into from

Conversation

bjfletcher
Copy link

With the model registry (board of pinned models), the number one most requested feature from my data science team is metadata about model input parameters.

Workaround 1

With current vetiver, it is possible by running the below code with every pinned model's rds file in the board:

lapply(ptype, class)

but having to load the .rds file for every model (some very large due to large models) in order to get the ptype object is expensive and slow.

Workaround 2

We can ask data scientists to replace:

v <- vetiver_model(model)

with:

v <- vetiver_model(model)
v$metadata$user$ptype <- lapply(v$ptype, class)

but the usability/inconsistency of it is a bit confusing - "isn't ptype already included? do we need to specify the ptype? why is description already in metadata but not ptype?" etc. The lapply code and use of class is also only understood by those with a more advanced understanding of R typing. Would it be better to have it included in the vetiver library?

Proposal

This PR includes metadata about ptype when the ptype is included. This means that with the following code:

v <- vetiver_model(model)

it will include ptype along with their metadata, by default.

@bjfletcher bjfletcher changed the title add ptype names and types to the metadata Add ptype names and types to the metadata Jun 30, 2022
@juliasilge
Copy link
Member

Thank you for this contribution @bjfletcher!

This is a really interesting question. We had originally put both the ptype and required packages into the serialized object in the board rather than the metadata because our current strategy doesn't involve using something like the ptype unless the model needs to be loaded to.

The other consideration is that the ptype is more than column names and types; it includes other information needed for prediction like factor levels:

library(vetiver)
data(penguins, package = "modeldata")

penguins_lm <- lm(bill_length_mm ~ species + bill_depth_mm, data = penguins)
penguin_ptype <- vetiver_ptype(penguins_lm)

attributes(penguin_ptype$species)
#> $levels
#> [1] "Adelie"    "Chinstrap" "Gentoo"   
#> 
#> $class
#> [1] "factor"

Created on 2022-07-01 by the reprex package (v2.0.1)

It's not super straightforward to capture all input data prototype info you might need to make a prediction outside of a serialized binary object.

I wonder if it would be better to use the OpenApi specification for this. This contains the JSON types, rather than the R types:

library(vetiver)
library(plumber)
data(penguins, package = "modeldata")

penguins_lm <- lm(bill_length_mm ~ species + bill_depth_mm, data = penguins)
v <- vetiver_model(penguins_lm, "palmer_penguins")

penguins_api <- pr() %>%
    vetiver_api(v)

api_spec <- penguins_api$getApiSpec()
super_nested_info <- api_spec$paths$`/predict`$post$requestBody$content$`application/json`$schema$items$properties
purrr::map_dfr(super_nested_info, "type")
#> # A tibble: 1 × 2
#>   species bill_depth_mm
#>   <chr>   <chr>        
#> 1 string  number

Created on 2022-07-01 by the reprex package (v2.0.1)

You can get this info without loading the model into memory once the model is deployed. For example, if you look at this model:
https://colorado.rstudio.com/rsc/seattle-housing/
It is at:
https://colorado.rstudio.com/rsc/seattle-housing/openapi.json

api_spec <- jsonlite::fromJSON("https://colorado.rstudio.com/rsc/seattle-housing/openapi.json")
super_nested_info <- api_spec$paths$`/predict`$post$requestBody$content$`application/json`$schema$items$properties
purrr::map_dfr(super_nested_info, "type")
#> # A tibble: 1 × 4
#>   bedrooms bathrooms sqft_living yr_built
#>   <chr>    <chr>     <chr>       <chr>   
#> 1 integer  number    integer     integer

Created on 2022-07-01 by the reprex package (v2.0.1)

We could make a little function to handle that deeply nested mess.

@juliasilge
Copy link
Member

This would be better, actually:

api_spec <- jsonlite::fromJSON("https://colorado.rstudio.com/rsc/seattle-housing/openapi.json")
super_nested_info <- api_spec$paths$`/predict`$post$requestBody$content$`application/json`$schema$items$properties
purrr::map_chr(super_nested_info, "type") |> tibble::enframe("feature", "json_type")
#> # A tibble: 4 × 2
#>   feature     json_type
#>   <chr>       <chr>    
#> 1 bedrooms    integer  
#> 2 bathrooms   number   
#> 3 sqft_living integer  
#> 4 yr_built    integer

Created on 2022-07-01 by the reprex package (v2.0.1)

In your particular use case @bjfletcher are you interested in the R types or the JSON types, for making predictions?

@juliasilge
Copy link
Member

With the publication of the new cereal package and the changes in #220, this is now doable:

url <- "https://colorado.posit.co/rsc/chicago-ridership/prototype"
r <- httr::GET(url)
prototype <- httr::content(r, as = "text", encoding = "UTF-8")
p <- cereal::cereal_from_json(prototype)
str(p)
#> tibble [0 × 49] (S3: tbl_df/tbl/data.frame)
#>  $ Austin          : num(0) 
#>  $ Quincy_Wells    : num(0) 
#>  $ Belmont         : num(0) 
#>  $ Archer_35th     : num(0) 
#>  $ Oak_Park        : num(0) 
#>  $ Western         : num(0) 
#>  $ Clark_Lake      : num(0) 
#>  $ Clinton         : num(0) 
#>  $ Merchandise_Mart: num(0) 
#>  $ Irving_Park     : num(0) 
#>  $ Washington_Wells: num(0) 
#>  $ Harlem          : num(0) 
#>  $ Monroe          : num(0) 
#>  $ Polk            : num(0) 
#>  $ Ashland         : num(0) 
#>  $ Kedzie          : num(0) 
#>  $ Addison         : num(0) 
#>  $ Jefferson_Park  : num(0) 
#>  $ Montrose        : num(0) 
#>  $ California      : num(0) 
#>  $ temp_min        : num(0) 
#>  $ temp            : num(0) 
#>  $ temp_max        : num(0) 
#>  $ temp_change     : num(0) 
#>  $ dew             : num(0) 
#>  $ humidity        : num(0) 
#>  $ pressure        : num(0) 
#>  $ pressure_change : num(0) 
#>  $ wind            : num(0) 
#>  $ wind_max        : num(0) 
#>  $ gust            : num(0) 
#>  $ gust_max        : num(0) 
#>  $ percip          : num(0) 
#>  $ percip_max      : num(0) 
#>  $ weather_rain    : num(0) 
#>  $ weather_snow    : num(0) 
#>  $ weather_cloud   : num(0) 
#>  $ weather_storm   : num(0) 
#>  $ Blackhawks_Away : num(0) 
#>  $ Blackhawks_Home : num(0) 
#>  $ Bulls_Away      : num(0) 
#>  $ Bulls_Home      : num(0) 
#>  $ Bears_Away      : num(0) 
#>  $ Bears_Home      : num(0) 
#>  $ WhiteSox_Away   : num(0) 
#>  $ WhiteSox_Home   : num(0) 
#>  $ Cubs_Away       : num(0) 
#>  $ Cubs_Home       : num(0) 
#>  $ date            : 'Date' num(0)

Created on 2023-07-03 with reprex v2.0.2

Thanks so much for your ideas here and your feedback! 🙌

@juliasilge juliasilge closed this Jul 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants