Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add output type conversion functions #150

Open
wants to merge 46 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
49eab2b
start transformation functions
eahowerton Jun 6, 2024
bca3fab
Add documentation
eahowerton Jun 6, 2024
8dd2ad8
switch from summarize() to reframe()
eahowerton Jun 6, 2024
48dfffb
add preliminary validations
eahowerton Jun 6, 2024
71dacb9
add validations of new_output_type_id
eahowerton Jun 6, 2024
9b81d7a
update documentation, lint
eahowerton Jun 6, 2024
d8f9e4b
Update DESCRIPTION
eahowerton Jun 6, 2024
5b9e311
Create get_task_id_cols.R
eahowerton Jun 6, 2024
2a6c595
use get_task_id_cols()
eahowerton Jun 6, 2024
10ff653
Create get_task_id_cols.R (#149)
eahowerton Jun 6, 2024
7243570
use get_task_id_cols()
eahowerton Jun 6, 2024
243cdb5
Merge branch 'output-type-conversion' of https://github.com/Infectiou…
eahowerton Jun 6, 2024
cc4e002
switch model_outputs to model_out_tbl
eahowerton Jun 6, 2024
4db0992
add functionality for multiple new output_types
eahowerton Jun 7, 2024
bee1c61
lint
eahowerton Jun 7, 2024
94e1b40
Merge branch 'main' into output-type-conversion2
lshandross Sep 11, 2024
28a6be6
Update DESCRIPTION
lshandross Sep 11, 2024
cd82801
Rename `convert_output_types()` title
lshandross Sep 11, 2024
8bfee20
Punctuation fixes
lshandross Sep 11, 2024
727d33d
Remove within package `hubUtils` namespacing
lshandross Sep 11, 2024
d07d985
Update DESCRIPTION
lshandross Sep 11, 2024
2d7d19c
Update NAMESPACE
lshandross Sep 11, 2024
0f73f84
Update convert_output_types.R
lshandross Sep 11, 2024
944404d
Fix linting issues
lshandross Sep 11, 2024
94fa108
Document `convert_output_type()`
lshandross Sep 11, 2024
ac6333a
Update DESCRIPTION
lshandross Sep 11, 2024
e8595e1
Fix failing tests
lshandross Sep 12, 2024
781016d
Improve output type conversion validation messages
lshandross Sep 12, 2024
1a9789e
Define global functions
lshandross Sep 12, 2024
84ff4aa
Clean up `convert_output_types` functions
lshandross Sep 12, 2024
9fa2e35
Don't export smaller conversion functions
lshandross Sep 12, 2024
db08c45
Document `get_task_id_cols()`
lshandross Sep 12, 2024
75df987
Fix `convert_output_type()` example
lshandross Sep 12, 2024
ee6b098
Clarify `convert_output_type()` description and return value
lshandross Sep 12, 2024
de84f22
Fix linting issues
lshandross Sep 12, 2024
d4119e2
Add single starting output type validation and test
lshandross Sep 12, 2024
1e48584
Remove duplicate test
lshandross Sep 12, 2024
66ad690
Remove seed and check for `distfromq`
lshandross Sep 12, 2024
4b8ed54
Fix failing test
lshandross Sep 12, 2024
e23e047
Refactor new `validate_output_type_id()` function
lshandross Sep 13, 2024
6dd2926
Update documentation
lshandross Sep 13, 2024
95d6f28
Testing fixes
lshandross Sep 13, 2024
254b737
Merge branch 'main' into output-type-conversion2
lshandross Oct 1, 2024
522e88f
model_out_tbl output type col is character in `convert_output_type()`…
lshandross Oct 3, 2024
054bffb
Make `new_output_type_id` arg usually a named list; simplify its vali…
lshandross Oct 3, 2024
cbe2acf
Fix quantile -> median, mean/cdf transformation (use intermediary sam…
lshandross Oct 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ Imports:
checkmate,
cli,
curl,
distfromq,
fs,
gh,
glue,
Expand All @@ -40,9 +41,14 @@ Imports:
Suggests:
arrow (>= 17.0.0),
dplyr,
hubData,
knitr,
rmarkdown,
testthat (>= 3.2.0)
Remotes:
hubverse-org/hubData,
reichlab/distfromq
Additional_repositories: https://hubverse-org.r-universe.dev/
Config/Needs/website: hubverse-org/hubStyle
Config/testthat/edition: 3
Encoding: UTF-8
Expand Down
3 changes: 3 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ S3method(read_config,default)
export("%>%")
export(as_model_out_tbl)
export(check_deprecated_schema)
export(convert_output_type)
export(extract_schema_version)
export(get_config_tid)
export(get_round_ids)
Expand All @@ -15,6 +16,7 @@ export(get_schema)
export(get_schema_url)
export(get_schema_valid_versions)
export(get_schema_version_latest)
export(get_task_id_cols)
export(get_task_id_names)
export(is_v3_config)
export(is_v3_config_file)
Expand All @@ -30,3 +32,4 @@ importFrom(gh,gh)
importFrom(jsonlite,fromJSON)
importFrom(jsonlite,read_json)
importFrom(magrittr,"%>%")
importFrom(rlang,.data)
308 changes: 308 additions & 0 deletions R/convert_output_types.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
#' Transform between output types
#'
#' Transform between output types for each unique combination of task IDs for
#' each model. Conversion must be from a single starting output type to one or more
#' new output types, and the resulting output will only contain the new output types.
#' See details for supported conversions.
#'
#' @param model_out_tbl an object of class `model_out_tbl` containing predictions
#' with a single, unique value in the `output_type` column.
#' @param new_output_type character vector of the desired output type(s) after
#' transformation. May contain any of the following output types:
#' `"mean"`, `"median"`, `"quantile"`, `"cdf"`.
#' @param new_output_type_id A named list indicating the desired output type IDs
#' for each new output type, in which each element is a vector of output type IDs.
#' If only one new output type is requested, then it may be a single numeric vector
#' (for `"quantile"` or `"cdf"`) or not required (`"mean"` or `"median"`). See
#' the examples for an illustration of both cases.
#' @param n_samples `numeric` that specifies the number of samples to use when
#' calculating output_types from an estimated quantile function. Defaults to `1e4`.
#' @param ... parameters that are passed to `distfromq::make_q_fn`, specifying
#' details of how to estimate a quantile function from provided quantile levels
#' and quantile values for `"quantile"` or `"cdf"` output types.
#'
#' @details
#' The following transformations are supported: (i) `"sample"` can be
#' transformed to `"mean"`, `"median"`, `"quantile"`, or `"cdf"`; (ii)
#' `"quantile"` can be transformed to `"mean"`, `"median"`, or `"cdf"`; and
#' (iii) `"cdf"` can be transformed to `"mean"`, `"median"`, or `"quantile"`.
#'
#' For `"quantile"` and `"cdf"` starting output types, we use the following approach:
#' 1. Interpolate and extrapolate from the provided quantiles or probabilities
#' for each component model to obtain an estimate of the cdf of that distribution.
#' 2. Draw samples from the distribution for each component model. To reduce
#' Monte Carlo variability, we use quasi-random samples corresponding to
#' quantiles of the estimated distribution.
#' 3. Calculate the desired quantity (e.g., mean).
#' If the median quantile is provided in the `model_out_tbl` object (i.e., the
#' original output_type is `"median"` and 0.5 is contained in the original
#' output_type_id), the median value is extracted and returned directly.
#'
#' If both `"quantile"` and `"cdf"` output_types are desired, `new_output_type_id`
#' should be a named list, where each element specifies the corresponding
#' `new_output_type_id`. See examples for an illustration.
#'
#' @examples
#' # We illustrate the conversion between output types using normal distributions
#' ex_quantiles <- c(0.25, 0.5, 0.75)
#' model_out_tbl <- expand.grid(
#' stringsAsFactors = FALSE,
#' group1 = c(1,2),
#' model_id = "A",
#' output_type = "quantile",
#' output_type_id = ex_quantiles
#' ) %>%
#' dplyr::mutate(value = qnorm(p = output_type_id, mean = group1))
#'
#' convert_output_type(model_out_tbl, new_output_type = "median", new_output_type_id = NA)
#'
#' # Next, we illustrate conversion from samples to quantile and cdf
#' ex_bins <- seq(-2,2,1)
#' ex_quantiles <- c(0.25, 0.5, 0.75)
#' model_out_tbl <- expand.grid(
#' stringsAsFactors = FALSE,
#' group1 = c(1,2),
#' model_id = "A",
#' output_type = "sample",
#' output_type_id = 1:100
#' ) %>%
#' dplyr::mutate(value = rnorm(200, mean = group1))
#'
#' convert_output_type(model_out_tbl, new_output_type = c("quantile", "cdf"),
#' new_output_type_id = list("quantile" = ex_quantiles, "cdf" = ex_bins))
#'
#' @return object of class `model_out_tbl` containing (only) the new output_type(s)
#' for each unique combination of task IDs for each model
#' @export
#' @importFrom rlang .data
convert_output_type <- function(model_out_tbl, new_output_type,
new_output_type_id = NA, n_samples = 1e4, ...) {
Comment on lines +78 to +79
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of this function, where everything is converted to samples and from there to the target output type, has a certain kind of elegance and at the same time is kind of inefficient. For example, here's what we do when the starting_output_type is "quantile" and the new_output_type is "cdf":

  1. In get_samples_from_quantiles, we call distfromq::make_q_fn to estimate the quantile function and draw samples from it
  2. Then we compute the empirical cdf of those samples and evaluate that ecdf in

Restating, our goal was to get to cdf values, and in order to do that we (1) estimate the cdf; (2) invert that to get to a quantile function estimate; (3) draw samples from the quantile function; (4) get the empirical cdf of those samples; (5) evaluate that empirical cdf. Although it is conceptually clean to work through the "universal representation" of samples, steps 2-5 here are extra work and in particular the sample-based stuff in steps 3-5 will introduce more layers of approximation to the result.

Another alternative organization to this code might provide a set of "direct" transformations that don't use samples as an intermediate step. For example, the quantile to cdf transform might define a helper function along the lines of transform_quantile_to_cdf_one_group along these lines:

transform_quantile_to_cdf_one_group <- function(starting_ps, starting_qs, to_qs, ...) {
  p_fn <- distfromq::make_p_fn(
    ps = as.numeric(starting_ps),
    qs = starting_qs,
    ...
  )
  return(p_fn(to_qs))
}

Which could then be called on a grouped data frame somewhere, like

  samples <- model_out_tbl %>%
    dplyr::group_by(model_id, dplyr::across(dplyr::all_of(task_id_cols))) %>%
    dplyr::reframe(
      value = transform_quantile_to_cdf_one_group(
        starting_ps=as.numeric(.data$output_type_id),
        starting_qs=.data$value,
        to_qs=new_output_type_id,
        ...
      )

I think this approach would likely involve more code in the end, but could be nice as it would be more direct and involve less approximation. I'm open to your ideas about this!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I had conceptualized a set of transformations like transform_quantile_to_cdf_one_group() that you describe @elray1. When most required transforming to samples, I decided for the elegant route. But I agree, there is additional approximation that is not necessary, and would ideally be avoided.

Question if we opt for the individual transformation functions: consider, for example, we are transforming from quantile to mean. The steps are (1) estimate the cdf, (2) invert to quantile function, (3) draw samples from quantile function, (4) calculate mean. Would this function call the transform_quantile_to_samples() function? Or would it replicate steps 1-3 within the transform_quantile_to mean() function?

@lshandross, I'm happy for you to make the final decision on this one.

Copy link
Contributor

@elray1 elray1 Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was a good prompt for me to think about which transformations require samples, estimates of the QF, and/or estimates of the CDF. Here's my thinking, does this seem corret?

  • starting type = quantile
    • new type = quantile: do we support this? should we? if so, would not require samples, use estimated QF
    • new type = cdf: samples not needed, use estimated CDF
    • new type = pmf: samples not needed, use estimated CDF. Some data structure would be required specifying bin endpoints, or some convention for the names of the output_type_ids for bins, e.g. (0,0.1].
    • new type = sample: samples needed
    • new type = median: samples not needed, use estimated QF
    • new type = mean: samples needed
  • starting type = cdf
    • new type = quantile: samples not needed, use estimated QF
    • new type = cdf: do we support this? should we? if so, would not require samples, use estimated CDF
    • new type = pmf: samples not needed, use estimated CDF
    • new type = sample: samples needed
    • new type = median: samples not needed, use estimated QF
    • new type = mean: samples needed
  • starting type = pmf.
    • overall note, i think that for the most part, transforms here only make sense if the variable is a discretization of an underlying continuous variable and we get some information about the bin endpoints used for the discretization. but in that setting, i think the transformation options match what was said above for starting type = cdf.
    • though if the variable is nominal, we could go to the CDF type naturally.
  • starting type = sample
    • you already have samples!
  • starting type = median
    • can't do any transformations other than the trivial transform to quantile at level 0.5
  • starting type = mean
    • can't do any transformations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with Li just now, and arrived at a plan to merge this PR more or less as-is, then do the following in separate issues:

  • Allow new_type = sample for existing probabilistic starting types where they are not supported (quantile and cdf)
  • Allow new_type = pmf for existing probabilistic starting types (quantile, cdf, and sample).
    • Maybe at the same time, move away from using samples as intermediary for new type = cdf. This would get us to a place where the same methodology is used when the new output type is pmf and cdf.
    • Maybe at the same time, support cdf -> cdf, e.g. extending the set of points where the cdf is evaluated. This feels like a low priority extension, not sure there is really a use case?
  • Allow starting_type = pmf
  • Lower priority issue, support quantile --> quantile transformations, e.g. extending the set of quantile levels predicted. This could be useful to modelers who are doing something like quantile regression, to allow them to get predictions at a small number of quantile levels and then fill in the others.
    • At this time, could switch to using QF as an intermediary when new_type = quantile or median rather than samples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds great, thank you both for pushing this across the finish line!

# validations
starting_output_type <- unique(model_out_tbl$output_type)
starting_output_type_ids <- unique(model_out_tbl$output_type_id)
task_id_cols <- get_task_id_cols(model_out_tbl)
lshandross marked this conversation as resolved.
Show resolved Hide resolved
validate_new_output_type(
starting_output_type, new_output_type,
new_output_type_id
)
# for cdf and quantile functions, get samples
if (starting_output_type == "cdf") {
# estimate from samples
model_out_tbl <- get_samples_from_cdf(model_out_tbl, task_id_cols, n_samples)
} else if (starting_output_type == "quantile") {
# if median output desired, and Q50 provided return exact value, otherwise
# estimate from samples
if (any(new_output_type != "median") || !(0.5 %in% starting_output_type_ids)) {
model_out_tbl <- model_out_tbl %>%
get_samples_from_quantiles(task_id_cols, n_samples) %>%
rbind(model_out_tbl)
}
}
# transform based on new_output_type
model_out_tbl_transform <- vector("list", length = length(new_output_type))
for (i in seq_along(new_output_type)) {
# first find new_output_type_id
new_output_type_id_tmp <- new_output_type_id
if (new_output_type[i] %in% c("mean", "median")) {
new_output_type_id_tmp <- NA
} else if (is.list(new_output_type_id)) {
new_output_type_id_tmp <- new_output_type_id[[new_output_type[i]]]
}
# if median output desired, and Q50 provided return exact value
if (new_output_type[i] == "median" && 0.5 %in% starting_output_type_ids) {
model_out_tbl_transform[[i]] <- model_out_tbl %>%
dplyr::filter(
.data[["output_type"]] != "sample",
.data[["output_type_id"]] == 0.5
) %>%
dplyr::mutate(
output_type = new_output_type[i],
output_type_id = NA
) %>%
as_model_out_tbl()
} else { # otherwise calculate new values
grouped_model_out_tbl <- model_out_tbl %>%
dplyr::filter(.data[["output_type"]] == "sample") %>%
dplyr::group_by(dplyr::across(dplyr::all_of(c("model_id", task_id_cols))))
model_out_tbl_transform[[i]] <- grouped_model_out_tbl %>%
convert_from_sample(new_output_type[i], new_output_type_id_tmp) %>%
dplyr::ungroup()
}
}
return(dplyr::bind_rows(model_out_tbl_transform))
}

#' @noRd
validate_new_output_type <- function(starting_output_type, new_output_type,
new_output_type_id) {
# check only one starting_output_type is provided
if (length(starting_output_type) != 1) {
cli::cli_abort(c("Only one {.var starting_output_type} may be provided"))
}
valid_conversions <- list(
"sample" = c("mean", "median", "quantile", "cdf"),
"quantile" = c("mean", "median", "cdf"),
"cdf" = c("mean", "median", "quantile")
)
# check starting_output_type is supported
lshandross marked this conversation as resolved.
Show resolved Hide resolved
valid_starting_output_type <- starting_output_type %in% names(valid_conversions)
if (!valid_starting_output_type) {
cli::cli_abort(c(
"{.var output_type} {.val {starting_output_type}} provided cannot be transformed",
i = "must be of type {.val sample}, {.val quantile} or {.val cdf}."
))
}
# check new_output_type is supported
invalid_new_output_type <- which(!(new_output_type %in% valid_conversions[[starting_output_type]]))
if (length(invalid_new_output_type) > 0) {
cli::cli_abort(c(
"Output type {.val {starting_output_type}} cannot be converted to the specified
{.val {new_output_type[invalid_new_output_type]}}",
i = "{.var new_output_type} values must be one of {.val {valid_conversions[[starting_output_type]]}}"
))
}
# check new_output_type_id
if (length(new_output_type) == 1) {
validate_new_output_type_id(new_output_type, new_output_type_id)
} else if (length(new_output_type > 1)) {
purrr::map(.x = new_output_type,
~ validate_new_output_type_id(new_output_type = .x,
new_output_type_id = new_output_type_id[[.x]]))
}
}

#' @noRd
validate_new_output_type_id <- function(new_output_type, new_output_type_id) {
if (new_output_type %in% c("mean", "median") && !all(is.na(new_output_type_id))) {
cli::cli_abort(c(
"{.var new_output_type_id} is incompatible with {.var new_output_type}",
i = "{.var new_output_type_id} should be {.var NA}"
))
lshandross marked this conversation as resolved.
Show resolved Hide resolved
} else if (new_output_type == "quantile") {
new_output_type_id_quantile <- new_output_type_id
if (!is.numeric(new_output_type_id_quantile)) {
cli::cli_abort(c(
"elements of {.var new_output_type_id} should be numeric",
i = "elements of {.var new_output_type_id} represent quantiles
of the predictive distribution"
))
}
if (any(new_output_type_id_quantile < 0) || any(new_output_type_id_quantile > 1)) {
cli::cli_abort(c(
"elements of {.var new_output_type_id} should be between 0 and 1",
i = "elements of {.var new_output_type_id} represent quantiles
of the predictive distribution"
))
}
} else if (new_output_type == "cdf") {
new_output_type_id_cdf <- new_output_type_id
if (!is.numeric(new_output_type_id_cdf)) {
cli::cli_abort(c(
"elements of {.var new_output_type_id} should be numeric",
i = "elements of {.var new_output_type_id} represent possible
values of the target variable"
))
}
}
}

#' @noRd
get_samples_from_quantiles <- function(model_out_tbl, task_id_cols, n_samples, ...) {
if (!requireNamespace("distfromq")) {
cli::cli_abort(
c("x" = "{.pkg distfromq} must be installed to convert {.val cdf}
or {.val quantile} output types.",
"i" = "Use {.code remotes::install_github('reichlab/distfromq')} to install."
)
)
}

samples <- model_out_tbl %>%
dplyr::group_by(dplyr::across(dplyr::all_of(c("model_id", task_id_cols)))) %>%
dplyr::reframe(
value = distfromq::make_q_fn(
ps = as.numeric(.data[["output_type_id"]]),
qs = .data[["value"]], ...
)(stats::runif(n_samples, 0, 1))
) %>%
dplyr::ungroup()
split_samples <- split(samples, f = samples[[task_id_cols]])
formatted_samples <- split_samples %>%
purrr::map(.f = function(split_samples) {
dplyr::mutate(split_samples,
output_type = "sample",
output_type_id = as.numeric(dplyr::row_number()),
.before = "value")
}) %>%
purrr::list_rbind() %>%
as_model_out_tbl()
return(formatted_samples)
}

#' @noRd
get_samples_from_cdf <- function(model_out_tbl, task_id_cols, n_samples, ...) {
if (!requireNamespace("distfromq")) {
cli::cli_abort(
c("x" = "{.pkg distfromq} must be installed to convert {.val cdf}
or {.val quantile} output types.",
"i" = "Use {.code remotes::install_github('reichlab/distfromq')} to install."
)
)
}

samples <- model_out_tbl %>%
dplyr::group_by(dplyr::across(dplyr::all_of(c("model_id", task_id_cols)))) %>%
dplyr::reframe(
value = distfromq::make_q_fn(
ps = .data[["value"]],
qs = as.numeric(.data[["output_type_id"]]), ...
)(stats::runif(n_samples, 0, 1))
) %>%
dplyr::ungroup()
split_samples <- split(samples, f = samples[[task_id_cols]])
formatted_samples <- split_samples %>%
purrr::map(.f = function(split_samples) {
dplyr::mutate(split_samples,
output_type = "sample",
output_type_id = as.numeric(dplyr::row_number()),
.before = "value")
}) %>%
purrr::list_rbind() %>%
as_model_out_tbl()
return(formatted_samples)
}

#' @noRd
convert_from_sample <- function(grouped_model_out_tbl, new_output_type,
new_output_type_id) {
if (new_output_type == "mean") {
model_out_tbl_transform <- grouped_model_out_tbl %>%
dplyr::reframe(
value = mean(.data[["value"]]),
output_type_id = new_output_type_id
)
} else if (new_output_type == "median") {
model_out_tbl_transform <- grouped_model_out_tbl %>%
dplyr::reframe(
value = mean(.data[["value"]]),
output_type_id = new_output_type_id
)
} else if (new_output_type == "quantile") {
model_out_tbl_transform <- grouped_model_out_tbl %>%
dplyr::reframe(
value = stats::quantile(.data[["value"]], as.numeric(new_output_type_id), names = FALSE),
output_type_id = new_output_type_id
)
} else if (new_output_type == "cdf") {
model_out_tbl_transform <- grouped_model_out_tbl %>%
dplyr::reframe(
value = stats::ecdf(.data[["value"]])(as.numeric(new_output_type_id)),
output_type_id = new_output_type_id
)
}
# update output_type and output_type_id columns
model_out_tbl_transform <- model_out_tbl_transform %>%
dplyr::mutate(output_type = new_output_type) %>%
as_model_out_tbl()
return(model_out_tbl_transform)
}
11 changes: 11 additions & 0 deletions R/get_task_id_cols.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#' Get task ID column names from `model_out_tbl` object
#'
#' @param model_out_tbl an object of class `model_out_tbl`
#'
#' @return a character vector of task ID column names
#' @export
get_task_id_cols <- function(model_out_tbl) {
model_out_cols <- colnames(model_out_tbl)
task_id_cols <- model_out_cols[!model_out_cols %in% std_colnames]
return(task_id_cols)
}
Loading
Loading