-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add output type conversion functions #150
Open
eahowerton
wants to merge
46
commits into
main
Choose a base branch
from
output-type-conversion2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
46 commits
Select commit
Hold shift + click to select a range
49eab2b
start transformation functions
eahowerton bca3fab
Add documentation
eahowerton 8dd2ad8
switch from summarize() to reframe()
eahowerton 48dfffb
add preliminary validations
eahowerton 71dacb9
add validations of new_output_type_id
eahowerton 9b81d7a
update documentation, lint
eahowerton d8f9e4b
Update DESCRIPTION
eahowerton 5b9e311
Create get_task_id_cols.R
eahowerton 2a6c595
use get_task_id_cols()
eahowerton 10ff653
Create get_task_id_cols.R (#149)
eahowerton 7243570
use get_task_id_cols()
eahowerton 243cdb5
Merge branch 'output-type-conversion' of https://github.com/Infectiou…
eahowerton cc4e002
switch model_outputs to model_out_tbl
eahowerton 4db0992
add functionality for multiple new output_types
eahowerton bee1c61
lint
eahowerton 94e1b40
Merge branch 'main' into output-type-conversion2
lshandross 28a6be6
Update DESCRIPTION
lshandross cd82801
Rename `convert_output_types()` title
lshandross 8bfee20
Punctuation fixes
lshandross 727d33d
Remove within package `hubUtils` namespacing
lshandross d07d985
Update DESCRIPTION
lshandross 2d7d19c
Update NAMESPACE
lshandross 0f73f84
Update convert_output_types.R
lshandross 944404d
Fix linting issues
lshandross 94fa108
Document `convert_output_type()`
lshandross ac6333a
Update DESCRIPTION
lshandross e8595e1
Fix failing tests
lshandross 781016d
Improve output type conversion validation messages
lshandross 1a9789e
Define global functions
lshandross 84ff4aa
Clean up `convert_output_types` functions
lshandross 9fa2e35
Don't export smaller conversion functions
lshandross db08c45
Document `get_task_id_cols()`
lshandross 75df987
Fix `convert_output_type()` example
lshandross ee6b098
Clarify `convert_output_type()` description and return value
lshandross de84f22
Fix linting issues
lshandross d4119e2
Add single starting output type validation and test
lshandross 1e48584
Remove duplicate test
lshandross 66ad690
Remove seed and check for `distfromq`
lshandross 4b8ed54
Fix failing test
lshandross e23e047
Refactor new `validate_output_type_id()` function
lshandross 6dd2926
Update documentation
lshandross 95d6f28
Testing fixes
lshandross 254b737
Merge branch 'main' into output-type-conversion2
lshandross 522e88f
model_out_tbl output type col is character in `convert_output_type()`…
lshandross 054bffb
Make `new_output_type_id` arg usually a named list; simplify its vali…
lshandross cbe2acf
Fix quantile -> median, mean/cdf transformation (use intermediary sam…
lshandross File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,308 @@ | ||
#' Transform between output types | ||
#' | ||
#' Transform between output types for each unique combination of task IDs for | ||
#' each model. Conversion must be from a single starting output type to one or more | ||
#' new output types, and the resulting output will only contain the new output types. | ||
#' See details for supported conversions. | ||
#' | ||
#' @param model_out_tbl an object of class `model_out_tbl` containing predictions | ||
#' with a single, unique value in the `output_type` column. | ||
#' @param new_output_type character vector of the desired output type(s) after | ||
#' transformation. May contain any of the following output types: | ||
#' `"mean"`, `"median"`, `"quantile"`, `"cdf"`. | ||
#' @param new_output_type_id A named list indicating the desired output type IDs | ||
#' for each new output type, in which each element is a vector of output type IDs. | ||
#' If only one new output type is requested, then it may be a single numeric vector | ||
#' (for `"quantile"` or `"cdf"`) or not required (`"mean"` or `"median"`). See | ||
#' the examples for an illustration of both cases. | ||
#' @param n_samples `numeric` that specifies the number of samples to use when | ||
#' calculating output_types from an estimated quantile function. Defaults to `1e4`. | ||
#' @param ... parameters that are passed to `distfromq::make_q_fn`, specifying | ||
#' details of how to estimate a quantile function from provided quantile levels | ||
#' and quantile values for `"quantile"` or `"cdf"` output types. | ||
#' | ||
#' @details | ||
#' The following transformations are supported: (i) `"sample"` can be | ||
#' transformed to `"mean"`, `"median"`, `"quantile"`, or `"cdf"`; (ii) | ||
#' `"quantile"` can be transformed to `"mean"`, `"median"`, or `"cdf"`; and | ||
#' (iii) `"cdf"` can be transformed to `"mean"`, `"median"`, or `"quantile"`. | ||
#' | ||
#' For `"quantile"` and `"cdf"` starting output types, we use the following approach: | ||
#' 1. Interpolate and extrapolate from the provided quantiles or probabilities | ||
#' for each component model to obtain an estimate of the cdf of that distribution. | ||
#' 2. Draw samples from the distribution for each component model. To reduce | ||
#' Monte Carlo variability, we use quasi-random samples corresponding to | ||
#' quantiles of the estimated distribution. | ||
#' 3. Calculate the desired quantity (e.g., mean). | ||
#' If the median quantile is provided in the `model_out_tbl` object (i.e., the | ||
#' original output_type is `"median"` and 0.5 is contained in the original | ||
#' output_type_id), the median value is extracted and returned directly. | ||
#' | ||
#' If both `"quantile"` and `"cdf"` output_types are desired, `new_output_type_id` | ||
#' should be a named list, where each element specifies the corresponding | ||
#' `new_output_type_id`. See examples for an illustration. | ||
#' | ||
#' @examples | ||
#' # We illustrate the conversion between output types using normal distributions | ||
#' ex_quantiles <- c(0.25, 0.5, 0.75) | ||
#' model_out_tbl <- expand.grid( | ||
#' stringsAsFactors = FALSE, | ||
#' group1 = c(1,2), | ||
#' model_id = "A", | ||
#' output_type = "quantile", | ||
#' output_type_id = ex_quantiles | ||
#' ) %>% | ||
#' dplyr::mutate(value = qnorm(p = output_type_id, mean = group1)) | ||
#' | ||
#' convert_output_type(model_out_tbl, new_output_type = "median", new_output_type_id = NA) | ||
#' | ||
#' # Next, we illustrate conversion from samples to quantile and cdf | ||
#' ex_bins <- seq(-2,2,1) | ||
#' ex_quantiles <- c(0.25, 0.5, 0.75) | ||
#' model_out_tbl <- expand.grid( | ||
#' stringsAsFactors = FALSE, | ||
#' group1 = c(1,2), | ||
#' model_id = "A", | ||
#' output_type = "sample", | ||
#' output_type_id = 1:100 | ||
#' ) %>% | ||
#' dplyr::mutate(value = rnorm(200, mean = group1)) | ||
#' | ||
#' convert_output_type(model_out_tbl, new_output_type = c("quantile", "cdf"), | ||
#' new_output_type_id = list("quantile" = ex_quantiles, "cdf" = ex_bins)) | ||
#' | ||
#' @return object of class `model_out_tbl` containing (only) the new output_type(s) | ||
#' for each unique combination of task IDs for each model | ||
#' @export | ||
#' @importFrom rlang .data | ||
convert_output_type <- function(model_out_tbl, new_output_type, | ||
new_output_type_id = NA, n_samples = 1e4, ...) { | ||
# validations | ||
starting_output_type <- unique(model_out_tbl$output_type) | ||
starting_output_type_ids <- unique(model_out_tbl$output_type_id) | ||
task_id_cols <- get_task_id_cols(model_out_tbl) | ||
lshandross marked this conversation as resolved.
Show resolved
Hide resolved
|
||
validate_new_output_type( | ||
starting_output_type, new_output_type, | ||
new_output_type_id | ||
) | ||
# for cdf and quantile functions, get samples | ||
if (starting_output_type == "cdf") { | ||
# estimate from samples | ||
model_out_tbl <- get_samples_from_cdf(model_out_tbl, task_id_cols, n_samples) | ||
} else if (starting_output_type == "quantile") { | ||
# if median output desired, and Q50 provided return exact value, otherwise | ||
# estimate from samples | ||
if (any(new_output_type != "median") || !(0.5 %in% starting_output_type_ids)) { | ||
model_out_tbl <- model_out_tbl %>% | ||
get_samples_from_quantiles(task_id_cols, n_samples) %>% | ||
rbind(model_out_tbl) | ||
} | ||
} | ||
# transform based on new_output_type | ||
model_out_tbl_transform <- vector("list", length = length(new_output_type)) | ||
for (i in seq_along(new_output_type)) { | ||
# first find new_output_type_id | ||
new_output_type_id_tmp <- new_output_type_id | ||
if (new_output_type[i] %in% c("mean", "median")) { | ||
new_output_type_id_tmp <- NA | ||
} else if (is.list(new_output_type_id)) { | ||
new_output_type_id_tmp <- new_output_type_id[[new_output_type[i]]] | ||
} | ||
# if median output desired, and Q50 provided return exact value | ||
if (new_output_type[i] == "median" && 0.5 %in% starting_output_type_ids) { | ||
model_out_tbl_transform[[i]] <- model_out_tbl %>% | ||
dplyr::filter( | ||
.data[["output_type"]] != "sample", | ||
.data[["output_type_id"]] == 0.5 | ||
) %>% | ||
dplyr::mutate( | ||
output_type = new_output_type[i], | ||
output_type_id = NA | ||
) %>% | ||
as_model_out_tbl() | ||
} else { # otherwise calculate new values | ||
grouped_model_out_tbl <- model_out_tbl %>% | ||
dplyr::filter(.data[["output_type"]] == "sample") %>% | ||
dplyr::group_by(dplyr::across(dplyr::all_of(c("model_id", task_id_cols)))) | ||
model_out_tbl_transform[[i]] <- grouped_model_out_tbl %>% | ||
convert_from_sample(new_output_type[i], new_output_type_id_tmp) %>% | ||
dplyr::ungroup() | ||
} | ||
} | ||
return(dplyr::bind_rows(model_out_tbl_transform)) | ||
} | ||
|
||
#' @noRd | ||
validate_new_output_type <- function(starting_output_type, new_output_type, | ||
new_output_type_id) { | ||
# check only one starting_output_type is provided | ||
if (length(starting_output_type) != 1) { | ||
cli::cli_abort(c("Only one {.var starting_output_type} may be provided")) | ||
} | ||
valid_conversions <- list( | ||
"sample" = c("mean", "median", "quantile", "cdf"), | ||
"quantile" = c("mean", "median", "cdf"), | ||
"cdf" = c("mean", "median", "quantile") | ||
) | ||
# check starting_output_type is supported | ||
lshandross marked this conversation as resolved.
Show resolved
Hide resolved
|
||
valid_starting_output_type <- starting_output_type %in% names(valid_conversions) | ||
if (!valid_starting_output_type) { | ||
cli::cli_abort(c( | ||
"{.var output_type} {.val {starting_output_type}} provided cannot be transformed", | ||
i = "must be of type {.val sample}, {.val quantile} or {.val cdf}." | ||
)) | ||
} | ||
# check new_output_type is supported | ||
invalid_new_output_type <- which(!(new_output_type %in% valid_conversions[[starting_output_type]])) | ||
if (length(invalid_new_output_type) > 0) { | ||
cli::cli_abort(c( | ||
"Output type {.val {starting_output_type}} cannot be converted to the specified | ||
{.val {new_output_type[invalid_new_output_type]}}", | ||
i = "{.var new_output_type} values must be one of {.val {valid_conversions[[starting_output_type]]}}" | ||
)) | ||
} | ||
# check new_output_type_id | ||
if (length(new_output_type) == 1) { | ||
validate_new_output_type_id(new_output_type, new_output_type_id) | ||
} else if (length(new_output_type > 1)) { | ||
purrr::map(.x = new_output_type, | ||
~ validate_new_output_type_id(new_output_type = .x, | ||
new_output_type_id = new_output_type_id[[.x]])) | ||
} | ||
} | ||
|
||
#' @noRd | ||
validate_new_output_type_id <- function(new_output_type, new_output_type_id) { | ||
if (new_output_type %in% c("mean", "median") && !all(is.na(new_output_type_id))) { | ||
cli::cli_abort(c( | ||
"{.var new_output_type_id} is incompatible with {.var new_output_type}", | ||
i = "{.var new_output_type_id} should be {.var NA}" | ||
)) | ||
lshandross marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} else if (new_output_type == "quantile") { | ||
new_output_type_id_quantile <- new_output_type_id | ||
if (!is.numeric(new_output_type_id_quantile)) { | ||
cli::cli_abort(c( | ||
"elements of {.var new_output_type_id} should be numeric", | ||
i = "elements of {.var new_output_type_id} represent quantiles | ||
of the predictive distribution" | ||
)) | ||
} | ||
if (any(new_output_type_id_quantile < 0) || any(new_output_type_id_quantile > 1)) { | ||
cli::cli_abort(c( | ||
"elements of {.var new_output_type_id} should be between 0 and 1", | ||
i = "elements of {.var new_output_type_id} represent quantiles | ||
of the predictive distribution" | ||
)) | ||
} | ||
} else if (new_output_type == "cdf") { | ||
new_output_type_id_cdf <- new_output_type_id | ||
if (!is.numeric(new_output_type_id_cdf)) { | ||
cli::cli_abort(c( | ||
"elements of {.var new_output_type_id} should be numeric", | ||
i = "elements of {.var new_output_type_id} represent possible | ||
values of the target variable" | ||
)) | ||
} | ||
} | ||
} | ||
|
||
#' @noRd | ||
get_samples_from_quantiles <- function(model_out_tbl, task_id_cols, n_samples, ...) { | ||
if (!requireNamespace("distfromq")) { | ||
cli::cli_abort( | ||
c("x" = "{.pkg distfromq} must be installed to convert {.val cdf} | ||
or {.val quantile} output types.", | ||
"i" = "Use {.code remotes::install_github('reichlab/distfromq')} to install." | ||
) | ||
) | ||
} | ||
|
||
samples <- model_out_tbl %>% | ||
dplyr::group_by(dplyr::across(dplyr::all_of(c("model_id", task_id_cols)))) %>% | ||
dplyr::reframe( | ||
value = distfromq::make_q_fn( | ||
ps = as.numeric(.data[["output_type_id"]]), | ||
qs = .data[["value"]], ... | ||
)(stats::runif(n_samples, 0, 1)) | ||
) %>% | ||
dplyr::ungroup() | ||
split_samples <- split(samples, f = samples[[task_id_cols]]) | ||
formatted_samples <- split_samples %>% | ||
purrr::map(.f = function(split_samples) { | ||
dplyr::mutate(split_samples, | ||
output_type = "sample", | ||
output_type_id = as.numeric(dplyr::row_number()), | ||
.before = "value") | ||
}) %>% | ||
purrr::list_rbind() %>% | ||
as_model_out_tbl() | ||
return(formatted_samples) | ||
} | ||
|
||
#' @noRd | ||
get_samples_from_cdf <- function(model_out_tbl, task_id_cols, n_samples, ...) { | ||
if (!requireNamespace("distfromq")) { | ||
cli::cli_abort( | ||
c("x" = "{.pkg distfromq} must be installed to convert {.val cdf} | ||
or {.val quantile} output types.", | ||
"i" = "Use {.code remotes::install_github('reichlab/distfromq')} to install." | ||
) | ||
) | ||
} | ||
|
||
samples <- model_out_tbl %>% | ||
dplyr::group_by(dplyr::across(dplyr::all_of(c("model_id", task_id_cols)))) %>% | ||
dplyr::reframe( | ||
value = distfromq::make_q_fn( | ||
ps = .data[["value"]], | ||
qs = as.numeric(.data[["output_type_id"]]), ... | ||
)(stats::runif(n_samples, 0, 1)) | ||
) %>% | ||
dplyr::ungroup() | ||
split_samples <- split(samples, f = samples[[task_id_cols]]) | ||
formatted_samples <- split_samples %>% | ||
purrr::map(.f = function(split_samples) { | ||
dplyr::mutate(split_samples, | ||
output_type = "sample", | ||
output_type_id = as.numeric(dplyr::row_number()), | ||
.before = "value") | ||
}) %>% | ||
purrr::list_rbind() %>% | ||
as_model_out_tbl() | ||
return(formatted_samples) | ||
} | ||
|
||
#' @noRd | ||
convert_from_sample <- function(grouped_model_out_tbl, new_output_type, | ||
new_output_type_id) { | ||
if (new_output_type == "mean") { | ||
model_out_tbl_transform <- grouped_model_out_tbl %>% | ||
dplyr::reframe( | ||
value = mean(.data[["value"]]), | ||
output_type_id = new_output_type_id | ||
) | ||
} else if (new_output_type == "median") { | ||
model_out_tbl_transform <- grouped_model_out_tbl %>% | ||
dplyr::reframe( | ||
value = mean(.data[["value"]]), | ||
output_type_id = new_output_type_id | ||
) | ||
} else if (new_output_type == "quantile") { | ||
model_out_tbl_transform <- grouped_model_out_tbl %>% | ||
dplyr::reframe( | ||
value = stats::quantile(.data[["value"]], as.numeric(new_output_type_id), names = FALSE), | ||
output_type_id = new_output_type_id | ||
) | ||
} else if (new_output_type == "cdf") { | ||
model_out_tbl_transform <- grouped_model_out_tbl %>% | ||
dplyr::reframe( | ||
value = stats::ecdf(.data[["value"]])(as.numeric(new_output_type_id)), | ||
output_type_id = new_output_type_id | ||
) | ||
} | ||
# update output_type and output_type_id columns | ||
model_out_tbl_transform <- model_out_tbl_transform %>% | ||
dplyr::mutate(output_type = new_output_type) %>% | ||
as_model_out_tbl() | ||
return(model_out_tbl_transform) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
#' Get task ID column names from `model_out_tbl` object | ||
#' | ||
#' @param model_out_tbl an object of class `model_out_tbl` | ||
#' | ||
#' @return a character vector of task ID column names | ||
#' @export | ||
get_task_id_cols <- function(model_out_tbl) { | ||
model_out_cols <- colnames(model_out_tbl) | ||
task_id_cols <- model_out_cols[!model_out_cols %in% std_colnames] | ||
return(task_id_cols) | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic of this function, where everything is converted to samples and from there to the target output type, has a certain kind of elegance and at the same time is kind of inefficient. For example, here's what we do when the
starting_output_type
is "quantile" and thenew_output_type
is "cdf":get_samples_from_quantiles
, we calldistfromq::make_q_fn
to estimate the quantile function and draw samples from itdistfromq::make_q_fn
estimates the cdf and inverts thatRestating, our goal was to get to cdf values, and in order to do that we (1) estimate the cdf; (2) invert that to get to a quantile function estimate; (3) draw samples from the quantile function; (4) get the empirical cdf of those samples; (5) evaluate that empirical cdf. Although it is conceptually clean to work through the "universal representation" of samples, steps 2-5 here are extra work and in particular the sample-based stuff in steps 3-5 will introduce more layers of approximation to the result.
Another alternative organization to this code might provide a set of "direct" transformations that don't use samples as an intermediate step. For example, the quantile to cdf transform might define a helper function along the lines of
transform_quantile_to_cdf_one_group
along these lines:Which could then be called on a grouped data frame somewhere, like
I think this approach would likely involve more code in the end, but could be nice as it would be more direct and involve less approximation. I'm open to your ideas about this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, I had conceptualized a set of transformations like
transform_quantile_to_cdf_one_group()
that you describe @elray1. When most required transforming to samples, I decided for the elegant route. But I agree, there is additional approximation that is not necessary, and would ideally be avoided.Question if we opt for the individual transformation functions: consider, for example, we are transforming from quantile to mean. The steps are (1) estimate the cdf, (2) invert to quantile function, (3) draw samples from quantile function, (4) calculate mean. Would this function call the
transform_quantile_to_samples()
function? Or would it replicate steps 1-3 within thetransform_quantile_to mean()
function?@lshandross, I'm happy for you to make the final decision on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was a good prompt for me to think about which transformations require samples, estimates of the QF, and/or estimates of the CDF. Here's my thinking, does this seem corret?
output_type_id
s for bins, e.g.(0,0.1]
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this with Li just now, and arrived at a plan to merge this PR more or less as-is, then do the following in separate issues:
starting_type
= pmfThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds great, thank you both for pushing this across the finish line!