Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add output type conversion functions #150

Open
wants to merge 46 commits into
base: main
Choose a base branch
from

Conversation

eahowerton
Copy link

This PR introduces two main functions:

  1. convert between output types - convert_output_types() takes in a model_out_tbl with a single output type and can return a new model_out_tbl with the desired output types. The idea is to create a user-facing wrapper function that lives in hubData and can process multiple starting output types.
  2. get column names of task ID columns - will replace some functionality in hubEnsembles in validate_ensemble_inputs(), related to Refactor to use get_task_ids_tbl function for task id validation hubEnsembles#111, and potentially be useful in other packages (e.g., hubUtils)

Copy link
Collaborator

@annakrystalli annakrystalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good!

I've made a lot of minor/stylistic comments which you can feel free to ignore!

I've also made some suggestions for tackling a couple of more salient points.

DESCRIPTION Outdated Show resolved Hide resolved
R/convert_output_types.R Outdated Show resolved Hide resolved
R/convert_output_types.R Outdated Show resolved Hide resolved
R/convert_output_types.R Outdated Show resolved Hide resolved
R/convert_output_types.R Outdated Show resolved Hide resolved
R/convert_output_types.R Outdated Show resolved Hide resolved
value = distfromq::make_q_fn(
ps = .data$value,
qs = as.numeric(.data$output_type_id), ...
)(runif(n_samples, 0, 1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is )(runif(n_samples, 0, 1)) doing?

R/convert_output_types.R Outdated Show resolved Hide resolved
R/convert_output_types.R Outdated Show resolved Hide resolved
R/convert_output_types.R Outdated Show resolved Hide resolved
Copy link
Contributor

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a really nice start at this functionality! I had one big question about whether it might make sense to avoid the use of samples as an intermediate representation of the distribution in all settings, and some other smaller comments.

R/convert_output_types.R Outdated Show resolved Hide resolved
} else if (starting_output_type == "quantile") {
# if median output desired, and Q50 provided return exact value, otherwise
# estimate from samples
if (!("median" %in% new_output_type && 0.5 %in% starting_output_type_ids)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we might still want this calculation if, for example, new_output_type = c("median", "cdf")

Copy link
Contributor

@lshandross lshandross Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the placement of the parentheses, this calculation does occur in that case because it simplifies to !("median" %in% new_output_type) || !(0.5 %in% starting_output_type_ids)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarifying/restating -- what if we have a situation where:

  • starting_output_type = "quantile", with starting_output_type_ids = c(0.025, 0.25, 0.5, 0.75, 0.975)
  • new_output_type = c("median", "cdf")

In this case, !("median" %in% new_output_type) and !(0.5 %in% starting_output_type_ids) both evaluate to FALSE and so the condition does not run, right? But in that case we would still need the representation in terms of samples in order to get to the cdf outputs (if we stick with a workflow that uses samples as an intermediate).

Copy link
Contributor

@lshandross lshandross Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further investigation revealed that your example with median and cdf new output types led to a case where indeed no samples were estimated and the cdf values were instead estimated from the provided quantiles using stats::ecdf().

I went in and made some changes to the function so that samples would be used as an intermediary for consistency, as the plan is to merging this pull request first before switching how we perform the transformation as detailed below in your other comment in commit cbe2acf

Comment on lines 98 to 100
if (new_output_type[i] == "median" && 0.5 %in% starting_output_type_ids) {
model_out_tbl_transform[[i]] <- model_out_tbl %>%
dplyr::filter(output_type_id == 0.5) %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably fine, so may not be worth updating -- but it seems like there may be an edge case where one model does not provide a predictive quantile at the level 0.5, or they provide some output_type_id like 0.49999999999999999999999 that doesn't evaluate as equal to 0.5, in which case we might end up without a row with a median where that model did provide forecasts.

I think the conditions of that edge case would be pretty rare, so again, not sure it's worth coding for them...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could file a low priority issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, maybe not the most important case to handle. And I might argue that the user should address this (rather than us making assumptions about what is "sufficiently" close to 0.5 to be considered a "median") but I don't feel strongly about this!

R/convert_output_types.R Outdated Show resolved Hide resolved
Comment on lines 193 to 196
value = distfromq::make_q_fn(
ps = as.numeric(.data$output_type_id),
qs = .data$value, ...
)(runif(n_samples, 0, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of notes here:

  • This is actually exactly the calculation that's done in the function that's returned by distfromq::make_r_fn. So you could do this in a slightly easier-to-read way by calling
value = distfromq::make_r_fn(
        ps = as.numeric(.data$output_type_id),
        qs = .data$value, ...
      )(n_samples)
  • But these are pseudo-random samples rather than quasi-random samples. I actually think the distinction here is probably of low importance. In case someone is converting to output type sample and they want the randomness they'd expect from that, maybe it's best to keep this behavior rather than updating to quasi-random samples. Then we should just update the documentation for the main function above to remove the line "To reduce Monte Carlo variability, we use quasi-random samples corresponding to quantiles of the estimated distribution."
  • Otherwise, if we do want to stick with quasi-random samples, we could use a line like the following here:
value = distfromq::make_q_fn(
        ps = as.numeric(.data$output_type_id),
        qs = .data$value, ...
      )(seq(from = 0, to = 1, length = n_samples + 2)[1:(n_samples+1)])
  • To address a comment I saw somewhere else from Anna about what's going on here, it might be helpful to refactor this into a separate function that takes .data$output_type_id and .data$value as arguments and has two lines: one calling distfromq::make_*_fn and storing that result in q_fn or r_fn (depending on which way you prefer to go), and a second calling that function. (this refactoring may be less necessary if you switch to just using make_r_fn since that one is easier to read)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support changing to make_r_fn(), and that following the intuition of samples, we probably want to default to pseudo-random samples. We could also file a low-priority issue to make this sampling scheme an argument that the user can specify? I could imagine use cases of the function where each type of sampling may be desired.

Comment on lines +70 to +71
convert_output_type <- function(model_out_tbl, new_output_type,
new_output_type_id = NA, n_samples = 1e4, ...) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of this function, where everything is converted to samples and from there to the target output type, has a certain kind of elegance and at the same time is kind of inefficient. For example, here's what we do when the starting_output_type is "quantile" and the new_output_type is "cdf":

  1. In get_samples_from_quantiles, we call distfromq::make_q_fn to estimate the quantile function and draw samples from it
  2. Then we compute the empirical cdf of those samples and evaluate that ecdf in

Restating, our goal was to get to cdf values, and in order to do that we (1) estimate the cdf; (2) invert that to get to a quantile function estimate; (3) draw samples from the quantile function; (4) get the empirical cdf of those samples; (5) evaluate that empirical cdf. Although it is conceptually clean to work through the "universal representation" of samples, steps 2-5 here are extra work and in particular the sample-based stuff in steps 3-5 will introduce more layers of approximation to the result.

Another alternative organization to this code might provide a set of "direct" transformations that don't use samples as an intermediate step. For example, the quantile to cdf transform might define a helper function along the lines of transform_quantile_to_cdf_one_group along these lines:

transform_quantile_to_cdf_one_group <- function(starting_ps, starting_qs, to_qs, ...) {
  p_fn <- distfromq::make_p_fn(
    ps = as.numeric(starting_ps),
    qs = starting_qs,
    ...
  )
  return(p_fn(to_qs))
}

Which could then be called on a grouped data frame somewhere, like

  samples <- model_out_tbl %>%
    dplyr::group_by(model_id, dplyr::across(dplyr::all_of(task_id_cols))) %>%
    dplyr::reframe(
      value = transform_quantile_to_cdf_one_group(
        starting_ps=as.numeric(.data$output_type_id),
        starting_qs=.data$value,
        to_qs=new_output_type_id,
        ...
      )

I think this approach would likely involve more code in the end, but could be nice as it would be more direct and involve less approximation. I'm open to your ideas about this!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I had conceptualized a set of transformations like transform_quantile_to_cdf_one_group() that you describe @elray1. When most required transforming to samples, I decided for the elegant route. But I agree, there is additional approximation that is not necessary, and would ideally be avoided.

Question if we opt for the individual transformation functions: consider, for example, we are transforming from quantile to mean. The steps are (1) estimate the cdf, (2) invert to quantile function, (3) draw samples from quantile function, (4) calculate mean. Would this function call the transform_quantile_to_samples() function? Or would it replicate steps 1-3 within the transform_quantile_to mean() function?

@lshandross, I'm happy for you to make the final decision on this one.

Copy link
Contributor

@elray1 elray1 Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was a good prompt for me to think about which transformations require samples, estimates of the QF, and/or estimates of the CDF. Here's my thinking, does this seem corret?

  • starting type = quantile
    • new type = quantile: do we support this? should we? if so, would not require samples, use estimated QF
    • new type = cdf: samples not needed, use estimated CDF
    • new type = pmf: samples not needed, use estimated CDF. Some data structure would be required specifying bin endpoints, or some convention for the names of the output_type_ids for bins, e.g. (0,0.1].
    • new type = sample: samples needed
    • new type = median: samples not needed, use estimated QF
    • new type = mean: samples needed
  • starting type = cdf
    • new type = quantile: samples not needed, use estimated QF
    • new type = cdf: do we support this? should we? if so, would not require samples, use estimated CDF
    • new type = pmf: samples not needed, use estimated CDF
    • new type = sample: samples needed
    • new type = median: samples not needed, use estimated QF
    • new type = mean: samples needed
  • starting type = pmf.
    • overall note, i think that for the most part, transforms here only make sense if the variable is a discretization of an underlying continuous variable and we get some information about the bin endpoints used for the discretization. but in that setting, i think the transformation options match what was said above for starting type = cdf.
    • though if the variable is nominal, we could go to the CDF type naturally.
  • starting type = sample
    • you already have samples!
  • starting type = median
    • can't do any transformations other than the trivial transform to quantile at level 0.5
  • starting type = mean
    • can't do any transformations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with Li just now, and arrived at a plan to merge this PR more or less as-is, then do the following in separate issues:

  • Allow new_type = sample for existing probabilistic starting types where they are not supported (quantile and cdf)
  • Allow new_type = pmf for existing probabilistic starting types (quantile, cdf, and sample).
    • Maybe at the same time, move away from using samples as intermediary for new type = cdf. This would get us to a place where the same methodology is used when the new output type is pmf and cdf.
    • Maybe at the same time, support cdf -> cdf, e.g. extending the set of points where the cdf is evaluated. This feels like a low priority extension, not sure there is really a use case?
  • Allow starting_type = pmf
  • Lower priority issue, support quantile --> quantile transformations, e.g. extending the set of quantile levels predicted. This could be useful to modelers who are doing something like quantile regression, to allow them to get predictions at a small number of quantile levels and then fill in the others.
    • At this time, could switch to using QF as an intermediary when new_type = quantile or median rather than samples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds great, thank you both for pushing this across the finish line!

Copy link

github-actions bot commented Sep 11, 2024

@saraloo saraloo removed their assignment Sep 11, 2024
@annakrystalli annakrystalli changed the base branch from main to output-type-conversion September 25, 2024 15:16
@annakrystalli annakrystalli changed the base branch from output-type-conversion to main September 25, 2024 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

5 participants