Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling Origin & Initial Time Split: Lag parameter #136

Merged
merged 4 commits into from
Mar 31, 2020

Conversation

mdancho84
Copy link
Contributor

@mdancho84 mdancho84 commented Mar 26, 2020

As described in #135, the rolling_origin() and initial_time_split() functions have a proposed modification to add an overlap parameter, which is necessary to perform resamples and train/test validation with lagged predictors when sample sizes are very low.

# - Add overlap parameter to initial_time_split() & rolling_origin()

# Libraries
library(recipes)
library(rsample)
library(tidyverse)

# 3 years of data
drinks_subset <- drinks %>% tail(36)

splits_no_overlap <- initial_time_split(drinks_subset, prop = 2/3, overlap = 0)

training(splits_no_overlap)
#> # A tibble: 24 x 2
#>    date       S4248SM144NCEN
#>    <date>              <dbl>
#>  1 2014-10-01          11817
#>  2 2014-11-01          10470
#>  3 2014-12-01          13310
#>  4 2015-01-01           8400
#>  5 2015-02-01           9062
#>  6 2015-03-01          10722
#>  7 2015-04-01          11107
#>  8 2015-05-01          11508
#>  9 2015-06-01          12904
#> 10 2015-07-01          11869
#> # … with 14 more rows

testing(splits_no_overlap)
#> # A tibble: 12 x 2
#>    date       S4248SM144NCEN
#>    <date>              <dbl>
#>  1 2016-10-01          11914
#>  2 2016-11-01          13025
#>  3 2016-12-01          14431
#>  4 2017-01-01           9049
#>  5 2017-02-01          10458
#>  6 2017-03-01          12489
#>  7 2017-04-01          11499
#>  8 2017-05-01          13553
#>  9 2017-06-01          14740
#> 10 2017-07-01          11424
#> 11 2017-08-01          13412
#> 12 2017-09-01          11917

# Without overlap - Get missing values in testing dataset 
rec_lag <- recipe(~ ., data = training(splits_no_overlap)) %>%
  step_lag(S4248SM144NCEN, lag = 12) 

bake(prep(rec_lag), testing(splits_no_overlap)) %>% tail(12) # Missing values
#> # A tibble: 12 x 3
#>    date       S4248SM144NCEN lag_12_S4248SM144NCEN
#>    <date>              <dbl>                 <dbl>
#>  1 2016-10-01          11914                    NA
#>  2 2016-11-01          13025                    NA
#>  3 2016-12-01          14431                    NA
#>  4 2017-01-01           9049                    NA
#>  5 2017-02-01          10458                    NA
#>  6 2017-03-01          12489                    NA
#>  7 2017-04-01          11499                    NA
#>  8 2017-05-01          13553                    NA
#>  9 2017-06-01          14740                    NA
#> 10 2017-07-01          11424                    NA
#> 11 2017-08-01          13412                    NA
#> 12 2017-09-01          11917                    NA

# With overlap - No missing values
splits_with_overlap <- initial_time_split(
  drinks_subset, 
  prop    = 2/3, 
  overlap = 12 # New parameter
)

rec_lag <- recipe(~ ., data = training(splits_with_overlap)) %>%
  step_lag(S4248SM144NCEN, lag = 12) 

bake(prep(rec_lag), testing(splits_with_overlap)) %>% tail(12) # Get the right values
#> # A tibble: 12 x 3
#>    date       S4248SM144NCEN lag_12_S4248SM144NCEN
#>    <date>              <dbl>                 <dbl>
#>  1 2016-10-01          11914                    11983
#>  2 2016-11-01          13025                    11506
#>  3 2016-12-01          14431                    14183
#>  4 2017-01-01           9049                    8650
#>  5 2017-02-01          10458                    10323
#>  6 2017-03-01          12489                    12110
#>  7 2017-04-01          11499                    11424
#>  8 2017-05-01          13553                    12243
#>  9 2017-06-01          14740                    13686
#> 10 2017-07-01          11424                    10956
#> 11 2017-08-01          13412                    12706
#> 12 2017-09-01          11917                    12279

Created on 2020-03-26 by the reprex package (v0.3.0)

@topepo
Copy link
Member

topepo commented Mar 30, 2020

Looks good but can you make these changes:

  • Change the argument name to lags instead of overlap. I think that's more literal (but make suggestions if you have another idea).

  • Add some input validation for overlap/lags.

@mdancho84
Copy link
Contributor Author

@topepo Sounds good to me. I will:

  1. Change argument to lags.
  2. Add input validation - Checks for whole number and not greater than the training data size.

@mdancho84
Copy link
Contributor Author

@topepo I've implemented the change to lag (singular, not plural since only one lag should be added not multiple) and I've added 2 checks to make sure a whole number that does not exceed the number of training observations. Should be good to go.

@mdancho84 mdancho84 changed the title Overlap parameter Rolling Origin & Initial Time Split: Lag parameter Mar 31, 2020
@topepo
Copy link
Member

topepo commented Mar 31, 2020

Thanks!

@topepo topepo merged commit 474b340 into tidymodels:master Mar 31, 2020
@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants