Skip to content

Validation

Lucie Contamin edited this page Dec 13, 2023 · 4 revisions

File Check List

File format (test_column())

The name and number of the columns are corresponding to the expected format:

origin_date scenario_id target horizon location age_group output_type output_type_id run_grouping stochastic_run value

The order of the column is not important but it should contain the expected number of columns with each name correctly spelled.

The column should be in the expected format (no "factor" column accepted)

*Remarks:*If one column is missing, the submission test will directly stop and return an error message without running all the other tests.

Scenario Information (test_scenario())

  • The ID of the scenarios are corresponding to the expected ID of the expected round without any typo.

Column "origin_date" (test_origindate())

The origin_date is the start date for scenarios (first date of simulated transmission/outcomes)

  • The origin_date column contains:
    • one unique date value in the YYYY-MM-DD format (character or date format accepted, datetime will return a warning).
    • the date in the submission file is matching the date in the name of the file.
    • the date in the submission file matching the projection starting date.

"sample" information (test_sample())

  • The output_type_id column should only contain NA

  • The column run_grouping and stochastic_run must only contain integer value

  • The submission file must contain an expected number of repetition (number of samples or trajectories) for each scenario/target/location/horizon/(age_group) group

  • The submission should at least contain a unique sample identifier by "horizon" and "age_group""group". It means that in a submission, each unique sample identifier (calculate by concatenation of the run_grouping and stochastic_run columns) should contain at least all the possible horizon values and age_group values once, and optionally can contain the specific and multiple value for the other task id column (origin_date, scenario, location, target, horizon, (age_group, etc.))

"cdf" information

If the submission contains "cdf" value only,

  • The output_type_id column contains the expected Epiweek values, noted in the EWYYYYWW format.

  • The submission should contain a unique sample identifier for each scenario/target/location (age_group) combination.

Quantiles information and value (test_quantile())

If the submission contains quantiles value only,

  • The submission file should contains quantiles matching the expected quantiles value:
0.010 0.025 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500
0.550 0.600 0.650 0.700 0.750, 0.800 0.850 0.900 0.950 0.975 0.990 
  • Two additional optional quantiles have been added to the list: 0 and 1. These 2 quantiles are not required.

  • For each target/scenario/location/age_group group, the value increases with the quantiles. For example, for the 1st week ahead of target X for the location Y and for the scenario A, if quantile 0.01= "5" than quantile 0.5 should be equal or greater than "5".

Value and Types information (test_val())

  • Each type/type_id/target/scenario/location/(age_group, etc.) group combination has one unique value projected. For example: only 1 value for sample 1, location US, target inc hosp, horizon 1, age group 0-130 and, scenario A)

  • The projection contains only values greater than or equal to 0

  • For each target name/scenario/location/age_group group (except locations 66 (Guam), 69 (Northern Mariana Island), 60 (American Samoa), 74 (US. Minor Outlying Islands)), the whole projection does not contain only 1 unique value. For example, the projection for the incidence cases for one location and for one scenario does not contain only one unique value for the whole time series. ** As there is a possibility that 0 death or case might be projected, the submission will still be accepted if the test failed but it will return a warning message asking to verify the projection.**

*Each projected value cannot by greater than the population size of the corresponding geographical entity. As an individual can be reinfected, the submission will still be accepted if the test failed but it will return a warning message asking to verify the projection.

Target information and value (test_target())

  • The target are corresponding to the target name as expressed in the SMH Github README and wiki files: "inc hosp", "cum hosp", "peak time hosp", "peak size hosp".

  • The submission file contains projections for all the required targets. The submission file will be accepted if some targets are missing, but will return a warning message and the submission might not be included in the Ensembles

  • The submission file contains projection for an expected number of week. If the file contains more projected weeks than expected, the submission will still be accepted, but will return a warning message and the additional weeks will not be included in the visualization on the SMH website. If the file contains less projected weeks than expected, the submission might still be accepted, but will return an error message and might not be included in the Ensembles

Round Minimal number of weeks Maximal number of weeks
1 29 29

Column "location" (test_location())

  • The submission should contains projection by location, the 'location' column contains the location in the format FIPS number as available in the location table in the SMH GitHub Repository. If the FIPS number are missing a trailing zero, the submission will be accepted but a warning message will be returned.

  • The submission contains only the expected location, here the location contains in the RSV-NET target-data

*Remarks: *If a submission file contains only state level projection (one or multiple), the location column might be automatically identify as numeric even if it was submitted in a character format. In this case, a warning message will be automatically print on the validation but, please feel free to ignore it.

Column "age_group" (test_agegroup())

  • The submission should contain a column age_group with values defined as <AGEMIN>-<AGEMAX>, cannot be equal or greater than .

  • For the target requiring only specific age group(s), no additional age group is provided in the submission file. If additional age group are provided, a warning will be returned and the additional information might not be integrated in the analysis and visualization.

Remarks: These tests are only run if the submission contains an age_group column. If an age_group value is not in the expected format (<AGEMIN>-<AGEMAX>) , some tests are skipped (802, 303, 805).

File Checks Running Locally

Each submission will be validated using the validate_submision() function from the SMHvalidation R package. The package is currently only available on GitHub, to install it please follow the next steps:

install.packages("remotes")
remotes::install_github("midas-network/SMHvalidation", 
                        build_vignettes = TRUE,
                        ref = "main") 

or it can be manually installed by directly cloning/forking/downloading the package from GitHub.

To load the package, execute the following command:

library(SMHvalidation)

The package contains a validate_submission() function allowing the user to check their SMH submissions locally.

To validate round 1, please use the SMHvalidation package version 0.0.22 (last version) to have all the tests. The version 0.0.21 will also work, but does not include test on column and origin_date format.

Prerequisite

To test a submission file, the function requires multiple parameters:

  • path: path to the submission file(s). The SMHvalidation package contains multiple examples files that can be used to test the function. Please refer to the package documentation for more information
  • js_def: path to a JSON file containing the round specific and scenario information, following the Consortium of Infectious Disease Modeling Hubs standard
  • lst_gs: This parameter can be set to NULL is no COVID-19 observed data comparison is required.
  • pop_path: path to a table containing the population size of each geographical entities by FIPS (in a column "location") and by location name.
  • merge_sample_col: Boolean to indicate if for the output type "sample", the output_type_id column is set to NA and the sample identifier information is contained into 2 columns: "run_grouping" and "stochastic_run"

Run the Validation

Run without testing against observed data:

js_def <- "https://raw.githubusercontent.com/midas-network/rsv-scenario-modeling-hub/main/hub-config/tasks.json"
pop_path <- "https://raw.githubusercontent.com/midas-network/rsv-scenario-modeling-hub/main/auxiliary-data/locations.csv"
lst_gs <- NULL
validate_submission("PATH/TO/SUBMISSION", js_def, lst_gs, pop_path, merge_sample_col = TRUE)

File Visualization Running Locally (only for quantiles values)

The SMHvalidation R package contains plotting functionality to output a plot of each location and target, with all scenarios and only for quantile output type .

To run this visualization locally:

generate_validation_plots(path_proj = "PATH/TO/SUBMISSION", lst_gs=NULL , save_path=getwd(), y_sqrt = FALSE, plot_quantiles = c(0.025, 0.975))