Config reader with expected schema validation #7

zsusswein · 2024-08-07T20:46:39Z

Note

Edit 2024-09-12: Marking as draft while reworking. Will re-open when ready for further review.

Read in the Rt run config (this is meant to be task-specific), validate it against a specified schema, and throw an error if anything doesn't match. As part of the error, dump a description of what's wrong with the config.

Note that this process assumes that all keys are specified. We don't, for example, have a default prior not in the config.

Please take a look at the proposed sample config in tests/testthat/data/sample_config.json! I tried to make it both opinionated and flexible. There are some particular choices in there, like specifying both job and task IDs in the config as UUIDs. Maybe that's too stringent and it would be good to catch that now.

zsusswein · 2024-08-08T12:59:24Z

There's one NOTE in R CMD check but it's because the azure download is specified in #7 and I don't include it here.

zsusswein · 2024-08-08T13:13:52Z

@kaitejohnson feedback on extensibility of this setup for WW modeling would be great!

natemcintosh

Just a few questions. Overall, I think having the schema like this is going to make our lives so much simpler. We just have to make sure we always use the config values throughout the Rt code now 😅

R/config.R

inst/extdata/config_schema.json

natemcintosh · 2024-08-08T13:51:21Z

inst/extdata/config_schema.json

+                "seed": {
+                    "type": "integer"
+                },
+                "horizon": {


Suggested change

"horizon": {

"horizon_days": {

just to be a little more clear

Is this the specified number of days to forecast from the as_of_date?

tests/testthat/data/sample_config.json

natemcintosh · 2024-08-08T13:59:13Z

inst/extdata/config_schema.json

+                        "format": "date"
+                    }
+                },
+                "reference_date": {


Each report date runs on all the reference dates?

Hmmmmm -- yeah this is a good flag. That's a bad assumption. Let me revisit.

Is this intended to be, for at least the EpiNow2 example, this vector of dates corresponding to the time series data passed in? E.g the date of admissions

And for EpiNow2, you would just have a single report date? So for example for this week if I ran EpiNow2 today, assuming old NHSN data reporting. I'd have:
as_of_date = "2024-08-08",
report_date = "2024-08-07",
reference_date = a vector of dates going back some specified calibration period up until "2024-08-02" (last friday)

inst/extdata/config_schema.json

natemcintosh · 2024-08-08T14:01:23Z

inst/extdata/config_schema.json

+            "additionalProperties": false,
+            "properties": {
+                "mean": {
+                    "type": "integer"


why integer instead of number?

natemcintosh · 2024-08-08T14:09:39Z

inst/extdata/config_schema.json

+            "type": "object",
+            "additionalProperties": false,
+            "properties": {
+                "job_id": {


My only slight hesitation with UUIDs for the job IDs is if we run multiple jobs, it would just be a bit harder to know which job is which. That said, it would mean we never run into the annoying error "This job already exists" because we forgot to delete it.

What about, e.g. Rt-estimation-2024-08-08T10:08:34 as job name?

This assumes that you will pass in a UUID that is generated somewhere else right?

Probably not for this PR, I would add more metadata. For example, if this job id is under the "EpiNow2" umbrella, I would want something that names the job based on the name of the package being used.

I like @natemcintosh's suggestion as long as we (1) store the date timestamp inside the metadata, not just in the path name, and (2) as long as there are no special character concerns using this as a path name 😬

I was testing out this naming scheme idea on something else, and discovered that Azure was not happy with :, so I replaced it with -.

So this might be something more like Rt-estimation-2024-08-08T10-08-34

inst/extdata/config_schema.json

kaitejohnson

Probably asking some obvious questions... feel free to ignore if they are obviously googleable.

My main question is about the flexibiltiy of the format of the config. E.g. Does it need to take in a separate arg for Priors, and do those Priors have to have very specific args of rt and Gp?

How would multiple data paths be specified in this workflow?

kaitejohnson · 2024-08-08T16:27:35Z

R/config.R

+#'
+#' @param config_path The path to the config file, either in the local
+#'   filesystem or with an Azure Blob Storage container. If
+#'   `blob_storage_container` is specified, the the path is assumed to be within


Suggested change

#' `blob_storage_container` is specified, the the path is assumed to be within

#' `blob_storage_container` is specified, the path is assumed to be within

kaitejohnson · 2024-08-08T16:28:32Z

R/config.R

+#'   `blob_storage_container` is specified, the the path is assumed to be within
+#'   the specified container otherwise it is assumed to be in the local
+#'   filesystem.
+#' @param local_dest The local directory to write the config to when downloading


What happens if local_dest doesn't exist?

kaitejohnson · 2024-08-08T16:29:36Z

R/config.R

+#' The validation relies on `inst/data/config_schema.json` for validation. This
+#' file is in `json-schema` notation and generated programatically via
+#' https://www.jsonschema.net/.
+#'


General comment, I think these are all character strings but I find it helpful when reading documentation when the type is explicitly specified.

kaitejohnson · 2024-08-08T16:37:32Z

inst/extdata/config_schema.json

+            "type": "object",
+            "additionalProperties": false,
+            "properties": {
+                "job_id": {


This assumes that you will pass in a UUID that is generated somewhere else right?

Probably not for this PR, I would add more metadata. For example, if this job id is under the "EpiNow2" umbrella, I would want something that names the job based on the name of the package being used.

inst/extdata/config_schema.json

kaitejohnson · 2024-08-08T16:45:10Z

inst/extdata/config_schema.json

+                "seed",
+                "task_id"
+            ],
+            "title": "Epinow2"


Just want to make sure I understand, currently this example is for "Epinow2" but you want to be able to swap this for another package name right?

kaitejohnson · 2024-08-08T16:47:00Z

inst/extdata/config_schema.json

+                        "format": "date"
+                    }
+                },
+                "reference_date": {


Is this intended to be, for at least the EpiNow2 example, this vector of dates corresponding to the time series data passed in? E.g the date of admissions

kaitejohnson · 2024-08-08T16:49:53Z

inst/extdata/config_schema.json

+                        "format": "date"
+                    }
+                },
+                "reference_date": {


And for EpiNow2, you would just have a single report date? So for example for this week if I ran EpiNow2 today, assuming old NHSN data reporting. I'd have:
as_of_date = "2024-08-08",
report_date = "2024-08-07",
reference_date = a vector of dates going back some specified calibration period up until "2024-08-02" (last friday)

kaitejohnson · 2024-08-08T16:53:22Z

inst/extdata/config_schema.json

+            ],
+            "title": "Parameters"
+        },
+        "Priors": {


Would you always needs to pass in these arguments or is this specific to Epinow2? I think there might be other packages where you would handle specifying priors differently (e.g. in the ww package we're developing, priors are lumped in with parameters...)

My assumption is that you'd have to edit this in another version of the repo to enforce that schema, but I think the framework is adaptable!

Given that something like fetch_config is not EpiNow2 specific there is some argument it shouldn't be in this package. It seems like if we are to have another package cfa-newpackage-pipeline then it'll also need things function. So the schema is EpiNow2 specific but the surrounding functions are not

kaitejohnson · 2024-08-08T16:56:05Z

tests/testthat/data/sample_config.json

+       "blob_storage_container": null
+    },
+    "data": {
+        "path": "gold/",


As written, how would you point to multiple data sources when it seems that data just has one path option

kgostic

This is good but I have some big picture questions and comments:

As noted within, I think there are a few runtime pars we want to add to the schema
Will the output file path be specified elsewhere? Did I miss this?
How will this be applied across jurisdictions? Is this meant to be a stem that we can later append the jurisdiction ID to? One use case that isn't covered here is the ability to change a parameter for just one jurisdiction.
Are you envisioning that config specification and validation happen before kicking off the Batch job or inside Azure?
I think this should be out of scope for now, but I'm wondering if it's worth adding at some point "optional_validators" to the schema -- this would be a list of function names that could be turned on in the config, e.g. "validate_for_production" could check that we're using the right days of week, etc. but also could be easily turned off.
I think we need to start writing documentation as we merge PRs or we're never going to be able to remember ourselves how things work, let alone enforce adherence to the existing standards and architecture and avoid this disintegrating back into spaghetti code. At minimum, I think for the config we need:

a "data dictionary" explaining what each parameter means
some guidance for how to build, store, edit, and pass a config file into a run
some guidance on how to change the config schema (edit the example files, the tests, the config_schema.json)

kgostic · 2024-08-09T19:58:04Z

inst/extdata/config_schema.json

+            "type": "object",
+            "additionalProperties": false,
+            "properties": {
+                "job_id": {


I like @natemcintosh's suggestion as long as we (1) store the date timestamp inside the metadata, not just in the path name, and (2) as long as there are no special character concerns using this as a path name 😬

kgostic · 2024-08-09T20:08:35Z

inst/extdata/config_schema.json

+                }
+            },
+            "required": [
+                "as_of_date",


Suggested change

"as_of_date",

"as_of_date",

"timeseries_end_date",

"timeseries_length_weeks",

These are needed so that we can do things like:

Change the number of weeks in the sliding window

Kick off retrospective runs (e.g. where the as_of_date is much later than the timeseries_end_date)

What is the logical relationship between as_of_date here, and report_date below inside the data block? Do we need to enforce / validate this relationship somewhere?

kgostic · 2024-08-09T20:09:16Z

inst/extdata/config_schema.json

+            },
+            "required": [
+                "as_of_date",
+                "data",


I'm sure this is explained later, but what does "data" mean?

kgostic · 2024-08-09T20:14:40Z

inst/extdata/config_schema.json

+            ],
+            "title": "Parameters"
+        },
+        "Priors": {


My assumption is that you'd have to edit this in another version of the repo to enforce that schema, but I think the framework is adaptable!

kgostic · 2024-08-09T20:27:41Z

inst/extdata/config_schema.json

+                    "type": "string",
+                    "format": "uuid"
+                },
+                "as_of_date": {


See comment below

kgostic · 2024-08-09T20:29:29Z

inst/extdata/config_schema.json

+            ],
+            "title": "Data"
+        },
+        "Parameters": {


What if we want to point to more than one container to pull in parameters?

kgostic · 2024-08-09T20:33:19Z

inst/extdata/config_schema.json

+                "alpha_sd"
+            ],
+            "title": "Gp"
+        },


I think we want to add specification of the ls mean via gp_opts() https://epiforecasts.io/EpiNow2/reference/gp_opts.html

kgostic · 2024-08-09T20:34:37Z

inst/extdata/config_schema.json

+        "SamplerOpts": {
+            "type": "object",
+            "additionalProperties": false,
+            "properties": {


Can we add the number of samples run here as a parameter -- seems like something we could want to change at runtime, and I think helpful to encode it explicitly instead of using the defaults

athowes · 2024-08-13T15:03:00Z

tests/testthat/test-fetch_config.R

+  expect_equal(actual, expected)
+})
+
+test_that("Bad config errors", {


Use more descriptive test name?

athowes · 2024-08-13T15:03:12Z

tests/testthat/test-fetch_config.R

+  )
+})
+
+test_that("Test config validates", {


Same: use more descriptive test name?

athowes · 2024-08-14T14:58:09Z

R/config.R

+  return(config)
+}
+
+#' Compare loaded json against expectation in `inst/data/config-schema.json`


JSON (caps)?

For this kind of thing, please use the "Add a suggestion" box (in the middle of the commenting header with a +/-). It speeds up adding the edit!

Suggested change

#' Compare loaded json against expectation in `inst/data/config-schema.json`

#' Compare loaded JSON against expectation in `inst/data/config-schema.json`

athowes · 2024-08-14T14:59:00Z

R/config.R

+#' and downloads the Rt model run's config to the local config (if
+#' `blob_storage_container` is specified), reads the config in from the
+#' filesystem, and validates that it matches expectations. If any of these steps
+#' fails, the pipeline fails with an informative error message. Note, however,


Suggested change

#' fails, the pipeline fails with an informative error message. Note, however,

#' fail, the pipeline fails with an informative error message. Note, however,

athowes · 2024-08-14T15:49:44Z

R/config.R

+#' `blob_storage_container` is specified), reads the config in from the
+#' filesystem, and validates that it matches expectations. If any of these steps
+#' fails, the pipeline fails with an informative error message. Note, however,
+#' that a failure in this initial step suggests that something fundamental is


"fundamental" is quite vague

athowes · 2024-08-16T13:55:55Z

I think I might be getting at similar things as Katie but:

Can we do documentation of extdata using standard R package approaches?
- On that topic, is there any way to create automatic human readable versions schema like inst/extdata/config_schema.json?
As I put in a comment, some of these functions seem to be non-EpiNow2 specific

for more information, see https://pre-commit.ci

zsusswein force-pushed the zs-read-config branch 3 times, most recently from 0e566c7 to 196d7e2 Compare August 7, 2024 21:53

Config reader with expected schema validation

a7602bb

zsusswein force-pushed the zs-read-config branch from 196d7e2 to a7602bb Compare August 8, 2024 12:56

zsusswein requested review from natemcintosh and kgostic August 8, 2024 12:57

zsusswein marked this pull request as ready for review August 8, 2024 12:58

zsusswein requested a review from kaitejohnson August 8, 2024 13:13

natemcintosh approved these changes Aug 8, 2024

View reviewed changes

kaitejohnson reviewed Aug 8, 2024

View reviewed changes

kgostic reviewed Aug 9, 2024

View reviewed changes

athowes reviewed Aug 13, 2024

View reviewed changes

tests/testthat/test-fetch_config.R

)

})

test_that("Test config validates", {

Copy link

Collaborator

athowes Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same: use more descriptive test name?

athowes reviewed Aug 14, 2024

View reviewed changes

Merge branch 'main' into zs-read-config

51b5a25

zsusswein marked this pull request as draft September 12, 2024 23:13

Document config parameter meanings

981e158

zsusswein force-pushed the zs-read-config branch from 125a5f8 to 981e158 Compare September 12, 2024 23:22

[pre-commit.ci] auto fixes from pre-commit.com hooks

a006e91

for more information, see https://pre-commit.ci

	#' `blob_storage_container` is specified, the the path is assumed to be within
	#' `blob_storage_container` is specified, the path is assumed to be within

	#' Compare loaded json against expectation in `inst/data/config-schema.json`
	#' Compare loaded JSON against expectation in `inst/data/config-schema.json`

	#' fails, the pipeline fails with an informative error message. Note, however,
	#' fail, the pipeline fails with an informative error message. Note, however,

Config reader with expected schema validation #7

Are you sure you want to change the base?

Config reader with expected schema validation #7

Conversation

zsusswein commented Aug 7, 2024 • edited Loading

zsusswein commented Aug 8, 2024

zsusswein commented Aug 8, 2024

natemcintosh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natemcintosh Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natemcintosh Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaitejohnson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kgostic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

athowes commented Aug 16, 2024

zsusswein commented Aug 7, 2024 •

edited

Loading

natemcintosh Aug 8, 2024 •

edited

Loading

natemcintosh Aug 8, 2024 •

edited

Loading