This package contains models for estimating epi-curves and individual infection progression distributions simultaneously. The estimators require both total case data by date as well as detailed ("line list") information for a subset of individual.
This models the delay between the onset of symptoms and the presenting/testing/reporting of cases to health authorities. Early in outbreaks these delays can frequently be large due to lack of awareness of the symptoms, however, with increased awareness due to public health information the delays will decrease in time. Even if these delays are constant with time, when estimating their distribution it is necessary to consider both censoring effects and the underlying dynamics of the infection to avoid biases. Conversely, when estimating the dynamics of then infection (e.g. r(t) or R(t)) from reported cases, it is necessary to know these distributions. Therefore, if both infection dynamics and reporting delays are varying at the same time, the best way to account for the biases is to simultaneously estimate both.
The aim of the model is to understand the interaction between the symptom-report time distribution and the underlying dynamics of the infection rate, therefore we use a very simple model for the number of people developing the symptoms each day.
We model the daily growth rate
where
where
where
The symptom-report time distribution must support both positive and negative values. In addition, empirically it is observed that this distribution can be highly skewed with heavy tails, therefore we model it using the Johnson SU distribution which contains 4 parameters
At then end of the reporting period, there may not be many reported case for each symptoms date (since the data is right-censored), therefore there is the option to to make the distribution static after a particular time
where the
Range priors are put on the initial values of all the parameters and the variances of the Gaussian processes.
For efficiency, we allow for the period between sampling of the Gaussian processes to be greater than day, in which case we linearly interpolate for the intermediate days.
The posterior of the model is sampled using Stan and is contained within this R package.
The function symptom_report.fit
fits the model to data and symptom_report.simulator
generates simulated data under the model.
We now demonstrate the model using simulated data which is contained in examples/linear_r_dist.R
.
In this example, the daily growth rate
library( EpiLine )
set.seed( 1 )
# define the length of the simulatiopn
t_rep <- 50 # length of time for which data is reported
t_symptom_pre <- 30 # time before the reporting period to simulate
t_symptom_post <- 5 # time after the reporting period to simulate
t_max <- t_rep + t_symptom_post + t_symptom_pre
# set up the variable r(t) and distribution
symptom_0 <- 2 # initial number of symptomatic people
r <- 0.1 - 0.13 * ( 1:t_max ) / t_max # r(t) in the simulation
xi <- -1 + 6 * ( t_max:1 ) / t_max # xi parameter in the symptom-report dist
lambda <- 2 + ( t_max:1 ) / t_max # lambda parameter in the symptom-report dist
simulation <- symptom_report.simulator(
t_rep = t_rep,
t_symptom_pre = t_symptom_pre,
t_symptom_post = t_symptom_post,
symptom_0 = symptom_0,
r = r,
dist_xi = xi,
dist_lambda = lambda
)
We next sample from the posterior of the model using Stan (note we're using very short chains for computational speed so mean and variance of parameter estimates will be of limited accuracy).
# data
reported <- simulation$reported
ll_report <- simulation$linelist$report
ll_symptom <- simulation$linelist$symptom
report_date <- as.Date("2022-04-01")
# fit using model
mcmc_n_samples <- 100
mcmc_n_chains <- 1
fit <- symptom_report.fit( reported, ll_symptom, ll_report, report_date = report_date,
mcmc_n_samples = mcmc_n_samples, mcmc_n_chains = mcmc_n_chains )
Once the fit is complete (this examples takes roughly 50s), we can plot the posterior of the fitted parameters against the simulation parameters.
First we consider the estimate of the number of people developing symptoms on each day (fit$plot.symptoms( simulation = simulation
).
The dotted vertical lines in the chart show the period over which case reports were provided.
The extended time pre- and post- the reporting window is required because some of the reported cases in this period will have developed symptoms outside of the window.
Next we plot the estimated fit$plot.r( simulation = simulation
).
Note that outside the reporting window the estimated
Finally we plot the estimated symptom-report distribution at the start and end of the reporting period (fit$plot.symptom_report.dist( simulation = simulation )
).
Note that model successfully estimates the distribution at the start and end of the reporting period.
Alternatively we can plot the posterior distribution of different quantiles of the symptom-report distribution with time (fit$plot.symptom_report.quantiles( simulation = simulation, quantiles = c( 0.1,0.5,0.9) )
).
To run the model on real data you need to provide the function symptom_report.fit()
with both the time-series of the number of reported cases and the line list of symptom-report case pairs. Note that it is not necessary to have line list data for all reported cases, so for a reported case where there is no data about the time of symptoms, it should be included in the time-series but not the line list.
The data to fit can be supplied to the model as 3 vectors (symptom_report.fit( reported, linelist_symptom, linelist_report )
), see examples\flat_r.R
.
reported
- a vector of integers with length of the reporting period and containing the total number of cases reported each day. The first entry is for the earliest report date with following entries being ordered for each day in the reporting period (if no reported cases on a day then it should be entered as a 0).linelist_symptom
- a vector with length of the number of cases for which both symptom and report date are known. The value is an integer of the date that each case developed symptoms relative to the reporting period start date (a value of 1 means symptoms were developed on the first day for which there are reported cases). Note that values are allowed to be negative.linelist_report
- a vector with length of the number of cases for which both symptom and report date are known, paired withlinelist_symptom
. The value is an integer of that each case was reported relative to the reporting period start date. Note that all reported cases should be during the reporting periods used inreported
, i.e. all values are between 1 and the length ofreported
.
Alternatively data can be read directly from csv files (symptom_report.fit( file_reported = file_reported, file_linelist = file_linelist )
, see examples\from_csv.R
.
file_reported
- is a path to a csv file which has 2 columns, the report date and the total number of cases reported on that day. The first row is the headerdate,reported
(seeexamples\linear_r_reported.csv
).file_linelist
- is a path to a csv file which has 2 columns, the report date and the symptom date for each case for which the pair is known. The first row is the headerreport,symptom
(seeexamples\linear_r_linelist.csv
). Note, cases for which the only report date is known should be excluded from this file.
Option arguments which can be set are:
report_date
- the date of the start of the reporting period (e.g.as.Date("2022-04-01")
). This is only used for the final plotting of the posteriors.mcmc_n_samples
- the number of samples that the MCMC chains in Stan run for. This is default to100
for a quick and dirty result (and will produce Stan warnings when run). We recommend using1000
or2000
for producing accurate answers.mcmc_n_chains
- the number of MCMC chains in Stan. This is default to1
for a quick and dirty result, we recommend using at least3
to allow for cross-chain checks.prior_xxxxx
- for setting the priors in the model.
The output charts are member functions of the R6 fit object returned by fit=symptom_report.fit()
. The output is shown in the Example section above for results on simulated data (note with real data you do not provide a simulation object i.e. just call fit$plot.symptoms()
).
-
fit$plot.symptoms()
- shows a plot of the posterior distribution for the number of people developing symptoms at each time point along with the report data (note this in general is lagged compared to symptoms). -
fit$plot.r()
- shows a plot of the posterior distribution of the daily growth rate r(t). -
fit$plot.symptom_report.dist()
- shows a plot of the posterior distribution of symptom-report distribution for the start and end of the reporting period. -
fit$plot.symptom_report.quantiles()
- shows a plot of the posterior distribution for quantiles of the symptom-report distribution from the start to the end of the reporting period.
This is a R package and during the package build the Stan code is compiled. To build this package, clone the repository and then call R CMD INSTALL --no-multiarch --with-keep.source $FULL_PATH_REPO_DIR
, where $FULL_PATH_REPO_DIR
is the full path to the directory where the repository was cloned to. The package require rstan
, Rcpp
, rstantools
, StanHeaders
, data.table
, moments
, R6
, matrixStats
and plotly
to be installed (all available on CRAN).