13-survey-sampling.Rmd

# Survey Calibration

```{r, message = FALSE, echo = FALSE}
library(dplyr)
library(survey)
build_df <- function(api, method, wt_value) {
    d <- data.frame(apisrs$stype, apisrs$sch.wide, wt_value )
    names(d) <- c("stype", "sch.wide", "weight")
    rownames(d) <- NULL
    d %<>%
        group_by(stype, sch.wide) %>%
        summarize(value = first(weight), frequency = n())
    names(d) <- c("stype", "sch.wide", paste(method, "wts."), "Frequency")
    d
}
build_table <- function(d1, d2, title) {
    d <- inner_join(d1, d2, by = c("stype", "sch.wide"))
    names(d) <- gsub("Frequency.x|Frequency.y", "Frequency", names(d))
    d %>%
    knitr::kable(format = "html", digits = 3, caption = title) %>%
    kable_styling("striped") %>%
    column_spec(1:6, background = "#ececec")
}
```

## Goals

- Demonstrate different methods for calibrating data

## Survey Calibration Problem

Calibration is a widely used technique in survey sampling. Suppose
$m$ sampling units in a survey have been assigned initial weights
$d_i$ for $i = 1,\ldots,m$, and furthermore, there are $n$ auxiliary
variables whose values in the sample are known. Calibration seeks to
improve the initial weights $d_i$ by finding new weights $w_i$ that
incorporate this auxiliary information while perturbing the initial
weights as little as possible, \ie, the ratio $g_i = w_i/d_i$ must
be close to one. Such reweighting improves precision of estimates
(Chapter 7, @Lumley:2010).

Let $X \in {\mathbf R}^{m \times n}$ be the matrix of survey samples, with
each column corresponding to an auxiliary variable. Reweighting can be
expressed as the optimization problem (see @Davies:2016):

\[
	\begin{array}{ll}
		\mbox{minimize} & \sum_{i=1}^m d_i\phi(g_i) \\
		\mbox{subject to} & A^Tg = r
	\end{array}
\]

with respect to $g \in {\mathbf R}^m$, where $\phi:{\mathbf R} \rightarrow
{\mathbf R}$ is a strictly convex function with $\phi(1) = 0$, $r \in
{\mathbf R}^n$ are the known population totals of the auxiliary variables,
and $A \in {\mathbf R}^{m \times n}$ is related to $X$ by $A_{ij} =
d_iX_{ij}$ for $i = 1,\ldots,m$ and $j = 1,\ldots,n$. 

## Raking 

A common calibration technique is _raking_, which uses the penalty
function $\phi(g_i) = g_i\log(g_i) - g_i + 1$ as the calibration
metric.

We illustrate this with the California Academic Performance Index data in
the `survey` package (@lumley:2018), which also supplies facilities for
calibration via the function `calibrate`. Both the population dataset
(`apipop`) and a simple random sample of $m = 200$ (`apisrs`) are
provided. Suppose that we wish to reweight the observations in the
sample using known totals for two variables from the population:
`stype`, the school type (elementary, middle, or high) and `sch.wide`,
whether the school met the yearly target or not. This reweighting
would make the sample more representative of the general population.

The code below estimates the weights using `survey::calibrate`.

```{r}
data(api)
design_api <- svydesign(id = ~dnum, weights = ~pw, data = apisrs)
formula <- ~stype + sch.wide
T <- apply(model.matrix(object = formula, data = apipop),
           2,
           sum)

cal_api <- calibrate(design_api, formula, population = T, calfun = cal.raking)
w_survey <- weights(cal_api)
```

The `CVXR` formulation follows.

```{r}
di <- apisrs$pw
X <- model.matrix(object = formula, data = apisrs)
A <- di * X
n <- nrow(apisrs)
g <- Variable(n)
constraints <- list(t(A) %*% g == T)

## Raking
Phi_R <- Minimize(sum(di * (-entr(g) - g + 1)))
p <- Problem(Phi_R, constraints)
res <- solve(p)
w_cvxr <- di * res$getValue(g)
```

The results are identical as shown in the table below.

```{r, echo = FALSE}
## Using functions in the *un echoed* preamble of this document...
build_table(d1 = build_df(apisrs, "Survey", w_survey),
            d2 = build_df(apisrs, "CVXR", w_cvxr),
            title = "Calibration weights from Raking")
```

### Exercise

The quadratic penalty function

\[
  \phi^{Q}(g) = \frac{1}{2}(g-1)^2
\]

is also sometimes used. Calibrate using `CVXR` and the `SCS` solver and compare with the
`survey::cal.linear` results, which can be obtained via

```{r}
w_survey_q <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear))
```

#### Solution

```{r}
## Quadratic
Phi_Q  <- Minimize(sum_squares(g - 1) / 2)
p <- Problem(Phi_Q, constraints)
res <- solve(p, solver = "SCS")
w_cvxr_q <- di * res$getValue(g)
```

The default `ECOS` solver produces a different number of unique
weights. Such differences are not unheard of among solvers.

```{r, echo = FALSE}
build_table(d1 = build_df(apisrs, "Survey", w_survey_q),
            d2 = build_df(apisrs, "CVXR", w_cvxr_q),
            title = "Calibration weights from Quadratic metric")
```

### Exercise

Repeat the above exercise with the logistic function 

\[
  \phi^{L}(g; l, u) = \frac{1}{C}\biggl[ (g-l)\log\left(\frac{g-l}{1-l}\right) +
  (u-g)\log\left(\frac{u-g}{u-1}\right) \biggr] \mbox{ for } C = \frac{u-l}{(1-l)(u-1)},
\]

which requires bounds $l$ and $u$ on the coefficients. Use $l=0.9$ and $u=1.1$. The results from `survey::cal.linear` can be obtained with the code below.

```{r}
u <- 1.10; l <- 0.90
w_survey_l <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear,
                                bounds = c(l, u)))
```

### Solution

```{r}
Phi_L <- Minimize(sum(-entr((g - l) / (u - l))  -
                      entr((u - g) / (u - l)))) 
p <- Problem(Phi_L, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_l <- di * res$getValue(g)
```

```{r, echo = FALSE}
build_table(d1 = build_df(apisrs, "Survey", w_survey_l),
            d2 = build_df(apisrs, "CVXR", w_cvxr_l),
            title = "Calibration weights from Logit metric")
```

### Exercise 

Repeat with the Hellinger distance

\[
  \Phi^{H}(g) = \frac{1}{(1 - g/2)^2}
\]

and the following results.

```{r}
hellinger <- make.calfun(Fm1 = function(u, bounds)  ((1 - u / 2)^-2) - 1,
                         dF= function(u, bounds) (1 -u / 2)^-3 ,
                         name = "Hellinger distance")
w_survey_h <- weights(calibrate(design_api, formula, population = T, calfun = hellinger))
```

#### Solution

```{r}
Phi_h <- Minimize(sum((1 - g / 2)^(-2)))
p <- Problem(Phi_h, constraints)
res <- solve(p)
w_cvxr_h <- di * res$getValue(g)
```

```{r, echo = FALSE}
build_table(d1 = build_df(apisrs, "Survey", w_survey_h),
            d2 = build_df(apisrs, "CVXR", w_cvxr_h),
            title = "Calibration weights from Hellinger distance metric")
```

### Exercise

Lastly, use the derivative of the inverse hyperbolic sine

\[
  \Phi^{S}(g) = \frac{1}{2}(e^g + e^{-g})
\]

along with the results produced below.

```{r}
w_survey_s <- weights(calibrate(design_api, formula, population = T, calfun = cal.sinh,
                                bounds = c(l, u)))
```

#### Solution

```{r}
Phi_s <- Minimize(sum( 0.5 * (exp(g) + exp(-g))))
p <- Problem(Phi_s, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_s <- di * res$getValue(g)
```

```{r, echo = FALSE}
build_table(d1 = build_df(apisrs, "Survey", w_survey_s),
            d2 = build_df(apisrs, "CVXR", w_cvxr_s),
            title = "Calibration weights from derivative of sinh metric")
```

## References