-
Notifications
You must be signed in to change notification settings - Fork 16
/
13-survey-sampling.Rmd
245 lines (191 loc) · 7.1 KB
/
13-survey-sampling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
# Survey Calibration
```{r, message = FALSE, echo = FALSE}
library(dplyr)
library(survey)
build_df <- function(api, method, wt_value) {
d <- data.frame(apisrs$stype, apisrs$sch.wide, wt_value )
names(d) <- c("stype", "sch.wide", "weight")
rownames(d) <- NULL
d %<>%
group_by(stype, sch.wide) %>%
summarize(value = first(weight), frequency = n())
names(d) <- c("stype", "sch.wide", paste(method, "wts."), "Frequency")
d
}
build_table <- function(d1, d2, title) {
d <- inner_join(d1, d2, by = c("stype", "sch.wide"))
names(d) <- gsub("Frequency.x|Frequency.y", "Frequency", names(d))
d %>%
knitr::kable(format = "html", digits = 3, caption = title) %>%
kable_styling("striped") %>%
column_spec(1:6, background = "#ececec")
}
```
## Goals
- Demonstrate different methods for calibrating data
## Survey Calibration Problem
Calibration is a widely used technique in survey sampling. Suppose
$m$ sampling units in a survey have been assigned initial weights
$d_i$ for $i = 1,\ldots,m$, and furthermore, there are $n$ auxiliary
variables whose values in the sample are known. Calibration seeks to
improve the initial weights $d_i$ by finding new weights $w_i$ that
incorporate this auxiliary information while perturbing the initial
weights as little as possible, \ie, the ratio $g_i = w_i/d_i$ must
be close to one. Such reweighting improves precision of estimates
(Chapter 7, @Lumley:2010).
Let $X \in {\mathbf R}^{m \times n}$ be the matrix of survey samples, with
each column corresponding to an auxiliary variable. Reweighting can be
expressed as the optimization problem (see @Davies:2016):
\[
\begin{array}{ll}
\mbox{minimize} & \sum_{i=1}^m d_i\phi(g_i) \\
\mbox{subject to} & A^Tg = r
\end{array}
\]
with respect to $g \in {\mathbf R}^m$, where $\phi:{\mathbf R} \rightarrow
{\mathbf R}$ is a strictly convex function with $\phi(1) = 0$, $r \in
{\mathbf R}^n$ are the known population totals of the auxiliary variables,
and $A \in {\mathbf R}^{m \times n}$ is related to $X$ by $A_{ij} =
d_iX_{ij}$ for $i = 1,\ldots,m$ and $j = 1,\ldots,n$.
## Raking
A common calibration technique is _raking_, which uses the penalty
function $\phi(g_i) = g_i\log(g_i) - g_i + 1$ as the calibration
metric.
We illustrate this with the California Academic Performance Index data in
the `survey` package (@lumley:2018), which also supplies facilities for
calibration via the function `calibrate`. Both the population dataset
(`apipop`) and a simple random sample of $m = 200$ (`apisrs`) are
provided. Suppose that we wish to reweight the observations in the
sample using known totals for two variables from the population:
`stype`, the school type (elementary, middle, or high) and `sch.wide`,
whether the school met the yearly target or not. This reweighting
would make the sample more representative of the general population.
The code below estimates the weights using `survey::calibrate`.
```{r}
data(api)
design_api <- svydesign(id = ~dnum, weights = ~pw, data = apisrs)
formula <- ~stype + sch.wide
T <- apply(model.matrix(object = formula, data = apipop),
2,
sum)
cal_api <- calibrate(design_api, formula, population = T, calfun = cal.raking)
w_survey <- weights(cal_api)
```
The `CVXR` formulation follows.
```{r}
di <- apisrs$pw
X <- model.matrix(object = formula, data = apisrs)
A <- di * X
n <- nrow(apisrs)
g <- Variable(n)
constraints <- list(t(A) %*% g == T)
## Raking
Phi_R <- Minimize(sum(di * (-entr(g) - g + 1)))
p <- Problem(Phi_R, constraints)
res <- solve(p)
w_cvxr <- di * res$getValue(g)
```
The results are identical as shown in the table below.
```{r, echo = FALSE}
## Using functions in the *un echoed* preamble of this document...
build_table(d1 = build_df(apisrs, "Survey", w_survey),
d2 = build_df(apisrs, "CVXR", w_cvxr),
title = "Calibration weights from Raking")
```
### Exercise
The quadratic penalty function
\[
\phi^{Q}(g) = \frac{1}{2}(g-1)^2
\]
is also sometimes used. Calibrate using `CVXR` and the `SCS` solver and compare with the
`survey::cal.linear` results, which can be obtained via
```{r}
w_survey_q <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear))
```
#### Solution
```{r}
## Quadratic
Phi_Q <- Minimize(sum_squares(g - 1) / 2)
p <- Problem(Phi_Q, constraints)
res <- solve(p, solver = "SCS")
w_cvxr_q <- di * res$getValue(g)
```
The default `ECOS` solver produces a different number of unique
weights. Such differences are not unheard of among solvers.
```{r, echo = FALSE}
build_table(d1 = build_df(apisrs, "Survey", w_survey_q),
d2 = build_df(apisrs, "CVXR", w_cvxr_q),
title = "Calibration weights from Quadratic metric")
```
### Exercise
Repeat the above exercise with the logistic function
\[
\phi^{L}(g; l, u) = \frac{1}{C}\biggl[ (g-l)\log\left(\frac{g-l}{1-l}\right) +
(u-g)\log\left(\frac{u-g}{u-1}\right) \biggr] \mbox{ for } C = \frac{u-l}{(1-l)(u-1)},
\]
which requires bounds $l$ and $u$ on the coefficients. Use $l=0.9$ and $u=1.1$. The results from `survey::cal.linear` can be obtained with the code below.
```{r}
u <- 1.10; l <- 0.90
w_survey_l <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear,
bounds = c(l, u)))
```
### Solution
```{r}
Phi_L <- Minimize(sum(-entr((g - l) / (u - l)) -
entr((u - g) / (u - l))))
p <- Problem(Phi_L, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_l <- di * res$getValue(g)
```
```{r, echo = FALSE}
build_table(d1 = build_df(apisrs, "Survey", w_survey_l),
d2 = build_df(apisrs, "CVXR", w_cvxr_l),
title = "Calibration weights from Logit metric")
```
### Exercise
Repeat with the Hellinger distance
\[
\Phi^{H}(g) = \frac{1}{(1 - g/2)^2}
\]
and the following results.
```{r}
hellinger <- make.calfun(Fm1 = function(u, bounds) ((1 - u / 2)^-2) - 1,
dF= function(u, bounds) (1 -u / 2)^-3 ,
name = "Hellinger distance")
w_survey_h <- weights(calibrate(design_api, formula, population = T, calfun = hellinger))
```
#### Solution
```{r}
Phi_h <- Minimize(sum((1 - g / 2)^(-2)))
p <- Problem(Phi_h, constraints)
res <- solve(p)
w_cvxr_h <- di * res$getValue(g)
```
```{r, echo = FALSE}
build_table(d1 = build_df(apisrs, "Survey", w_survey_h),
d2 = build_df(apisrs, "CVXR", w_cvxr_h),
title = "Calibration weights from Hellinger distance metric")
```
### Exercise
Lastly, use the derivative of the inverse hyperbolic sine
\[
\Phi^{S}(g) = \frac{1}{2}(e^g + e^{-g})
\]
along with the results produced below.
```{r}
w_survey_s <- weights(calibrate(design_api, formula, population = T, calfun = cal.sinh,
bounds = c(l, u)))
```
#### Solution
```{r}
Phi_s <- Minimize(sum( 0.5 * (exp(g) + exp(-g))))
p <- Problem(Phi_s, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_s <- di * res$getValue(g)
```
```{r, echo = FALSE}
build_table(d1 = build_df(apisrs, "Survey", w_survey_s),
d2 = build_df(apisrs, "CVXR", w_cvxr_s),
title = "Calibration weights from derivative of sinh metric")
```
## References