13-MBOptimisationProbabilitySampling.Rmd

# Model-based optimisation of probability sampling designs {#MBpredictionofDesignVariance}

Chapter \@ref(Modelassisted) on model-assisted estimation explained how a linear regression model or a non-linear model obtained with a  machine learning algorithm can be used to increase the precision of design-based estimates of the population mean or total using the data collected by a given probability sampling design. This chapter will explain how a model of the study variable can be used at an earlier stage to optimise probability sampling designs. It will show how a model can help in choosing between alternative sampling design types, for instance between systematic random sampling, spreading the sampling units throughout the study area, and two-stage cluster random sampling, resulting in spatial clusters of sampling units. Besides, the chapter will explain  how to use a model to optimise the sample size of a given sampling design type, for instance, the number of primary and secondary sampling units with two-stage cluster random sampling. The final section of this chapter is about how a model can be used to optimise spatial strata for stratified simple random sampling.

The models used in this chapter are all geostatistical models of the spatial variation. Chapter \@ref(Introkriging) is an introduction to geostatistical modelling. Several geostatistical concepts explained in that chapter are needed here to predict the sampling variance.

A general geostatistical model of the spatial variation is

\begin{equation}
\begin{split}
Z(\mathbf{s}) &= \mu(\mathbf{s}) + \epsilon(\mathbf{s}) \\
\epsilon(\mathbf{s}) &\sim \mathcal{N}(0,\sigma^2) \\
\mathrm{Cov}(\epsilon(\mathbf{s}),\epsilon(\mathbf{s}^{\prime})) &= C(\mathbf{h}) \;,
\end{split}
(\#eq:geostatmodel)
\end{equation}

with $Z(\mathbf{s})$ the study variable at location $\mathbf{s}$, $\mu(\mathbf{s})$ the mean at location $\mathbf{s}$, $\epsilon(\mathbf{s})$ the residual at location $\mathbf{s}$, and $C(\mathbf{h})$ the covariance of the residuals at two locations separated by vector $\mathbf{h} = \mathbf{s}-\mathbf{s}^{\prime}$. The residuals are assumed to have a normal distribution with zero mean and a constant variance $\sigma^2$ ($\mathcal{N}(0,\sigma^2)$). 

The model of the spatial variation has several parameters. In case of a model in which the mean is a linear combination of covariates, these are the regression coefficients associated with the covariates and the parameters of a semivariogram describing the spatial dependence of the residuals. A semivariogram is a model for half the expectation of the squared difference of the study variable or the residuals of a model at two locations, referred to as the semivariance, as a function of the length (and direction) of the vector separating the two locations (Chapter \@ref(Introkriging)).

Using the model to predict the sampling variance of a design-based estimator of a population mean requires prior knowledge of the semivariogram. When data from the study area of interest are available, these data can be used to choose a semivariogram model and to estimate the parameters of the model. If no such data are available, we must make a best guess, based on data collected in other areas. In all cases I recommend keeping the model as simple as possible.


## Model-based optimisation of sampling design type and sample size

Chapter \@ref(RequiredSampleSize) presented methods and formulas for computing the required sample size given various measures to quantify the quality of the survey result. These required sample sizes are for simple random sampling. For other types of sampling design, the required sample size can be approximated by multiplying the required sample size for simple random sampling with the design effect, see Section \@ref(DesignEffect). An alternative is to use a model of the spatial variation to predict the sampling variance of the estimator of the mean for the type of sampling design under study and a range of sample sizes, plotting the predicted variance, or standard error, against the sample size, and using this plot inversely to derive the required sample size given a constraint on the sampling variance or standard error.

The computed required sample size applies to several given parameters of the sampling design. For instance, for stratified random sampling, the sample size is computed for a given stratification and sample size allocation scheme, for cluster random sampling for given clusters, and for two-stage cluster random sampling for given primary sampling units (PSUs) and the number of secondary sampling units (SSUs) selected per PSU draw. However, the model can also be used to optimise these sampling design parameters. For stratified random sampling, the optimal allocation can be computed by predicting the population standard deviations within strata and using these predicted standard deviations in Equation \@ref(eq:optallocation). Even the stratification can be optimised, see Section \@ref(Ospats). If we have a costs model for two-stage cluster random sampling, the number of PSU draws and the number of SSUs per PSU draw can be optimised.  

Model-based prediction of the sampling variance can also be useful to compare alternative types of sampling design at equal total costs or equal variance of the estimated population mean or total. For instance, to compare systematic random sampling, leading to good spatial coverage, and two-stage cluster random sampling, resulting in spatial clusters of observations.

Three approaches for model-based prediction of the sampling variance of a design-based estimator of the population mean, or total, are described: the analytical approach (Subsection \@ref(AnalyticalApproach)), the geostatistical simulation approach (Subsection \@ref(GeostatisticalSimulationApproach)), and the Bayesian approach (Subsection \@ref(MBpredSamplingVarBayes)). In the analytical approach we assume that the mean, $\mu(\mathbf{s})$ in Equation \@ref(eq:geostatmodel), is everywhere the same. This assumption is relaxed in the geostatistical simulation approach. This simulation approach can also accommodate models in which the mean is a linear combination of covariates and/or spatial coordinates to predict the sampling variance. Furthermore, this simulation approach can be used to predict the sampling variance of the estimator of the mean of a trans-Gaussian variable, i.e., a random variable that can be transformed to a Gaussian variable. 

The predicted sampling variances of the estimated population mean obtained with the analytical and the geostatistical simulation approach are conditional on the model of the spatial variation. Uncertainty about this model is not accounted for. On the contrary, in the Bayesian approach, we do account for our uncertainty about the assumed model, and we analyse how this uncertainty propagates to the sampling variance of the estimator of the mean.

In some publications, the sampling variance as predicted with a statistical model is referred to as the anticipated variance\index{Anticipated variance} (@sar92, p. 450).

### Analytical approach {#AnalyticalApproach}

In the analytical approach, the sampling variance of the estimator of the mean is derived from mean semivariances within the study area\index{Mean semivariance!within a study area}  and mean semivariances within the sample\index{Mean semivariance!within a sample}. Assuming isotropy, the mean semivariances are a function of the separation distance between pairs of points. 

The sampling variance of the $\pi$ estimator of the population mean can be predicted by (@dom94, @gru06)

\begin{equation}
E_{\xi}\{V_p(\hat{\bar{z}})\}=\bar{\gamma}-E_p(\pmb{\lambda}^{\mathrm{T}}\pmb{\Gamma}_{\mathcal{S}}\pmb{\lambda}) \;,
(\#eq:MBdesignvarAnydesign)
\end{equation}

where $E_{\xi}(\cdot)$ is the statistical expectation over realisations from the model $\xi$, $E_{p}(\cdot)$ is the statistical expectation over repeated sampling with sampling design $p$, $V_{p}(\hat{\bar{z}})$ is the variance of the $\pi$ estimator of the population mean over repeated sampling with sampling design $p$, $\bar{\gamma}$ is the mean semivariance of the random variable at two randomly selected locations in the study area, $\pmb{\lambda}$ is the vector of design-based weights of the units of a sample selected with design $p$, and $\pmb{\Gamma}_{\mathcal{S}}$ is the matrix of semivariances between the units of a sample $\mathcal{S}$ selected with design $p$.

The mean semivariance $\bar{\gamma}$ is a model-based prediction of the population variance, or spatial variance, i.e., the model-expectation of the population variance:

\begin{equation}
\bar{\gamma} = E_{\xi}\{S^2(z)\}\:.
(\#eq:meansemivariance)
\end{equation}

The mean semivariance $\bar{\gamma}$ can be calculated by discretising the study area by a fine square grid, computing the matrix with geographical distances between the discretisation nodes, transforming this into a semivariance matrix, and computing the average of all elements of the semivariance matrix. The second term $E_p(\pmb{\lambda}^{\mathrm{T}}\pmb{\Gamma}_{\mathcal{S}}\pmb{\lambda})$ can be evaluated by Monte Carlo simulation, repeatedly selecting a sample according to design $p$, calculating $\pmb{\lambda}^{\mathrm{T}}\pmb{\Gamma}_{\mathcal{S}}\pmb{\lambda}$, and averaging.

```{block2, type='rmdnote'}
The semivariance at zero distance (same location) is 0, so the diagonal of a semivariance matrix is filled with zeroes. If a semivariogram with nugget is assumed, the zeroes on the diagonal must be replaced by the nugget to compute $\bar{\gamma}$. The same holds for the diagonal zeroes in  $\pmb{\Gamma}_{\mathcal{S}}$.  
```

This generic procedure is still computationally demanding, but it is the only option for complex spatial sampling designs. For basic sampling designs, the general formula can be worked out. For simple random sampling, the sampling variance can be predicted by

\begin{equation}
E_{\xi}\{V_\mathrm{SI}(\hat{\bar{z}})\}=\bar{\gamma }/n \;,
(\#eq:MBdesignvarSI)
\end{equation}

and for stratified simple random sampling by

\begin{equation}
E_{\xi}\{V_\mathrm{STSI}(\hat{\bar{z}})\}=\sum_{h=1}^H w^2_h \bar{\gamma_h}/n_h \;,
(\#eq:MBdesignvarSTSI)
\end{equation}

with $H$ the number of strata, $w_h$ the stratum weight (relative size), $\bar{\gamma}_h$ the mean semivariance of stratum $h$, and $n_h$ the number of sampling units of stratum $h$.

```{block2, type='rmdnote'}
The model-based predictions of the variances within the strata, $\bar{\gamma}_h$, can also be used to compute the sample sizes for Neyman allocation, which are the optimal sample sizes  when the mean costs per unit are equal for the strata. To compute these sample sizes, the standard deviations $S_h(z)$ in Equation \@ref(eq:Neymanallocation) are replaced by $\sqrt{\bar{\gamma}_h}$.
```

For systematic random sampling, i.e., sampling on a randomly placed grid, the variance can be predicted by

\begin{equation}
E_{\xi}\{V_\mathrm{SY}(\hat{\bar{z}})\}=\bar{\gamma} - E_{\mathrm{SY}}(\bar{\gamma}_{\mathrm{SY}}) \;,
(\#eq:MBdesignvarSY)
\end{equation}

with $E_{\mathrm{SY}}(\bar{\gamma}_{\mathrm{SY}})$ the design-expectation, i.e., the expectation over repeated systematic sampling, of the mean semivariance within the grid. With systematic random sampling, the number of grid points within the study area can vary among samples, as well as the spatial pattern of the points (Chapter \@ref(SY)). Therefore, multiple systematic random samples must be selected, and the average of the mean semivariance within the systematic sample must be computed.

The analytical approach is illustrated with the data of agricultural field Leest [@HofmanBrus2021]. Nitrate-N (NO$_3$-N) in kg ha^-1^ in the layer $0-90$ cm, using a standard soil density of 1,500 kg m^-3^, is measured at 30 points. In the next code chunk, these data are used to compute a sample semivariogram using the method-of-moments with function `variogram`, see Chapter \@ref(Introkriging). A spherical model without nugget is fitted to the sample semivariogram using function `fit.variogram`. The numbers in this plot are the numbers of pairs of points used to compute the semivariances.

(ref:VariogramLeestlabel) Sample semivariogram and fitted spherical model for NO~3~-N in field Leest. The numbers refer to point-pairs used in computing semivariances.

```{r, fig.asp=0.7, fig.cap="(ref:VariogramLeestlabel)"}
library(gstat)
coordinates(sampleLeest) <- ~ s1 + s2
vg <- variogram(N ~ 1, data = sampleLeest)
vgm_MoM <- fit.variogram(
  vg, model = vgm(model = "Sph", psill = 2000, range = 20))
```

```{r VariogramLeest, echo = FALSE, fig.asp=0.7, fig.cap="(ref:VariogramLeestlabel)"}
fitted <- variogramLine(vgm_MoM, maxdist = 50, n = 1000)
ggplot(data = vg) +
  geom_point(mapping = aes(x = dist, y = gamma), size = 2) +
  geom_text(mapping = aes(x = dist, y = gamma, label = np), nudge_x = 1.2) +
  geom_smooth(data = fitted, mapping = aes(x = dist, y = gamma), colour = "red") +
  scale_x_continuous(name = "Distance (m)") +
  scale_y_continuous(name = "Semivariance", limits = c(0, 1250))
```
The few data lead to a very noisy sample semivariogram. For the moment, I ignore my uncertainty about the semivariogram parameters; Subsection \@ref(MBpredSamplingVarBayes) will show how we can account for our uncertainty about the semivariogram parameters in model-based prediction of the sampling variance. A spherical semivariogram model\index{Semivariogram model} without nugget is fitted to the sample semivariogram, i.e., the intercept is 0. The fitted range of the model is `r formatC(vgm_MoM$range, 0, format = "f")` m, and the fitted sill equals `r formatC(vgm_MoM$psill, 0, format = "f")` (kg ha^-1^)^2^. The fitted semivariogram is used to predict the sampling variance for three sampling designs: simple random sampling, stratified simple random sampling, and systematic random sampling. The costs for these three design types will be about equal, as the study area is small, so that the access time of the sampling points selected with the three designs is about equal. The sample size of the evaluated sampling designs is 25 points. As for systematic random sampling, the number of points varies among the samples, this sampling design has an *expected* sample size of 25 points.

#### Simple random sampling {-}

For simple random sampling, we must compute the mean semivariance within the field (Equation \@ref(eq:MBdesignvarSI)). As shown in the next code chunk, the mean semivariance is approximated by discretising the field by a square grid of 2,000 points, computing the 2,000 $\times$ 2,000 matrix with distances between all pairs of discretisation nodes, transforming this distance matrix into a semivariance matrix using function `variogramLine` of package **gstat** [@peb04], and finally averaging the semivariances. Note that in this case we do not need to replace the zeroes on the diagonal of the semivariance matrix by the nugget, as a model without nugget is fitted. The geopackage file is read with function `read_sf` of package **sf** [@sf], resulting in a simple feature object. The projection attributes of this object are removed with function `st_set_crs`. The centres of square grid cells are selected with function `st_make_grid`. The centres inside the field are selected with `mygrid[field]`, and finally the coordinates are extracted with function `st_coordinates`.  

```{r ExiVarSILeest}
field <- read_sf(system.file("extdata/leest.gpkg", package = "sswr")) %>%
  st_set_crs(NA_crs_)
mygrid <- st_make_grid(field, cellsize = 2, what = "centers")
mygrid <- mygrid[field] %>%
  st_coordinates(mygrid)
H <- as.matrix(dist(mygrid))
G <- variogramLine(vgm_MoM, dist_vector = H)
m_semivar_field <- mean(G)
n <- 25
Exi_V_SI <- m_semivar_field / n
```

The model-based prediction of the sampling variance of the estimator of the mean with this design equals `r formatC(Exi_V_SI, 1, format = "f")`.

#### Stratified simple random sampling {-}

The strata of the stratified simple random sampling design are compact geographical strata\index{Compact geographical stratum} of equal size (Section \@ref(geostrata)). The number of geostrata is equal to the sample size, 25 points, so that we have one point per stratum. With this design, the sampling points are reasonably well spread over the field, but not as good as with systematic random sampling. To predict the sampling variance, we must compute the mean semivariances within the geostrata, see Equation \@ref(eq:MBdesignvarSTSI). Note that the stratum weights are constant as the strata have equal size, $w_h = 1/n$, and that $n_h=1$. Therefore, Equation \@ref(eq:MBdesignvarSTSI) reduces to

\begin{equation}
E_{\xi}\{V_\mathrm{STSI}(\hat{\bar{z}})\}=\frac{1}{n^2}\sum_{h=1}^H \bar{\gamma_h} \;.
(\#eq:MBdesignvarSTSI2)
\end{equation}

The next code chunk shows the computation of the mean semivariance per stratum and the model-based prediction of the sampling variance of the estimator of the mean. The matrix with the coordinates is first converted to a tibble with function `as_tibble`.  

```{r ExiVarSTSILeest}
library(spcosa)
mygrid <- mygrid %>%
  as_tibble() %>%
  setNames(c("x1", "x2"))
gridded(mygrid) <- ~ x1 + x2
mygeostrata <- stratify(mygrid, nStrata = n, equalArea = TRUE, nTry = 10) %>%
  as("data.frame")
m_semivar_geostrata <- numeric(length = n)
for (i in 1:n) {
 ids <- which(mygeostrata$stratumId == (i - 1))
 mysubgrd <- mygeostrata[ids, ]
 H_geostratum <- as.matrix(dist(mysubgrd[, c(2, 3)]))
 G_geostratum <- variogramLine(vgm_MoM, dist_vector = H_geostratum)
 m_semivar_geostrata[i] <- mean(G_geostratum)
}
Exi_V_STSI <- sum(m_semivar_geostrata) / n^2
```

The model-based prediction of the sampling variance with this design equals `r formatC(Exi_V_STSI, 1, format = "f")` (kg ha^-1^)^2^, which is much smaller than with simple random sampling. The large stratification effect\index{Stratification effect} can be explained by the assumed strong spatial structure of NO$_3$-N in the agricultural field and the improved geographical spreading of the sampling points, see Figure \@ref(fig:VariogramLeest).

#### Systematic random sampling {-}

To predict the  sampling variance for systematic random sampling with an expected sample size of 25 points, we must compute the design-expectation of the mean semivariance within the systematic sample (Equation \@ref(eq:MBdesignvarSY)). As shown in the next code chunk, I approximated this expectation by selecting 100 systematic random samples, computing the mean semivariance for each sample, and averaging. Finally, the model-based prediction of the sampling variance is computed by subtracting the average of the mean semivariances within a systematic sample  from the mean semivariance within the field computed above.

```{r}
set.seed(314)
m_semivar_SY <- numeric(length = 100)
for (i in 1:100) {
  mySYsample <- spsample(x = mygrid, n = n, type = "regular") %>%
    as("data.frame")
  H_SY <- as.matrix(dist(mySYsample))
  G_SY <- variogramLine(vgm_MoM, dist_vector = H_SY)
  m_semivar_SY[i] <- mean(G_SY)
}
Exi_V_SY <- m_semivar_field - mean(m_semivar_SY)
```

The model-based prediction of the sampling variance of the estimator of the mean with this design equals `r formatC(Exi_V_SY, 1, format = "f")` (kg ha^-1^)^2^, which is smaller than that of stratified simple random sampling. This can be explained by the improved geographical spreading of the sampling points with systematic random sampling as compared to stratified simple random sampling with compact geographical strata.

#### Bulking soil aliquots into a composite sample

If the soil aliquots collected at the points of the stratified random sample are bulked into a composite, as is usually done in soil testing of agricultural fields, the procedure for predicting the variance of the estimator of the mean is slightly different. Only the composite sample\index{Composite sampling} is analysed in a laboratory on NO$_3$-N, not the individual soil aliquots\index{Soil aliquot}. This implies that the contribution of the measurement error\index{Measurement error} to the total uncertainty about the population mean is larger. To predict the sampling variance in this situation, we need the semivariogram of errorless measurements of NO$_3$-N, i.e., of the true NO$_3$-N contents of soil aliquots collected at points. The sill of this semivariogram  will be smaller than the sill of the semivariogram of measured NO$_3$-N data. A simple option is to subtract an estimate of the measurement error variance from the semivariogram of measured NO$_3$-N data that contain a measurement error. So, the measurement error variance is subtracted from the nugget. This may lead to negative nuggets, which is not allowed (a variance cannot be negative). The preferable alternative is to add the measurement error variance to the diagonal of the covariance matrix of the data in fitting the model with maximum likelihood, see function `ll` in Subsection \@ref(MBpredSamplingVarBayes). 

#### Exercises {-}

1. Write an **R** script to predict the sampling variance of the estimator of the mean of NO$_3$-N of agricultural field Leest, for simple random sampling and a sample size of 25 points. Use in prediction a spherical semivariogram with a nugget of 483, a partial sill of 483, and a range of 44.6 m. The sum of the nugget and partial sill (966) is equal to the sill of the semivariogram used above in predicting sampling variances. Compare the predicted sampling variance with the predicted sampling variance for the same sampling design, obtained with the semivariogram without nugget. Explain the difference.  
2. Write an **R** script to compute the required sample size for simple random sampling of agricultural field Leest, for a maximum length of a 95\% confidence interval of 20. Use the semivariogram without nugget in predicting the sampling variance. See Section \@ref(ReqSampleSizeLengthCI) (Equation \@ref(eq:nreqwidthCI)) for how to compute the required sample size given a prior estimate of the standard deviation of the study variable in the population.  
3. Do the same for systematic random sampling. Note that for this sampling design, no such formula is available. Predict for a series of *expected* sample sizes, $n = 5,6,\dots , 40$, the sampling variance of the estimator of the mean, using Equation \@ref(eq:MBdesignvarSY). Approximate $E_{\mathrm{SY}}(\bar{\gamma}_{\mathrm{SY}})$ from ten repeated selections. Compute the length of the confidence interval from the predicted sampling variances, and plot the interval length against the sample size. Finally, determine the required sample size for a maximum length of 20. What is the design effect for an expected sample size of 34 points (the required sample size for simple random sampling)? See Equation \@ref(eq:designeffect). Also compute the design effect for expected sample sizes of $5,6,\dots , 40$. Explain why the design effect is not constant.  

### Geostatistical simulation approach {#GeostatisticalSimulationApproach}

The alternative to the analytical approach is to use a geostatistical simulation\index{Geostatistical simulation} approach. It is computationally more demanding, but an advantage of this approach is its flexibility. It can also be used to predict the sampling variance of the estimator of the mean using a geostatistical model with a non-constant mean. And besides, this approach can also handle trans-Gaussian variables\index{Trans-Gaussian variable}, i.e., variables whose distribution can be transformed into a normal distribution. In Subsection \@ref(MBpredSamplingVarBayes), geostatistical simulation is used to predict the variance of the estimator of the mean of a lognormal variable. 

The geostatistical  simulation approach for predicting the sampling variance of a design-based estimator of the population mean involves the following steps:  

1. Select a large number $S$ of random samples with sampling design $p$.  
2. Use the model to simulate values of the study variable for all sampling points.  
3. Estimate for each sample the population mean, using the design-based estimator of the population mean for sampling design $p$. This results in $S$ estimated population means.  
4. Compute the variance of the $S$ estimated means.  
5. Repeat steps 1 to 4 $R$ times, and compute the mean of the $R$ variances.  

This approach is illustrated with the western part of the Amhara region in Ethiopia (hereafter referred to as West-Amhara) where a large sample is available with organic matter data in the topsoil (SOM) in decagram per kg dry soil (dag kg^-1^; 1 decagram = 10 gram). The soil samples are collected along roads (see Figure \@ref(fig:spatialinfillEthiopia)). It is a convenience sample\index{Convenience sample}, not a probability sample, so these sample data cannot be used in design-based or model-assisted estimation of the mean or total soil carbon stock in the study area. However, the data can be used to model the spatial variation of the SOM concentration, and this geostatistical model\index{Geostatistical model} can then be used to design a probability sample for design-based estimation of the total mass of SOM. Apart from the point data of the SOM concentration, maps of covariates are available, such as a digital elevation model and remote sensing reflectance data. In the next code chunk, four covariates are selected to model the mean of the SOM concentration: elevation (dem), average near infrared reflectance (rfl-NIR), average red reflectance (rfl-red), and average land surface temperature (lst). I assume a normal distribution for the residuals of the linear model.  The model parameters are estimated by restricted maximum likelihood\index{Restricted maximum likelihood estimation} (REML), using package **geoR** [@geoR], see Subsection \@ref(REML) for details on REML estimation of a geostatistical model. As a first step, the projected coordinates of the sampling points are changed from m into km using function `mutate`. Using coordinates in m in function `likfit` could not find an optimal estimate for the range.

```{r}
library(geoR)
sampleAmhara <- sampleAmhara %>%
  mutate(s1 = s1 / 1000, s2 = s2 / 1000)
dGeoR <- as.geodata(obj = sampleAmhara, header = TRUE,
  coords.col = c("s1", "s2"), data.col = "SOM",
  covar.col = c("dem", "rfl_NIR", "rfl_red", "lst"))
vgm_REML <- likfit(geodata = dGeoR,
  trend = ~ dem + rfl_NIR + rfl_red + lst,
  cov.model = "spherical", ini.cov.pars = c(1, 50), nugget = 0.2,
  lik.method = "REML", messages = FALSE)
```

The estimated parameters of the residual semivariogram of the SOM concentration are shown in Table \@ref(tab:VariogramREMLEthiopia). The estimated regression coefficients are `r formatC(vgm_REML$beta[1], 1, format = "f")` for the intercept, `r formatC(vgm_REML$beta[2], 3, format = "f")` for elevation (dem), `r formatC(vgm_REML$beta[3], 2, format = "f")` for NIR reflectance, `r formatC(vgm_REML$beta[4], 2, format = "f")` for red reflectance, and `r formatC(vgm_REML$beta[5], 3, format = "f")` for land surface temperature. 

Package **gstat** is used for geostatistical simulation, and therefore first the REML estimates of the semivariogram parameters are passed to function `vgm` using arguments `nugget`, `psill`, and `range` of function `vgm` of this package. 

```{r}
vgm_REML_gstat <- vgm(model = "Sph", nugget = vgm_REML$tausq,
  psill = vgm_REML$sigmasq, range = vgm_REML$phi)
```

The fitted model of the spatial variation of the SOM concentration is used to compare systematic random sampling and two-stage cluster random sampling at equal variances of the estimator of the mean. 

#### Systematic random sampling {-}

One hundred systematic random samples ($S=100$) with an expected sample size of 50 points ($E[n]=50$) are selected. The four covariates at the selected sampling points are extracted by overlaying the `SpatialPointsDataFrame`  `mySYsamples` and the `SpatialPixelsDataFrame` `grdAmhara` with function `over` of package **sp** [@Pebesma2005]. Values at the sampling points are simulated by sequential Gaussian simulation\index{Sequential Gaussian simulation} [@goo97], using function `krige`  with argument `nsim = 1` of package **gstat**. Argument `dummy` is set to `TRUE` to enforce unconditional simulation\index{Geostatistical simulation!unconditional}.

```{block2, type='rmdnote'}
The alternative is conditional simulation\index{Geostatistical simulation!conditional}, using the data of the convenience sample as conditioning data. Conditional simulation is only recommended if the quality of these legacy data is sufficient, and we may trust that the study variable at the legacy points has not changed since these legacy data have been collected.
```

Note that by first drawing 100 samples, followed by simulating values of $z$ at the selected sampling points, instead of first simulating values of $z$ at the nodes of a discretisation  grid, followed by selecting samples and overlaying with the simulated field, the simulated values of points in the same discretisation cell differ, so that we account for the infinite number of points in the population.

With systematic random sampling, the sample mean is an approximately unbiased estimator of the population mean (Chapter \@ref(SY)). Therefore, of each sample the mean of the simulated values is computed, using function `tapply`. Finally, the variance of the 100 sample means is computed. This is a conditional variance, conditional on the simulated values. In the code chunk below, the whole procedure is repeated 100 times ($R=100$), leading to 100 conditional variances of sample means.

```{r MonteCarloEthiopia, eval = FALSE}
grdAmhara <- grdAmhara %>%
  mutate(s1 = s1 / 1000, s2 = s2 / 1000)
gridded(grdAmhara) <- ~ s1 + s2
S <- R <- 100
v_mzsim_SY <- numeric(length = R)
set.seed(314)
for (i in 1:R) {
  mySYsamples <- NULL
  for (j in 1:S) {
    xy <- spsample(x = grdAmhara, n = 50, type = "regular")
    mySY <- data.frame(
      s1 = xy$x1, s2 = xy$x2, sample = rep(j, length(xy)))
    mySYsamples <- rbind(mySYsamples, mySY)
  }
  coordinates(mySYsamples) <- ~ s1 + s2
  res <- over(mySYsamples, grdAmhara)
  mySYs <- data.frame(mySYsamples, res[, c("dem", "rfl_NIR", "rfl_red", "lst")])
  coordinates(mySYs) <- ~ s1 + s2
  zsim <- krige(
    dummy ~ dem + rfl_NIR + rfl_red + lst,
    locations = mySYs, newdata = mySYs,
    model = vgm_REML_gstat, beta = vgm_REML$beta,
    nmax = 20, nsim = 1,
    dummy = TRUE,
    debug.level = 0) %>% as("data.frame")
  m_zsim <- tapply(zsim$sim1, INDEX = mySYs$sample, FUN = mean)
  v_mzsim_SY[i] <- var(m_zsim)
}
grdAmhara <- as_tibble(grdAmhara)
```

```{r, eval = FALSE, echo = FALSE}
write_rds(v_mzsim_SY,file = "results/MBpredVariance_SY_Amhara.rds")
```

```{r, echo = FALSE}
v_mzsim_SY <- read_rds(file = "results/MBpredVariance_SY_Amhara.rds")
```

```{r}
Exi_vmz_SY <- mean(v_mzsim_SY)
```

The mean of the 100 conditional variances equals `r formatC(Exi_vmz_SY, 3, format = "f")` (dag kg^-1^)^2^. This is a Monte Carlo approximation of the model-based prediction of the sampling variance of the ratio estimator of the mean for systematic random sampling with an expected sample size of 50.

#### Two-stage cluster random sampling {-}

Due to the geographical spreading of the sampling points with systematic random sampling, the accuracy of the estimated mean is expected to be high compared to that of other sampling designs of the same size. However, with large areas, the time needed for travelling to the sampling points can become substantial, lowering the sampling efficiency. With large areas, sampling designs leading to spatial clusters of sampling points can be an attractive alternative. One option then is two-stage cluster random sampling, see Chapter \@ref(Twostage). The question is whether this alternative design is more efficient than systematic random sampling.

In the next code chunk, 100 compact geostrata (see Section \@ref(geostrata)) are computed for West-Amhara. Here, these geostrata are not used as strata in stratified random sampling, but as PSUs in two-stage cluster random sampling. The difference is that in stratified random sampling from each geostratum at least one sampling unit is selected, whereas in two-stage cluster random sampling only a randomly selected subset of the geostrata is sampled. The compact geostrata, used as PSUs, are computed with function `kmeans`, and as a consequence the PSUs do not have equal size, see output of code chunk below. This is not needed in two-stage cluster random sampling, see Chapter \@ref(Twostage). If PSUs of equal size are preferred, then these can be computed with function `stratify` of package **spcosa** with argument `equalArea = TRUE`, see Section \@ref(geostrata).

<!--Projection attributes of the file with covariates: crs <- st_crs("+proj=laea +ellps=WGS84  +lat_0=5 +lon_0=20 +no_defs")-->

```{r}
set.seed(314)
res <- kmeans(
  grdAmhara[, c("s1", "s2")], iter.max = 1000, centers = 100, nstart = 100)
mypsus <- res$cluster
psusize <- as.numeric(table(mypsus))
summary(psusize)
```

In the next code chunks, I assume that the PSUs are selected with probabilities proportional to their size and with replacement (ppswr sampling), see Chapter \@ref(Twostage). In Section \@ref(twostagesamplingestimators) formulas are presented for computing the optimal number of PSU draws and SSU draws per PSU draw. The optimal sample sizes are a function of the pooled variance of PSU means, $S^2_{\mathrm{b}}$, and the pooled variance of secondary units (points) within the PSUs, $S^2_{\mathrm{w}}$. In the current subsection, these variance components are predicted with the geostatistical model.

As a first step, a large number of maps are simulated.

```{r}
grdAmhara$psu <- mypsus
coordinates(grdAmhara) <- ~ s1 + s2
set.seed(314)
zsim <- krige(
  dummy ~ dem + rfl_NIR + rfl_red + lst,
  locations = grdAmhara, newdata = grdAmhara,
  model = vgm_REML_gstat, beta = vgm_REML$beta,
  nmax = 20, nsim = 1000,
  dummy = TRUE, debug.level = 0) %>% as("data.frame")
zsim <- zsim[, -c(1, 2)]
```

For each simulated field, the means of the PSUs and the variances of the simulated values within the PSUs are computed using function `tapply` in function `apply`.

```{r}
m_zsim_psu <- apply(zsim, MARGIN = 2, FUN = function(x)
  tapply(x, INDEX = grdAmhara$psu, FUN = mean))
v_zsim_psu <- apply(zsim, MARGIN = 2, FUN = function(x)
  tapply(x, INDEX = grdAmhara$psu, FUN = var))
```

Next, for each simulated field, the pooled variance of PSU means and the pooled variance within PSUs are computed, and finally these pooled variances are averaged over all simulated fields. The averages are approximations of the model-expectations of the pooled between unit and within unit variances, $E_{\xi}[S^2_{\mathrm{b}}]$ and $E_{\xi}[S^2_{\mathrm{w}}]$.

```{r}
p_psu <- psusize / sum(psusize)
S2b <- apply(m_zsim_psu, MARGIN = 2, FUN = function(x)
  sum(p_psu * (x - sum(p_psu * x))^2))
S2w <- apply(v_zsim_psu, MARGIN = 2, FUN = function(x)
  sum(p_psu * x))
Exi_S2b <- mean(S2b)
Exi_S2w <- mean(S2w)
```

The optimal sample sizes are computed for a simple linear costs model: $C = c_0 + c_1n + c_2nm$, with $c_0$ the fixed costs, $c_1$ the access costs per PSU, including the access costs of the SSUs (points) within a given PSU, and $c_2$ the observation costs per SSU. In the next code chunk, I use $c_1=2$ and $c_2=1$. For the optimal sample sizes only the ratio of $c_1$ and $c_2$ is important, not their absolute values.

Given values for $c_1$ and $c_2$, the optimal number of PSU draws $n$ and the optimal number of SSU draws per PSU draw $m$ are computed, required for a sampling variance of the estimator of the mean equal to the sampling variance with systematic random sampling of 50 points, see Equations \@ref(eq:nopt) and \@ref(eq:mopt). 

```{r}
c1 <- 2; c2 <- 1
nopt <- 1 / Exi_vmz_SY * (sqrt(Exi_S2w * Exi_S2b) * sqrt(c2 / c1) + Exi_S2b)
mopt <- sqrt(Exi_S2w / Exi_S2b) * sqrt(c1 / c2)
```

The optimal number of PSU draws is `r ceiling(nopt)`, and the optimal number of points per PSU draw equals `r ceiling(mopt)`. The total number of sampling points is `r ceiling(nopt)` $\times$ `r ceiling(mopt)` = `r ceiling(nopt)*ceiling(mopt)`. This is much larger than the sample size of 50 obtained with systematic random sampling. The total observation costs therefore are substantially larger. However, the access time can be substantially smaller due to the spatial clustering of sampling points. To answer the question of whether the costs saved by this reduced access time outweigh the extra costs of observation, the model for the access costs and observation costs must be further developed.

### Bayesian approach {#MBpredSamplingVarBayes}

The model-based prediction of the variance of the design-based estimator of the population mean for a given sampling design is conditional on the model. If we change the model type or the model parameters, the predicted sampling variance also changes. In most situations we are quite uncertain about the model, even in situations where we have data that can be used to estimate the model parameters, as in the West-Amhara case study. Instead of using the best estimated model to predict the sampling variance as done in the previous sections, we may prefer to account for the uncertainty about the model parameters. This can be done through a Bayesian approach\index{Bayesian approach!to prediction of sampling variance}, in which the legacy data are used to update a prior distribution of the model parameters to a posterior distribution. For details about a Bayesian approach for estimating model parameters, see Section \@ref(BayesianGridSpacing). A sample from the posterior distribution of the model parameters is used one-by-one to predict the sampling variance. This can be done either analytically, as described in Subsection \@ref(AnalyticalApproach), or through geostatistical simulation, as described in Subsection \@ref(GeostatisticalSimulationApproach). Both approaches result in a *distribution* of sampling variances, reflecting our uncertainty about the sampling variance of the estimator of the population mean due to uncertainty about the model parameters. The mean or median of the distribution of sampling variances can be used as the predicted sampling variance. 

The Bayesian approach is illustrated with a case study on predicting the sampling variance of NO$_3$-N in agricultural field Melle in Belgium [@HofmanBrus2021]. As for field Leest used in Subsection \@ref(AnalyticalApproach), data of NO$_3$-N are available at 30 points. The sampling points are approximately on the nodes of a square grid with a spacing of about 4.5 m. As a first step, I check whether we can safely assume that the data come from a normal distribution.

(ref:qqplotMellelabel) Q-Q plot of NO~3~-N of field Melle.

```{r qqplotMelle, fig.asp=0.7, fig.width = 5, fig.cap="(ref:qqplotMellelabel)"}
ggplot(sampleMelle, aes(sample = N)) +
  geom_qq() +
  geom_qq_line()
pvalue <- shapiro.test(sampleMelle$N)$p.value
```

The Q-Q plot\index{Q-Q plot} (Figure \@ref(fig:qqplotMelle)) shows that a normal distribution is not very likely: there are too many large values, i.e., the distribution is skewed to the right. Also the *p*-value\index{Significance level}\index{\emph{p}-value of a test} of the Shapiro-Wilk test\index{Shapiro-Wilk test} shows that we should reject the null hypothesis of a normal distribution for the data: $p=0.0028$. I therefore proceed with the natural log of NO$_3$-N, in short lnN.

```{r}
sampleMelle$lnN <- log(sampleMelle$N)
```

```{r, echo=FALSE}
min_s1 <- min(sampleMelle$s1)
min_s2 <- min(sampleMelle$s2)
sampleMelle$s1 <- sampleMelle$s1 - min_s1
sampleMelle$s2 <- sampleMelle$s2 - min_s2
```

As a first step, the semivariogram of lnN is estimated by maximum likelihood\index{Maximum likelihood estimation} (Subsection \@ref(MLestimationVariogram)). An exponential semivariogram model is assumed, see Equation \@ref(eq:exponential). 

```{block2, type = 'rmdnote'}
The parameters that are estimated are the reciprocal of the sill $\lambda$, the ratio of spatial dependence\index{Ratio of spatial dependence} $\xi$, defined as the partial sill divided by the sill, and the distance parameter $\phi$. This parameterisation of the semivariogram is chosen because hereafter in the Bayesian approach prior distributions are chosen for these parameters.
```

The likelihood function is defined, using a somewhat unusual parameterisation, tailored to the Markov chain Monte Carlo (MCMC) sampling\index{Markov chain Monte Carlo sampling} from the posterior distribution of the semivariogram parameters. In MCMC a Markov chain of sampling units (vectors with semivariogram parameters) is generated using the previous sampling unit to randomly generate the next sampling unit (@Gelman2013, chapter 11). In MCMC sampling, the probability of accepting a proposed sampling unit $\pmb{\theta}^*$  is a function of the ratio of the posterior density of the proposed sampling unit and that of the current sampling unit, $f(\pmb{\theta}^*|\mathbf{z})/f(\pmb{\theta}_{t-1}|\mathbf{z})$, so that the normalising constant, the denominator of Equation \@ref(eq:BayesTheorem), cancels.

```{r}
library(mvtnorm)
ll <- function(thetas) {
  sill <- 1 / thetas[1]
  psill <- thetas[2] * sill
  nugget <- sill - psill
  vgmodel <- vgm(
    model = model, psill = psill, range = thetas[3], nugget = nugget)
  C <- variogramLine(vgmodel, dist_vector = D, covariance = TRUE)
  XCX <- crossprod(X, solve(C, X))
  XCz <- crossprod(X, solve(C, z))
  betaGLS <- solve(XCX, XCz)
  mu <- as.numeric(X %*% betaGLS)
  logLik <- dmvnorm(x = z, mean = mu, sigma = C, log = TRUE)
  logLik
}
```

Next, initial estimates of the semivariogram parameters are computed by maximising the likelihood, using function `optim`. 

```{r}
lambda.ini <- 1 / var(sampleMelle$lnN)
xi.ini <- 0.5
phi.ini <- 20
pars <- c(lambda.ini, xi.ini, phi.ini)
D <- as.matrix(dist(sampleMelle[, c("s1", "s2")]))
X <- matrix(1, nrow(sampleMelle), 1)
z <- sampleMelle$lnN
model <- "Exp"
vgML <- optim(pars, ll, control = list(fnscale = -1),
  lower = c(1e-6, 0, 1e-6), upper = c(1000, 1, 150), method = "L-BFGS-B")
```

The maximum likelihood (ML) estimates of the semivariogram parameters are used as initial values in MCMC sampling. A uniform prior is used for the inverse of the sill parameter, $\lambda=1/\sigma^2$, with a lower bound of $10^{-6}$ and an upper bound of 1. For the relative nugget, $\tau^2/\sigma^2$, a uniform prior is assumed with a lower bound of 0 and an upper bound of 1. For the distance parameter $\phi$ of the exponential semivariogram a uniform prior is assumed, with a lower bound of $10^{-6}$ m and an upper bound of 150 m. 

These priors can be defined by function `createUniformPrior` of package **BayesianTools** [@Hartig2018]. Function `createBayesianSetup` is then used to define the setup of the MCMC sampling, specifying the likelihood function, the prior, and the vector with best prior estimates of the model parameters, passed to function `createBayesianSetup` using argument `best`. Argument `sampler` of function `runMCMC` specifies the type of MCMC sampler. I used the differential evolution algorithm\index{Differential evolution algorithm} of @terBraak2008. Argument `start` of function `getSample` specifies the burn-in period\index{Burn-in period}, i.e., the number of first samples that are discarded to diminish the influence of the initial semivariogram parameter values. Argument `numSamples` specifies the sample size, i.e., the number of saved vectors with semivariogram parameter values, drawn from the posterior distribution.

```{r MCMCMelles, eval = FALSE}
library(BayesianTools)
priors <- createUniformPrior(lower = c(1e-6, 0, 1e-6),
                             upper = c(1000, 1, 150))
bestML <- c(vgML$par[1], vgML$par[2], vgML$par[3])
setup <- createBayesianSetup(likelihood = ll, prior = priors,
  best = bestML, names = c("lambda", "xi", "phi"))
set.seed(314)
res <- runMCMC(setup, sampler = "DEzs")
MCMCsample <- getSample(res, start = 1000, numSamples = 1000) %>% data.frame()
```

```{r, eval = FALSE, echo = FALSE}
write_rds(MCMCsample, file = "results/MCMC_Melles.rds")
```

Figure \@ref(fig:MCMCsampleVariogramlnN) shows several semivariograms, sampled by MCMC from the posterior distribution of the estimated semivariogram parameters.

(ref:MCMCsampleVariogramlnNlabel) Semivariograms of the natural log of NO~3~-N for field Melle obtained by MCMC sampling from posterior distribution of the estimated semivariogram parameters.

```{r MCMCsampleVariogramlnN, echo = FALSE, fig.width = 5, fig.asp = 0.7, fig.cap = "(ref:MCMCsampleVariogramlnNlabel)"}
MCMCsample <- read_rds(file = "results/MCMC_Melles.rds")
d <- 1:150
semivar <- matrix(nrow = length(d), ncol = 20)
for (i in 1:20) {
  sill <- 1 / MCMCsample$lambda[i]
  psill <- MCMCsample$xi[i] * sill
  nugget <- sill - psill
  phi <- MCMCsample$phi[i]
  g <- variogramLine(vgm(model = "Exp", psill = psill, range = phi, nugget = nugget), dist_vector = d)
  semivar[, i] <- g$gamma
}
dfsemi <- data.frame(d = d, semivar)
df_lf <- dfsemi %>%
  pivot_longer(cols = names(dfsemi)[-1])

ggplot(df_lf, mapping = aes(x = d, y = value,  group = name)) +
  geom_line() +
  scale_x_continuous(name = "Distance (m)", ) +
  scale_y_continuous(name = "Semivariance", limits = c(0, NA))
```

The evaluated sampling design is the same as used in Subsection \@ref(AnalyticalApproach) for field Leest: stratified simple random sampling, using compact geographical strata of equal size, a total sample size of 25 points, and one point per stratum. 

```{r, echo = FALSE}
field <- read_sf(system.file("extdata/melle.gpkg", package = "sswr")) %>%
  st_set_crs(NA_crs_)
mygrid <- st_make_grid(field, cellsize = 2, what = "centers")
mygrid <- mygrid[field] %>%
  st_coordinates(mygrid) %>%
  as_tibble() %>%
  mutate(
    x1 = X - min_s1,
    x2 = Y - min_s2
  )
gridded(mygrid) <- ~ x1 + x2
n <- 25
mygeostrata <- stratify(mygrid, nStrata = n, equalArea = TRUE, nTry = 10)
```

The next step is to simulate with each of the sampled semivariograms a large number of maps of lnN. This is done by sequential Gaussian simulation, conditional on the available data. The simulated values are backtransformed. Each simulated map is then used to compute the variance of the simulated values within the geostrata $S^2_h$. These stratum variances are used to compute the sampling variance of the estimator of the mean. Plugging $w_h = 1/n$ (all strata have equal size) into Equation \@ref(eq:EstVarMeanSTSI) and using $n_h=1$ in Equation \@ref(eq:EstVarstratummean) yield (compare with Equation \@ref(eq:MBdesignvarSTSI2))

\begin{equation}
    V(\hat{\bar{z}}) = \frac{1}{n^2}\sum_{h=1}^H S^2_h \;.
    (\#eq:VarSTSIMelles)
\end{equation}

In the code chunk below, I use the first 100 sampled semivariograms to simulate with each semivariogram 100 maps.

```{r, echo=FALSE}
mygeostrata <- as(mygeostrata, "data.frame")
```


```{r simulatevariances, eval = FALSE}
V <- matrix(data = NA, nrow = 100, ncol = 100)
coordinates(sampleMelle) <- ~ s1 + s2
set.seed(314)
for (i in 1:100) {
  sill <- 1 / MCMCsample$lambda[i]
  psill <- MCMCsample$xi[i] * sill
  nug <- sill - psill
  range <- MCMCsample$phi[i]
  vgmdl <- vgm(model = "Exp", nugget = nug, psill = psill, range = range)
  ysim <- krige(
    lnN ~ 1, locations = sampleMelle, newdata = mygrid,
    model = vgmdl,
    nmax = 20, nsim = 100,
    debug.level = 0) %>% as("data.frame")
  zsim <- exp(ysim[, -c(1, 2)])
  S2h <- apply(zsim, MARGIN = 2, FUN = function(x)
    tapply(x, INDEX = as.factor(mygeostrata$stratumId), FUN = var))
  V[i, ] <- 1 / n^2 * apply(S2h, MARGIN = 2, FUN = sum)
}
```

```{r, eval = FALSE, echo = FALSE}
write_rds(V, file = "results/SimulatedVariances_Melle.rds")
```

Figure \@ref(fig:plotsimulatedmapsMelle) shows 16 maps simulated with the first four semivariograms. The four maps in a row (a to d) are simulated with the same semivariogram. All maps show that the simulated data have positive skew, which is in agreement with the prior data. The data obtained by simulating from a lognormal distribution are always strictly positive. This is not guaranteed when simulating from a normal distribution.

```{r, echo = FALSE, eval = FALSE}
#coordinates(sampleMelle) <- ~ s1 + s2
v <- matrix(data = NA, nrow = 4, ncol = 4)
Zsim <- NULL
set.seed(314)
for (i in 1:4) {
  sill <- 1 / MCMCsample$lambda[i]
  psill <- MCMCsample$xi[i] * sill
  nugget <- sill - psill
  range <- MCMCsample$phi[i]
  for (j in 1:4) {
    ysim <- krige(
      lnN ~ 1,
      locations = sampleMelle,
      newdata = mygrid,
      model = vgm(model = "Exp", nugget = nugget, psill = psill, range = range),
      nmax = 20,
      nsim = 1,
      debug.level = 0) %>%
      as("data.frame")
    zsim <- exp(ysim$sim1)
    S2h <- tapply(zsim, INDEX = as.factor(mygeostrata$stratumId), FUN = var)
    v[i, j] <- 1 / n^2 * sum(S2h)
    fld <- paste0(i, letters[j])
    zsimdf <- data.frame(coordinates(mygrid), zsim = zsim, fld = fld)
    Zsim <- rbind(Zsim, zsimdf)
  }
}
save(Zsim, v, file = "results/SimulatedMaps_Melle.rda")
```

(ref:plotsimulatedmapsMellelabel) Maps of NO~3~-N of field Melle simulated with four semivariograms (rows). Each semivariogram is used to simulate four maps (columns a-d).

```{r, plotsimulatedmapsMelle, echo = FALSE, out.width = "100%", fig.cap = "(ref:plotsimulatedmapsMellelabel)"}
load(file = "results/SimulatedMaps_Melle.rda")
ggplot(data = Zsim) +
  geom_raster(mapping = aes(x = x1, y = x2, fill = zsim)) +
  scale_x_continuous(name = "") +
  scale_y_continuous(name = "") +
  coord_fixed(ratio = 1) +
  scale_fill_viridis_c(name = "Nsim", limits = c(0, 100)) +
  facet_wrap(~ fld, nrow = 4, ncol = 4)
```

The sampling variances of the estimated mean of NO$_3$-N obtained with these 16 maps are shown below. 

```{r, echo = FALSE}
colnames(v) <- c("a", "b", "c", "d")
rownames(v) <- 1:4
round(v, 3)
```

The sampling variance shows quite strong variation among the maps. The frequency distribution of Figure \@ref(fig:histogramNMelle) shows our uncertainty about the sampling variance, due to uncertainty about the semivariogram, and about the spatial distribution of NO$_3$-N within the agricultural field given the semivariogram and the available data from that field.

(ref:histogramNMellelabel) Frequency distribution of simulated sampling variances of the $\pi$ estimator of the mean of NO~3~-N of field Melle, for stratified simple random sampling, using 25 compact geostrata of equal size, and one point per stratum.

```{r histogramNMelle, echo = FALSE, fig.width = 5, fig.asp = 0.7, fig.cap = "(ref:histogramNMellelabel)"}
V <- read_rds(file = "results/SimulatedVariances_Melle.rds")
ggplot() +
  geom_histogram(aes(x = V), binwidth = 0.2, fill = "black", alpha = 0.5, colour = "black") +
  scale_y_continuous(name = "Count") +
  scale_x_continuous(name = "Variance of estimator of mean")
```

```{r, echo = FALSE}
Exi_V_STSI <- mean(V)
P50_V_STSI <- quantile(V, 0.5)
P90_V_STSI <- quantile(V, 0.9)
```

As a model-based prediction of the sampling variance, we can take the mean or the median of the sampling variances over all 100 $\times$ 100 simulated maps, which are equal to `r formatC(Exi_V_STSI, 3, format = "f")` (dag kg^-1^) and `r formatC(P50_V_STSI, 3, format = "f")` (dag kg^-1^), respectively.  If we want to be more safe, we can take a high quantile, e.g., the P90 of this distribution as the predicted sampling variance, which is equal to `r formatC(P90_V_STSI, 3, format = "f")` (dag kg^-1^).

I used the 30 available NO$_3$-N data as conditioning data in geostatistical simulation. Unconditional simulation is recommended if we cannot rely on the quality of the legacy data, for instance due to a temporal change in lnN since the time the legacy data have been observed. For NO$_3$-N this might well be the case. I believe that, although the effect of 30 observations on the simulated fields and on the uncertainty distribution of the sampling variance will be very small, one still may prefer unconditional simulation. With unconditional simulation, we must assign the model-mean $\mu$ to argument `beta` of function `krige`. The estimated model-mean can be estimated by generalised least squares, see function `ll` above.

## Model-based optimisation of spatial strata {#Ospats}

In this section  a spatial stratification method is described that uses model predictions of the study variable as a stratification variable. As opposed to *cum-root-f* stratification, this spatial stratification method accounts for errors in the predictions, as well as for spatial correlation of the prediction errors [@deGruijter2015]. 

The **Julia** package **Ospats** is an implementation of this stratification method. In **Ospats**\index{Optimal spatial stratification} the stratification is optimised through iterative reallocation of the raster cells to the strata. Recently, this stratification method was implemented in the package **SamplingStrata** (@Barcaroli2014, @Barcaroli2020). However, the algorithm used to optimise the strata differs from that in **Ospats**.  In **SamplingStrata** the stratification is optimised by optimising the bounds\index{Stratum bound} (splitting points) on the stratification variable with a genetic algorithm\index{Genetic algorithm}. Optimisation of the strata through optimisation of the bounds on the stratification variable necessarily leads to non-overlapping strata, while with iterative reallocation\index{Iterative reallocation} the strata may overlap, i.e., when the strata are sorted on the mean of the stratification variable, the upper bound of a stratum can be larger than the lower bound of the next stratum. As argued by @deGruijter2015 optimisation of strata through optimisation of the stratum bounds can be suboptimal. On the other hand, optimisation of the stratum bounds needs fewer computations and therefore is quicker.

The situation considered in this section is that prior data are available, either from the study area itself or from another similar area, that can be used to fit a linear regression model for the study variable, using one or more quantitative covariates and/or factors as predictors. These predictors must be available in the study area so that the fitted model can be used to map the study variable in the study area. We wish to collect more data by stratified simple random sampling, to be used in design-based estimation of the population mean or total of the study variable. The central research question then is how to construct these strata.

Recall the variance estimator of the mean estimator for stratified simple random sampling (Equations \@ref(eq:EstVarMeanSTSI) and \@ref(eq:EstVarstratummean)):

\begin{equation}
V\!\left(\hat{\bar{z}}\right)=\sum\limits_{h=1}^{H}w_{h}^{2} \frac{S^2_h(z)}{n_h}\;.
(\#eq:VarMeanSTSI2)
\end{equation}

Plugging the stratum sample sizes under optimal allocation (Equation \@ref(eq:optallocation)) into Equation \@ref(eq:VarMeanSTSI2) yields

\begin{equation}
V\!\left(\hat{\bar{z}}\right)=\frac{1}{n}\left(\sum\limits_{h=1}^{H}w_h S_h(z) \sqrt{c_h} \sum_{h=1}^H \frac{w_h S_h(z)}{\sqrt{c_h}}\right)\;.
(\#eq:VSTSINeyman)
\end{equation}

So, given the total sample size $n$, the variance of the estimator of the mean is minimal when the criterion 

\begin{equation}
O = \sum\limits_{h=1}^{H}w_h S_h(z) \sqrt{c_h} \sum_{h=1}^H \frac{w_h S_h(z)}{\sqrt{c_h}}
(\#eq:minicritospats)
\end{equation}

is minimised. 

Assuming that the costs are equal for all population units, so that the mean costs are the same for all strata, the minimisation criterion reduces to

\begin{equation}
O = \left(\sum\limits_{h=1}^H w_h S_h(z)\right)^2\;.
(\#eq:EOconstantch)
\end{equation}

In practice, we do not know the values of the study variable $z$. @deGruijter2015 consider the situation where we have predictions of the study variable from a linear regression model: $\hat{z} = z + \epsilon$, with $\epsilon$ the prediction error. So, this implies that we do not know the population standard deviations within the strata, $S_h(z)$ of Equation \@ref(eq:VSTSINeyman). What we do have are the stratum standard deviations of the predictions of $z$: $S_h(\hat{z})$. With many statistical models, such as regression and kriging models, the standard deviation of the predictions is smaller than that of the study variable: $S_h(\hat{z}) < S_h(z)$. This is known as the smoothing or levelling effect. 

The stratum standard deviations in the minimisation criterion are replaced by model-expectations of the stratum standard deviations, i.e., by model-based predictions of the stratum standard deviations, $E_{\xi}[S_h(z)]$. This leads to the following minimisation criterion:

\begin{equation}
E_{\xi}[O] = \left(\sum\limits_{h=1}^H w_h E_{\xi}[S_h(z)]\right)^2\;.
(\#eq:EOconstantch2)
\end{equation}

The stratum variances are predicted by

\begin{equation}
E_{\xi}[S^2_h(z)]=\frac{1}{N^2_h}\sum_{i=1}^{N_h-1}\sum_{j=i+1}^{N_h}E_{\xi}[d^2_{ij}]\;,
\end{equation}

with $d^2_{ij} = (z_i-z_j)^2$ the squared difference of the study variable values at two nodes of a discretisation grid. The model-expectation of the squared differences are equal to

\begin{equation}
E_{\xi}[d^2_{ij}] = (\hat{z}_i-\hat{z}_j)^2+S^2(\epsilon_i)+S^2(\epsilon_j)-2S^2(\epsilon_i,\epsilon_j)
(\#eq:Exid2ij)
\;, 
\end{equation}

with $S^2(\epsilon_i)$ the variance of the prediction error at node $i$ and $S^2(\epsilon_i,\epsilon_j)$ the covariance of the prediction errors at nodes $i$ and $j$. The authors then argue that for smoothers, such as kriging and regression, the first term must be divided by the squared correlation coefficient $R^2$:

\begin{equation}
E_{\xi}[d^2_{ij}] = \frac{(\hat{z}_i-\hat{z}_j)^2}{R^2}+S^2(\epsilon_i)+S^2(\epsilon_j)-2S^2(\epsilon_i,\epsilon_j) \;.
(\#eq:Exid2ij2)
\end{equation}

The predicted stratum standard deviations are approximated by the square root of Equation \@ref(eq:Exid2ij2). Plugging these model-based predictions of the stratum standard deviations into the minimisation criterion, Equation \@ref(eq:EOconstantch), yields

\begin{equation}
E_{\xi}[O] = \frac{1}{N} \sum\limits_{h=1}^H \left( \sum_{i=1}^{N_h-1}\sum_{j=i+1}^{N_h}\frac{(\hat{z}_i-\hat{z}_j)^2}{R^2}+S^2(\epsilon_i)+S^2(\epsilon_j)-2S^2(\epsilon_i,\epsilon_j) \right)^{1 / 2}\;.
(\#eq:minicritospatsDeGruijter)
\end{equation}

Optimal spatial stratification with package **SamplingStrata** is illustrated with a survey of the SOM concentration (g kg^-1^) in the topsoil (A horizon) of Xuancheng (China). Three samples are available. These three samples are merged. The total number of sampling points is 183. This sample is used to fit a simple linear regression model for the SOM concentration, using the elevation of the surface (dem) as a predictor. Function `lm` of the **stats** package is used to fit the simple linear regression model.

<!-- projection attributes of Xuancheng: wgs_1984_UTM_Zone_50N  -->

```{r}
lm_SOM <- lm(SOM_A_hori ~ dem, data = sampleXuancheng)
```

In fitting a linear regression model, we assume that the relation is linear, the residual variance is constant (independent of the fitted value), and the residuals have a normal distribution. These assumptions are checked with a scatter plot of the residuals against the fitted value and a Q-Q plot\index{Q-Q plot}, respectively (Figure \@ref(fig:checkassumptions)).

```{r checkassumptions, echo = FALSE, out.width = "100%", fig.asp = 0.5, fig.cap = "Scatter plot of residuals against fitted value and Q-Q plot of residuals, for a simple linear regression model of the SOM concentration in Xuancheng, using elevation as a predictor."}
fit <- fitted(lm_SOM)
e <- residuals(lm_SOM)
df <- data.frame(fit = fit, resid = e)
plt1 <- ggplot(df) +
  geom_point(mapping = aes(x = fit, y = e)) +
  geom_hline(yintercept = 0) +
  scale_x_continuous(name = "Fitted value") +
  scale_y_continuous(name = "Residuals")

plt2 <- ggplot(df, aes(sample = e)) +
  geom_qq() +
  geom_qq_line()

grid.arrange(plt1, plt2, nrow = 1)
```

The scatter plot shows that the first assumption is realistic. No pattern can be seen: at all fitted values, the residuals are scattered around the horizontal line. However, the second and third assumptions are questionable: the residual variance clearly increases with the fitted value, and the distribution of the residuals has positive skew, i.e., it has a long upper tail. There clearly is some evidence that these two assumptions are violated. Possibly these problems can be solved by fitting a model for the natural log of the SOM concentration.

```{r checkassumptions2, echo = FALSE, out.width = "100%", fig.asp = 0.5, fig.cap = "Scatter plot of residuals against fitted value and Q-Q plot of residuals, for a simple linear regression model of the natural log of the SOM concentration in Xuancheng, using elevation as a predictor."}
sampleXuancheng$lnSOM <- log(sampleXuancheng$SOM_A_hori)
lm_lnSOM <- lm(lnSOM ~ dem, data = sampleXuancheng)
e <- residuals(lm_lnSOM)
fit <- fitted(lm_lnSOM)
df <- data.frame(fit, e)
plt1 <- ggplot(df) +
  geom_point(mapping = aes(x = fit, y = e)) +
  geom_hline(yintercept = 0) +
  scale_x_continuous(name = "Fitted value") +
  scale_y_continuous(name = "Residuals")

plt2 <- ggplot(df, aes(sample = e)) +
  geom_qq() +
  geom_qq_line()

grid.arrange(plt1, plt2, nrow = 1)
```

The variance of the residuals is more constant (Figure \@ref(fig:checkassumptions2)), and the Q-Q plot is improved, although we now have too many strong negative residuals for a normal distribution. I proceed with the model for natural-log transformed SOM (lnSOM). The fitted linear regression model is used to predict lnSOM at the nodes of a 200 m $\times$ 200 m discretisation grid.

```{r}
res <- predict(lm_lnSOM, newdata = as(grdXuancheng, "data.frame"), se.fit = TRUE)
grdXuancheng <- within(grdXuancheng, {
  lnSOMpred <- res$fit; varpred <- res$se.fit^2})
```

The predictions and their standard errors are shown in Figure \@ref(fig:predictedlnSOM).

(ref:predictedlnSOMlabel) Predicted natural log of the SOM concentration (g kg^-1^) in the topsoil of Xuancheng and the standard error (se) of the predictor, obtained with a linear regression model with elevation as a predictor.

```{r predictedlnSOM, echo = FALSE, out.width = "100%", fig.cap = "(ref:predictedlnSOMlabel)"}
plt1 <- ggplot(grdXuancheng) +
  geom_raster(mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = lnSOMpred)) +
  scale_fill_viridis_c(name = "lnSOM") +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  coord_fixed()

plt2 <- ggplot(grdXuancheng) +
  geom_raster(mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = sqrt(varpred))) +
  scale_fill_viridis_c(name = "se") +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  coord_fixed()
grid.arrange(plt1, plt2, ncol = 1)
```

Let us check now whether the spatial structure of the study variable lnSOM is fully captured by the mean, modelled as a linear function of elevation. This can be checked by estimating the semivariogram of the model residuals. If the semivariogram of the residuals is pure nugget (the semivariance does not increase with distance), then we can assume that the prediction errors are independent. In that case, we do not need to account for a covariance of the prediction errors in optimisation of the spatial strata. However, if the semivariogram does show spatial structure, we must account for a covariance of the prediction errors. Figure \@ref(fig:residualvariogramSOM) shows the sample semivariogram of the residuals computed with function `variogram` of package **gstat**.

```{r}
library(gstat)
sampleXuancheng <- sampleXuancheng %>%
  mutate(s1 = s1 / 1000, s2 = s2 / 1000)
coordinates(sampleXuancheng) <- ~ s1 + s2
vg <- variogram(lnSOM ~ dem, data = sampleXuancheng)
```


```{r residualvariogramSOM, echo = FALSE, fig.asp = 0.7, fig.cap = "Sample semivariogram of the residuals of a simple linear regression model for the natural log of the SOM concentration in Xuancheng. Numbers refer to point-pairs used in computing semivariances."}
ggplot(data = vg) +
  geom_point(mapping = aes(x = dist, y = gamma), size = 2) +
  geom_text(mapping = aes(x = dist, y = gamma, label = np), nudge_x = 1.2) +
  scale_x_continuous(name = "Distance (km)") +
  scale_y_continuous(name = "Semivariance", limits = c(0, 0.20))
```

The sample semivariogram does not show much spatial structure, but the first two points in the semivariogram have somewhat smaller values. This indicates that the residuals at two close points, say, $<\pm 5$ km, are not independent, whereas if the distance between the two points $>\pm 5$ km, they are independent. This spatial dependency of the residuals can be modelled, e.g., by an exponential function. The exponential semivariogram has three parameters, the nugget\index{Nugget} variance $c_0$, the partial sill\index{Partial sill} $c_1$, and the distance parameter\index{Distance parameter} $\phi$. The total number of model parameters now is five: two regression coefficients (intercept and slope for elevation) and three semivariogram parameters. All five parameters can best be estimated by restricted maximum likelihood\index{Restricted maximum likelihood estimation}, see Subsection \@ref(REML). Table \@ref(tab:TableModelXuancheng) shows the estimated regression coefficients and semivariogram parameters. Up to a distance of about three times the estimated distance parameter $\phi$, which is about 8 km, the residuals are spatially correlated; beyond that distance, they are hardly correlated anymore.

```{r, echo = FALSE}
library(geoR)
sampleXuancheng <- as_tibble(sampleXuancheng)
dGeoR <- as.geodata(obj = sampleXuancheng, header = TRUE,
   coords.col = c("s1", "s2"), data.col = "lnSOM", covar.col = "dem"
 )
vgm_REML <- likfit(
  geodata = dGeoR, trend = ~ dem, cov.model = "exponential",
  ini.cov.pars = c(0.05, 3), nugget = 0.1, lik.method = "REML", messages = FALSE)
```

(ref:TableModelXuanchenglabel) Estimated regression coefficients (intercept and slope for dem) and parameters of an exponential semivariogram for the natural log of the SOM concentration (g kg^-1^) in Xuancheng.

```{r TableModelXuancheng, echo = FALSE}
int <- round(vgm_REML$beta[1], 3)
dem <- round(vgm_REML$beta[2], 5)
nugget <- c(round(vgm_REML$nugget, 3))
psill <- c(round(vgm_REML$sigmasq, 3))
range <- c(round(vgm_REML$phi, 3))

coefs <- data.frame(int, dem, nugget, psill, range)
rownames(coefs) <- c()

knitr::kable(
  coefs, caption = "(ref:TableModelXuanchenglabel)",
  booktabs = TRUE,
  col.names = c("Int", "dem", "Nugget", "Partial sill", "Distance parameter (km)"),
  linesep = ""
) %>%
  kable_classic()
```

We conclude that the errors in the regression model predictions are not independent, although the correlation will be weak in this case, and that we must account for this correlation in optimising the spatial strata.

The discretisation grid with predicted lnSOM consists of 115,526 nodes. These are too many for function `optimStrata`. The grid is therefore thinned to a grid with a spacing of 800 m $\times$ 800 m, resulting in 7,257 nodes.

```{r, echo = FALSE}
library(sp)
gridded(grdXuancheng) <- ~ s1 + s2
subgrd <- spsample(grdXuancheng, type = "regular", cellsize = 800, offset = c(0.5, 0.5))
subgrd <- data.frame(coordinates(subgrd), over(subgrd, grdXuancheng))
```

The first step in optimisation of spatial strata with package **SamplingStrata** is to build the sampling frame with function `buildFrameSpatial`. Argument `X` specifies the stratification variables, and argument `Y` specifies the study variables. In our case, we have only one stratification variable and one study variable, and these are the same variable. Argument `variance` specifies the variance of the prediction error of the study variable. Variable `dom` is an identifier of the domain of interest of which we want to estimate the mean or total. I assign the value 1 to all population units, see code chunk below, which implies that the stratification is optimised for the entire population. If we have multiple domains of interest, the stratification is optimised for each domain separately.

Finally, as a preparatory step we must specify how precise the estimated mean should be. This precision must be specified in terms of the coefficient of variation (cv), i.e., the standard error of the estimated mean divided by the mean. I use a cv of 0.005. In case of multiple domains of interest and multiple study variables, a cv must be specified per domain and per study variable. This precision requirement is used to compute the sample size for Neyman allocation (Equation \@ref(eq:Neymanallocation))^[For multivariate stratification, i.e., stratification with multiple study variables, Bethel allocation is used to compute the required sample size.]. The optimal stratification is independent of the precision requirement.

```{r}
library(SamplingStrata)
subgrd$id <- seq_len(nrow(subgrd))
subgrd$dom <- rep(1, nrow(subgrd))
frame <- buildFrameSpatial(df = subgrd, id = "id", X = c("lnSOMpred"),
  Y = c("lnSOMpred"), variance = c("varpred"), lon = "x1", lat = "x2",
  domainvalue = "dom")
cv <- as.data.frame(list(DOM = "DOM1", CV1 = 0.005, domainvalue = 1))
```

The optimal spatial stratification can be computed with function `optimStrata`, with argument `method = "spatial"`. The $R^2$ value of the linear regression model, used in the minimisation criterion (Equation \@ref(eq:minicritospatsDeGruijter)), can be specified with argument `fitting`.

```{block2, type = 'rmdnote'}
I am not sure that the correction factor $R^2$ in Equation \@ref(eq:Exid2ij2) is really needed. I believe that the smoothing effect is already accounted for by the variances and covariances of the prediction errors. I used an $R^2$ value of 1.
```

Arguments `range` and `kappa` are parameters of an exponential semivariogram, needed for computing the covariance of the prediction errors. Function `optimStrata` uses an extra parameter in the exponential semivariogram: $c_0+c_1\mathrm{exp}(-\kappa h/\phi)$. So, for the usual exponential semivariogram (Equation \@ref(eq:exponential)) `kappa` equals 1.

```{r, eval = FALSE}
res <- optimStrata(
  framesamp = frame, method = "spatial", errors = cv, nStrata = 5,
  fitting = 1, range = c(vgm_REML$phi), kappa = 1, showPlot = FALSE)
```

```{r, eval = FALSE, echo = FALSE}
write_rds(res, file = "results/OptimalStrata_Xuancheng.rds")
```

```{r, echo = FALSE}
res <- read_rds(file = "results/OptimalStrata_Xuancheng.rds")
```

A summary of the optimised strata can be obtained with function `summaryStrata`.

```{r}
print(smr_strata <- summaryStrata(
  res$framenew, res$aggr_strata, progress = FALSE))
```

In the next code chunk, it is checked whether the coefficient of variation is indeed equal to the desired value of 0.005.

```{r}
strata <- res$aggr_strata
framenew <- res$framenew
N_h <- strata$N
w_h <- N_h / sum(N_h)
se <- sqrt(sum(w_h^2 * strata$S1^2 / strata$SOLUZ))
se / mean(framenew$Y1)
```

The coefficient of variation can also be computed with function `expected_CV`.

```{r}
expected_CV(strata)
```

Figure \@ref(fig:optimalstrataXuanchengSamplingStrata) shows the optimised strata. I used the stratum bounds in `data.frame` `smr_strata`, to compute the stratum for all raster cells of the original 200 m $\times$  200 m grid.

```{r optimalstrataXuanchengSamplingStrata, echo = FALSE, out.width = "100%", fig.cap = "Model-based optimal strata for estimating the mean of the natural log of the SOM concentration in Xuancheng."}
grdXuancheng <- as_tibble(grdXuancheng)
strat <- findInterval(grdXuancheng$lnSOMpred, smr_strata$Upper_X1[-nrow(smr_strata)])

labels <- as.character(round(smr_strata$Upper_X1, 2))
first <- paste("<", labels[1])
second <- paste(labels[1], "-", labels[2])
third <- paste(labels[2], "-", labels[3])
fourth <- paste(labels[3], "-", labels[4])
fifth <- paste(">", labels[4])
labs <- c(first, second, third, fourth, fifth)

ggplot(grdXuancheng) +
  geom_raster(mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = factor(strat))) +
  scale_fill_viridis_d(name = "lnSOM", labels = labs) +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  coord_fixed()
```

```{r, echo = FALSE}
rm(list = ls())
```