Skip to content

sparsegdm/sgdm_package

Repository files navigation

# sgdm: an R package for performing sparse generalized dissimilarity modelling with tools for gdm

Pedro J. Leitão, Marcel Schwieder, and Cornelius Senf

**Reference**

Leitão, J. P., Schwieder, M., and Senf, C. (2017) sgdm: an R package for performing sparse generalized dissimilarity modelling, including tools for gdm. *ISPRS International Journal of Geo-Information*, 6(1), 23-35.

## Installing `sgdm`

In order to install `sgdm` from GitHub, you need to install the `devtools` package first. Using the `install_git` function in the `devtools` package, the `sgdm` package can be easily installed:

```{r, eval=FALSE}
library(devtools)
devtools::install_github("sparsegdm/sgdm_package")
library(sgdm)
```

```{r, echo=FALSE}
library(sgdm)
```

**Note 1:** On a Linux (Ubuntu) machine things might be a little more complicated and you first need to install openssl for linux:

- sudo apt-get install libcurl4-openssl-dev libxml2-dev
- sudo apt-get install libssl-dev

Before you can install the `devtools` package.

**Note 2:** The `PMA` package, which is required by SGDDM and installed automatically, needs the `impute` package from the bioclite repository. However, it seems that `PMA` doe not automatically load `impute`. If you haven't installed either `PMA` or `impute` before, you need to first install the `impute` package:

```{r, echo=FALSE}
source("https://bioconductor.org/biocLite.R")
biocLite("impute")
```

## Exemplary data

The package includes nine functions, and three exemplary datasets. The exemplary datasets include one biological dataset, one predictor dataset and one predictor map. The trees biological dataset is composed of 30 observations with abundance values for 48 different tree families in an area of natural vegetation in the Brazilian Cerrado. The spectra predictor dataset is composed of the same 30 observations, with reflectance values for 83 narrow spectral bands; extracted from spaceborne hyperspectral Hyperion imagery after pre-processing and band quality screening. Both datasets include an ID column and the later also include two geographical coordinate (X and Y) columns. The spectral.image predictor map is a raster object constituting of a subset (100 times 100 pixels) of the respective Hyperion image.

## Running a `sgdm` model

Running a full `sgdm` model requires five steps:

- Parameterize and train the SGDM model
- Reduce the SGDM model by identifying non-significant predictors
- Validate the SGDM model
- Map community composition patterns

In the following we will go through each of the four steps in order to examplify how the `sgdm` package works.

### Parameterize and train the SGDM model

For parameterizing a `sgdm` model, it is necessary to identify the best penalization parameter pair using a heuristic grid search implemented in the function `sgdm.param`:

```{r, results='hide'}
sgdm.gs <- sgdm.param(predData = spectra, bioData = trees, k = 30, predPenalization = seq(0.6, 1, 0.1), bioPenalization = seq(0.6, 1, 0.1), geo = F)
```

The parameter `k` sets the number of components to be used and the parameters `predPenalization` and `bioPenalization` set the penalization values to be tested (ranging from 0 (strong penalization) to 1 (weak penalization). The standard values implemented in this function range from 0.6 to 1 in 0.1 steps, but these values can be manually configured to better match the used datasets. Finally, the user can specify if geographical distance as a variable in the GDM model should be used. Note that this function implementation only allows biological data in **format 1** specification as described in the `gdm` package.

The function `sgdm.param` returns a performance matrix, with the performance values (RMSE) of each parameter pair estimated from 5-fold cross-validation:

```{r}
print(sgdm.gs, digits = 4)
```

Using the performance matrix and the `sgdm.best` function, we can retreive the best model (`output = "m"`):

```{r, results='hide'}
sgdm.model <- sgdm.best(perf.matrix = sgdm.gs, predData = spectra, bioData = trees, output = "m", k = 30)
```

Summary of the model:

```{r, results='hide'}
summary(sgdm.model)
```

Alternatively, the sgdm.best function also allows the user to retrieve the resulting sparse canonical components (`output = "c"`) or the respective canonical vectors (`output = "v"`).

```{r, results='hide'}
sgdm.sccbest <- sgdm.best(perf.matrix = sgdm.gs, predData = spectra, bioData = trees, output = "c", k = 30)
sgdm.vbest <- sgdm.best(perf.matrix = sgdm.gs, predData = spectra, bioData = trees, output = "v", k = 30)
```

### Reduce the SGDM model by removing non-significant predictors

Some of the sparse canonical components might be less important than others. In order to reduce the model complexity, we can utilize the `gdm.varsig` function with the sparse canonical components and the biological data:

```{r, results='hide'}
sigtest.sgdm <- gdm.varsig(predData = sgdm.sccbest, bioData = trees)
```

and use this significance test result to reduce the predictor data for re-training the model:

```{r}
sgdm.sccbest.red <- data.reduce(data = sgdm.sccbest, datatype = "pred", sigtest = sigtest.sgdm)
```

In order to train the final model, the significant sparse canonical components must be combined with the biological dataset in a site pair dataset using the function `formatsitepair` in the `gdm` package:

```{r}
spData.sccabest.red <- gdm::formatsitepair(bioData = trees, bioFormat = 1, dist = "bray",
                                           abundance = TRUE, siteColumn = "Plot_ID", 
                                           XColumn = "X",YColumn = "Y", predData = sgdm.sccbest.red)

sgdm.model.red <- gdm::gdm(data = spData.sccabest.red)
```

The final model can now be inspected and the predicted dissimilarities plotted:

```{r, results='hide'}
summary(sgdm.model.red)
```

```{r}
plot(sgdm.model.red$predicted, sgdm.model.red$observed, xlim = c(0, 1), ylim = c(0, 1))
abline(0, 1)
```

### Validate the SGDM model

For independent validation, there is a function in the `sgdm` package called `gdm.cv`, which performes n-fold cross-validation on a (S)GDM model:

```{r}
gdm.cv(spData = spData.sccabest.red, nfolds = 10)
```

The functions returns the cross-validated RMSE value (default). Alternatively, the cross-validated coefficient of determination (r$^2$) can be returned:

```{r}
gdm.cv(spData = spData.sccabest.red, nfolds = 10, performance = "r2")
```

### Map community composition patterns

The community composition patterns can be plotted along the main axes of variation following an NMDS transformation of the predicted dissimilarities between sample pairs:

```{r, results='hide'}
community.samples <- gdm.map(spData = spData.sccabest.red, model = sgdm.model.red, k = 0, t = 0.1)
```

The number of NMDS axes to be extracted is by default automatically determined by the resulting mean NMDS stress values out of 20 iterations following:

- less than 0.05: excellent
- less than 0.1: good
- greater 0.1: poor

The number of NMDS axes (`k`) can also be set by the user.

These patterns can also be mapped in space using a predictor map. As it would be unfeasible to run a NMDS on the dissimilarities between all possible pairs of image pixels, it is possible to assign the pixels to the NMDS axes from the sample pairs, through knn-imputation.

The predictor map used here is called `spectral.image` and is a `RasterStack` object, which can be plotted in false color:

```{r}
raster::plotRGB(spectral.image, r = 43, g = 22, b = 12, stretch = "hist")
```

In order to map the community composition patterns predicted by sgdm, it is necessary to apply the canonical transformation onto the prediction map. The resulting component map must also to be reduced according to the variable significance test:

```{r}
component.image <- predData.transform(predData = spectral.image, v = sgdm.vbest)
component.image.red <- data.reduce(component.image, datatype = "pred", sigtest = sigtest.sgdm)
```

With the resulting reduced component image, it is possible to map the community composition patterns in space. For example, in order to visualize them in RGB space, it is possible to extract 3 NMDS axes:

```{r}
map.sgdm.red <- gdm.map(spData = spData.sccabest.red, predMap = component.image.red, model = sgdm.model.red, k = 3)
raster::plotRGB(map.sgdm.red, r = 3, g = 2, b = 1, stretch = "hist")
```

About

SGDM package

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages