Add new proper vignettes (#35)

- moved statistical examples into their own vignettes - they will now be rendered by `pkgdown` - simplifies `README` - four new vignettes: - loading-and-wrangling - binary-classification - linear-regression - two-group-comparison - fixes #35
SomaLogic · Apr 6, 2023 · e2e6630 · e2e6630
1 parent dc7bff9
commit e2e6630
Show file tree

Hide file tree

Showing 9 changed files with 703 additions and 7 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -37,6 +37,7 @@ Suggests:
     Biobase,
     ggplot2,
     knitr,
+    purrr,
     recipes,
     rmarkdown,
     spelling,

diff --git a/_pkgdown.yml b/_pkgdown.yml
@@ -27,6 +27,35 @@ articles:
     contents:
     - SomaDataIO
 
+  - title: Loading and Wrangling
+    navbar: ~
+    desc: >
+      How to load and manipulate a 'SomaScan' flat text file into 
+      and R environment.
+    contents:
+    - loading-and-wrangling
+
+  - title: Two-group Comparison
+    navbar: ~
+    desc: >
+      Typical two-group comparison of 'SomaScan' data.
+    contents:
+    - two-group-comparison
+
+  - title: Binary Classification
+    navbar: ~
+    desc: >
+      Typical binary classification of 'SomaScan' data.
+    contents:
+    - binary-classification
+
+  - title: Linear Regression
+    navbar: ~
+    desc: >
+      Typical linear regression of continuous 'SomaScan' data.
+    contents:
+    - linear-regression
+
 reference:
   - title: Load an ADAT
     desc: >

diff --git a/vignettes/.gitignore b/vignettes/.gitignore
@@ -0,0 +1,2 @@
+*.html
+*.R
diff --git a/vignettes/SomaDataIO.Rmd b/vignettes/SomaDataIO.Rmd
@@ -1,15 +1,17 @@
 ---
 title: "Introduction to SomaDataIO"
-output: rmarkdown::html_vignette
+output:
+  rmarkdown::html_vignette:
+    fig_caption: yes
 vignette: >
   %\VignetteIndexEntry{Introduction to SomaDataIO}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 ```{r setup, include = FALSE}
-library(SomaDataIO)
 knitr::opts_chunk$set(
+  echo = TRUE,
   collapse = TRUE,
   comment = "#>"
 )
@@ -20,7 +22,8 @@ knitr::opts_chunk$set(
 This document accompanies the `SomaDataIO` R package, which loads
 and exports 'SomaScan' data via the SomaLogic Operating Co., Inc.
 proprietary text file called an ADAT (`*.adat`).
-For file format see [here](https://github.com/SomaLogic/SomaLogic-Data/blob/master/README.md).
+For file format see
+[here](https://github.com/SomaLogic/SomaLogic-Data/blob/master/README.md).
 The package also exports auxiliary functions for manipulating, wrangling,
 and extracting relevant information from an ADAT object once in memory.
 Basic familiarity with the R environment is assumed, as is the ability to install
@@ -42,19 +45,28 @@ contributed packages from the Comprehensive R Archive Network (CRAN).
     + `?SeqId` analyte (feature) matching.
     + `dplyr` and `tidyr` verb S3 methods for the `soma_adat` class.
     + `?rownames` helpers that do not break `soma_adat` attributes.
+    + please see vignette `vignette("loading-and-wrangling", package = "SomaDataIO")`
 
 * Exporting data (Output)
     + write out a `soma_adat` object as a `*.adat` text file.
 
 
 ----------------------
 
-## Workflows and Analysis (TODO)
+## Workflows and Analysis
 This section will become more fleshed out in future versions of
-`SomaDataIO`
+`SomaDataIO`. In the meantime, below are 3 examples of typical
+primary statistical analyses that are commonly performed on
+'SomaScan' data:
 
-#### In the meantime please see the package
-[README](https://github.com/SomaLogic/SomaDataIO/blob/main/README.md)
+- Two-group comparison (e.g. differential expression) via *t*-test
+  + see vignette `vignette("two-group-comparison", package = "SomaDataIO")`
+
+- Binary classification
+  + see vignette `vignette("binary-classification", package = "SomaDataIO")`
+
+- Linear regression
+  + see vignette `vignette("linear-regression", package = "SomaDataIO")`
 
 
 ---------------------

diff --git a/vignettes/binary-classification.Rmd b/vignettes/binary-classification.Rmd
@@ -0,0 +1,148 @@
+---
+title: "Binary Classification"
+author: "Stu Field, SomaLogic Operating Co., Inc."
+output:
+  rmarkdown::html_vignette:
+    fig_caption: yes
+vignette: >
+  %\VignetteIndexEntry{Binary Classification}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r setup, include = FALSE}
+library(SomaDataIO)
+library(dplyr)
+library(tidyr)
+library(purrr)
+knitr::opts_chunk$set(
+  echo = TRUE,
+  collapse = TRUE,
+  comment = "#>",
+  fig.path = "figures/classify-"
+)
+```
+
+
+----------------
+
+
+## Classification via Logistic Regression
+
+
+Although targeted statistical analyses are beyond the scope of 
+the `SomaDataIO` package, below is an example analysis
+that typical users/customers would perform on 'SomaScan' data.
+
+It is not intended to be a definitive guide in statistical
+analysis and existing packages do exist in the `R` ecosystem that perform
+parts or extensions of these techniques. Many variations of the workflow
+below exist, however the framework highlights how one could perform standard
+_preliminary_ analyses on 'SomaScan' data.
+
+
+## Data Preparation
+```{r data-prep}
+# the `example_data` package data
+dim(example_data)
+
+table(example_data$SampleType)
+
+# center/scale
+cs <- function(.x) {    # .x = numeric vector
+  out <- .x - mean(.x)  # center
+  out / sd(out)         # scale
+}
+
+# prepare data set for analysis
+cleanData <- example_data |>
+  filter(SampleType == "Sample") |>               # rm control samples
+  drop_na(Sex) |>                                 # rm NAs if present
+  log10() |>                                      # log10-transform (Math Generic)
+  mutate(Group = as.numeric(factor(Sex)) - 1) |>  # map Sex -> 0/1
+  modify_at(getAnalytes(example_data), cs)
+
+table(cleanData$Sex)
+
+table(cleanData$Group)    # F = 0; M = 1
+```
+
+## Set up Train/Test Data
+
+```{r train-test}
+# idx = hold-out 
+# seed resulting in 50/50 class balance
+idx   <- withr::with_seed(3, sample(1:nrow(cleanData), size = nrow(cleanData) - 50))
+train <- cleanData[idx, ]
+test  <- cleanData[-idx, ]
+
+# assert no overlap
+isTRUE(
+  all.equal(intersect(rownames(train), rownames(test)), character(0))
+)
+```
+
+
+## Logistic Regression
+We use the `cleanData`, `train`, and `test` data objects from above.
+
+### Predict Sex
+```{r logreg-tbl}
+LR_tbl <- getAnalyteInfo(train) |>
+  select(AptName, SeqId, Target = TargetFullName, EntrezGeneSymbol, UniProt) |>
+  mutate(
+    formula  = map(AptName, ~ as.formula(paste("Group ~", .x))),  # create formula
+    model    = map(formula, ~ stats::glm(.x, data = train, family = "binomial", model = FALSE)),  # fit glm()
+    beta_hat = map_dbl(model, ~ coef(.x)[2L]),      # pull out coef Beta
+    p.value  = map2_dbl(model, AptName, ~ {
+      summary(.x)$coefficients[.y, "Pr(>|z|)"] }),  # pull out p-values
+    fdr      = p.adjust(p.value, method = "BH")     # FDR correction multiple testing
+  ) |>
+  arrange(p.value) |>            # re-order by `p-value`
+  mutate(rank = row_number())    # add numeric ranks
+
+LR_tbl
+```
+
+
+### Fit Model | Calculate Performance
+
+Next, select features for the model fit. We have a good idea of reasonable `Sex`
+markers from prior knowledge (`CGA*`), and fortunately many of these are highly
+ranked in `LR_tbl`. Below we fit a 4-marker logistic regression model from
+cherry-picked gender-related features:
+
+```{r fit-logreg}
+# AptName is index key between `LR_tbl` and `train`
+feats <- LR_tbl$AptName[c(1L, 3L, 5L, 7L)]
+form  <- as.formula(paste("Group ~", paste(feats, collapse = "+")))
+fit   <- glm(form, data = train, family = "binomial", model = FALSE)
+pred  <- tibble(
+  true_class = test$Sex,                                         # orig class label
+  pred       = predict(fit, newdata = test, type = "response"),  # prob. 'Male'
+  pred_class = ifelse(pred < 0.5, "F", "M"),                     # class label
+)
+conf <- table(pred$true_class, pred$pred_class, dnn = list("Actual", "Predicted"))
+tp   <- conf[2L, 2L]
+tn   <- conf[1L, 1L]
+fp   <- conf[1L, 2L]
+fn   <- conf[2L, 1L]
+
+# Confusion matrix
+conf
+
+# Classification metrics
+tibble(Sensitivity = tp / (tp + fn),
+       Specificity = tn / (tn + fp),
+       Accuracy    = (tp + tn) / sum(conf),
+       PPV         = tp / (tp + fp),
+       NPV         = tn / (tn + fn)
+)
+```
+
+
+---------------------
+
+
+Created by [Rmarkdown](https://github.com/rstudio/rmarkdown)
+(v`r utils::packageVersion("rmarkdown")`) and `r R.version$version.string`.
diff --git a/vignettes/figures/.gitignore b/vignettes/figures/.gitignore
@@ -0,0 +1,2 @@
+*.png
+*.html