From 2ceea39817d18816c397e175119e89b6f38c51a4 Mon Sep 17 00:00:00 2001 From: Stu Field Date: Wed, 13 Dec 2023 16:53:12 -0700 Subject: [PATCH] Add new lifting and bridging vignette - word smithing of lifting documentation - addition of new vignette with lifting guidance - closes #77 --- R/adat-helpers.R | 2 +- R/lift-adat.R | 34 ++-- _pkgdown.yml | 16 +- man/adat-helpers.Rd | 2 +- man/lift_adat.Rd | 34 ++-- vignettes/lifting-and-bridging.Rmd | 307 +++++++++++++++++++++++++++++ 6 files changed, 343 insertions(+), 52 deletions(-) create mode 100644 vignettes/lifting-and-bridging.Rmd diff --git a/R/adat-helpers.R b/R/adat-helpers.R index bed00ca..998028c 100644 --- a/R/adat-helpers.R +++ b/R/adat-helpers.R @@ -8,7 +8,7 @@ #' [checkSomaScanVersion()] determines if the version of #' is a recognized version of SomaScan.\cr #' \cr -#' Table of SomaScan Assay versions: +#' Table of SomaScan assay versions: #' \tabular{lll}{ #' **Version** \tab **Commercial Name** \tab **Size** \cr #' `V4` \tab 5k \tab 5284 \cr diff --git a/R/lift-adat.R b/R/lift-adat.R index d729b81..362a239 100644 --- a/R/lift-adat.R +++ b/R/lift-adat.R @@ -5,7 +5,7 @@ #' between assay versions; from changing reagents, liquid handling equipment, #' well volumes, and content expansion. #' -#' Table of SomaScan Assay versions: +#' Table of SomaScan assay versions: #' #' \tabular{lll}{ #' **Version** \tab **Commercial Name** \tab **Size** \cr @@ -36,34 +36,22 @@ #' @details #' Matched samples across assay versions are used to calculate bridging #' scalars. For each analyte, this scalar is computed as the ratio of -#' population _medians_ (\eqn{n > 1000}) between assay versions. For example, -#' the linear scalar for the \eqn{i^{th}} analyte translating from `11k` -> `7k` -#' is defined as: -#' -#' \deqn{R_i = \frac{\hat\mu_{7k}}{\hat\mu_{11k}}} +#' population _medians_ across assay versions. +#' Please see the lifting vignette +#' `vignette("lifting-and-bridging", package = "SomaDataIO")` +#' for more details. #' #' @section Lin's CCC: -#' Calculating analyte-specific bridging scalars involves a careful evaluation -#' of the correlation of post-lifting RFU values in the reference population -#' used to calculate the linear scalars. The Lin's Concordance Correlation -#' Coefficient (CCC) is calculated between matched samples from the original -#' SomaScan signal space and the identical lifted samples that have been -#' scaled back to the original signal space. This CCC value is an estimate -#' of how well an analyte can be bridged across specific SomaScan versions. -#' Factors affecting an analyte's lifting CCC are: reagents with high -#' intra-assay CV (Coefficient of Variation) and reagents signaling -#' near background or saturation levels. +#' The Lin's Concordance Correlation Coefficient (CCC) is calculated +#' by computing the correlation between post-lift RFU values and the +#' RFU values generated on the original SomaScan version. +#' This CCC estimate is a measure of how well an analyte can be bridged +#' across SomaScan versions. +#' See `vignette("lifting-and-bridging", package = "SomaDataIO")`. #' As with the lifting scalars, if you have an annotations file #' you may view the analyte-specific CCC values via [read_annotations()]. #' Alternatively, [getSomaScanLiftCCC()] retrieves these values #' from an internal object for both `"serum"` and `"plasma"`. -#' Lin's CCC (\eqn{p_c}) is defined by: -#' -#' \deqn{p_c = \frac{2\rho\hat\sigma_x\hat\sigma_y}{(\hat\mu_x - \hat\mu_y)^2 + \hat\sigma^2_x + \hat\sigma^2_y}} -#' -#' where \eqn{\rho}, \eqn{\mu}, and \eqn{\sigma} are the Pearson correlation -#' coefficient, and estimated median and standard deviation estimates from -#' assay version groups \eqn{x} and \eqn{y} respectively. #' #' @section Extra Columns: #' * Newer versions of SomaScan typically have additional content, i.e. diff --git a/_pkgdown.yml b/_pkgdown.yml index d82b4aa..c7a1541 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -79,7 +79,7 @@ navbar: href: articles/stat-binary-classification.html - text: Linear Regression href: articles/stat-linear-regression.html - + FAQs: text: Coming Soon menu: @@ -106,6 +106,11 @@ articles: contents: - cli-merge-tool + - title: Lifting and Bridging + navbar: ~ + contents: + - lifting-and-bridging + - title: Statistical Workflow Examples contents: - starts_with("articles/stat-") @@ -139,12 +144,15 @@ reference: - title: Transform Between SomaScan Versions desc: > - Functionality required to convert between SomaScan versions, - e.g. v4.1 -> v4.0, sometimes referred to as "lifting". + Functionality required to bridge between SomaScan versions, + e.g. 11k -> 7k, sometimes referred to as "lifting". contents: - - read_annotations - lift_adat + - read_annotations - transform + - getSomaScanLiftCCC + - getSomaScanVersion + - getSignalSpace - title: Expression Data desc: > diff --git a/man/adat-helpers.Rd b/man/adat-helpers.Rd index 787c0d8..9c560e7 100644 --- a/man/adat-helpers.Rd +++ b/man/adat-helpers.Rd @@ -48,7 +48,7 @@ These are elements of the \code{HEADER} attributes of the object.\cr\cr \code{\link[=checkSomaScanVersion]{checkSomaScanVersion()}} determines if the version of is a recognized version of SomaScan.\cr \cr -Table of SomaScan Assay versions: +Table of SomaScan assay versions: \tabular{lll}{ \strong{Version} \tab \strong{Commercial Name} \tab \strong{Size} \cr \code{V4} \tab 5k \tab 5284 \cr diff --git a/man/lift_adat.Rd b/man/lift_adat.Rd index 7008369..5131dd6 100644 --- a/man/lift_adat.Rd +++ b/man/lift_adat.Rd @@ -34,7 +34,7 @@ The SomaScan platform continually improves its technical processes between assay versions; from changing reagents, liquid handling equipment, well volumes, and content expansion. -Table of SomaScan Assay versions: +Table of SomaScan assay versions: \tabular{lll}{ \strong{Version} \tab \strong{Commercial Name} \tab \strong{Size} \cr @@ -65,35 +65,23 @@ See below for all options for the \code{bridge} argument. \details{ Matched samples across assay versions are used to calculate bridging scalars. For each analyte, this scalar is computed as the ratio of -population \emph{medians} (\eqn{n > 1000}) between assay versions. For example, -the linear scalar for the \eqn{i^{th}} analyte translating from \verb{11k} -> \verb{7k} -is defined as: - -\deqn{R_i = \frac{\hat\mu_{7k}}{\hat\mu_{11k}}} +population \emph{medians} across assay versions. +Please see the lifting vignette +\code{vignette("lifting-and-bridging", package = "SomaDataIO")} +for more details. } \section{Lin's CCC}{ -Calculating analyte-specific bridging scalars involves a careful evaluation -of the correlation of post-lifting RFU values in the reference population -used to calculate the linear scalars. The Lin's Concordance Correlation -Coefficient (CCC) is calculated between matched samples from the original -SomaScan signal space and the identical lifted samples that have been -scaled back to the original signal space. This CCC value is an estimate -of how well an analyte can be bridged across specific SomaScan versions. -Factors affecting an analyte's lifting CCC are: reagents with high -intra-assay CV (Coefficient of Variation) and reagents signaling -near background or saturation levels. +The Lin's Concordance Correlation Coefficient (CCC) is calculated +by computing the correlation between post-lift RFU values and the +RFU values generated on the original SomaScan version. +This CCC estimate is a measure of how well an analyte can be bridged +across SomaScan versions. +See \code{vignette("lifting-and-bridging", package = "SomaDataIO")}. As with the lifting scalars, if you have an annotations file you may view the analyte-specific CCC values via \code{\link[=read_annotations]{read_annotations()}}. Alternatively, \code{\link[=getSomaScanLiftCCC]{getSomaScanLiftCCC()}} retrieves these values from an internal object for both \code{"serum"} and \code{"plasma"}. -Lin's CCC (\eqn{p_c}) is defined by: - -\deqn{p_c = \frac{2\rho\hat\sigma_x\hat\sigma_y}{(\hat\mu_x - \hat\mu_y)^2 + \hat\sigma^2_x + \hat\sigma^2_y}} - -where \eqn{\rho}, \eqn{\mu}, and \eqn{\sigma} are the Pearson correlation -coefficient, and estimated median and standard deviation estimates from -assay version groups \eqn{x} and \eqn{y} respectively. } \section{Extra Columns}{ diff --git a/vignettes/lifting-and-bridging.Rmd b/vignettes/lifting-and-bridging.Rmd new file mode 100644 index 0000000..04f5c96 --- /dev/null +++ b/vignettes/lifting-and-bridging.Rmd @@ -0,0 +1,307 @@ +--- +title: "Lifting and Bridging SomaScan" +author: "Stu Field, SomaLogic Operating Co., Inc." +description: > + A primer on lifting and bridging 'SomaScan' data. +output: + rmarkdown::html_vignette: + fig_caption: yes +vignette: > + %\VignetteIndexEntry{Lifting and Bridging SomaScan} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +editor_options: + chunk_output_type: console +--- + +```{r setup, include = FALSE} +library(SomaDataIO) +library(ggplot2) +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +calc_ccc <- function(x, y) { + k <- length(x) + sdx <- sd(x) + sdy <- sd(y) + rho <- stats::cor(x, y, method = "pearson") + v <- sdx / sdy # scale shift + sx2 <- stats::var(x) * (k - 1) / k + sy2 <- stats::var(y) * (k - 1) / k + # location shift relative to scale + u <- ( mean(x) - mean(y) ) / ( (sx2 * sy2)^0.25 ) + rho * ( (v + 1 / v + u^2 ) / 2 )^-1 +} +``` + + +# Overview + +`SomaDataIO` contains functionality to bridge +(aka "lift") between various SomaScan versions by linear transformations +of RFU data. +Lifting between various versions is essentially a calibration of the +analyte/features in RFU space. + + +## Why lift? + +The SomaScan platform continually improves its technical processes +between assay versions; from changing reagents, liquid handling equipment, +well volumes, and content expansion. + +However, for a given analyte, these technical upgrades can result +in minute measurement signal differences, +requiring a calibration (aka "lifting" or "bridging") to bring RFUs into a +comparable signal space. +This is accomplished by applying an analyte-specific scalar, +a linear transformation, to each analyte RFU measurement (column). + +### Current SomaScan Versions + +| **Version** | **Commercial Name** | **Size** | +|:------------- |:------------------- |:------------- | +| `V4` | 5k | 5284 | +| `v4.1` | 7k | 7596 | +| `v5.0` | 11k | 11083 | + + +### Lifting Requirements + +There are 4 main requirements in order to reliably bridge +across SomaScan signal space: + +1. the object attributes, where signal information is stored, must be intact. +1. the sample matrix must be either serum or plasma. No other matrices + are currently supported. +1. the RFU data must have been normalized by Adaptive Normalization via + Maximum-Likelihood (ANML). This is the standard normalization for + most SomaScan deliveries. +1. the current SomaScan version and signal space must be one of those + above (see table), i.e. one of `5k`, `7k`, or `11k`. Older versions + of SomaScan are not supported. + +--------------- + +## Lifting Scalars + +Matched samples (n > 1000) from a healthy, normal reference population +were run across assay versions and used to calculate bridging scalars. +This experiment was run separately for both serum and plasma and all +SomaScan runs were first normalized as per the standard normalization +procedure, and flagged samples were removed prior to further analysis. + +For each analyte, the "lifting" or "bridging" scalar is computed as the +ratio of population _medians_ between assay versions. For example, +the linear scalar for the $i^{th}$ analyte translating from +`11k` $\rightarrow$ `7k` is defined as: + +$$ +R_i = \frac{\hat\mu_{7k}}{\hat\mu_{11k}} +$$ + +Signals generated in `11k` space can be multiplied by this scale factor +to translate into `7k` space. + +Below is a concordance plot of what this shift would look like for a single +analyte on a _simulated_ reference population. Please see the section on +Lin's CCC for its definition and interpretation. + +```{r lift-concord1, echo = FALSE, fig.width = 6, fig.height = 4} +rfu <- dplyr::filter(example_data, SampleType == "Sample")$seq.9016.12 +L <- length(rfu) +rfu2 <- rfu + + withr::with_seed(123, rnorm(L, mean = 500, sd = sd(rfu) / 3)) +sf <- median(rfu) / median(rfu2) +pre <- data.frame(x = rfu, y = rfu2) +pre$group <- sprintf("pre-lift (%0.3f)", calc_ccc(pre$x, pre$y)) +post <- data.frame(x = rfu, y = rfu2 * sf) +post$group <- sprintf("post-lift (%0.3f)", calc_ccc(post$x, post$y)) +plot_df <- rbind(pre, post) +plot_df$group <- factor(plot_df$group, levels = rev(sort(unique(plot_df$group)))) +lims <- range(plot_df[, -3L]) +plot_df |> + ggplot(aes(x = x, y = y, colour = group)) + + geom_point(alpha = 0.5, size = 3) + + scale_x_log10(guide = "axis_logticks") + + scale_y_log10(guide = "axis_logticks") + + scale_colour_manual(name = "CCC", values = c("#00A499", "#24135F")) + + expand_limits(x = lims, y = lims) + + labs(x = "SomaScan 7k", y = "SomaScan 11k", + title = sprintf("Lifting Concordance (Scalar = %0.3f)", sf)) + + geom_abline(slope = 1, intercept = 0, color = "black") +``` + +--------------- + +## Lin's CCC + +The Lin's Concordance Correlation Coefficient (CCC) is calculated +by computing the correlation between post-lift RFU values and the +RFU values generated on the original SomaScan version. Measurements +generated from the matched samples used to calculate the lifting scalars +were also used to calculate the post-hoc CCC estimates for the SomaScan +bridge. +Lin's CCC ($p_c$) is defined by: + +$$ +p_c = \frac{2\rho\hat\sigma_x\hat\sigma_y}{(\hat\mu_x - \hat\mu_y)^2 + \hat\sigma^2_x + \hat\sigma^2_y} +$$ + +where $\rho$, $\mu$, and $\sigma$ are the Pearson correlation +coefficient, and the estimated mean and standard deviation from +assay version groups _x_ and _y_ respectively. + + +### Interpretation of CCC + +This CCC value can be viewed as an estimate of the confidence in the +bridging transformation across SomaScan versions. +Examples of factors that could affect an analyte's lifting CCC are: + +- analytes/reagents with high intra-assay CV (Coefficient of Variation) +- analytes/reagents signaling near background or saturation levels + + + + +### Accessing CCC + +The `getSomaScanLiftCCC()` function retrieves these values +from an internal object for either `"serum"` and `"plasma"`. + +```{r ccc} +plasma <- getSomaScanLiftCCC("p") +plasma + +serum <- getSomaScanLiftCCC("s") +serum +``` + +```{r cdf, fig.width = 6, fig.height = 5} +cdf_df <- data.frame( + ccc = c(plasma$plasma_11k_to_7k_ccc, serum$serum_11k_to_7k_ccc), + matrix = rep(c("plasma", "serum"), each = nrow(plasma)) +) +cdf_df <- cdf_df[!is.na(cdf_df$ccc), ] # rm NAs; non-comparable analytes +ggplot(cdf_df, aes(x = ccc, colour = matrix)) + + stat_ecdf(linewidth = 0.75) + + scale_colour_manual(name = "", values = c("#00A499", "#24135F")) + + labs(title = "CDF of CCC Values", + x = "Lin's CCC (11k -> 7k)", y = "P(X < x)") + + coord_cartesian() +``` + +As shown in distribution above, for the `11k` $\rightarrow$ `7k` lift, post-bridging +CCC values above 0.75 (high quality) are approximately 88% and 84% of +the SomaScan menu for plasma and serum respectively. + + +## Column Setdiff + +There are two scenarios to consider: + +* Newer versions of SomaScan typically have additional content, i.e. + new reagents added to the multi-plex assay that bind to additional proteins. + When lifting _to_ a previous SomaScan version, new reagents that do _not_ + exist in the "earlier" assay version assay are scaled by 1.0, and thus + maintained, unmodified in the returned object. Users may need to drop + these columns in order to combine these data with a previous study + from an earlier SomaScan version, e.g. with `collapseAdats()`. +* In the inverse scenario, lifting "forward" _from_ a previous, lower-plex + version, there will be extra reference values that are unnecessary + to perform the lift, and a warning is triggered. The resulting data + consists of RFU data in the "new" signal space, but with fewer analytes + than would otherwise be expected (e.g. `11k` space with only 5284 + analytes; see example below). + + +------------ + + +## Example: `5k` $\rightarrow$ `11k` + +Since `example_data` object was originally run on SomaScan +`r getSignalSpace(example_data)`, this vignette will demonstrate +the lifting/bridging process _from_ a `5k` $\rightarrow$ `11k` signal space, +the most recent SomaScan version. + +### Steps + +1. Determine that attributes are intact. + ```r + is_intact_attr(adat) + ``` +1. Determine the matrix type of the data (serum or plasma). + ```r + attr(adat, "Header.Meta")$HEADER$StudyMatrix + ``` +1. Determine whether the SomaScan version is supported. + ```r + checkSomaScanVersion(getSomaScanVersion(adat)) + ``` +1. Ensure the current `soma_adat` is in the SomaScan version space you wish. + ```r + getSignalSpace(adat) + ``` +1. Apply the scale factors using `lift_adat()`. + ```r + lift_adat(adat, bridge = "") + ``` + +### Step 1 +```{r attr} +# determine intact attributes +# must be TRUE +is_intact_attr(example_data) +``` + +### Step 2 +```{r matx} +# determine study matrix +# must be Serum or Plasma +attr(example_data, "Header.Meta")$HEADER$StudyMatrix |> as.character() +``` + +Confirm that the matrix of the SomaScan run was `"EDTA Plasma"`. + +### Step 3 +```{r version} +# determine if current space can be lifted +# must be NULL +is.null(checkSomaScanVersion(getSomaScanVersion(example_data))) +``` + +### Step 4 +```{r space} +# determine current SomaScan space +# must be V4, v4.1, or v5.0 +getSignalSpace(example_data) +``` + +Finally, invoke `lift_adat()` to perform the lift/bridge/transformation. + +### Step 5 +```{r lift} +lift_11k <- lift_adat(example_data, bridge = "5k_to_11k") + +is_lifted(lift_11k) + +is.soma_adat(lift_11k) + +getSignalSpace(lift_11k) + +getSomaScanVersion(lift_11k) +``` + + +------------- + + +## Considerations + +### Should you filter analytes? + +### Was the SomaScan bridge successful? +