diff --git a/R/adat-helpers.R b/R/adat-helpers.R index 803f7fb..7588d8d 100644 --- a/R/adat-helpers.R +++ b/R/adat-helpers.R @@ -8,7 +8,7 @@ #' [checkSomaScanVersion()] determines if the version of #' is a recognized version of SomaScan.\cr #' \cr -#' Table of SomaScan Assay versions: +#' Table of SomaScan assay versions: #' \tabular{lll}{ #' **Version** \tab **Commercial Name** \tab **Size** \cr #' `V4` \tab 5k \tab 5284 \cr diff --git a/R/lift-adat.R b/R/lift-adat.R index 616e94f..26284c8 100644 --- a/R/lift-adat.R +++ b/R/lift-adat.R @@ -5,7 +5,7 @@ #' between assay versions; from changing reagents, liquid handling equipment, #' well volumes, and content expansion. #' -#' Table of SomaScan Assay versions: +#' Table of SomaScan assay versions: #' #' \tabular{lll}{ #' **Version** \tab **Commercial Name** \tab **Size** \cr @@ -36,36 +36,24 @@ #' @details #' Matched samples across assay versions are used to calculate bridging #' scalars. For each analyte, this scalar is computed as the ratio of -#' population _medians_ (\eqn{n > 1000}) between assay versions. For example, -#' the linear scalar for the \eqn{i^{th}} analyte translating from `11k` -> `7k` -#' is defined as: -#' -#' \deqn{R_i = \frac{\hat\mu_{7k}}{\hat\mu_{11k}}} +#' population _medians_ across assay versions. +#' Please see the lifting vignette +#' `vignette("lifting-and-bridging", package = "SomaDataIO")` +#' for more details. #' #' @section Lin's CCC: -#' Calculating analyte-specific bridging scalars involves a careful evaluation -#' of the correlation of post-lifting RFU values in the reference population -#' used to calculate the linear scalars. The Lin's Concordance Correlation -#' Coefficient (CCC) is calculated between matched samples from the original -#' SomaScan signal space and the identical lifted samples that have been -#' scaled back to the original signal space. This CCC value is an estimate -#' of how well an analyte can be bridged across specific SomaScan versions. -#' Factors affecting lifting CCCs are: reagents with high -#' intra-assay CV (Coefficient of Variation) and reagents signaling -#' near background or saturation levels. +#' The Lin's Concordance Correlation Coefficient (CCC) is calculated +#' by computing the correlation between post-lift RFU values and the +#' RFU values generated on the original SomaScan version. +#' This CCC estimate is a measure of how well an analyte can be bridged +#' across SomaScan versions. +#' See `vignette("lifting-and-bridging", package = "SomaDataIO")`. #' As with the lifting scalars, if you have an annotations file #' you may view the analyte-specific CCC values via [read_annotations()]. #' Alternatively, [getSomaScanLiftCCC()] retrieves these values #' from an internal object for both `"serum"` and `"plasma"`. -#' Lin's CCC (\eqn{p_c}) is defined by: -#' -#' \deqn{p_c = \frac{2\rho\hat\sigma_x\hat\sigma_y}{(\hat\mu_x - \hat\mu_y)^2 + \hat\sigma^2_x + \hat\sigma^2_y}} -#' -#' where \eqn{\rho}, \eqn{\mu}, and \eqn{\sigma} are the Pearson correlation -#' coefficient, and estimated median and standard deviation estimates from -#' assay version groups \eqn{x} and \eqn{y} respectively. #' -#' @section Column Setdiff: +#' @section Analyte Setdiff: #' * Newer versions of SomaScan typically have additional content, i.e. #' new reagents added to the multi-plex assay that bind to additional proteins. #' When lifting _to_ a previous SomaScan version, new reagents that do _not_ diff --git a/_pkgdown.yml b/_pkgdown.yml index e0c37f7..f45e7c0 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -79,7 +79,7 @@ navbar: href: articles/stat-binary-classification.html - text: Linear Regression href: articles/stat-linear-regression.html - + FAQs: text: Coming Soon menu: @@ -88,7 +88,6 @@ navbar: - text: Standard Process - text: Best Practices - text: ---- - - text: Lifting SomaScan - text: Non-Standard Matrices - text: Limits of Detection (LoD) @@ -106,6 +105,11 @@ articles: contents: - cli-merge-tool + - title: Lifting and Bridging + navbar: ~ + contents: + - articles/lifting-and-bridging + - title: Statistical Workflow Examples contents: - starts_with("articles/stat-") @@ -140,12 +144,14 @@ reference: - title: Transform Between SomaScan Versions desc: > - Functionality required to convert between SomaScan versions, - e.g. v4.1 -> v4.0, sometimes referred to as "lifting". + Functionality required to bridge between SomaScan versions, + e.g. 11k -> 7k, sometimes referred to as "lifting". contents: - - read_annotations - lift_adat + - read_annotations - transform + - starts_with("getSomaScan") + - getSignalSpace - title: Expression Data desc: > diff --git a/man/adat-helpers.Rd b/man/adat-helpers.Rd index 29408a4..3734067 100644 --- a/man/adat-helpers.Rd +++ b/man/adat-helpers.Rd @@ -53,7 +53,7 @@ that generated RFU measurements within a \code{soma_adat} object. \code{\link[=checkSomaScanVersion]{checkSomaScanVersion()}} determines if the version of is a recognized version of SomaScan.\cr \cr -Table of SomaScan Assay versions: +Table of SomaScan assay versions: \tabular{lll}{ \strong{Version} \tab \strong{Commercial Name} \tab \strong{Size} \cr \code{V4} \tab 5k \tab 5284 \cr diff --git a/man/lift_adat.Rd b/man/lift_adat.Rd index dd25b0d..38433e2 100644 --- a/man/lift_adat.Rd +++ b/man/lift_adat.Rd @@ -34,7 +34,7 @@ The SomaScan platform continually improves its technical processes between assay versions; from changing reagents, liquid handling equipment, well volumes, and content expansion. -Table of SomaScan Assay versions: +Table of SomaScan assay versions: \tabular{lll}{ \strong{Version} \tab \strong{Commercial Name} \tab \strong{Size} \cr @@ -65,38 +65,26 @@ See below for all options for the \code{bridge} argument. \details{ Matched samples across assay versions are used to calculate bridging scalars. For each analyte, this scalar is computed as the ratio of -population \emph{medians} (\eqn{n > 1000}) between assay versions. For example, -the linear scalar for the \eqn{i^{th}} analyte translating from \verb{11k} -> \verb{7k} -is defined as: - -\deqn{R_i = \frac{\hat\mu_{7k}}{\hat\mu_{11k}}} +population \emph{medians} across assay versions. +Please see the lifting vignette +\code{vignette("lifting-and-bridging", package = "SomaDataIO")} +for more details. } \section{Lin's CCC}{ -Calculating analyte-specific bridging scalars involves a careful evaluation -of the correlation of post-lifting RFU values in the reference population -used to calculate the linear scalars. The Lin's Concordance Correlation -Coefficient (CCC) is calculated between matched samples from the original -SomaScan signal space and the identical lifted samples that have been -scaled back to the original signal space. This CCC value is an estimate -of how well an analyte can be bridged across specific SomaScan versions. -Factors affecting lifting CCCs are: reagents with high -intra-assay CV (Coefficient of Variation) and reagents signaling -near background or saturation levels. +The Lin's Concordance Correlation Coefficient (CCC) is calculated +by computing the correlation between post-lift RFU values and the +RFU values generated on the original SomaScan version. +This CCC estimate is a measure of how well an analyte can be bridged +across SomaScan versions. +See \code{vignette("lifting-and-bridging", package = "SomaDataIO")}. As with the lifting scalars, if you have an annotations file you may view the analyte-specific CCC values via \code{\link[=read_annotations]{read_annotations()}}. Alternatively, \code{\link[=getSomaScanLiftCCC]{getSomaScanLiftCCC()}} retrieves these values from an internal object for both \code{"serum"} and \code{"plasma"}. -Lin's CCC (\eqn{p_c}) is defined by: - -\deqn{p_c = \frac{2\rho\hat\sigma_x\hat\sigma_y}{(\hat\mu_x - \hat\mu_y)^2 + \hat\sigma^2_x + \hat\sigma^2_y}} - -where \eqn{\rho}, \eqn{\mu}, and \eqn{\sigma} are the Pearson correlation -coefficient, and estimated median and standard deviation estimates from -assay version groups \eqn{x} and \eqn{y} respectively. } -\section{Column Setdiff}{ +\section{Analyte Setdiff}{ \itemize{ \item Newer versions of SomaScan typically have additional content, i.e. diff --git a/vignettes/.gitignore b/vignettes/.gitignore index 097b241..2b3d570 100644 --- a/vignettes/.gitignore +++ b/vignettes/.gitignore @@ -1,2 +1,3 @@ *.html +*.png *.R diff --git a/vignettes/articles/figures/.gitignore b/vignettes/articles/figures/.gitignore deleted file mode 100644 index 506eee3..0000000 --- a/vignettes/articles/figures/.gitignore +++ /dev/null @@ -1,2 +0,0 @@ -*.png -*.html diff --git a/vignettes/articles/lifting-and-bridging.Rmd b/vignettes/articles/lifting-and-bridging.Rmd new file mode 100644 index 0000000..f5763b3 --- /dev/null +++ b/vignettes/articles/lifting-and-bridging.Rmd @@ -0,0 +1,385 @@ +--- +title: "Lifting and Bridging SomaScan" +author: "Stu Field, SomaLogic Operating Co., Inc." +description: > + A primer on lifting and bridging 'SomaScan' data. +output: + rmarkdown::html_vignette: + fig_caption: yes +vignette: > + %\VignetteIndexEntry{Lifting and Bridging SomaScan} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +editor_options: + chunk_output_type: console +--- + +```{r setup, include = FALSE} +library(SomaDataIO) +library(ggplot2) +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + fig.path = "figures/lifting-" +) +calc_ccc <- function(x, y) { + k <- length(x) + sdx <- sd(x) + sdy <- sd(y) + rho <- stats::cor(x, y, method = "pearson") + v <- sdx / sdy # scale shift + sx2 <- stats::var(x) * (k - 1) / k + sy2 <- stats::var(y) * (k - 1) / k + # location shift relative to scale + u <- ( mean(x) - mean(y) ) / ( (sx2 * sy2)^0.25 ) + rho * ( (v + 1 / v + u^2 ) / 2 )^-1 +} +``` + + +# Overview + +`SomaDataIO` contains functionality to bridge (aka "lift") between +various SomaScan versions by linear transformations of RFU data. +Lifting between various versions is essentially a calibration of the +analytes/features in RFU space. + + +## Why lift? + +The SomaScan platform continually improves its technical processes +between assay versions. The primary change of interest is content expansion, +and other protocol changes may be implemented including: changing reagents, +liquid handling equipment, and well volumes. + +For any given analyte, these technical upgrades may result +in minute measurement signal differences, requiring a +calibration (aka "lifting" or "bridging") to bring RFU +values into a comparable signal space. This is accomplished +by applying an analyte-specific scalar, a linear transformation, +to each analyte RFU measurement (column). + +### Current SomaScan Versions + +| **Version** | **Commercial Name** | **Size** | +|:------------- |:------------------- |:------------- | +| `V4` | 5k | 5284 | +| `v4.1` | 7k | 7596 | +| `v5.0` | 11k | 11083 | + + +### Lifting Requirements + +There are 4 main requirements in order to reliably bridge +across SomaScan signal space: + +1. the `soma_adat` object attributes, where SomaScan signal information is + stored, must be intact (see `is_intact_attr()`). +1. the sample matrix must be either human serum or human EDTA-plasma. + No other matrices are currently supported. Additionally, bridging + must *not* be applied across matrices (i.e. serum $\leftrightarrow$ plasma). +1. the RFU data must have been normalized by Adaptive Normalization via + Maximum-Likelihood (ANML). This is the standard normalization for + most SomaScan deliveries. +1. the current SomaScan version and signal space must be one of those + above (see table), i.e. one of `5k`, `7k`, or `11k`. Older versions + of SomaScan are not supported. + +--------------- + +## Lifting Scalars + +Lifting (aka "bridging") scalars are numeric values used to multiply a +vector of RFU values to linearly transform them into another signal space. + +Lifting scalars are generated from matched samples (n $>$ 1000) from a +healthy, normal reference population were run across assay versions. +This experiment was run separately for both serum and plasma and all +SomaScan runs were first normalized as per the standard normalization +procedure, and flagged samples were removed prior to further analysis. + +For each analyte, the lifting scalar is computed as the ratio of +population _medians_ between assay versions. For example, +the linear scalar for the $i^{th}$ analyte translating from +`11k` $\rightarrow$ `7k` is defined as: + +$$ +R_i = \frac{\hat\mu_{7k}}{\hat\mu_{11k}}, +$$ + +where $\hat\mu$ is the _median_ signal for the $i^{th}$ analyte. +Signals generated in `11k` space can be multiplied by this scale factor +to translate into `7k` space. + +Below is a concordance plot of what this shift would look like for a single +analyte on a _simulated_ reference population. Please see the section below +on Lin's CCC for its definition and interpretation. + +```{r lift-concord, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Figure 1. Signal concordance for a single analyte pre- and post-lifting."} +rfu <- dplyr::filter(example_data, SampleType == "Sample")$seq.9016.12 +L <- length(rfu) +rfu2 <- rfu + + withr::with_seed(123, rnorm(L, mean = 500, sd = sd(rfu) / 3)) +sf <- median(rfu) / median(rfu2) +pre <- data.frame(x = rfu, y = rfu2) +pre$group <- sprintf("pre-lift (%0.3f)", calc_ccc(pre$x, pre$y)) +post <- data.frame(x = rfu, y = rfu2 * sf) +post$group <- sprintf("post-lift (%0.3f)", calc_ccc(post$x, post$y)) +plot_df <- rbind(pre, post) +plot_df$group <- factor(plot_df$group, levels = rev(sort(unique(plot_df$group)))) +lims <- range(plot_df[, -3L]) +plot_df |> + ggplot(aes(x = x, y = y, colour = group)) + + geom_point(alpha = 0.5, size = 3) + + scale_x_log10(guide = "axis_logticks") + + scale_y_log10(guide = "axis_logticks") + + scale_colour_manual(name = "CCC", values = c("#00A499", "#24135F")) + + expand_limits(x = lims, y = lims) + + labs(x = "SomaScan 7k", y = "SomaScan 11k", + title = sprintf("Lifting Concordance (Scalar = %0.3f)", sf)) + + geom_abline(slope = 1, intercept = 0, color = "black") +``` + +--------------- + +## Lifting Concordance + +Measurements generated from the matched samples used to calculate +the lifting scalars were also used to calculate the post-hoc +Lin's Concordance Correlation Coefficient (CCC) estimates +of the SomaScan bridge. + +Lin's CCC is calculated by computing the correlation between +post-lift RFU values and the RFU values generated on the +original SomaScan version, and is defined by: + +$$ +CCC = \frac{2\rho\hat\sigma_x\hat\sigma_y}{(\hat\mu_x - \hat\mu_y)^2 + \hat\sigma^2_x + \hat\sigma^2_y}, +$$ + +where $\rho$, $\hat\mu$, and $\hat\sigma$ are the Pearson correlation +coefficient, and the estimated mean and standard deviation from +assay version groups _x_ and _y_ respectively. + + +### Interpretation of CCC + +Lin's CCC was chosen to evaluate lifting performance because it is +characterized not only by correlation (Pearson's $\rho$), but +also accounts for deviation from the $y = x$ unit line (diagonal). +CCC range is in $[-1, 1]$ and can be viewed as an estimate of the +confidence in the bridging transformation (in normal reference +samples) across SomaScan versions. +Examples of factors that could affect lifting CCC are: + +- analytes/reagents with high intra-assay CV (Coefficient of Variation) +- analytes/reagents signaling near background or saturation levels + + +### Accessing CCC + +The `getSomaScanLiftCCC()` function retrieves these values +from an internal object for either `"serum"` and `"plasma"`. + +```{r ccc} +plasma <- getSomaScanLiftCCC("p") +plasma + +serum <- getSomaScanLiftCCC("s") +serum +``` + +```{r cdf-ccc, fig.width = 6, fig.height = 5, fig.cap = "Figure 2. Cumulative distribution function of CCC values for the 11k -> 7k lift."} +cdf_df <- data.frame( + ccc = c(plasma$plasma_11k_to_7k_ccc, serum$serum_11k_to_7k_ccc), + matrix = rep(c("plasma", "serum"), each = nrow(plasma)) +) +cdf_df <- cdf_df[!is.na(cdf_df$ccc), ] # rm NAs; non-comparable analytes +ggplot(cdf_df, aes(x = ccc, colour = matrix)) + + stat_ecdf(linewidth = 0.75) + + scale_colour_manual(name = "", values = c("#00A499", "#24135F")) + + labs(title = "CDF of CCC Values", + x = "Lin's CCC (11k -> 7k)", y = "P(X < x)") + + coord_cartesian() +``` + +As shown in distribution above, for the `11k` $\rightarrow$ `7k` lift, +post-bridging CCC values above 0.75 (considered high quality) are +approximately 88% and 84% of the SomaScan menu for plasma +and serum respectively. In fact, characterizing CCC lifting quality into 3 +categories (Low, Medium, High) yields the table below: + +```{r ecdf-table, echo = FALSE} +fn <- function(x) { + cdf <- stats::ecdf(x) + data.frame(lo = cdf(0.5), med = cdf(0.75) - cdf(0.5), hi = 1 - cdf(0.75)) +} +do.call(rbind, tapply(cdf_df$ccc, cdf_df$matrix, fn)) |> + round(3L) |> + set_rn(c("Plasma", "Serum")) |> + rn2col("Matrix") |> + knitr::kable( + col.names = c("Matrix", "Low [0, 0.5)", "Medium [0.5, 0.75)", "High [0.75, 1]"), + caption = "Table 1. The proportion of the SomaScan menu split into 3 categories by CCC." + ) +``` + + +## SomaScan Analyte Setdiff + +For any given bridge, there is a common, intersecting subset of +analytes between SomaScan versions. Non-intersecting analytes will be +either missing or added in the new signal space. As a result, +bridging data across SomaScan may involve either skipping analytes (columns) +or scaling by 1.0. `SomaDataIO` has internal checks that trigger +warnings if these conditions are met. + +There are two scenarios to consider: + +* Newer versions of SomaScan typically have additional content, + i.e. new reagents added to the multi-plex assay that bind to additional + proteins. When lifting _to_ a previous SomaScan version, + new reagents that do _not_ exist in the "earlier" assay version + assay are scaled by 1.0, and thus are maintained, unmodified in + the returned object. + Downstream analysis may require removing these columns in order + to combine these data with a previous study from an earlier + SomaScan version, e.g. with `collapseAdats()`. +* In the inverse scenario, lifting "forward" _from_ a previous, lower-plex + version, there will be extra reference values that are unnecessary + to perform the lift, and a warning is triggered. The resulting data + consists of RFU data in the "new" signal space, but with fewer analytes + than would otherwise be expected (e.g. `11k` space with only 5284 + analytes; see example below). + + +------------ + + +## Example: `5k` $\rightarrow$ `11k` + +Since `example_data` object was originally run on SomaScan +`r getSignalSpace(example_data)`, this vignette will demonstrate +the lifting/bridging process _from_ a `5k` $\rightarrow$ `11k` +signal space, the most recent SomaScan version. + +### Steps + +1. Determine that attributes are intact. + ```r + is_intact_attr(adat) + ``` +1. Determine the matrix type of the data (serum or plasma). + ```r + attr(adat, "Header.Meta")$HEADER$StudyMatrix + ``` +1. Ensure the current SomaScan signal space is lift-supported. + ```r + getSignalSpace(adat) + checkSomaScanVersion(getSignalSpace(adat)) + ``` +1. Apply analyte-specific scalars to their corresponding + columns via `lift_adat()`. + ```r + lift_adat(adat, bridge = "") + ``` + Current `bridge` options are: `r dQuote(eval(formals(lift_adat)$bridge))`. + + +### Step 1 +```{r attr} +# determine intact attributes +# must be TRUE +is_intact_attr(example_data) +``` + +### Step 2 +```{r matx} +# determine study matrix +# must be Human Serum or EDTA-Plasma +attr(example_data, "Header.Meta")$HEADER$StudyMatrix |> as.character() +``` + +Confirm that the matrix of the SomaScan run was `"EDTA Plasma"`: + +### Step 3 +```{r version} +# determine if current space can be lifted +# must be V4, v4.1, or v5.0 +from_space <- getSignalSpace(example_data) +from_space + +# must be NULL +is.null(checkSomaScanVersion(from_space)) +``` + +Finally, invoke `lift_adat()` to perform the bridge/transformation: + +### Step 4 +```{r lift} +lift_11k <- lift_adat(example_data, bridge = "5k_to_11k") + +is_lifted(lift_11k) # signal space was lifted + +is.soma_adat(lift_11k) # preserves 'soma_adat' class + +getSignalSpace(lift_11k) # current space + +getSomaScanVersion(lift_11k) # original space +``` + + +------------- + + +## Considerations + +### Was the SomaScan bridge successful? + +Lifting SomaScan involves a simple linear transformation of a +numeric vector (of RFU values), thus in one sense it will always +be "successful". However, users often wish to know if this was +the correct course of action. + +From the concordance plot in **Figure 1**, we can see that the +transformation is *reducing* the RFU brightness by ~`r round(100*(1 - sf))`% +in the `11k` space, in accordance with the median signal difference +in the reference population (of healthy normals). +Rare edge cases aside, this is *usually* the desired outcome, otherwise +downstream analysis would be confounded by the uncorrected shift in +SomaScan space, and is likely result in significant differences related +to signal space rather than actual biology. + +### Should you filter analytes? + +Users often ask if they should remove certain analytes *prior* to +beginning an analysis based on a given CCC threshold. +The issue of choosing an appropriate threshold aside, unless +there is prior knowledge justifying removal, we do not recommend +removing analytes based on CCC alone. + +This advice stems from how the CCC values are initially calculated, +from a healthy, normal reference population sampled across two +versions of SomaScan. Recall that CCC is influenced by CV and +signaling range. So for example, if a given analyte is near its +limit of detection in a healthy population, and therefore likely +has a low CCC (high CV), removing this analyte may *not* be the +desired course of action in a disease population where that +analyte may by signaling in the linear range. + +Therefore, we currently recommend careful evaluation on a +case by case basis using prior knowledge and orthogonal +justification before filtering analytes from discovery or +exploratory analyses. + + +------------- + +## Questions + +As always, if you have any bridging or lifting questions, +we are here to help. Please reach out to us via: + +* via GitHub [SUPPORT](https://somalogic.github.io/SomaDataIO/SUPPORT.html) +* Global Scientific Engagement Team: +* General SomaScan inquiries: