diff --git a/R/adat-helpers.R b/R/adat-helpers.R index bed00ca..998028c 100644 --- a/R/adat-helpers.R +++ b/R/adat-helpers.R @@ -8,7 +8,7 @@ #' [checkSomaScanVersion()] determines if the version of #' is a recognized version of SomaScan.\cr #' \cr -#' Table of SomaScan Assay versions: +#' Table of SomaScan assay versions: #' \tabular{lll}{ #' **Version** \tab **Commercial Name** \tab **Size** \cr #' `V4` \tab 5k \tab 5284 \cr diff --git a/R/lift-adat.R b/R/lift-adat.R index d729b81..362a239 100644 --- a/R/lift-adat.R +++ b/R/lift-adat.R @@ -5,7 +5,7 @@ #' between assay versions; from changing reagents, liquid handling equipment, #' well volumes, and content expansion. #' -#' Table of SomaScan Assay versions: +#' Table of SomaScan assay versions: #' #' \tabular{lll}{ #' **Version** \tab **Commercial Name** \tab **Size** \cr @@ -36,34 +36,22 @@ #' @details #' Matched samples across assay versions are used to calculate bridging #' scalars. For each analyte, this scalar is computed as the ratio of -#' population _medians_ (\eqn{n > 1000}) between assay versions. For example, -#' the linear scalar for the \eqn{i^{th}} analyte translating from `11k` -> `7k` -#' is defined as: -#' -#' \deqn{R_i = \frac{\hat\mu_{7k}}{\hat\mu_{11k}}} +#' population _medians_ across assay versions. +#' Please see the lifting vignette +#' `vignette("lifting-and-bridging", package = "SomaDataIO")` +#' for more details. #' #' @section Lin's CCC: -#' Calculating analyte-specific bridging scalars involves a careful evaluation -#' of the correlation of post-lifting RFU values in the reference population -#' used to calculate the linear scalars. The Lin's Concordance Correlation -#' Coefficient (CCC) is calculated between matched samples from the original -#' SomaScan signal space and the identical lifted samples that have been -#' scaled back to the original signal space. This CCC value is an estimate -#' of how well an analyte can be bridged across specific SomaScan versions. -#' Factors affecting an analyte's lifting CCC are: reagents with high -#' intra-assay CV (Coefficient of Variation) and reagents signaling -#' near background or saturation levels. +#' The Lin's Concordance Correlation Coefficient (CCC) is calculated +#' by computing the correlation between post-lift RFU values and the +#' RFU values generated on the original SomaScan version. +#' This CCC estimate is a measure of how well an analyte can be bridged +#' across SomaScan versions. +#' See `vignette("lifting-and-bridging", package = "SomaDataIO")`. #' As with the lifting scalars, if you have an annotations file #' you may view the analyte-specific CCC values via [read_annotations()]. #' Alternatively, [getSomaScanLiftCCC()] retrieves these values #' from an internal object for both `"serum"` and `"plasma"`. -#' Lin's CCC (\eqn{p_c}) is defined by: -#' -#' \deqn{p_c = \frac{2\rho\hat\sigma_x\hat\sigma_y}{(\hat\mu_x - \hat\mu_y)^2 + \hat\sigma^2_x + \hat\sigma^2_y}} -#' -#' where \eqn{\rho}, \eqn{\mu}, and \eqn{\sigma} are the Pearson correlation -#' coefficient, and estimated median and standard deviation estimates from -#' assay version groups \eqn{x} and \eqn{y} respectively. #' #' @section Extra Columns: #' * Newer versions of SomaScan typically have additional content, i.e. diff --git a/_pkgdown.yml b/_pkgdown.yml index d82b4aa..c7a1541 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -79,7 +79,7 @@ navbar: href: articles/stat-binary-classification.html - text: Linear Regression href: articles/stat-linear-regression.html - + FAQs: text: Coming Soon menu: @@ -106,6 +106,11 @@ articles: contents: - cli-merge-tool + - title: Lifting and Bridging + navbar: ~ + contents: + - lifting-and-bridging + - title: Statistical Workflow Examples contents: - starts_with("articles/stat-") @@ -139,12 +144,15 @@ reference: - title: Transform Between SomaScan Versions desc: > - Functionality required to convert between SomaScan versions, - e.g. v4.1 -> v4.0, sometimes referred to as "lifting". + Functionality required to bridge between SomaScan versions, + e.g. 11k -> 7k, sometimes referred to as "lifting". contents: - - read_annotations - lift_adat + - read_annotations - transform + - getSomaScanLiftCCC + - getSomaScanVersion + - getSignalSpace - title: Expression Data desc: > diff --git a/man/adat-helpers.Rd b/man/adat-helpers.Rd index 787c0d8..9c560e7 100644 --- a/man/adat-helpers.Rd +++ b/man/adat-helpers.Rd @@ -48,7 +48,7 @@ These are elements of the \code{HEADER} attributes of the object.\cr\cr \code{\link[=checkSomaScanVersion]{checkSomaScanVersion()}} determines if the version of is a recognized version of SomaScan.\cr \cr -Table of SomaScan Assay versions: +Table of SomaScan assay versions: \tabular{lll}{ \strong{Version} \tab \strong{Commercial Name} \tab \strong{Size} \cr \code{V4} \tab 5k \tab 5284 \cr diff --git a/man/lift_adat.Rd b/man/lift_adat.Rd index 7008369..5131dd6 100644 --- a/man/lift_adat.Rd +++ b/man/lift_adat.Rd @@ -34,7 +34,7 @@ The SomaScan platform continually improves its technical processes between assay versions; from changing reagents, liquid handling equipment, well volumes, and content expansion. -Table of SomaScan Assay versions: +Table of SomaScan assay versions: \tabular{lll}{ \strong{Version} \tab \strong{Commercial Name} \tab \strong{Size} \cr @@ -65,35 +65,23 @@ See below for all options for the \code{bridge} argument. \details{ Matched samples across assay versions are used to calculate bridging scalars. For each analyte, this scalar is computed as the ratio of -population \emph{medians} (\eqn{n > 1000}) between assay versions. For example, -the linear scalar for the \eqn{i^{th}} analyte translating from \verb{11k} -> \verb{7k} -is defined as: - -\deqn{R_i = \frac{\hat\mu_{7k}}{\hat\mu_{11k}}} +population \emph{medians} across assay versions. +Please see the lifting vignette +\code{vignette("lifting-and-bridging", package = "SomaDataIO")} +for more details. } \section{Lin's CCC}{ -Calculating analyte-specific bridging scalars involves a careful evaluation -of the correlation of post-lifting RFU values in the reference population -used to calculate the linear scalars. The Lin's Concordance Correlation -Coefficient (CCC) is calculated between matched samples from the original -SomaScan signal space and the identical lifted samples that have been -scaled back to the original signal space. This CCC value is an estimate -of how well an analyte can be bridged across specific SomaScan versions. -Factors affecting an analyte's lifting CCC are: reagents with high -intra-assay CV (Coefficient of Variation) and reagents signaling -near background or saturation levels. +The Lin's Concordance Correlation Coefficient (CCC) is calculated +by computing the correlation between post-lift RFU values and the +RFU values generated on the original SomaScan version. +This CCC estimate is a measure of how well an analyte can be bridged +across SomaScan versions. +See \code{vignette("lifting-and-bridging", package = "SomaDataIO")}. As with the lifting scalars, if you have an annotations file you may view the analyte-specific CCC values via \code{\link[=read_annotations]{read_annotations()}}. Alternatively, \code{\link[=getSomaScanLiftCCC]{getSomaScanLiftCCC()}} retrieves these values from an internal object for both \code{"serum"} and \code{"plasma"}. -Lin's CCC (\eqn{p_c}) is defined by: - -\deqn{p_c = \frac{2\rho\hat\sigma_x\hat\sigma_y}{(\hat\mu_x - \hat\mu_y)^2 + \hat\sigma^2_x + \hat\sigma^2_y}} - -where \eqn{\rho}, \eqn{\mu}, and \eqn{\sigma} are the Pearson correlation -coefficient, and estimated median and standard deviation estimates from -assay version groups \eqn{x} and \eqn{y} respectively. } \section{Extra Columns}{ diff --git a/vignettes/lifting-and-bridging.Rmd b/vignettes/lifting-and-bridging.Rmd new file mode 100644 index 0000000..d1b6b11 --- /dev/null +++ b/vignettes/lifting-and-bridging.Rmd @@ -0,0 +1,302 @@ +--- +title: "Lifting and Bridging SomaScan" +author: "Stu Field, SomaLogic Operating Co., Inc." +description: > + A primer on lifting and bridging 'SomaScan' data. +output: + rmarkdown::html_vignette: + fig_caption: yes +vignette: > + %\VignetteIndexEntry{Lifting and Bridging SomaScan} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +editor_options: + chunk_output_type: console +--- + +```{r setup, include = FALSE} +library(SomaDataIO) +library(ggplot2) +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + + +# Overview + +`SomaDataIO` contains functionality to bridge +(aka "lift") between various SomaScan versions by linear transformations +of RFU data. +Lifting between various versions is essentially a calibration of the +analyte/features in RFU space. + + +## Why lift? + +The SomaScan platform continually improves its technical processes +between assay versions; from changing reagents, liquid handling equipment, +well volumes, and content expansion. + +However, for a given analyte, these technical upgrades can result +in minute measurement signal differences, +requiring a calibration (aka "lifting" or "bridging") to bring RFUs into a +comparable signal space. +This is accomplished by applying an analyte-specific scalar, +a linear transformation, to each analyte RFU measurement (column). + +### Current SomaScan Versions + +| **Version** | **Commercial Name** | **Size** | +|:------------- |:------------------- |:------------- | +| `V4` | 5k | 5284 | +| `v4.1` | 7k | 7596 | +| `v5.0` | 11k | 11083 | + + +### Lifting Requirements + +There are 4 main requirements in order to reliably bridge +across SomaScan signal space: + +1. the object attributes, where signal information is stored, must be intact. +1. the sample matrix must be either serum or plasma. No other matrices + are currently supported. +1. the RFU data must have been normalized by Adaptive Normalization via + Maximum-Likelihood (ANML). This is the standard normalization for + most SomaScan deliveries. +1. the current SomaScan version and signal space must be one of those + above (see table), i.e. one of `5k`, `7k`, or `11k`. Older versions + of SomaScan are not supported. + +--------------- + +## Lifting Scalars + +Matched samples (n > 1000) from a healthy, normal reference population +were run across assay versions and used to calculate bridging scalars. +This experiment was run separately for both serum and plasma and all +SomaScan runs were first normalized as per the standard normalization +procedure, and flagged samples were removed prior to further analysis. + +For each analyte, the "lifting" or "bridging" scalar is computed as the +ratio of population _medians_ between assay versions. For example, +the linear scalar for the $i^{th}$ analyte translating from `11k -> 7k` +is defined as: + +$$ +R_i = \frac{\hat\mu_{7k}}{\hat\mu_{11k}} +$$ + +Signals generated in `11k` space can be multiplied by this scale factor +to translate into `7k` space. + + +## Lin's CCC: + +The Lin's Concordance Correlation Coefficient (CCC) is calculated +by computing the correlation between post-lift RFU values and the +RFU values generated on the original SomaScan version. Measurements +generated from the matched samples used to calculate the lifting scalars +were also used to calculate the post-hoc CCC estimates for the SomaScan +bridge. +Lin's CCC ($p_c$) is defined by: + +$$ +p_c = \frac{2\rho\hat\sigma_x\hat\sigma_y}{(\hat\mu_x - \hat\mu_y)^2 + \hat\sigma^2_x + \hat\sigma^2_y} +$$ + +where $\rho$, $\mu$, and $\sigma$ are the Pearson correlation +coefficient, and estimated median and standard deviation estimates from +assay version groups _x_ and _y_ respectively. + + +### Interpretation of CCC + +This CCC value can be viewed as an estimate of the confidence in the +bridging transformation across SomaScan versions. + +Factors affecting an analyte's lifting CCC are: reagents with high +intra-assay CV (Coefficient of Variation) and reagents signaling +near background or saturation levels. + + + +### Accessing CCC + +The `getSomaScanLiftCCC()` function retrieves these values +from an internal object for either `"serum"` and `"plasma"`. + +```{r ccc} +plasma <- getSomaScanLiftCCC("p") +plasma + +serum <- getSomaScanLiftCCC("s") +serum +``` + +```{r cdf, fig.width = 6, fig.height = 5} +cdf_df <- data.frame( + ccc = c(plasma$plasma_11k_to_7k_ccc, serum$serum_11k_to_7k_ccc), + matrix = rep(c("plasma", "serum"), each = nrow(plasma)) +) +cdf_df <- cdf_df[!is.na(cdf_df$ccc), ] # rm NAs; non-comparable analytes +ggplot(cdf_df, aes(x = ccc, colour = matrix)) + + stat_ecdf(linewidth = 0.75) + + scale_colour_manual(name = "", values = c("#00A499", "#24135F")) + + labs(title = "CDF of CCC Values", + x = "Lin's CCC (11k -> 7k)", y = "P(X < x)") + + coord_cartesian() +``` + +As shown in distribution above, for the `11k -> 7k` lift, post-bridging +CCC values above 0.75 (high quality) are approximately 88% and 84% of +the SomaScan menu for plasma and serum respectively. + + +## Column Setdiff + +There are two scenarios to consider: + +* Newer versions of SomaScan typically have additional content, i.e. + new reagents added to the multi-plex assay that bind to additional proteins. + When lifting _to_ a previous SomaScan version, new reagents that do _not_ + exist in the "earlier" assay version assay are scaled by 1.0, and thus + maintained, unmodified in the returned object. Users may need to drop + these columns in order to combine these data with a previous study + from an earlier SomaScan version, e.g. with `collapseAdats()`. +* In the inverse scenario, lifting "forward" _from_ a previous, lower-plex + version, there will be extra reference values that are unnecessary + to perform the lift, and a warning is triggered. The resulting data + consists of RFU data in the "new" signal space, but with fewer analytes + than would otherwise be expected (e.g. `11k` space with only 5284 + analytes; see example below). + + +------------ + + +## Example: `5k` -> `11k` +### Steps: + +1. Determine that attributes are intact: + ```r + is_intact_attr(adat) + ``` +1. Determine the matrix type of the data (serum or plasma) + ```r + attr(adat, "Header.Meta")$HEADER$StudyMatrix + ``` +1. Determine whether the SomaScan version is supported + ```r + checkSomaScanVersion(getSomaScanVersion(adat)) + ``` +1. Ensure the current `soma_adat` is in the SomaScan version space you think + ```r + getSignalSpace(adat) + ``` +1. Apply the scale factors using `lift_adat()` + ```r + lift_adat(adat, bridge = "") + ``` + +### Step 1 +```{r attr} +# determine intact attributes +is_intact_attr(example_data) +``` + +### Step 2 +```{r matx} +# determine study matrix +attr(example_data, "Header.Meta")$HEADER$StudyMatrix |> as.character() +``` + +Confirm that the matrix of the SomaScan run was `"EDTA Plasma"`. + +### Step 3 +```{r version} +# determine if current space can be lifted +is.null(checkSomaScanVersion(getSomaScanVersion(example_data))) +``` + +### Step 4 +```{r space} +# determine current SomaScan space +getSignalSpace(example_data) +``` + +The `example_data` object is in `r getSignalSpace(example_data)` space, +so for this vignette assume we wish to lift to `11k` signal space. +Invoke `lift_adat()` to perform the lift/bridge/transformation: + +### Step 5 +```{r lift} +lift_11k <- lift_adat(example_data, bridge = "5k_to_11k") + +is_lifted(lift_11k) + +is.soma_adat(lift_11k) + +getSignalSpace(lift_11k) + +getSomaScanVersion(lift_11k) +``` + + +------------- + + +## Visualize Linear Transformation + +It is always a good idea to plot the before/after to ensure it actually did +what you think (in this case a linear shift or bias). +In this example let's randomly select an analyte: + +```{r seq} +seq <- withr::with_seed(101, sample(getAnalytes(lift_11k), 1L)) +seq + +tbl <- getAnalyteInfo(lift_11k) +filter(tbl, AptName == seq) |> + unlist() |> + tibble::enframe() +``` + +If you have downloaded and are familiar with our plotting package +[SomaPlotr](https://github.com/SomaLogic/SomaPlotr), it can be useful to +plot concordance between 2 numeric vectors. +See `SomaPlotr::plotConcord()`. + +Below we visualize the linear shift that was performed for `r seq`. + +```{r plot, echo = FALSE, fig.width = 6, fig.height = 5} +target_tbl <- getTargetNames(tbl) +plot_df <- data.frame( + orig = rep(example_data[[seq]], 2L), + lift = c(example_data[[seq]] / 0.75, example_data[[seq]]), + group = rep(c("pre-lift", "post-lift"), each = nrow(lift_11k)) +) +lims <- range(plot_df[, -3L]) +plot_df |> + ggplot(aes(x = orig, y = lift, colour = group)) + + geom_point(alpha = 0.5, size = 3) + + scale_x_log10() + + scale_y_log10() + + scale_colour_manual(name = "", values = c("#00A499", "#24135F")) + + expand_limits(x = lims, y = lims) + + labs(x = "5k", y = "11k", + title = sprintf("Lifting Concordance: %s", target_tbl[[seq]])) + + geom_abline(slope = 1, intercept = 0, color = "black") +``` + +The figure above highlights the linear shift (down) that was applied to the `5k` +RFU signal space to transform it into `11k` signal space. +The low-signaling samples ($n = 6$) at ~100 RFU are control (`Buffer`) +samples containing no protein. + +```{r buffers} +table(lift_11k$SampleType) +``` +