sccomp
tests differences in cell type proportions from single-cell
data. It is robust against outliers, it models continuous and discrete
factors, and capable of random-effect/intercept modelling.
Please cite PNAS - sccomp: Robust differential composition and variability analysis for single-cell data
- Complex linear models with continuous and categorical covariates
- Multilevel modelling, with population fixed and random effects/intercept
- Modelling data from counts
- Testing differences in cell-type proportionality
- Testing differences in cell-type specific variability
- Cell-type information share for variability adaptive shrinkage
- Testing differential variability
- Probabilistic outlier identification
- Cross-dataset learning (hyperpriors).
Bioconductor
if (!requireNamespace("BiocManager")) install.packages("BiocManager")
BiocManager::install("sccomp")
Github
devtools::install_github("stemangiola/sccomp")
Function | Description |
---|---|
sccomp_estimate |
Fit the model onto the data, and estimate the coefficients |
sccomp_remove_outliers |
Identify outliers probabilistically based on the model fit, and exclude them from the estimation |
sccomp_test |
Calculate the probability that the coefficients are outside the H0 interval (i.e. test_composition_above_logit_fold_change) |
sccomp_replicate |
Simulate data from the model, or part of the model |
sccomp_predict |
Predicts proportions, based on the mode, or part of the model |
sccomp_remove_unwanted_variation |
Removes the variability for unwanted factors |
plot |
Plors summary plots to asses significance |
sccomp
can model changes in composition and variability. By default,
the formula for variability is either ~1
, which assumes that the
cell-group variability is independent of any covariate or
~ factor_of_interest
, which assumes that the model is dependent on the
factor of interest only. The variability model must be a subset of the
model for composition.
Of the output table, the estimate columns start with the prefix c_
indicate composition
, or with v_
indicate variability
(when
formula_variability is set).
sccomp_result =
single_cell_object |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample,
.cell_group = cell_group,
bimodal_mean_variability_association = TRUE,
cores = 1
) |>
sccomp_remove_outliers(cores = 1) |> # Optional
sccomp_test()
sccomp_result =
counts_obj |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample,
.cell_group = cell_group,
.count = count,
bimodal_mean_variability_association = TRUE,
cores = 1, verbose = FALSE
) |>
sccomp_remove_outliers(cores = 1, verbose = FALSE) |> # Optional
sccomp_test()
Here you see the results of the fit, the effects of the factor on composition and variability. You also can see the uncertainty around those effects.
sccomp_result
## # A tibble: 72 × 18
## cell_group parameter factor c_lower c_effect c_upper c_pH0 c_FDR c_n_eff
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 B1 (Intercep… <NA> 0.881 1.11 1.32 0 0 4693.
## 2 B1 typecancer type -1.16 -0.747 -0.361 5.00e-4 1.00e-4 2396.
## 3 B2 (Intercep… <NA> 0.404 0.703 0.990 2.50e-4 1.25e-5 4775.
## 4 B2 typecancer type -1.23 -0.722 -0.253 7.25e-3 1.03e-3 3857.
## 5 B3 (Intercep… <NA> -0.674 -0.384 -0.104 2.38e-2 2.48e-3 4022.
## 6 B3 typecancer type -0.749 -0.313 0.0808 1.47e-1 3.81e-2 2849.
## 7 BM (Intercep… <NA> -1.32 -1.03 -0.753 0 0 3497.
## 8 BM typecancer type -0.746 -0.320 0.0943 1.53e-1 4.34e-2 3270.
## 9 CD4 1 (Intercep… <NA> 0.0795 0.303 0.507 3.70e-2 3.81e-3 3528.
## 10 CD4 1 typecancer type -0.102 0.187 0.472 2.65e-1 6.63e-2 3734.
## # ℹ 62 more rows
## # ℹ 9 more variables: c_R_k_hat <dbl>, v_lower <dbl>, v_effect <dbl>,
## # v_upper <dbl>, v_pH0 <dbl>, v_FDR <dbl>, v_n_eff <dbl>, v_R_k_hat <dbl>,
## # count_data <list>
The estimated effects are expressed in the unconstrained space of the parameters. Similarly, to differential expression analysis that express change in terms of log fold change. However, for differences, in proportion, logit foold change must be used. This measure is harder to interpret and understand.
Therefore, we provide a more intuitive proportion, full change, that can be easier understood. However, these cannot be used to infer significance (use sccomp_test() instead), and a lot of care must be taken given the nonlinearity of these measure (1 fold increase from 0.0001 to 0.0002 carried a different weight that 1 fold increase from 0.4 to 0.8).
From your estimates, you can state which effects you are interested about (this can be a part of the full model, in case you want to not consider unwanted effects), and the two points you would like to compare.
In case of a chategorical variable, the starting and ending points are categories.
sccomp_result |>
sccomp_proportional_fold_change(
formula_composition = ~ type,
from = "healthy",
to = "cancer"
) |>
select(cell_group, statement)
## # A tibble: 36 × 2
## cell_group statement
## <chr> <glue>
## 1 B1 2.1-fold decrease (from 0.0562 to 0.0264)
## 2 B2 2.1-fold decrease (from 0.0374 to 0.0181)
## 3 B3 1.4-fold decrease (from 0.0127 to 0.0092)
## 4 BM 1.4-fold decrease (from 0.0066 to 0.0048)
## 5 CD4 1 1.2-fold increase (from 0.025 to 0.0301)
## 6 CD4 2 1.5-fold increase (from 0.0488 to 0.0732)
## 7 CD4 3 2.7-fold decrease (from 0.0863 to 0.0321)
## 8 CD4 4 1-fold increase (from 0.0016 to 0.0016)
## 9 CD4 5 1.1-fold increase (from 0.0297 to 0.0313)
## 10 CD8 1 1.2-fold increase (from 0.1051 to 0.1234)
## # ℹ 26 more rows
A plot of group proportion, faceted by groups. The blue boxplots
represent the posterior predictive check. If the model is likely to be
descriptively adequate to the data, the blue box plot should roughly
overlay with the black box plot, which represents the observed data. The
outliers are coloured in red. A box plot will be returned for every
(discrete) covariate present in formula_composition
. The colour coding
represents the significant associations for composition and/or
variability.
sccomp_result |>
sccomp_boxplot(factor = "type")
## Joining with `by = join_by(cell_group, sample)`
## Joining with `by = join_by(cell_group, type)`
A plot of estimates of differential composition (c_) on the x-axis and differential variability (v_) on the y-axis. The error bars represent 95% credible intervals. The dashed lines represent the minimal effect that the hypothesis test is based on. An effect is labelled as significant if bigger than the minimal effect according to the 95% credible interval. Facets represent the covariates in the model.
sccomp_result |>
plot_1D_intervals()
We can plot the relationship between abundance and variability. As we can see below, they are positively correlated, you also appreciate that this relationship is by model for single cell RNA sequencing data.
sccomp
models, these relationship to obtain a shrinkage effect on the
estimates of both the abundance and the variability. This shrinkage is
adaptive as it is modelled jointly, thanks for Bayesian inference.
sccomp_result |>
plot_2D_intervals()
You can produce the series of plots calling the plot
method.
sccomp_result |> plot()
seurat_obj |>
sccomp_estimate(
formula_composition = ~ 0 + type,
.sample = sample,
.cell_group = cell_group,
bimodal_mean_variability_association = TRUE,
cores = 1, verbose = FALSE
) |>
sccomp_test( contrasts = c("typecancer - typehealthy", "typehealthy - typecancer"))
## # A tibble: 60 × 18
## cell_group parameter factor c_lower c_effect c_upper c_pH0 c_FDR c_n_eff
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 B immature typecanc… <NA> -1.92 -1.40 -0.903 0 0 NA
## 2 B immature typeheal… <NA> 0.903 1.40 1.92 0 0 NA
## 3 B mem typecanc… <NA> -2.34 -1.71 -1.06 0 0 NA
## 4 B mem typeheal… <NA> 1.06 1.71 2.34 0 0 NA
## 5 CD4 cm S10… typecanc… <NA> -1.48 -1.03 -0.596 0 0 NA
## 6 CD4 cm S10… typeheal… <NA> 0.596 1.03 1.48 0 0 NA
## 7 CD4 cm hig… typecanc… <NA> 0.809 1.76 2.88 0 0 NA
## 8 CD4 cm hig… typeheal… <NA> -2.88 -1.76 -0.809 0 0 NA
## 9 CD4 cm rib… typecanc… <NA> 0.327 0.994 1.68 0.00375 0.00117 NA
## 10 CD4 cm rib… typeheal… <NA> -1.68 -0.994 -0.327 0.00375 0.00117 NA
## # ℹ 50 more rows
## # ℹ 9 more variables: c_R_k_hat <dbl>, v_lower <dbl>, v_effect <dbl>,
## # v_upper <dbl>, v_pH0 <dbl>, v_FDR <dbl>, v_n_eff <dbl>, v_R_k_hat <dbl>,
## # count_data <list>
This is achieved through model comparison with loo
. In the following
example, the model with association with factors better fits the data
compared to the baseline model with no factor association. For
comparisons check_outliers
must be set to FALSE as the leave-one-out
must work with the same amount of data, while outlier elimination does
not guarantee it.
If elpd_diff
is away from zero of > 5 se_diff
difference of 5, we
are confident that a model is better than the other
reference.
In this case, -79.9 / 11.5 = -6.9, therefore we can conclude that model
one, the one with factor association, is better than model two.
library(loo)
# Fit first model
model_with_factor_association =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample,
.cell_group = cell_group,
bimodal_mean_variability_association = TRUE,
cores = 1,
enable_loo = TRUE
)
# Fit second model
model_without_association =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ 1,
.sample = sample,
.cell_group = cell_group,
bimodal_mean_variability_association = TRUE,
cores = 1 ,
enable_loo = TRUE
)
# Compare models
loo_compare(
model_with_factor_association |> attr("fit") |> loo(),
model_without_association |> attr("fit") |> loo()
)
We can model the cell-group variability also dependent on the type, and so test differences in variability
res =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type,
formula_variability = ~ type,
.sample = sample,
.cell_group = cell_group,
bimodal_mean_variability_association = TRUE,
cores = 1, verbose = FALSE
)
res
## # A tibble: 60 × 14
## cell_group parameter factor c_lower c_effect c_upper c_n_eff c_R_k_hat
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 B immature (Interce… <NA> 0.366 0.771 1.18 5085. 1.00
## 2 B immature typeheal… type 0.857 1.43 1.98 5419. 1.00
## 3 B mem (Interce… <NA> -1.49 -0.873 -0.216 5736. 1.00
## 4 B mem typeheal… type 1.08 1.85 2.63 4874. 1.00
## 5 CD4 cm S100A4 (Interce… <NA> 1.30 1.65 1.98 6680. 1.00
## 6 CD4 cm S100A4 typeheal… type 0.488 0.943 1.43 5134. 1.00
## 7 CD4 cm high cyto… (Interce… <NA> -1.06 -0.542 0.0339 4705. 1.00
## 8 CD4 cm high cyto… typeheal… type -3.07 -1.24 1.15 4987. 1.00
## 9 CD4 cm ribosome (Interce… <NA> -0.0718 0.311 0.706 4282. 1.00
## 10 CD4 cm ribosome typeheal… type -1.81 -0.966 0.0231 5178. 1.00
## # ℹ 50 more rows
## # ℹ 6 more variables: v_lower <dbl>, v_effect <dbl>, v_upper <dbl>,
## # v_n_eff <dbl>, v_R_k_hat <dbl>, count_data <list>
We recommend setting bimodal_mean_variability_association = TRUE
. The
bimodality of the mean-variability association can be confirmed from the
plots$credible_intervals_2D (see below).
We recommend setting bimodal_mean_variability_association = FALSE
(Default).
It is possible to directly evaluate the posterior distribution. In this example, we plot the Monte Carlo chain for the slope parameter of the first cell type. We can see that it has converged and is negative with probability 1.
res %>% attr("fit") %>% rstan::traceplot("beta[2,1]")
Plot 1D significance plot
plots = res |> sccomp_test() |> plot()
## Joining with `by = join_by(cell_group, sample)`
## Joining with `by = join_by(cell_group, type)`
plots$credible_intervals_1D
Plot 2D significance plot. Data points are cell groups. Error bars are the 95% credible interval. The dashed lines represent the default threshold fold change for which the probabilities (c_pH0, v_pH0) are calculated. pH0 of 0 represent the rejection of the null hypothesis that no effect is observed.
This plot is provided only if differential variability has been tested.
The differential variability estimates are reliable only if the linear
association between mean and variability for (intercept)
(left-hand
side facet) is satisfied. A scatterplot (besides the Intercept) is
provided for each category of interest. The for each category of
interest, the composition and variability effects should be generally
uncorrelated.
plots$credible_intervals_2D
The new tidy framework was introduced in 2024, two, understand the differences and improvements. Compared to the old framework, please read this blog post.