chronospace
: statistical exploration of time-calibrated phylogenies and the relative impact of methodological decisions.
Calibrating phylogenies to absolute time is a complex and time-consuming task, yet it is a crucial first step in the inference of macroevolutionary history. Bayesian approaches to time scaling phylogenetic trees have dramatically grown in complexity, and hinge today upon numerous methodological choices (e.g., gene sampling strategy, type of clock, model of molecular evolution, shape of the priors implemented on calibrated nodes, set of fossil calibrations used, type of tree prior, etc.). Justifying all of these, or even a subset, can be challenging and computationally intensive. As a consequence, divergence times are routinely inferred under a limited range of parametric conditions, often chosen arbitrarily. Attempts to measure and summarize the sensitivity of results to such decisions rely on either tables summarizing results for selected clades or stacking consensus trees, but both approaches can be cumbersome to interpret, and lack a consistent way of quantifying (and therefore comparing) the relative effect of each choice.
chronospace
is an R package devised to help visualizing and
quantifying the sensitivity of results to methodological decisions
through the use of chronospaces, i.e., graphical representations
summarizing variation in the node ages contained in populations of
time-calibrated trees inferred using a fixed topology. In particular,
this package uses between-groups PCA (bgPCA) to summarize variation in
node ages induced by varying specific methodological choices, and
estimates their impact using a Sum of squares (SSQ) approach for
measuring their effect sizes. Visualization of chronospaces and other
graphical tools included in chronospace
are built with the aid of
ggplot2
(Wickham 2016), ggtree
(Yu 2017), and deeptime
(Gearty
2023), among others packages.
Note: this repository addresses the version of chronospace
described
in Mongiardino Koch and Milla Carmona
(2024). If
you are interested in the data and procedures described in Mongiardino
Koch et al. 2022 eLife paper (the precursor of this package), those
files are now deposited in the chronospaces_eLife
repository.
The development version of chronospace
can be installed from
GitHub using:
# install.packages("devtools")
devtools::install_github("mongiardino/chronospace")
Below, the capabilities of chronospace
are showcased using a dataset
of echinoid node
ages,
which represents of a subset of the inferences performed by Mongiardino
Koch et al. 2022. It is
comprised of six posterior distributions of time-calibrated trees
obtained using PhyloBayes (Lartillot et al. 2013). These were all run
using the same constrained topology (as required by the chronospace
suite of functions), but varying two methodological decisions:
-
Gene subsampling strategy: three different sets of 100 genes, subsampled from a larger phylogenomic data set based on their level of A) clock-likeness, B) phylogenetic signal, or C) selected at random.
-
Model of molecular evolution: each gene subsample was also run under two models of molecular evolution, the site-homogeneous GTR+G and the site-heterogeneous CAT+GTR+G (Lartillot & Philippe 2004).
This setting will allow us to evaluate whether inferred ages are more sensitive to the choice of loci or to the choice of model of molecular evolution.
The function extract_ages
imports and formats temporal data from a
series of external tree files (these must be in Newick format). Each
file should represent the posterior distribution of trees obtained from
an individual run or chain, and all trees (within and among files)
should share the same tree topology. Furthermore, they should be placed
together in an folder with no other file.
extract_ages
has three main arguments: path
- specifying the folder
where files are stored; type
- a list of vectors, one for each factor
being tested (in this case two, loci choice and model of evolution),
specifying the group (i.e., factor levels) to which the chronograms from
each file will be assigned to (in this case, three groups for loci
choice and two for model of evolution, see example below); and
sample
- the fixed number of topologies to retain from each file (here
we will sample 500 trees from each file for a total of 3000 trees).
Note: extract_ages
will import the files in the order they appear in
the folder; the list provided in type
must follow this same order.
#load chronospace
library(chronospace)
#> Registered S3 methods overwritten by 'treeio':
#> method from
#> MRCA.phylo tidytree
#> MRCA.treedata tidytree
#> Nnode.treedata tidytree
#> Ntip.treedata tidytree
#> ancestor.phylo tidytree
#> ancestor.treedata tidytree
#> child.phylo tidytree
#> child.treedata tidytree
#> full_join.phylo tidytree
#> full_join.treedata tidytree
#> groupClade.phylo tidytree
#> groupClade.treedata tidytree
#> groupOTU.phylo tidytree
#> groupOTU.treedata tidytree
#> is.rooted.treedata tidytree
#> nodeid.phylo tidytree
#> nodeid.treedata tidytree
#> nodelab.phylo tidytree
#> nodelab.treedata tidytree
#> offspring.phylo tidytree
#> offspring.treedata tidytree
#> parent.phylo tidytree
#> parent.treedata tidytree
#> root.treedata tidytree
#> rootnode.phylo tidytree
#> sibling.phylo tidytree
#Important! Check files names, compare against type order below
list.files(temp)
#> [1] "clockCATGTR_ln_sample.datedist" "clockGTR_ln_sample.datedist"
#> [3] "randomCATGTR_ln_sample.datedist" "randomGTR_ln_sample.datedist"
#> [5] "signalCATGTR_ln_sample.datedist" "signalGTR_ln_sample.datedist"
#Set type of runs and number of chronograms to be retained.
#Name the listed elements in type to add factor names; otherwise,
#generic names will be given.
type <- list(loci = c('clock', 'clock', 'random', 'random', 'signal', 'signal'),
model = c('CATGTR', 'GTR', 'CATGTR', 'GTR', 'CATGTR', 'GTR'))
sample <- 500
#Import data to R (this might take a minute)
data <- extract_ages(path = temp, type = type, sample = sample)
#> Check that labels are assigned correctly to the input files.
#> If there is an error, modify the order of factors in 'type'
#> or the name of input files, for the two to match.
#>
#> file = clockCATGTR_ln_sample.datedist | type = clock - CATGTR
#> file = clockGTR_ln_sample.datedist | type = clock - GTR
#> file = randomCATGTR_ln_sample.datedist | type = random - CATGTR
#> file = randomGTR_ln_sample.datedist | type = random - GTR
#> file = signalCATGTR_ln_sample.datedist | type = signal - CATGTR
#> file = signalGTR_ln_sample.datedist | type = signal - GTR
The function will organize these files into an object of class
"nodeAges"
, containing 1) a matrix where each row is a particular tree
in the posterior distribution and each column represents the estimated
age of a given node, 2) a data frame where each methodological choice is
coded as a separate factor, and 3) the fixed topology (used for
graphical output).
#Print nodeAges object
data
#> Data from 3000 trees with 65 internal nodes (see $ages and $topology),
#> Obtained using 6 methodological pathways (see $factors).
The node age data stored in objects of class "nodeAges"
can be
summarized using the chronospace
function, which uses bgPCA to find
the major directions of variation between groups. These bgPC axes are
directions across the multivariate chronospace along which groups of
chronograms that were inferred under different conditions, such as
different models of molecular evolution (e.g., GTR+G vs CAT+GTR+G), are
maximally discriminated. The function reports the percentage of
variation explained by each factor, both in relation to the total amount
of raw variation in node ages and for each individual bgPC extracted.
This information can be used to gauge the relative importance of each
choice for the estimation of node ages.
As can be seen below, in the specific case of the echinoid data set, the choice of loci explains a much larger fraction of the total variability in node ages than the choice of model of molecular evolution. This can be interpreted to mean that the choice of loci is a more important determinant of the results obtained, having a larger effect size. Finding ways of justifying that choice, or otherwise exploring the uncertainty of results of interest to it, is therefore more important than spending time/computational resources on the model of molecular evolution.
#Summarize chronospace
cspace <- chronospace(data)
#> loci (%) model (%) Unaccounted (%)
#> Total variation 14.79692 4.69255 80.51053
#> _________________________________________________________________________________
#> --- Results for loci (clock/random/signal) ---
#> loci (%) model (%) Unaccounted (%)
#> bgPC1(18.11%) 11.21472 0.11907 6.77879
#> bgPC2(7.58%) 3.58220 0.12878 3.86595
#> ---------------------------------------------------------------------------------
#> --- Results for model (CATGTR/GTR) ---
#> loci (%) model (%) Unaccounted (%)
#> bgPC1(10.42%) 0.38288 4.69255 5.3465
#> ---------------------------------------------------------------------------------
#> * All percentages are relative to the total amount of variation in node ages
Chronospaces can be also assessed using visualization tools. The plot
function implements methods to display scatter plots (if there are three
or more groups) or histograms (when there are only two), depicting the
distribution of trees in the space configured by bgPC axes. A few
visualization alternatives are shown below, including the use of
confidence ellipses, multivariate distances between centroids, and
opacity:
#Plot scatterplot for loci
plot(cspace, factor = 1, output = "ordination",
ellipses = TRUE)
#Plot scatterplot for loci, show distances between centroids
#in multivariate space
plot(cspace, factor = 1, output = "ordination",
distances = TRUE, pt.alpha = 0.05)
#Plot histogram for model
plot(cspace, factor = 2, output = "ordination")
Another graphical output that can be generated are pairs of chronograms showing the change in node ages at the extremes of bgPC axes. These are equivalent to the depictions of how landmarks move as one traverses a morphospace axis, and we therefore refer to them as chronogram warping diagrams. Change is represented by color-coding branches according to their degree of contraction/expansion along the bgPCA axis, depicting the way in which different node ages are realized (i.e., are younger ages for a given node induced by shortening of branches ahead of it, or expansion of branches behind it):
#Plot between-models PC1 extremes
plot(cspace, factor = 1, output = "extremes")
"nodeAges"
objects can also be fed to a number of functions that help
characterize the impact of methodological choices over particular
aspects of evolutionary history.
sensitive_nodes
plots the posterior distribution of ages for the nodes
who are most sensitive to each individual methodological choice. The
nodes most affected by varying different aspects of the analysis are
generally different. The user can either ask for a fixed number of most
affected nodes (argument num_clades
), or specify a threshold for the
amount of change above which nodes are considered to be relevant
(arguments amount_of_change
).
#Plot the ages distribution of the three nodes most affected by the model of
#evolution chosen
sensitive_nodes(echinoid_dates, num_clades = 3, factor = 2)
specified_node
produces a similar output, but instead of showing the
most affected nodes it displays the distribution of posterior ages for a
particular node of interest.This node is specified by the user through
the argument tips
, which takes a vector of length two specifying the
names of terminals used to find their most common recent ancestor.
#Plot the ages distribution of the MRCA of Brissus obesus and Abatus cordatus
#under different loci subsampling strategies
specified_node(echinoid_dates, factor = 1,
tips = c('Brissus_obesus', 'Abatus_cordatus'))
Finally, ltt_sensitivity
allows exploring the impact of methodological
choices on diversification rates by displaying separate
lineage-through-time curves for each level of a particular factor.
Averaging node ages across subsamples of chronograms can be computed
using either the mean or the median (argument summary
)
#Plot lineage through time plots for each model of evolution
ltt_sensitivity(echinoid_dates, factor = 1,
summary = "median", gscale = TRUE)
Gearty W. (2023). deeptime: Plotting Tools for Anyone Working in Deep Time. R package version 1.0.1, https://CRAN.R-project.org/package=deeptime.
Lartillot N., & Philippe H. (2004). A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. 21, 1095–1109. https://doi.org/10.1093/molbev/msh112.
Lartillot N., Rodrigue N., Stubbs D., & Richer J. (2013). PhyloBayes MPI. Phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. Syst. Biol. 62, 611–615. https://doi.org/10.1093/sysbio/syt022.
Mongiardino Koch N. & Milla Carmona P. (2024). Chronospaces: an R package for the statistical exploration of divergence times reveals extreme dependence on molecular clocks and gene choice. bioRxiv 2024.02.04.578835. https://doi.org/10.1101/2024.02.04.578835.
Mongiardino Koch N., Thompson J.R., Hiley A.S., McCowin M.F., Armstrong A.F., Coppard S.E., Aguilera F., Bronstein O., Kroh A., Mooi, R. & Rouse G.W. (2022). Phylogenomic analyses of echinoid diversification prompt a re-evaluation of their fossil record. Elife, 11, e72460. https://doi.org/10.7554/eLife.72460.
Wickham H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
Yu G., Smith D.K., Zhu H., Guan Y., & Lam T.T.Y. (2017). ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. MEE, 8(1), 28-36. https://doi.org/10.1111/2041-210X.12628.