-
Notifications
You must be signed in to change notification settings - Fork 1
/
13_microbiome.Rmd
852 lines (622 loc) · 41.2 KB
/
13_microbiome.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
# Microbiome Data Analysis {#microbiome}
\usepackage{amsmath} \usepacakge{amssymb}
```{r, message=F, warning=F,echo=F}
library(mgcv)
library(vegan)
library(lattice)
library(labdsv)
library(tidyverse)
```
## Loading and Exploring the Guerroro Negro Data
Metagenome sequence information was downloaded by the NCBI link provided in the the tutorial summary page on learn. The OTUs were then clustered and classified, resulting in a read table ("table") and a taxa table ("utax").
To load the tables, we'll do the following
```{r,include=F}
#### set working directory. make sure to change this!
load('~/bioinformatics_class/data/data_gn/GN16SData.RData')
```
```{r,warning=F}
counts <- GN16SData$otuTable
tax <- tibble(tax=rownames(counts)) %>%
separate(tax,c('Kingdom','Phylum','Class','Order','Family','Genus','Species','Subspecies'),sep=';') %>%
data.frame(stringsAsFactors=FALSE)
tax[is.na(tax)] <- ''
rownames(counts) <- sapply(1:nrow(counts),function(x) paste0('OTU_',x,collapse=''))
rownames(tax) <- sapply(1:nrow(counts),function(x) paste0('OTU_',x,collapse=''))
depth <- as.numeric(gsub('^GNmat\\.(.*?)mm$','\\1',colnames(counts)))
zone <- if_else(depth < 2,'A',if_else(depth >= 6,'C','B'))
```
Knowing what every command means is probably beyond the scope of this tutorial. We simply want two clean tables with the OTU numbers set as the row names.
Counts is a table of OTU counts for each mat, so rows 1 (OTU 4973) and 2 (OTU 41) from column 1 had 1 and 14 occurrences, respectively.
```{r}
counts[1,1] ## row 1, column 1
counts[2,1] ## row 2, column 1
```
Note the sparsity of the count table, which could pose problems in traditional statistical methods.
```{r}
mean(counts == 0)
```
The taxa table shows the taxonomic breakdown (Kingdom - Phylum - ... - Species) for each OTU.
```{r}
head(tax)
```
Note that it's not possible to classify every OTU taxonomically, so you will see similar taxons for different OTUs. These should still be considered different. For example, look at OTUs 8 and 9:
```{r}
tax["OTU_8",]
tax["OTU_9",]
```
They have the same taxonomic names, but, again, they should be considered unique OTUs.
## Statistical Analysis
The goal is to cover Canonical Correspondence Analysis (CCA) and Redundancy Analysis (RDA). These are extensions of "simpler" algorithms, Correspondence Analysis (CA) and Principal Component Analysis (PCA), respectively, so it's best to cover those first.
### PCA
PCA is an unsupervised learning algorithm to essentially cluster unlabeled data. The trick is through dimensionality reduction, we will end up with principal components that explain as much variation as possible to make latent groups more separable. This process is driven by reducing the squared orthogonal projection distance. This is a similar problem to minimizing sums of squares in linear regression.
Given a unit vector $u$ and a data vector $x$, to minimize the projection residuals, we must minimize the following:
$$
\begin{align}
\text{MSE}(u) &= \frac{1}{n}\sum_{i}(||x_i||^2-(x_{}^{T} u)^2)\\
&= \frac{1}{n}\sum_{i}||x_i||^2 - \frac{1}{n}\sum_{i}(x_{i}^{T} u)^2
\end{align}
$$
As the right side of the second term gets large, the mean squared error of the projection residuals gets small. If we assume $||u||=1$, then the length of the $x^{(i)}$ projection onto $u$ is $x^{(i)T}u$:
$$
\text{proj}_{u}(x)=\frac{x^{T}u}{||u||}=x^{T}u
$$
The plan is then to choose a vector $u$ that maximizes
$$
\frac{1}{n}\sum_{i}(x^{(i)T}u)^2\\
\text{s.t. }||u||=1
$$
We can rearrange the equation, so it will make a little more sense:
$$
\begin{align}
\frac{1}{n} \sum_{i}(x^{(i)T}u)^2 &= \frac{1}{n}\sum_{i}(u^{T}x)(u^{T}x)\\
&= u^{T}(\frac{1}{n}\sum_{i}xx^{T})u\\
&= u^{T}\Sigma u
\end{align}
$$
Now the maximization requires a Lagrange multiplier:
$$
\begin{align}
\text{argmax }u^{T}\Sigma u &= \mathcal{L}(u,\lambda) \text{ s.t. }||u||=1\\
&= u^{T}\Sigma u-\lambda(u^{T}u-1)\\\\
\therefore \nabla_{u}\mathcal{L} &= \Sigma u - \lambda u\\\\
\end{align}
$$
Which we set to $0$ and solve:
$$
\begin{align}
\Sigma u - \lambda u &= 0\\
\Sigma u &= \lambda u
\end{align}
$$
Let's rearrange this one last time:
$$
u^{T}\Sigma u = \lambda
$$
We can think of the left side as the variation. Remember, the goal is to maximize this to give us the greatest separation. There's a few things that should be mentioned here. First, $\Sigma$ is the covariance matrix in this problem (or correlation matrix, assuming scaling beforehand), $u$ is an eigenvector, and $\lambda$ is the associated eigenvalue. By finding the eigenvalues and choosing those that are largest, we effectively will maximize the left side, the variation.
Assuming we have a matrix $X$ of centered, rescaled data, PCA proceeds as follows. We first calculate the correlation matrix of $X$ and then compute its eigenvalues and eigenvectors. We then find the largest eigenvalue $\lambda_{1}$ and rescale $X$ with its corresponding eigenvector $u_{1}$, resulting in the first principal component. We repeat this process using the second largest eigenvalue $\lambda_{2}$, rescaling the data using $u_{2}$ to acquire the second principal component. These two vectors can be regressed against one another, effectively reducing the dimentionality of $X$ to $\mathbb{R}^{2}$. We can proceed to calculate the remaining principal components, regressing different combinations to find patterns. One indirect assumption in PCA (and therefore will be quite important for RDA) is that the covariance (correlation) matrix is calculated via the Pearson method, which assumes a linear association between two variables. Our data of interest could very well be related, but not linearly, posing problems for computing the correlations, which in turn leads to problems in the PCA. Considering OTU data tables are often zero-inflated, it shouldn't be surprising that the linearity assumption could be violated.
Let's quickly run through some sample data (where the labels are known), performing PCA "manually," to show you what's happening. Then, we'll rerun the analysis with the same sample data but with a base PCA command in R called princomp(). Finally, we'll run PCA on the Guerrero Negro matrix.
#### PCA Example with Labeled Data
Let's say we are given weight lifting data. The PCA won't know this, but we know the labels beforehand, which basically breakdown into three groups: (1) average sized men, (2) very tall men, and (3) average sized women. The recorded information ("features") are height, weight, and amount of weight moved (say, for benching).
First, we have to generate the data:
```{r}
set.seed(4131)
x1 <- c(rnorm(100,68,5),rnorm(100,78,5),rnorm(100,63,4)) ## height
x2 <- c(rnorm(100,215,40),rnorm(100,200,20),rnorm(100,125,15)) ## weight
x3 <- c(rnorm(100,280,50),rnorm(100,180,15),rnorm(100,95,15)) ## exercise
label <- rep(c("Compact Guys","Tall Guys","Women"),each=100) ## labels
```
Now, scale and center the data $(x_{i}-\bar x)/\sigma_{i}$:
```{r}
x1 <- scale(x1)
x2 <- scale(x2)
x3 <- scale(x3)
```
Form the data matrix $X$ and the covariance matrix $\Sigma$:
```{r}
x <- cbind(x1,x2,x3)
colnames <- c("Average Guys","Tall Guys","Women")
E <- cov(x)
```
This was mentioned above in passing, but just to prove it, because we scaled the data, we are essentially working with the correlation matrix:
```{r}
cor(x)
cov(x)
```
Perform an eigendecomposition of $\Sigma$ by solving $\text{det}(\Sigma - \lambda I)=0$:
```{r}
evd <- eigen(E)
```
This results in a list of 3 eigenvalues and 3 eigenvectors
```{r}
print(evd)
```
Rescale the data $X$ using the eigenvectors
```{r}
ev1 <- evd$vectors[,1] ## eigenvector 1, corresponding to the largest eigenvalue thus most variation
ev2 <- evd$vectors[,2] ## eigenvector 2
ev3 <- evd$vectors[,3] ## eigenvector 3
pc1 <- x %*% ev1
pc2 <- x %*% ev2
pc3 <- x %*% ev3
```
Regress the first two principal components:
```{r}
df <- data.frame(cbind(pc1,pc2),label)
names(df) <- c('PC1','PC2','Group')
ggplot(df,aes(x=PC1,y=PC2,colour=Group)) + geom_point(alpha=1)
```
The figure shows clear separation between men and women, and clustering of all three groups. Had we not know the labels to colorcode the output, we would at least been able to confidently separate the data into two groups, men and women.
Now, let's see what the base PCA command gives us.
```{r}
out <- princomp(x)
df <- data.frame(cbind(out$scores[,1],out$scores[,2]),label)
names(df) <- c('PC1','PC2','Group')
ggplot(df,aes(x=PC1,y=PC2,colour=Group)) + geom_point(alpha=1)
```
Not surprisingly, it's the same figure, albeit rotated because of differences in sign from the respective outputs:
```{r}
cbind(ev1,ev2,ev3)
out$loadings
```
We should look at one other figure, one that shows the variation as a function of the principal components. While not incredibly useful here because we are only working with three features, given other situations, being able to quickly identify which components explain the majority of the variation will help hone in on important components for regression and the like.
```{r}
plot(out,type="l")
```
We can see that most of the variance is accounted for by the first two components, which is consistent with the fact that the groups were indeed separable in the figure.
Here's a figure of the third component plotted against the second to show how the separability is far less pronounced since these two explain far less variation than PC1.
```{r}
df <- data.frame(cbind(out$scores[,2],out$scores[,3]),label)
names(df) <- c('PC2','PC3','Group')
ggplot(df,aes(x=PC2,y=PC3,colour=Group)) + geom_point(alpha=1)
```
#### PCA with Guerroro Negro Data
If you no longer have the GN data loaded, go ahead and redo that. We need the "counts" matrix. Let's go ahead and perform PCA. We will will center the data, but abstain from rescaling the data because we are working with counts.
```{r}
cen.counts <- counts - mean(counts)
gn.out <- princomp(cen.counts)
```
If you print output, you'll notice that there are 10 PCs now since there were 10 mats (columns).
```{r}
head(gn.out$scores)
```
Considering there are so many, figuring out how many explain most of the variance would be beneficial.
```{r}
plot(gn.out,type="l",main="Explained Variance")
summary(gn.out)
```
We can see that the first 6 PCs explain more than 95% of the variance. Let's plot the first two components.
```{r}
df <- data.frame(cbind(gn.out$scores[,1],gn.out$scores[,2]),rownames(cen.counts),tax[,2])
names(df) <- c('PC1','PC2','OTU','Phylum')
ggplot(df,aes(x=PC1,y=PC2,label=OTU)) + geom_text(size=3,position="jitter")
ggplot(df,aes(x=PC1,y=PC2,label=Phylum)) + geom_text(size=3,position="jitter")
```
Given the number of OTUs, this figure is difficult to read. One thing to note is that the taxa information we loaded from before can be quite useful. Notice in the second plot, instead of using OTU labels, we used phylum labels. We could potentially see clusters of the same phyla, then look at their OTUs and mat location, and draw interesting conclusions.
Let's quickly zoom in:
```{r}
ggplot(df,aes(x=PC1,y=PC2,label=Phylum)) + geom_text(size=3,position="jitter") +xlim(-50,20) +ylim(-25,25)
```
Note the peculiar shape of the distribution; it's curved, "horseshoe-like." This is a serious problem resulting from secondary gradients in the data; hence, the figure cannot be trusted. A detailed explanation of this artifact is below in the "arch effect" section ("arch effect" is simply the term for essentially the same problem in CA).
Despite this analysis being flawed, we can still explore some of the output for PCA. Here are the loadings for the PCA results. These values tell us, for each PC, how much variance is explained by a variable.
```{r}
gn.out$loadings
```
A biplot will show the same distribution of points we've been looking at in the above PCA plots where we regress components against each other, but it also plots information about these loadings.
```{r,fig.height=10}
biplot(gn.out)
```
What we care about here are those red lines. You can interpret their orientation the same way you'd interpret regression lines. Notice in the printout of the loadings above, mats 5 and 6 explain very little variation for PC2; we would expect a nearly zero slope with respect to PC2, something easily identifiable in the biplot.
Other useful information we can gather is based on which mats pair with one another. If mats explain a similar amount of information for both PCs, then one would expect that their "lines" would be nearly identically positioned. We can extrapolate from this that, say, mats 1 and 2 are likely similar, whereas mats 1 and 9, which are at right angles, are quite different.
### RDA
Because we got PCA out of the way, we've basically covered the second step in Redundancy Analysis (RDA). The term "redundancy" in this context can be thought of as a synonym for "explained variation." The procedure is quite simple. Assume we are working with the same data matrix, "counts." Now, say we have some explanatory variables that are capable of explaining some of the variation across mats. We actually loaded two of these in earlier: depth and zone. In our case, multiple regression would be performed using these two variables, with a particular OTU as the response. The fitted values from the best fit line would then replace the original response values in our matrix. Regressions are performed for every OTU. Essentially, each column is being smoothed based on the linear relationships between OTUs and the explanatory variables. This new "smoothed" matrix is then thrown into PCA. The reason this is called a "redundancy" analysis may make more sense now. Say two OTUs have very similar best fit lines after their respective regressions; the fitted values would then be the same and we can then consider these OTUs to be redundant -- i.e., they explain the same information. It should also be noted that the linearity assumption presents itself twice in RDA: first, the regressions themselves assume a linear association between variables, and second, as mentioned before, the covariances calculated for the PCA covariance matrix too assume a linear association between variables. Zero-inflated data could very well pose a problem.
We'll start by demonstrating exactly how this new matrix is generated. Because we're performing a linear regression, and we care about accurate estimates, the typical regression assumptions apply. While this can be tedious for large datasets, it's nevertheless good practice to deal with normality and outlier issues. If you are familiar with generalized linear models, you may be concerned about the sparsity of the matrix, which would severely affect the accuracy of said predictions, and hence are considering some zero-inflated implementation. While this is beyond the scope of this tutorial, the sparsity is certainly a concern.
First, let's transpose the matrix to make our responses column vectors.
```{r}
cen.counts <- t(cen.counts)
```
The overall regression model will be
$$
\begin{align}
\hat{Counts}_{\dot j} = \beta_{1}Depth + \beta_{2}Zone + \beta_{3}Depth\times Zone\\
\end{align}
$$
Note that $\hat{Counts}_{\dot j}$ is the column vector $j$ in the matrix $\hat Counts$. Also, notice that we have added an interaction term. Let's work with only column 3 and just the explanatory variable depth. We're going to center and scale depth as well.
```{r}
cen.depth <- scale(depth)
m1 <- lm(cen.counts[,3] ~ cen.depth)
fit1 <- fitted(m1)
plot(cen.depth,cen.counts[,3],pch=19,ylab="counts")
points(cen.depth,fit1,type="l",col="red",lwd=3)
```
Here, we would replace the raw values in column 3 of the matrix with the values from the best fit line. For comparison, here are the raw and fitted values:
```{r}
rbind("raw"=cen.counts[,3],"fit"=fit1)
```
The plotting can be tricky, so it's probably easiest to jump right into using the rda() function from vegan, so let's run rda on our matrix. Note that I'm setting zone to a factor. Most functions do this automatically, but it's nevertheless good practice. In case you are unsure, you can think of a factor the way you would in statistics, a categorical variable composed of levels. (I'm simply renaming them here so the figures later on will be easier to read because the names will be shorter.)
```{r}
Zone <- factor(zone)
Depth <- cen.depth
gn.rda <- rda(cen.counts ~ Depth*Zone)
print(gn.rda)
```
Focus in on the proportion of constrained and unconstrained variance. RDA is a "constrained" ordination model, so the proportion of constrained variance is the amount of variance explained by our explanatory variables, depth and zone. We can see that this amounts to about 81% of the variance. Note that this is analogous to an $R^2$ value in regression models. We can obtain an adjusted $R^2$ via the following command:
```{r}
RsquareAdj(gn.rda)$adj.r.squared
```
There's also a way to compare our model to chance:
```{r}
anova(gn.rda)
```
This performs a permutation test, where the unconstrained data matrix is randomized by row. Given the small p-value, we can assume that the constraining variables, and hence our model, performed better than the permuted matrix. Between the amount of constrained variance and the significance of the permutation test, there's quite a bit of evidence that these variables help explain the distribution of OTUs across mats.
Lastly, we'll plot the results.
```{r}
plot(gn.rda,scaling=2,type="n")
points(gn.rda,"sp",col="darkgreen",pch=19)
text(gn.rda,display="sites",cex=.6)
text(gn.rda,display="cn",cex=.8,col="blue")
```
The figure tells quite a bit. Note that zone A clusters with the more superficial mats, which should be expected. This relationship is found among the other zones and mats. The interaction levels differ from one another, where zone B's behavior with respect to depth differs the most from zone A's. Still, if you recall that odd shape in the PCA; well, it's present in the RDA model as well. Notice the shape of the mats.
### CA
Before we begin with CCA, we have to first understand CA. Recall that RDA is simply an extension of PCA, where we perform multivariate regression followed by PCA and eigenvalue decomposition of a covariance matrix. CCA is similar in that it involves *weighted* linear regression and CA, an eigenvalue decomposition of a *Chi-squared distance* matrix.
Chi-squared distance is defined as follows:
$$
\begin{align}
\chi_{ij}^{2} &= \frac{(Obs_{ij}-\mathbb{E}_{ij})^2}{\mathbb{E}_{ij}}\\
\chi_{ij} &= \frac{Obs_{ij}-\mathbb{E}_{ij}}{\sqrt{\mathbb{E}_{ij}}}\\
\end{align}
$$
where $Obs_{ij}$ is the observed cell *frequency* for element $(i,j)$ and $\mathbb{E}_{ij}$ is the expected *frequency* for element $(i,j)$. One thing that should be stressed here, unlike the Pearson correlation used in PCA (and therefore RDA), the chi squared distance **does not** assume a linear relationship between variables.
To be consistent with various packages and Legendre and Legendre, let's make the species (OTUs) and sites (mats) into the columns and rows, respectively.
While Legendre and Lenendre assume $|r|>|c|$, we want to compare our results with vegan's cca() results, so we'll have to transpose the matrix, setting the mats and OTUs as the rows and columns, respectively.
```{r}
counts <- t(counts)
counts[1:5,1:5]
```
We begin by calculated the Chi-squared distance of our count matrix, but hopefully you noticed that the distance deals with frequencies. Let's first convert our count matrix to a frequency matrix which we'll call $O_{ij}$.
```{r}
Oij <- as.matrix(counts/sum(counts)) ## counts to frequencies
```
Now, the $\mathbb{E}_{ij}$ is simply the outer product of this matrix's column and row sums. Note that the row sums will be used for weighting by OTU abundance ("OTU weight") ; whereas, the column sums will be used for weighting by mat abundance ("mat weight"). This come in handy later because they will be our weights for visualization (and recall CCA uses *weighted* regression):
```{r}
pi <- rowSums(Oij) ## OTU weight
pj <- colSums(Oij) ## mat weight
Eij <- outer(pi,pj)
```
We now can compare the Chi-squared distances, which we'll call $\bar Q$:
```{r}
Qbar <- (Oij-Eij)/sqrt(Eij)
```
You can imagine we are at the step in PCA where we have just computed the covariance matrix. If you recall, we would now perform an eigenvalue decomposition. The same holds true here for CA, but with our Chi-squared matrix. Because an eigenvalue decomposition requires a square matrix, we can just perform a singular value decomposition with the svd() function. (Tangent: there are two PCA function in R base, prcomp() and princomp(), which perform singular value decomposition and eigenvalue decomposition, respectively, with similar results.)
```{r}
Qbar <- svd(Qbar)
```
The svd factorizes $\bar Q = U \Sigma V$, where $U$ and $V$ are the left and right eigenvectors, respectively. We use these weights to adjust of the positions of the points in $\bar Q$ for visualization. As we'll see, CA (and CCA) can be plotted on two different scales; therefore, we need two separate adjustments of $\bar Q$ for two separate sets of positions in space.
$U$ is adjusted with the diagonal of the inverse of the row weight, $p_{i}$, and $V$ with the diagonal of the inverse of the column weight, $p_{j}$. These weighted matrices will be called $\hat V$ and $V$.
```{r}
Vhat <- diag(1/sqrt(pi)) %*% Qbar$u
V <- diag(1/sqrt(pj)) %*% Qbar$v
```
Finally, we calculate $F$ and $\hat F$:
```{r}
FF <- diag(1/pi) %*% Oij %*% V
FFhat <- diag(1/pj) %*% t(Oij) %*% Vhat
```
We now have two sets of spatial positions for two scalings:
* scaling 1 involves plotting the second versus the first eigenvector from $\hat V$ and the second versus the first eigenvector from $\hat F$ on the same figure.
* scaling 2 involves plotting the second versus the first eigenvector from $V$ and the second versus the first eigenvector from $F$ on the same figure.
```{r}
df <- data.frame("x" = c(FF[,1],V[,1],Vhat[,1],FFhat[,1]),
"y" = c(FF[,2],V[,2],Vhat[,2],FFhat[,2]),
"points" = c(rep(1:2,c(dim(FF)[1],dim(V)[1])),rep(1:2,c(dim(Vhat)[1],dim(FFhat)[1]))),
"scaling" = rep(1:2,c(dim(FF)[1]+dim(V)[1],dim(Vhat)[1]+dim(FFhat)[1])),
"names" = rep(c(rownames(counts),colnames(counts)),2))
df$points <- factor(df$points)
df$scaling <- factor(df$scaling)
ggplot(df,aes(x,y,colour=points,size=points)) +
facet_wrap(~ scaling) + geom_point(alpha=.3) +
scale_size_manual(values=c(1,3)) + scale_colour_manual(values=c("black","red")) +
geom_text(data=subset(df,points=="1"),aes(label=names),size=4) +
theme(legend.position = "none")
```
Which is the same as what vegan gives us (except prettier):
```{r,fig.height=7}
m1 <- cca(counts)
par(mfrow=c(2,1))
plot(m1,scaling=1)
text(m1, "sites", col="black")
plot(m1,scaling=2)
text(m1, "sites", col="black")
```
It might be difficult to appreciate what the scaling is doing, but look at scaling 1. Here we first use the $F$ matrix to plot the mat centroids ("center"). This matrix $F$ preserves the Chi-squared distance between mats, so this figure is a spatially accurate representation of the relationship among sites. The positions of the centroids are weighted (and thus determined) by the relative frequencies of each OTU (i.e., each column). Finally, we superimpose the $V$ matrix (the OTU coordinates) on top. The OTUs in the vicinity of a particular mat are more likely to be found in that mat.
For scaling 2, we first plot $\hat F$, which are the centroids of the OTU data (which are positioned by weighting them with the relative frequencies of each mat), and then superimpose $\hat V$, the coordinates for mats. This scaling is therefore an accurate spatial representation of the relationship among OTUs. Like scaling 1, any species in the vicinity of a specific mat is more likely to be found in that mat.
Clearly, the gradient for the mats we've seen so far is also present in the CA plots. Something we could try is to color code the OTUs based on taxa. This will allow us to appreciate scaling 2 more since we can visualize the relationship among OTUs. Let's just work with Archaea, only using scaling 2:
```{r}
df <- data.frame("x" = c(FF[,1],V[,1],Vhat[,1],FFhat[,1]),
"y" = c(FF[,2],V[,2],Vhat[,2],FFhat[,2]),
"points" = c(rep(1:2,c(dim(FF)[1],dim(V)[1])),rep(1:2,c(dim(Vhat)[1],dim(FFhat)[1]))),
"scaling" = rep(1:2,c(dim(FF)[1]+dim(V)[1],dim(Vhat)[1]+dim(FFhat)[1])),
"names" = rep(c(rownames(counts),colnames(counts)),2),
"kingdom" = c(tax[,1],rep("Site",10),tax[,1],rep("Site",10)),
"phylum" = c(tax[,2],rep("Site",10),tax[,2],rep("Site",10)),
"class" = c(tax[,3],rep("Site",10),tax[,3],rep("Site",10)))
df.scale2 <- subset(df, scaling==2)
ggplot(subset(df.scale2,kingdom=="Bacteria"),
aes(x,y,colour=kingdom,alpha=kingdom,size=kingdom)) +
geom_point(colour="gray",alpha=.7) +
geom_point(data=subset(df.scale2,kingdom=="Archaea"),
aes(colour=phylum),size=4,alpha=1) +
geom_text(data=subset(df.scale2,points=="1"),
aes(label=names),size=4,colour="red",alpha=.6) +
guides(size=FALSE,alpha=FALSE)
```
The green and blue dots represent the two Archaea phyla in the data set. There doesn't seem to be much of a pattern, but let's see what those are as a function of class (one taxa level lower).
```{r}
ggplot(subset(df.scale2,kingdom=="Bacteria"),
aes(x,y,colour=kingdom,alpha=kingdom,size=kingdom)) +
geom_point(colour="gray",alpha=.7) +
geom_point(data=subset(df.scale2,kingdom=="Archaea"),
aes(colour=class),size=4,alpha=1,
position=position_jitter(w=.075,h=.075)) +
geom_text(data=subset(df.scale2,points=="1"),
aes(label=names),size=4,colour="red",alpha=.6) +
guides(size=FALSE,alpha=FALSE)
```
Note that the points are jittered to ease distinguishing them. There now seems to be an interesting trend: Halobacteria species are found more in the center; whereas, Methanomicrobia, Thermococci, and Thermoplasmata are located at the extremes (i.e., superficial or deep mats).
Lastly, lets just colorcode all of the OTUs based on Phylum.
```{r,fig.width=10}
ggplot(df.scale2,aes(x,y,colour=phylum)) + geom_point(alpha=.5) +
geom_text(data=subset(df.scale2,points=="1"),
aes(label=names),size=4,colour="black") +
guides(size=FALSE, colour=guide_legend(ncol=3))
```
There doesn't seem to be much of a pattern.
#### Arch effect
If you look at the pattern of the figures we've been working with, you may have noticed it's shape. It tends to look like an arch. This is due to our mats being correlated with one another. This should be of no surprise considering the whole system is built as a gradient, so mats 1 and 2 would be far more related than mats 1 and 9. This gradient manifests itself in the matrix and hence the analysis can't help but join correlated sites to one another.
To demonstrate this, consider the the following simple matrix:
```{r}
comm <- matrix(0,nrow=10,ncol=10)
rownames(comm) <- paste("site",1:dim(comm)[1])
colnames(comm) <- paste("sp",1:dim(comm)[2])
for (i in 1:10){
left <- rep(0,i-1)
right <- rep(0,10-i)
comm[i,] <- c(left,1,right)
}
levelplot(comm,
panel = function(...){
panel.levelplot(...)
panel.abline(h=seq(.5,9.5,1))
panel.abline(v=seq(.5,9.5,1))
},colorkey=FALSE)
```
Let pink and cyan be 0 and 1s, respectively. For this matrix, there would be no correlation between sites; each site is associated with its own unique species with no overlap. There is no implicit way for the matrix to "know" that site 1 could be similar to site 2 because there is no information in the matrix reflecting that. By extension, there should be no pattern between sites when we perform CA.
```{r}
arch1 <- cca(comm)
plot(arch1,scaling=1)
```
It might be difficult to tell, but the site text is superimposed onto the species text, so as we saw in the matrix, each site is associated with its own species, and there is no pattern between sites because there are no overlapping species.
Now, take a look at this matrix:
```{r}
comm <- matrix(0,nrow=8,ncol=10)
rownames(comm) <- paste("site",1:dim(comm)[1])
colnames(comm) <- paste("sp",1:dim(comm)[2])
for (i in 1:8){
left <- rep(0,i-1)
right <- rep(0,10-2-i)
comm[i,] <- c(left,rep(1,3),right)
}
levelplot(comm,
panel = function(...){
panel.levelplot(...)
panel.abline(h=seq(.5,9.5,1))
panel.abline(v=seq(.5,7.5,1))
},colorkey=FALSE)
```
Now there are shared species among sites: site 1 shares species 2 and 3 with site 2, site 2 shares species 3 and 4 with site 3, and so on. The sites are therefore correlated and CA will do its best to maintain the pattern we're seeing between sites: site 1 will be near site 2, site 2 near site 3, etc. Let's look at the result from CA.
```{r}
arch2 <- cca(comm)
plot(arch2,scaling=1)
```
To no surprise, the gradient is preserved, but what causes the arch? This is an artifact from forcing the data into two dimensions. There are a few things going on here. Notice that the first ordination axis, alone, accurately separates the data:
```{r}
summary(arch2)$sites
plot(summary(arch2)$sites[,1],rep(0,8),type="n",ylab="",xlab="CA1")
text(summary(arch2)$sites[,1],rep(0,8),rownames(comm))
```
CA attempts to maximize the variation (i.e., separation) between sites, which the first axis accomplishes completely, The second axis nevertheless tries, but must remain uncorrelated with the first. Because the sites are completely separated, there are no remaining gradients to drive the second axis. The only candidate left for the second axis is a folded version of the first axis; the folding allows the two axes to no longer be linearly correlated, satisfying that aforementioned constraint, and after folding, it explains over half the variance explained by the first axis (which shouldn't be surprising considering it's essentially the first axis):
```{r}
summary(arch2)$cont
```
You can gain appreciation for the folding if you simply plot the two axes on top of one another:
```{r}
plot(summary(arch2)$sites[,1],type="b",lwd=2,ylab="CA Axes")
points(summary(arch2)$sites[,2],type="b",lwd=2,col="red",ylab="CA Axes")
```
The underlying issue with the arch effect is that CA2 is not interpretable, rendering these arch shaped figures flawed. Just look at the shifts in spacing between sites and species in $\mathbb{R}^1$ and $\mathbb{R}^2$.
```{r,fig.height=7}
par(mfrow=c(2,1))
plot(summary(arch2)$species[,1],rep(0,10),type="n",ylab="CA1",ylim=range(-2,2),xlab="")
text(summary(arch2)$sites[,1],rep(0,8),rownames(comm))
text(summary(arch2)$species[,1],rep(0,10),colnames(comm),col="red")
plot(arch2,scaling=2)
```
The separation at the extremes is different when regressed against CA2, and considering it's an artifact, the shift in separation is not to be trusted.
#### DCA
One solution to the arch effect is to perform a detrended CA (DCA), which can also alleviate some of the compression between points at the end of the gradient. We'll use the decorana() function:
```{r}
arch3 <- decorana(comm)
plot(arch3)
```
And let's compare this to the CA figures in $\mathbb{R}^1$ and $\mathbb{R}^2$
```{r}
levelplot(comm,
panel = function(...){
panel.levelplot(...)
panel.abline(h=seq(.5,9.5,1))
panel.abline(v=seq(.5,7.5,1))
},colorkey=FALSE)
```
```{r,fig.height=10}
par(mfrow=c(3,1))
plot(summary(arch2)$species[,1],rep(0,10),
type="n",ylab="CA1",ylim=range(-2,2),xlab="",main="CA 1D")
text(summary(arch2)$sites[,1],rep(0,8),rownames(comm))
text(summary(arch2)$species[,1],rep(0,10),colnames(comm),col="red")
plot(arch2,scaling=2,main="CA 2D")
plot(arch3,main="DCA")
```
If you take a look at the matrix, you can see that species 1 and 10 are exclusive to sites 1 and 8, respectively. This is a feature that neither of our previous methods picked up, but DCA did. Moreover, it makes the most sense for site 1 to be centered among species 1, 2, and 3, and 10 between 8, 9, and 10, which DCA clearly accomplished.
DCA is basically first running CA, allowing the first axis to bend into the second axis, and then divides the first axis into segments. For each segment, the corresponding segment mean for the second axis is forced to 0. With a zero mean in each segment, the overall mean of axis 2 is now zero, forcing the "arch" into a straight line. By straitening the second axis, any compression at the ends is also alleviated.
Let's quickly run DCA on our real data.
```{r}
gn.dca <- decorana(counts,iweigh=1)
```
Note that I have "iweigh" set to 1. This downweights rare species, since CA is sensitive to these outliers.
```{r}
plot(gn.dca,display="species",type="n")
points(gn.dca,display="species",col="black",pch=3)
points(gn.dca,display="sites",col="red",pch=19)
out <- summary(gn.dca,scaling=1)
df <- data.frame("DCA1"=c(out$spec.scores[,1],out$site.scores[,1]),
"DCA2"=c(out$spec.scores[,2],out$site.scores[,2]),
"points" = rep(1:2,c(dim(counts)[2],dim(counts)[1])),
"kingdom" = c(tax[,1],rep("site",dim(counts)[1])),
"phylum" = c(tax[,2],rep("site",dim(counts)[1])),
"order" = c(tax[,3],rep("site",dim(counts)[1])),
"family" = c(tax[,4],rep("site",dim(counts)[1])),
"names" = c(colnames(counts),rownames(counts))
)
df$points <- factor(df$points)
df.sp <- subset(df,points == "1" & kingdom != "Archaea")
ggplot(df.sp,aes(DCA1,DCA2)) + geom_point(colour="black",alpha=.2) +
geom_text(data=subset(df,points == "2"),aes(label=names),size=5) +
geom_point(data=subset(df,kingdom=="Archaea"),
aes(colour=phylum),size=4,alpha=1)
```
### CCA
Canonical Correspondence Analysis (CCA) involves similar steps seen in RDA. We first measure the distances between rows (mats) and columns (OTUs) in our matrix using Chi squared distance. Weighted linear regression is performed on the Chi squared matrix versus a set of explanatory variables (e.g., Depth and Zone), replacing the original distances with the fitted distances, just like RDA. Note that the weights are the same weights used in CA -- i.e., the row sums (a vector of the total abundances for each mat). Finally, we project the matrix via eigenvalue decomposition (or SVD).
Let's first start with the computation performed by CCA and note the similarities with CA and RDA. CCA begins with calculating the Chi squared distance matrix $\bar Q$ seen in CA. Let's quickly recalculate $\bar Q$:
```{r}
Oij <- as.matrix(counts/sum(counts)) ## counts to frequencies
pi <- rowSums(Oij) ## OTU weight
pj <- colSums(Oij) ## mat weight
Eij <- outer(pi,pj) ## expected value for each element in Oij
Qbar <- (Oij-Eij)/sqrt(Eij) ## Chi sqaured distance matrix
```
So that much is identical to CA and is equivalent to calculating the covariance matrix in RDA. Now, like RDA, we perform linear regression, but this time it will be **weighted** linear regression. We'll take it easy for now and only use one of our two explanatory variables, so we'll be regressing each OTU's set of mat values against the explanatory variable depth. First, we'll adjust depth with our weights:
```{r}
xw <- sum(depth*pi) ## note pi from above, the OTU weigh
xw <- (depth-xw)*sqrt(pi) ## weighted depth
```
The weights are adjusting depth by the square root of the abundance of OTUs for each of their respective mats:
```{r}
par(mfrow=c(1,2))
plot(depth,pch=19,ylim=range(min(xw,depth),max(xw,depth)),xlab="",ylab="Depth",main="Depth")
points(xw,pch=19,col="red")
plot(sqrt(pi),pch=19,ylim=range(0,.5),
col="red",xlab="",ylab=expression(paste(sqrt(Mat~~Weight))),
main=expression(paste("OTU Weight (",sqrt(p[i]),")")))
```
The left plot shows the raw depths (black) and the weighted depths (red). Because the OTU weights for each mat are quite similar, we are effectively rescaling the raw depths to 30% of their original values. While our set of data isn't a great example, one can imagine how these weights could influence the explanatory variables if the site (mat) proportion varied drastically.
Now we can perform our regression:
```{r}
yhat <- fitted(lm(Qbar ~ 0 + xw))
```
In case you're curious, here's the simple and weighted regression lines for OTU 43 (column 3, same regression shown for RDA).
```{r}
simp <- fitted(lm(Qbar[,3] ~ 0 + depth))
plot(xw,Qbar[,3],pch=19,xlab="",ylab="OTU Freq.")
points(xw,yhat[,3],type="l",col="red",lwd=3)
points(xw,simp,type="l",col="blue",lwd=3)
points(xw,pi,col="gray",pch=3)
legend(-4,.1,c("Weighted","Unweighted"),col=c("red","blue"),lty=c(1,1))
```
It might see as if that extreme value is having more of an effect, but what you should focus on is that the weighted regression line fits the first 7 values much better; hence, the line is more so a function of them as opposed to the latter 3.
The next step in CCA is what you'd expect if you remembered what came after regression in RDA: we now perform a singular value decomposition (remember, the matrix isn't square, hence SVD over EVD).
```{r}
yout <- svd(yhat)
```
Vegan performs the following normalizations, so we'll do the same to compare our results:
```{r}
m3.man <- list()
m3.man$eig <- yout$d[1]^2
m3.man$u <- yout$u[,1]*(1/sqrt(pi))
m3.man$v <- yout$v[,1]*(1/sqrt(pj))
m3.man$wa <- (Qbar %*% yout$v[,1])*(1/sqrt(pi))*(1/yout$d[1])
```
Now, lets compare our results to vegan's:
```{r}
m3.veg <- cca(counts ~ depth)
cbind(m3.man$eig,m3.veg$CCA$eig)
head(cbind(m3.man$u, m3.veg$CCA$u))
head(cbind(m3.man$v, m3.veg$CCA$v))
head(cbind(m3.man$wa, m3.veg$CCA$wa))
```
And that's the math behind CCA. Because we only used one explanatory variable, we only have one component to work with. Vegan, by default, plots this component (CCA1) against the first (normalized) component from CA. Because of the normalization procedure, our CA1 from earlier will not match vegan's, so it's probably better to go ahead and add our second explanatory variable. We could have manually calculated the components for the full model, but because zone is categorical, we'd have to work with dummy variables, which can make it seem far more confusing than it needs to be.
Let's build our full model, run it in vegan, and plot the results. We'll use an interaction term like our RDA model; hence, the overall regression model will be the same:
$$
\begin{align}
\hat{Counts}_{\dot j} = \beta_{1}Depth + \beta_{2}Zone + \beta_{3}Depth\times Zone\\
\end{align}
$$
```{r}
gn.cca <- cca(counts ~ Depth*Zone)
```
We can perform the same analyses as we did for RDA. For example, here is the permutation test against the null model:
```{r}
anova(gn.cca)
```
And the model output:
```{r}
print(gn.cca)
```
The constrained variance is much greater than unconstrained, implying our independent variables did a good job of accounting for the variation. Let's plot the figure:
```{r}
plot(gn.cca,scaling=2,type="n")
points(gn.cca,"sp",col="darkgreen",pch=3)
text(gn.cca,display="sites",cex=1)
text(gn.cca,display="cn",cex=.8,col="blue")
```
And there is the amazing arch effect. Per the R documentation for CCA, the arch effect should be corrected for by the model *unless* the constraints (explanatory variables) are curvilinear with one another. Considering our two constraints, depth and zone, likely explain the same information (the variation in OTU abundance with respect to depth into the system), a curvilinear relationship should not be surprising. Ultimately, the poor fits to our data boils down to a lack of adequate constraints.
Here are the RDA and CCA models plotted side by side:
```{r}
par(mfrow=c(1,2))
plot(gn.rda,scaling=2,type="n")
points(gn.rda,"sp",col="darkgreen",pch=19)
text(gn.rda,display="sites",cex=.6)
text(gn.rda,display="cn",cex=.8,col="blue")
plot(gn.cca,scaling=2,type="n")
points(gn.cca,"sp",col="darkgreen",pch=3)
text(gn.cca,display="sites",cex=1)
text(gn.cca,display="cn",cex=.8,col="blue")
```
Although we didn't introduce the arch effect until after our RDA run, it's quite clear that both models suffer from this artifact.
### PCoA
This last section will quickly try to tie together the above analyses to a unsupervised clustering method you may have heard of: Principal **Coordinate** Analysis (PCoA, cf. PCA). You can think of PCoA as a more general ordination method that can involve various distance metrics. If one uses Euclidean distance, then the ordination is equivalent to PCA; on the other hand, if the distance metric is Chi-squared distance, then the ordination is essentially the same as CA.
Let's compare. Note that we'll use a smaller matrix to speed up the computation
```{r}
counts <- t(counts)
pcoa.counts <- counts[1:500,]
pcoa.euc <- pco(dist(pcoa.counts,"euclidean"))
pcoa.pca <- prcomp(pcoa.counts)
par(mfrow=c(1,2))
plot(pcoa.euc$points[,1],pcoa.euc$points[,2],
pch=3,xlab="PC1",ylab="PC2",main="PCoA: Euclidean")
plot(pcoa.pca$x[,1],pcoa.pca$x[,2],
pch=3,xlab="PC1",ylab="PC2",main="PCA")
pcoa.chi <- pco(dsvdis(pcoa.counts,index="chisq"))
pcoa.ca <- scores(cca(pcoa.counts),scaling=1)
plot(pcoa.chi$points[,1],pcoa.chi$points[,2],
pch=3,xlab="PC1",ylab="PC2",main="PCoA: Chi-Squared",
xlim=range(-2,2),ylim=range(-2,2))
plot(-pcoa.ca$sites[,1],-pcoa.ca$sites[,2],
pch=3,xlab="PC1",ylab="PC2",main="CA",
xlim=range(-2,2),ylim=range(-2,2))
```
The are many other distance metrics that can be used; a more popular one is Bray-Curtis.
```{r}
pcoa.bc <- pco(dsvdis(pcoa.counts,index="bray/curtis"))
plot(pcoa.bc$points[,1],pcoa.bc$points[,2],
pch=3,xlab="PC1",ylab="PC2",main="PCoA: Bray-Curtis",
xlim=range(-2,2),ylim=range(-2,2))
```