The key to multivariate data analysis is understanding the patterns. Ordination approaches provide a useful means of examining patterns in high-dimensional dataspaces.
Principal components analysis (PCA) is the work-horse of ordination. There are numerous implementations in R. The base function prcomp is quite useful:
library(RRPP)
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-7
bumpus<-read.csv("data/bumpus.csv",header=T)
Y<-bumpus[,5:12]
Y <- scale(Y, scale = FALSE) #center data
gp.bumpus <- as.factor(bumpus$sex)
pca.bumpus<-prcomp(Y)
summary(pca.bumpus)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 6.2796 2.3278 1.22784 0.94189 0.50071 0.35513 0.29091
## Proportion of Variance 0.8256 0.1134 0.03156 0.01857 0.00525 0.00264 0.00177
## Cumulative Proportion 0.8256 0.9390 0.97057 0.98914 0.99439 0.99703 0.99880
## PC8
## Standard deviation 0.2390
## Proportion of Variance 0.0012
## Cumulative Proportion 1.0000
Using summary on the prcomp object returns a table of the eigenvalues, and percent variation described by each PC axis. For this example, the first 3 PC axes describe 76% of the variation. Note that the last 4 dimensions have precisely 0% variation, meaning that there were redundancies in this data (standard parametric-based significance testing in MANOVA would not work, as the covariance matrix would be singular).
One can visualize this using a scree plot:
library(vegan)
screeplot(pca.bumpus,bstick = TRUE)
A better understanding of the PC axes is found by examining the loadings of the original variables on each axis.
pca.bumpus$rotation[,1] #1st PC axis
## TL AE WT BHL HL FL TTL
## 0.46642125 0.86020181 0.14948852 0.06078853 0.06403447 0.05835265 0.09008610
## SW
## 0.02900088
Variables with loadings close to zero are relatively unimportant for that PC axis, while loadings closer to +1 and -1 are more important.
Finally, here is the PCA plot.
PC.scores<-pca.bumpus$x
plot(PC.scores,xlab="PC I", ylab="PC II",asp=1,pch=21,bg=gp.bumpus,cex = 1.5)
legend("topright", levels(gp.bumpus), pch = 21,pt.bg=1:2)
Interpreting ordination plots requires an understanding of what went into the plot, and what is preserved. For PCA, distances and directions among objects are preserved. Thus the approach has only rotated the data using a rigid-rotation. Further, if Euclidean data (or a metric distance) was used as input, then one may interpret distances and directions in the resulting ordination plot.
Remember however, that the plot still represents a projection into a lower-dimensional space, so there may be additional pattern in the data that is not well-represented. For instance, large distances between specimens are a better representation of the true distance between those specimens than are smaller distances in the plot. The reason is that two specimens close together in the PC1 - PC2 plot may still be separated along PC3, etc. This is the case for all ordination approaches.
To see this, one can do a plot of the true distances between objects in the full dataspace versus the distances as represented in the low-dimensional space:
plot(dist(Y),dist(PC.scores[,1:2]))
cor(dist(Y),dist(PC.scores[,1:2]))
## [1] 0.9949183
Here one can see greater ‘spread’ for the smaller distances (we have a teardrop-shaped distribution). Note also that the correlation of distances in the 2D-PC space with the full space is 0.99. Thus, very little information is ‘lost’ when viewing this 2D projection.
Sometimes it is useful to superimpose vectors for the variables in the PCA space. This is accomplished using a bi-plot (interpretation of the bi-plot was discussed in lecture).
biplot(pca.bumpus)
PCoA, or metric (classic) multidimensional scaling provides a Euclidean ordination beginning with a distance matrix. The original approach was found in Gower (1966), who showed that for multivariate normal data, PCA from the covariance matrix, and PCoA from the Euclidean distance matrix, were equivalent. However, the advantage of PCoA is that one may use other distance matrices as input.
As the name implies, PCoA is metric, meaning distances and directions are preserved during the approach. Thus, if one has supplied a metric distance matrix, one is free to interpret distances and directions in the resulting ordination plot.
bumpus.dist<-dist(Y)
PCoA<-cmdscale(bumpus.dist) #from vegan
plot(-1*PCoA[,1], PCoA[,2],pch=21,bg=gp.bumpus,cex=1.5,asp=1)
legend("topright", levels(gp.bumpus), pch = 21,pt.bg=1:2)
Note that for this example, the PCoA plot is identical to that of PCA (with the exception of reflections of the axes, which is arbitrary).
Another approach is to use NMDS. Here, the rank-order of distances are preserved, rather than the distances themselves. Because of this, one should take great caution to NOT overinterpret patterns in these plots (distances and directions are not preserved).
In other words, NMDS does not preserve distances and directions (even if a metric distance measure is used), so such patterns should be interpreted with extreme caution.
bumpus.nmds <- metaMDS(dist(Y), autotransform=FALSE, k=2)
## Run 0 stress 0.04172717
## Run 1 stress 0.07254343
## Run 2 stress 0.07254063
## Run 3 stress 0.07049132
## Run 4 stress 0.0417409
## ... Procrustes: rmse 0.0006457776 max resid 0.006367065
## ... Similar to previous best
## Run 5 stress 0.06096943
## Run 6 stress 0.05751043
## Run 7 stress 0.04174096
## ... Procrustes: rmse 0.0006466476 max resid 0.006355191
## ... Similar to previous best
## Run 8 stress 0.04174089
## ... Procrustes: rmse 0.0006455159 max resid 0.006379368
## ... Similar to previous best
## Run 9 stress 0.0725406
## Run 10 stress 0.04172719
## ... Procrustes: rmse 3.023738e-05 max resid 0.0003046057
## ... Similar to previous best
## Run 11 stress 0.05989338
## Run 12 stress 0.0609695
## Run 13 stress 0.05751035
## Run 14 stress 0.4145329
## Run 15 stress 0.05988179
## Run 16 stress 0.4145309
## Run 17 stress 0.07401851
## Run 18 stress 0.04172727
## ... Procrustes: rmse 5.266975e-05 max resid 0.0005699457
## ... Similar to previous best
## Run 19 stress 0.07401094
## Run 20 stress 0.04174089
## ... Procrustes: rmse 0.000646461 max resid 0.006411793
## ... Similar to previous best
## *** Solution reached
plot(bumpus.nmds$points, asp=1,pch=21, bg=gp.bumpus, cex=1.5,xlab="NMDS1", ylab="NMDS2")
legend("topright", levels(gp.bumpus), pch = 21,pt.bg=1:2)
Note that here as well, one can input non-Euclidean distances.
data(dune) #from vegan
dune
## Achimill Agrostol Airaprae Alopgeni Anthodor Bellpere Bromhord Chenalbu
## 1 1 0 0 0 0 0 0 0
## 2 3 0 0 2 0 3 4 0
## 3 0 4 0 7 0 2 0 0
## 4 0 8 0 2 0 2 3 0
## 5 2 0 0 0 4 2 2 0
## 6 2 0 0 0 3 0 0 0
## 7 2 0 0 0 2 0 2 0
## 8 0 4 0 5 0 0 0 0
## 9 0 3 0 3 0 0 0 0
## 10 4 0 0 0 4 2 4 0
## 11 0 0 0 0 0 0 0 0
## 12 0 4 0 8 0 0 0 0
## 13 0 5 0 5 0 0 0 1
## 14 0 4 0 0 0 0 0 0
## 15 0 4 0 0 0 0 0 0
## 16 0 7 0 4 0 0 0 0
## 17 2 0 2 0 4 0 0 0
## 18 0 0 0 0 0 2 0 0
## 19 0 0 3 0 4 0 0 0
## 20 0 5 0 0 0 0 0 0
## Cirsarve Comapalu Eleopalu Elymrepe Empenigr Hyporadi Juncarti Juncbufo
## 1 0 0 0 4 0 0 0 0
## 2 0 0 0 4 0 0 0 0
## 3 0 0 0 4 0 0 0 0
## 4 2 0 0 4 0 0 0 0
## 5 0 0 0 4 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 2
## 8 0 0 4 0 0 0 4 0
## 9 0 0 0 6 0 0 4 4
## 10 0 0 0 0 0 0 0 0
## 11 0 0 0 0 0 2 0 0
## 12 0 0 0 0 0 0 0 4
## 13 0 0 0 0 0 0 0 3
## 14 0 2 4 0 0 0 0 0
## 15 0 2 5 0 0 0 3 0
## 16 0 0 8 0 0 0 3 0
## 17 0 0 0 0 0 2 0 0
## 18 0 0 0 0 0 0 0 0
## 19 0 0 0 0 2 5 0 0
## 20 0 0 4 0 0 0 4 0
## Lolipere Planlanc Poaprat Poatriv Ranuflam Rumeacet Sagiproc Salirepe
## 1 7 0 4 2 0 0 0 0
## 2 5 0 4 7 0 0 0 0
## 3 6 0 5 6 0 0 0 0
## 4 5 0 4 5 0 0 5 0
## 5 2 5 2 6 0 5 0 0
## 6 6 5 3 4 0 6 0 0
## 7 6 5 4 5 0 3 0 0
## 8 4 0 4 4 2 0 2 0
## 9 2 0 4 5 0 2 2 0
## 10 6 3 4 4 0 0 0 0
## 11 7 3 4 0 0 0 2 0
## 12 0 0 0 4 0 2 4 0
## 13 0 0 2 9 2 0 2 0
## 14 0 0 0 0 2 0 0 0
## 15 0 0 0 0 2 0 0 0
## 16 0 0 0 2 2 0 0 0
## 17 0 2 1 0 0 0 0 0
## 18 2 3 3 0 0 0 0 3
## 19 0 0 0 0 0 0 3 3
## 20 0 0 0 0 4 0 0 5
## Scorautu Trifprat Trifrepe Vicilath Bracruta Callcusp
## 1 0 0 0 0 0 0
## 2 5 0 5 0 0 0
## 3 2 0 2 0 2 0
## 4 2 0 1 0 2 0
## 5 3 2 2 0 2 0
## 6 3 5 5 0 6 0
## 7 3 2 2 0 2 0
## 8 3 0 2 0 2 0
## 9 2 0 3 0 2 0
## 10 3 0 6 1 2 0
## 11 5 0 3 2 4 0
## 12 2 0 3 0 4 0
## 13 2 0 2 0 0 0
## 14 2 0 6 0 0 4
## 15 2 0 1 0 4 0
## 16 0 0 0 0 4 3
## 17 2 0 0 0 0 0
## 18 5 0 2 1 6 0
## 19 6 0 2 0 3 0
## 20 2 0 0 0 4 3
dune.dist<-vegdist(dune) #default = Bray-Curtis distance
dune.nmds <- metaMDS(dune.dist, autotransform=FALSE, k=2)
## Run 0 stress 0.1192678
## Run 1 stress 0.1192678
## ... Procrustes: rmse 8.590431e-05 max resid 0.0002645623
## ... Similar to previous best
## Run 2 stress 0.1192679
## ... Procrustes: rmse 9.50491e-05 max resid 0.0002919219
## ... Similar to previous best
## Run 3 stress 0.2838977
## Run 4 stress 0.1812932
## Run 5 stress 0.1183186
## ... New best solution
## ... Procrustes: rmse 0.02027188 max resid 0.06497132
## Run 6 stress 0.119268
## Run 7 stress 0.1183186
## ... New best solution
## ... Procrustes: rmse 2.238058e-05 max resid 7.094045e-05
## ... Similar to previous best
## Run 8 stress 0.1183186
## ... Procrustes: rmse 9.14787e-06 max resid 2.951255e-05
## ... Similar to previous best
## Run 9 stress 0.1192678
## Run 10 stress 0.1183186
## ... Procrustes: rmse 2.298179e-05 max resid 7.13856e-05
## ... Similar to previous best
## Run 11 stress 0.1192678
## Run 12 stress 0.1192678
## Run 13 stress 0.1192678
## Run 14 stress 0.1183186
## ... New best solution
## ... Procrustes: rmse 8.015474e-06 max resid 2.507029e-05
## ... Similar to previous best
## Run 15 stress 0.1939202
## Run 16 stress 0.1183186
## ... Procrustes: rmse 2.313646e-05 max resid 7.617391e-05
## ... Similar to previous best
## Run 17 stress 0.1192679
## Run 18 stress 0.1183186
## ... Procrustes: rmse 1.929154e-06 max resid 6.041611e-06
## ... Similar to previous best
## Run 19 stress 0.1192678
## Run 20 stress 0.1192678
## *** Solution reached
plot(dune.nmds)
## species scores not available
Correspondence Analysis (CA) is another ordination approach for count data.
dune.cca<-cca(dune) #from vegan
plot(dune.cca)