Motivation

Today we discuss methods of multivariate association, as well as canonical ordination techniques. Multivariate association methods are generalizations of bivariate correlation, but where Y1 and Y2 are matrices. There are several general approaches.

1: Multivariate Association Methods

1a: Mantel tests

Mantel tests evaluate the association between distance matrices. The procedure is straightforward: obtain the appropriate distance matrix for each multivariate dataset, and calculate the pairwise Mantel coefficient from them. This may be accomplished in the R-package vegan.

First we’ll load some data and libraries:

library(geomorph)
## Loading required package: RRPP
## Loading required package: rgl
## Loading required package: Matrix
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-7
library(RRPP)

data(pupfish)
p1<-c(4,10:17, 39:56) #variables [landmarks] in Y1
  Y1<-two.d.array(pupfish$coords[p1,,]) # head data as 2D matrix
  Y2<-two.d.array(pupfish$coords[-p1,,]) # body data as 2D matrix
  Group<-as.factor(paste(pupfish$Pop,pupfish$Sex)) 
cols <- rep(1, 56)
cols[p1] <- 2
plotAllSpecimens(pupfish$coords, mean = FALSE, plot.param=list(pt.bg = cols)) #to show the data

Now let’s run Mantel tests: first a pairwise and then a three-way Mantel test.

mantel(dist(Y1),dist(Y2)) 
## 
## Mantel statistic based on Pearson's product-moment correlation 
## 
## Call:
## mantel(xdis = dist(Y1), ydis = dist(Y2)) 
## 
## Mantel statistic r: 0.5683 
##       Significance: 0.001 
## 
## Upper quantiles of permutations (null model):
##    90%    95%  97.5%    99% 
## 0.0634 0.0827 0.0990 0.1209 
## Permutation: free
## Number of permutations: 999
 plot(dist(Y1),dist(Y2))

mantel.partial(dist(Y1),dist(Y2),dist(pupfish$CS))  #3-way Mantel
## 
## Partial Mantel statistic based on Pearson's product-moment correlation 
## 
## Call:
## mantel.partial(xdis = dist(Y1), ydis = dist(Y2), zdis = dist(pupfish$CS)) 
## 
## Mantel statistic r: 0.5777 
##       Significance: 0.001 
## 
## Upper quantiles of permutations (null model):
##    90%    95%  97.5%    99% 
## 0.0734 0.0877 0.1046 0.1178 
## Permutation: free
## Number of permutations: 999

As discussed in class, Mantel tests are straightforward and easy to use. However, they tend to have elevated type I error rates, low power, and can display significant bias (see references in lecture). Thus, alternative multivariate association methods are preferred.

1b: The RV Coefficient

One alternative is the RV coefficient, which characterizes the strength of association between two sets of continuous variables. The measure ranges from 0 to 1, with higher values indicative of a greater degree of covariation between the two datasets. The function below calculates the RV coefficient and assesses its significance using permutation, where rows of one dataset are permuted relative to the other.

RV<-function(x,y,iter=999){
  n<-nrow(x)
  x<-scale(x,scale=FALSE); y<-scale(y,scale=FALSE)
  ind <- c(list(1:n), (Map(function(x) sample.int(n, n), 1:iter)))
  y.rand <- lapply(1:(iter+1), function(i) y[ind[[i]], ])
  S11<-crossprod(x, x)/(n - 1)
  S22.r<-lapply(1:(iter+1), function(i) crossprod(y.rand[[i]], y.rand[[i]])/(n - 1))
  S12.r<-lapply(1:(iter+1), function(i) crossprod(x, y.rand[[i]])/(n - 1))
  RV.r<-unlist(lapply(1:(iter+1), function(i) sum(diag(S12.r[[i]]%*%t(S12.r[[i]]))) /
                 sqrt(sum(diag(S11%*%S11)) * sum(diag(S22.r[[i]]%*%S22.r[[i]])))))
  p.val<- 1-(rank(RV.r)[1])/(iter+1)
  p.val<-ifelse(p.val==0,1/(iter+1),p.val)
  return(list(RV=RV.r[[1]],pvalue=p.val))
}

RV(Y1,Y2)
## $RV
## [1] 0.608408
## 
## $pvalue
## [1] 0.001
sqrt(RV(Y1,Y2)$RV)  #closer to a correlation
## [1] 0.7800051

1c: Two-Block Partial Least Squares (PLS)

Another option is partial least squares, which characterizes the maximal covariation between two sets of variables. This is found from a singular-value decomposition (SVD) of the cross-covariance matrix: S12. As above significance using permutation, where rows of one dataset are permuted relative to the other.

pls.res<-two.b.pls(Y1,Y2,print.progress = FALSE)
## Data in either A1 or A2 do not have names.  It is assumed data in both A1 and A2 are ordered the same.
summary(pls.res)
## 
## Call:
## two.b.pls(A1 = Y1, A2 = Y2, print.progress = FALSE) 
## 
## 
## 
## r-PLS: 0.917
## 
## Effect Size (Z): 5.4019
## 
## P-value: 0.001
## 
## Based on 1000 random permutations
plot(pls.res)

2: Canonical Ordination Methods

Canonical ordinations are ordinations that take into account some linear model (Y~X). Thus, they do not generate ordinations of the actual dataspace, but rather provide a representation of that space under the model being examined. Visually, they are often very appealing; however it must be remembered that these represent a distorted (or stylized) view of the dataspace! Thus, they should never be used in place of PCA or PCoA.

2a: Canonical Variates Analysis

CVA (equivalent to Discriminant Analysis), provides an ordination of between-group variation relative to within-group variation. The plot maximally separates groups relative to within-group variability. When plotted properly, within-group scatter should be approximately circular, as this is standardized during the analysis.

Here is a simple example with 4 groups. First, the actual dataspace is displayed (using PCA) followed by the CVA representation.

mydata<-read.csv("data/Lab-10.Pupfish.csv",header=T)
species<-as.factor(mydata[,1]); sex<-as.factor(mydata[,2])
SL<-(mydata[,3]); Y<-as.matrix(mydata[,-(1:3)]) 
Group<-as.factor(paste(species,sex))  

Y<-prcomp(Y)$x  
 col.gp<-rep("green",nrow(Y));   col.gp[which(species== 'FW')]<-"red"
 shape.gp<-rep(21,nrow(Y));   shape.gp[which(sex== 'M')]<-22
 rdf <- rrpp.data.frame(Y=Y, SL=SL, sex=sex, species=species)
 
plot(Y,pch=shape.gp,bg=col.gp,asp=1,cex=1.5)

## CVA
library(MASS)
lda.pupfish<-lda(Y,Group)
cva.pupfish<-predict(lda.pupfish,Y)
cv.scores<-cva.pupfish$x

plot(cv.scores,pch=shape.gp,bg=col.gp,asp=1,cex=1.5)

CVA: Caveats

A major caveat with CVA is that both distances and directions in the CVA space do not represent the actual distances in the multivariate dataspace, because variation has been re-represented in a relative form (between relative to within). Thus, groups are generally more separated in CVA plots than they are in the actual datspace. Also, it can be the case that some groups are displayed further from one another, while others closer than they are (see example in lecture).

library(mvtnorm) 
corr.val<-0.7
groups<-gl(3,50)
a<- rmvnorm(n=50,mean=c(-3.4,0),sigma=matrix(c(1,corr.val,corr.val,1),2,2))
b<- rmvnorm(n=50,mean=c(3.4,0),sigma=matrix(c(1,corr.val,corr.val,1),2,2))
c<- rmvnorm(n=50,mean=c(0,6),sigma=matrix(c(1,corr.val,corr.val,1),2,2))
orig.data<-rbind(a,b,c)
col.gp.r<-rep("black",nrow(orig.data)); col.gp.r[which(groups== '2')]<-"red"; col.gp.r[which(groups== '3')]<-"green"

plot(orig.data,pch=21,bg=col.gp.r,asp=1,cex=1.5)
ordiellipse(orig.data, groups,conf=0.95)

plot(predict(lda(orig.data,groups))$x,pch=21,bg=col.gp.r,asp=1,cex=1.5)
ordiellipse(predict(lda(orig.data,groups))$x, groups,conf=0.95)

Finally, CVA can separate groups that are actually completely overlapping in the multivariate space. This last issue is shown below.

data.rand<-matrix(rnorm(150*150),ncol=150)
plot(data.rand,pch=21,bg=col.gp.r,asp=1,cex=1.5)

plot(predict(lda(data.rand[,1:150],groups))$x,pch=21,bg=col.gp.r,asp=1,cex=1.5)
## Warning in lda.default(x, grouping, ...): variables are collinear

2b: Redundancy Analysis

RDA provides an ordination of predicted values from a linear model. Again it is a stylized representation of the actual multivariate dataspace.

plot(Y,pch=shape.gp,bg=col.gp,asp=1,cex=1.5)

pupfish.rda<-rda(Y~SL+species+sex+species:sex)
rda.scores<-predict(pupfish.rda)
plot(rda.scores,pch=shape.gp,bg=col.gp,asp=1,cex=1.5,xlab="RDA 1", ylab="RDA 2")

Redundancy Analysis: Caveats

A major caveat with RDA is if an incorrect model utilized, the resulting ordination is misleading. For the data above, it turns out there is a significant SL:species interaction. This is easily seen in the PC space and MANOVA, so now run the RDA with this interaction:

anova(lm.rrpp(Y~SL*species*sex, data=rdf, print.progress = FALSE))$table
##                Df     SS     MS     Rsq         F       Z Pr(>F)   
## SL              1 632.53 632.53 0.91317 1326.4627  9.3185  0.001 **
## species         1  20.03  20.03 0.02892   42.0070  4.8100  0.001 **
## sex             1   7.33   7.33 0.01059   15.3816  4.2227  0.001 **
## SL:species      1   6.18   6.18 0.00892   12.9539  3.6882  0.001 **
## SL:sex          1   0.65   0.65 0.00094    1.3679  0.6042  0.282   
## species:sex     1   0.35   0.35 0.00050    0.7318 -0.0435  0.516   
## SL:species:sex  1   0.81   0.81 0.00117    1.6941  0.9719  0.160   
## Residuals      52  24.80   0.48 0.03580                            
## Total          59 692.68                                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(Y,pch=shape.gp,bg=col.gp,asp=1,cex=1.5)

plot(predict(rda(Y~SL*species*sex)),pch=shape.gp,
     bg=col.gp,asp=1,cex=1.5,xlab="RDA 1", ylab="RDA 2")

Finally, here is the RDA when only groups are used:

plot(predict(rda(Y~species*sex)),pch=shape.gp,
     bg=col.gp,asp=1,cex=1.5,xlab="RDA 1", ylab="RDA 2")

Obviously, what one uses as X in RDA matters a great deal!