Today we explore three ways to evaluate alternative explanatory models: Likelihood ratio tests, information criteria (AIC), and cross-validation. All methods can work with both univariate and multivariate data.
Let’s first read in some data and prepare various models for investigation:
# Read data from Smith and Collyer 2008
snake<-read.csv("data/Lab-11-snake.csv",header=T)
site<-as.factor(snake[,1]);region<-as.factor(snake[,2]);sex<-as.factor(snake[,3])
svl<-snake[,4]; hs<-snake[,5] # hs= head size, calculated as centroid size from lm
Y<-as.matrix(snake[,-(1:5)])
##Setup: Univariate and multivariate models
#Univariate
hs.svl<-lm(hs~svl, model=T,x=T)
hs.sex<-lm(hs~sex, model=T,x=T)
hs.reg<-lm(hs~region, model=T,x=T)
hs.svl.reg<-lm(hs~svl + region, model=T,x=T)
hs.svl.by.reg<-lm(hs~svl*region, model=T,x=T)
hs.svl.sex<-lm(hs~svl + sex, model=T,x=T)
hs.svl.by.sex<-lm(hs~svl*sex, model=T,x=T)
hs.ancova<-lm(hs~svl + sex*region, model=T,x=T)
hs.full<-lm(hs~svl*sex*region, model=T,x=T)
#multivariate
Y.svl<-lm(Y~svl, model=T,x=T)
Y.sex<-lm(Y~sex, model=T,x=T)
Y.reg<-lm(Y~region, model=T,x=T)
Y.svl.reg<-lm(Y~svl+region, model=T,x=T)
Y.svl.by.reg<-lm(Y~svl*region, model=T,x=T)
Y.svl.sex<-lm(Y~svl + sex, model=T,x=T)
Y.svl.by.sex<-lm(Y~svl*sex, model=T,x=T)
Y.mancova<-lm(Y~svl + sex*region, model=T,x=T)
Y.full<-lm(Y~svl*sex*region, model=T,x=T)
LRT are useful for comparing nested models. This approach is familiar: we have used this throughout the semester when examining linear models which, by definition compare null and alternative models. Such comparisons may be implmenented using ‘anova’.
Now let’s compare some models. First, we will run a factorial model, which (with type I SS) is a series of sequential LRT tests.
anova(hs.full)
## Analysis of Variance Table
##
## Response: hs
## Df Sum Sq Mean Sq F value Pr(>F)
## svl 1 293048 293048 9.4015 0.002823 **
## sex 1 5224 5224 0.1676 0.683177
## region 2 209225 104612 3.3562 0.039051 *
## svl:sex 1 887 887 0.0285 0.866375
## svl:region 2 73702 36851 1.1822 0.311059
## sex:region 2 28115 14058 0.4510 0.638354
## svl:sex:region 2 33126 16563 0.5314 0.589537
## Residuals 95 2961177 31170
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, we interpret significance of higher-order effects relative to main effects already present in the model. To see more clearly that this is in fact LRT, try the following:
anova(hs.svl.reg)
## Analysis of Variance Table
##
## Response: hs
## Df Sum Sq Mean Sq F value Pr(>F)
## svl 1 293048 293048 9.7245 0.002358 **
## region 2 207559 103779 3.4438 0.035666 *
## Residuals 103 3103897 30135
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(hs.svl,hs.svl.reg)
## Analysis of Variance Table
##
## Model 1: hs ~ svl
## Model 2: hs ~ svl + region
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 105 3311456
## 2 103 3103897 2 207559 3.4438 0.03567 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice that the SS, F, and P-value from the factorial model are identical to a direct comparison of the two models. Thus, for nested models, evaluating effects in a model is accomplished using LRT.
The approach works identically for multivariate data:
anova(Y.full)
## Analysis of Variance Table
##
## Df Pillai approx F num Df den Df Pr(>F)
## (Intercept) 1 0.00000 0.0000 15 81 1.0000000
## svl 1 0.39072 3.4629 15 81 0.0001482 ***
## sex 1 0.21061 1.4407 15 81 0.1487897
## region 2 0.57957 2.2305 30 164 0.0007565 ***
## svl:sex 1 0.11365 0.6924 15 81 0.7847519
## svl:region 2 0.34241 1.1292 30 164 0.3077880
## sex:region 2 0.33155 1.0863 30 164 0.3589074
## svl:sex:region 2 0.32038 1.0428 30 164 0.4152491
## Residuals 95
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Y.svl.reg,Y.svl)
## Analysis of Variance Table
##
## Model 1: Y ~ svl + region
## Model 2: Y ~ svl
## Res.Df Df Gen.var. Pillai approx F num Df den Df Pr(>F)
## 1 103 7.3356e-05
## 2 105 2 7.5192e-05 0.55423 2.3001 30 180 0.00042 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Y.svl.reg,Y.reg)
## Analysis of Variance Table
##
## Model 1: Y ~ svl + region
## Model 2: Y ~ region
## Res.Df Df Gen.var. Pillai approx F num Df den Df Pr(>F)
## 1 103 7.3356e-05
## 2 104 1 7.4736e-05 0.34579 3.1361 15 89 0.0003966 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Y.mancova,Y.svl.reg)
## Analysis of Variance Table
##
## Model 1: Y ~ svl + sex * region
## Model 2: Y ~ svl + region
## Res.Df Df Gen.var. Pillai approx F num Df den Df Pr(>F)
## 1 100 7.3000e-05
## 2 103 3 7.3356e-05 0.4598 1.0619 45 264 0.375
A second useful way of comparing models is using AIC (or related) values. These methods weight the logL of the data given the model by the number of parameters in the model; emphasizing the best description of the data with the fewest parameters. For the univariate models above:
aic.summary<-AIC(hs.svl,hs.sex,hs.reg,hs.svl.reg,hs.svl.by.reg,hs.svl.sex,hs.svl.by.sex,
hs.ancova,hs.full)
aic.summary #smallest AIC is preferred model
## df AIC
## hs.svl 3 1416.040
## hs.sex 3 1425.040
## hs.reg 4 1420.558
## hs.svl.reg 5 1413.114
## hs.svl.by.reg 7 1414.552
## hs.svl.sex 4 1417.871
## hs.svl.by.sex 5 1419.869
## hs.ancova 8 1417.028
## hs.full 13 1424.078
Here, hs.svl.reg (hs~svl+region) is found to be the best model. Models within 2 (or 4) of this (ie, deltaAIC < 2 or 4) are considered reasonably close models to the best model.
In cases where multiple candidate models are within 2 (or 4) AIC units, model averaging can be considered. This is found by obtaining AIC weights for each model, and averaging the weighted parameters across candidate models to obtain a weighted model.
c1<-exp(-.5*0)
c2<-exp(-.5*(AIC(hs.svl.by.reg)-AIC(hs.svl.reg)))
w1<-c1/(c1+c2) #AIC weights
w2<-c2/(c1+c2)
w1
## [1] 0.6723665
w2
## [1] 0.3276335
beta1<-w1*(hs.svl.reg$coef)
beta2<-w2*(hs.svl.by.reg$coef)
beta.avg<-c((beta1[1]+beta2[1]),(beta1[2]+beta2[2]),(beta1[3]+beta2[3]),
(beta1[4]+beta2[4]),beta2[5],beta2[6])
beta.avg #model averaged coefficients
## (Intercept) svl regionEast regionWest svl:regionEast
## 432.698370 3.701737 -13.601998 -14.739733 -0.570554
## svl:regionWest
## -1.128373
NOTE: the average model is NOT guaranteed to be a better model. In fact, for the example above, it is a worse model than either models used to obtain the average model.
#FIND AIC for averaged model (need likelihood and its parameters)
Terms<-terms(formula(hs.svl.by.reg)) #set up X matrix
X<-model.matrix(Terms)
resid<-hs-X%*%beta.avg
E<-t(resid)%*%resid #Note: univariate, so this IS the det(E)
n<-nrow(resid); p<-1 #NOTE: p<-ncol(resid) for multivariate
k<-hs.svl.by.reg$rank # same as df in "AIC" except "AIC" adds one for variance
LLik<- (-2*((-.5*n*(log(E/n^p)+p))-(0.5*(n*log(2*pi)))))
pen<-2*(p*k+0.5*p*(p+1))
AIC.mdl.avg<-LLik+pen
AIC(hs.svl.by.reg)
## [1] 1414.552
AIC(hs.svl.reg)
## [1] 1413.114
AIC.mdl.avg #NOTE: AIC of average model is WORSE than the 2 input models
## [,1]
## [1,] 1415.718
NOTE, because these are nested models, we can use LRT to evaluate:
anova(hs.svl.by.reg)
## Analysis of Variance Table
##
## Response: hs
## Df Sum Sq Mean Sq F value Pr(>F)
## svl 1 293048 293048 9.7668 0.00232 **
## region 2 207559 103779 3.4588 0.03524 *
## svl:region 2 73442 36721 1.2238 0.29842
## Residuals 101 3030455 30005
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(hs.svl.reg,hs.svl.by.reg)
## Analysis of Variance Table
##
## Model 1: hs ~ svl + region
## Model 2: hs ~ svl * region
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 103 3103897
## 2 101 3030455 2 73442 1.2238 0.2984
Obviously, the LRT demonstrates the second effect should not be in the model, thus, despite its ‘slose’ AIC, model averaging should not be used.
One can do multivarite AIC, but it requires a special function (the AIC function in R is univariate only):
maic<-function(x,...){ # each x is a model
y<-resid(x)
E<-t(y)%*%y
d<-det(E)
if(length(E)>1) n<-nrow(y) else n<-length(y)
if(length(E)>1) p<-ncol(y) else p<-1
k<-x$rank # same as df in "AIC" except "AIC" adds one for variance
lh<-n*(log(d/n^p)+p)
LLik<- (-2*((-.5*n*(log(d/n^p)+p))-(0.5*(n*log(2*pi)))))
pen<-2*(p*k+0.5*p*(p+1))
m<-LLik+pen
maic<-c(k,m)
}
maic.svl<-maic(Y.svl)
maic.sex<-maic(Y.sex)
maic.reg<-maic(Y.reg)
maic.svl.reg<-maic(Y.svl.reg)
maic.svl.by.reg<-maic(Y.svl.by.reg)
maic.svl.sex<-maic(Y.svl.sex)
maic.svl.by.sex<-maic(Y.svl.by.sex)
maic.mancova<-maic(Y.mancova)
maic.full<-maic(Y.full)
rbind(maic.svl,maic.sex,maic.reg,maic.svl.reg,maic.svl.by.reg,maic.svl.sex,
maic.svl.by.sex,maic.mancova,maic.full)
## [,1] [,2]
## maic.svl 2 -13168.85
## maic.sex 2 -13139.84
## maic.reg 3 -13163.98
## maic.svl.reg 4 -13179.39
## maic.svl.by.reg 6 -13161.66
## maic.svl.sex 3 -13159.18
## maic.svl.by.sex 4 -13141.50
## maic.mancova 7 -13144.65
## maic.full 12 -13088.92
Model selection can identify the best model, but it still may be a poor descriptor of the data patterns. Cross-validation can assess this. Here, one fits the model with part of the data, and then obtains predicted values (and residuals) for the remaining portion. The variation in the residuals is a measure of model fit. Iterating this process for multiple models informs on which provides a better description of the patterns in the data.
Here is some data:
x<-seq(1:10)
y<-c(2,2,3,4,4,7,7,7,8,8)
plot(x,y)
anova(lm(y~x))
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 49.648 49.648 100.52 8.327e-06 ***
## Residuals 8 3.952 0.494
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here is a cross-validation procedure:
iter=1000
rep<-lapply(1:iter, function(j) sample(seq(1:10),5))
diff<-lapply(1:iter, function(j) setdiff(seq(1:10),rep[[j]]))
#full model
model.x<- lapply(1:iter, function(j) lm(y[rep[[j]]]~x[rep[[j]]]))
resid.y<-lapply(1:iter, function(j) resid(model.x[[j]],newdata=data.frame(c(x[diff[[j]]]))))
ss.x<-unlist(lapply(1:iter, function(j) crossprod(resid.y[[j]])))
#reduced model
model.1<- lapply(1:iter, function(j) lm(y[rep[[j]]]~1))
resid.1<-lapply(1:iter, function(j) resid(model.1[[j]],newdata=data.frame(c(x[diff[[j]]]))))
ss.1<-unlist(lapply(1:iter, function(j) crossprod(resid.1[[j]])))
c(mean(ss.1),var(ss.1))
## [1] 23.8832 45.4257
c(mean(ss.x),var(ss.x))
## [1] 1.5367753 0.8878436
Here we are using SS.resid from each model as measure of model adequacy; smaller values are better. Clearly, the model Y~X is a better description of the trend as compared to Y~1.