Model Selection

Data Setup

Let’s first read in some data and prepare various models for investigation:

# Read data from Smith and Collyer 2008
snake<-read.csv("data/Lab-11-snake.csv",header=T)
site<-as.factor(snake[,1]);region<-as.factor(snake[,2]);sex<-as.factor(snake[,3])
svl<-snake[,4]; hs<-snake[,5] # hs= head size, calculated as centroid size from lm
Y<-as.matrix(snake[,-(1:5)])

##Setup: Univariate and multivariate models
    #Univariate
hs.svl<-lm(hs~svl, model=T,x=T)
hs.sex<-lm(hs~sex, model=T,x=T)
hs.reg<-lm(hs~region, model=T,x=T)
hs.svl.reg<-lm(hs~svl + region, model=T,x=T)
hs.svl.by.reg<-lm(hs~svl*region, model=T,x=T)
hs.svl.sex<-lm(hs~svl + sex, model=T,x=T)
hs.svl.by.sex<-lm(hs~svl*sex, model=T,x=T)
hs.ancova<-lm(hs~svl + sex*region, model=T,x=T)
hs.full<-lm(hs~svl*sex*region, model=T,x=T)
    #multivariate 
Y.svl<-lm(Y~svl, model=T,x=T)
Y.sex<-lm(Y~sex, model=T,x=T)
Y.reg<-lm(Y~region, model=T,x=T)
Y.svl.reg<-lm(Y~svl+region, model=T,x=T)
Y.svl.by.reg<-lm(Y~svl*region, model=T,x=T)
Y.svl.sex<-lm(Y~svl + sex, model=T,x=T)
Y.svl.by.sex<-lm(Y~svl*sex, model=T,x=T)
Y.mancova<-lm(Y~svl + sex*region, model=T,x=T)
Y.full<-lm(Y~svl*sex*region, model=T,x=T)

1: Likelihood Ratio Tests

LRT are useful for comparing nested models. This approach is familiar: we have used this throughout the semester when examining linear models which, by definition compare null and alternative models. Such comparisons may be implmenented using ‘anova’.

Now let’s compare some models. First, we will run a factorial model, which (with type I SS) is a series of sequential LRT tests.

anova(hs.full)

## Analysis of Variance Table
## 
## Response: hs
##                Df  Sum Sq Mean Sq F value   Pr(>F)   
## svl             1  293048  293048  9.4015 0.002823 **
## sex             1    5224    5224  0.1676 0.683177   
## region          2  209225  104612  3.3562 0.039051 * 
## svl:sex         1     887     887  0.0285 0.866375   
## svl:region      2   73702   36851  1.1822 0.311059   
## sex:region      2   28115   14058  0.4510 0.638354   
## svl:sex:region  2   33126   16563  0.5314 0.589537   
## Residuals      95 2961177   31170                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, we interpret significance of higher-order effects relative to main effects already present in the model. To see more clearly that this is in fact LRT, try the following:

anova(hs.svl.reg)

## Analysis of Variance Table
## 
## Response: hs
##            Df  Sum Sq Mean Sq F value   Pr(>F)   
## svl         1  293048  293048  9.7245 0.002358 **
## region      2  207559  103779  3.4438 0.035666 * 
## Residuals 103 3103897   30135                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(hs.svl,hs.svl.reg)

## Analysis of Variance Table
## 
## Model 1: hs ~ svl
## Model 2: hs ~ svl + region
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1    105 3311456                              
## 2    103 3103897  2    207559 3.4438 0.03567 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the SS, F, and P-value from the factorial model are identical to a direct comparison of the two models. Thus, for nested models, evaluating effects in a model is accomplished using LRT.

The approach works identically for multivariate data:

anova(Y.full)

## Analysis of Variance Table
## 
##                Df  Pillai approx F num Df den Df    Pr(>F)    
## (Intercept)     1 0.00000   0.0000     15     81 1.0000000    
## svl             1 0.39072   3.4629     15     81 0.0001482 ***
## sex             1 0.21061   1.4407     15     81 0.1487897    
## region          2 0.57957   2.2305     30    164 0.0007565 ***
## svl:sex         1 0.11365   0.6924     15     81 0.7847519    
## svl:region      2 0.34241   1.1292     30    164 0.3077880    
## sex:region      2 0.33155   1.0863     30    164 0.3589074    
## svl:sex:region  2 0.32038   1.0428     30    164 0.4152491    
## Residuals      95                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(Y.svl.reg,Y.svl)

## Analysis of Variance Table
## 
## Model 1: Y ~ svl + region
## Model 2: Y ~ svl
##   Res.Df Df   Gen.var.  Pillai approx F num Df den Df  Pr(>F)    
## 1    103    7.3356e-05                                           
## 2    105  2 7.5192e-05 0.55423   2.3001     30    180 0.00042 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(Y.svl.reg,Y.reg)

## Analysis of Variance Table
## 
## Model 1: Y ~ svl + region
## Model 2: Y ~ region
##   Res.Df Df   Gen.var.  Pillai approx F num Df den Df    Pr(>F)    
## 1    103    7.3356e-05                                             
## 2    104  1 7.4736e-05 0.34579   3.1361     15     89 0.0003966 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(Y.mancova,Y.svl.reg)

## Analysis of Variance Table
## 
## Model 1: Y ~ svl + sex * region
## Model 2: Y ~ svl + region
##   Res.Df Df   Gen.var. Pillai approx F num Df den Df Pr(>F)
## 1    100    7.3000e-05                                     
## 2    103  3 7.3356e-05 0.4598   1.0619     45    264  0.375

2: Information Criteria: AIC

A second useful way of comparing models is using AIC (or related) values. These methods weight the logL of the data given the model by the number of parameters in the model; emphasizing the best description of the data with the fewest parameters. For the univariate models above:

aic.summary<-AIC(hs.svl,hs.sex,hs.reg,hs.svl.reg,hs.svl.by.reg,hs.svl.sex,hs.svl.by.sex,
hs.ancova,hs.full)

aic.summary  #smallest AIC is preferred model

##               df      AIC
## hs.svl         3 1416.040
## hs.sex         3 1425.040
## hs.reg         4 1420.558
## hs.svl.reg     5 1413.114
## hs.svl.by.reg  7 1414.552
## hs.svl.sex     4 1417.871
## hs.svl.by.sex  5 1419.869
## hs.ancova      8 1417.028
## hs.full       13 1424.078

Here, hs.svl.reg (hs~svl+region) is found to be the best model. Models within 2 (or 4) of this (ie, deltaAIC < 2 or 4) are considered reasonably close models to the best model.

2a: Model Averaging (ILLUSTRATION ONLY)

In cases where multiple candidate models are within 2 (or 4) AIC units, model averaging can be considered. This is found by obtaining AIC weights for each model, and averaging the weighted parameters across candidate models to obtain a weighted model.

c1<-exp(-.5*0)
c2<-exp(-.5*(AIC(hs.svl.by.reg)-AIC(hs.svl.reg)))
w1<-c1/(c1+c2)  #AIC weights
w2<-c2/(c1+c2)
  w1

## [1] 0.6723665

w2

## [1] 0.3276335

beta1<-w1*(hs.svl.reg$coef)
beta2<-w2*(hs.svl.by.reg$coef)

beta.avg<-c((beta1[1]+beta2[1]),(beta1[2]+beta2[2]),(beta1[3]+beta2[3]),
            (beta1[4]+beta2[4]),beta2[5],beta2[6])
beta.avg #model averaged coefficients

##    (Intercept)            svl     regionEast     regionWest svl:regionEast 
##     432.698370       3.701737     -13.601998     -14.739733      -0.570554 
## svl:regionWest 
##      -1.128373

NOTE: the average model is NOT guaranteed to be a better model. In fact, for the example above, it is a worse model than either models used to obtain the average model.

#FIND AIC for averaged model (need likelihood and its parameters)
Terms<-terms(formula(hs.svl.by.reg))  #set up X matrix
X<-model.matrix(Terms)
resid<-hs-X%*%beta.avg
E<-t(resid)%*%resid  #Note: univariate, so this IS the det(E)

n<-nrow(resid); p<-1  #NOTE: p<-ncol(resid) for multivariate
k<-hs.svl.by.reg$rank # same as df in "AIC" except "AIC" adds one for variance
LLik<- (-2*((-.5*n*(log(E/n^p)+p))-(0.5*(n*log(2*pi)))))
pen<-2*(p*k+0.5*p*(p+1))
AIC.mdl.avg<-LLik+pen

AIC(hs.svl.by.reg)

## [1] 1414.552

AIC(hs.svl.reg)

## [1] 1413.114

AIC.mdl.avg  #NOTE: AIC of average model is WORSE than the 2 input models

##          [,1]
## [1,] 1415.718

NOTE, because these are nested models, we can use LRT to evaluate:

anova(hs.svl.by.reg)

## Analysis of Variance Table
## 
## Response: hs
##             Df  Sum Sq Mean Sq F value  Pr(>F)   
## svl          1  293048  293048  9.7668 0.00232 **
## region       2  207559  103779  3.4588 0.03524 * 
## svl:region   2   73442   36721  1.2238 0.29842   
## Residuals  101 3030455   30005                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(hs.svl.reg,hs.svl.by.reg)

## Analysis of Variance Table
## 
## Model 1: hs ~ svl + region
## Model 2: hs ~ svl * region
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1    103 3103897                           
## 2    101 3030455  2     73442 1.2238 0.2984

Obviously, the LRT demonstrates the second effect should not be in the model, thus, despite its ‘slose’ AIC, model averaging should not be used.

2b: Multivariate AIC

One can do multivarite AIC, but it requires a special function (the AIC function in R is univariate only):

maic<-function(x,...){ # each x is a model
    y<-resid(x)
    E<-t(y)%*%y
    d<-det(E)
    if(length(E)>1) n<-nrow(y) else n<-length(y)
    if(length(E)>1) p<-ncol(y) else p<-1
    k<-x$rank # same as df in "AIC" except "AIC" adds one for variance
    lh<-n*(log(d/n^p)+p)
    LLik<- (-2*((-.5*n*(log(d/n^p)+p))-(0.5*(n*log(2*pi)))))
    pen<-2*(p*k+0.5*p*(p+1))
    m<-LLik+pen
    maic<-c(k,m)
}  

maic.svl<-maic(Y.svl)
maic.sex<-maic(Y.sex)
maic.reg<-maic(Y.reg)
maic.svl.reg<-maic(Y.svl.reg)
maic.svl.by.reg<-maic(Y.svl.by.reg)
maic.svl.sex<-maic(Y.svl.sex)
maic.svl.by.sex<-maic(Y.svl.by.sex)
maic.mancova<-maic(Y.mancova)
maic.full<-maic(Y.full)

rbind(maic.svl,maic.sex,maic.reg,maic.svl.reg,maic.svl.by.reg,maic.svl.sex,
    maic.svl.by.sex,maic.mancova,maic.full)

##                 [,1]      [,2]
## maic.svl           2 -13168.85
## maic.sex           2 -13139.84
## maic.reg           3 -13163.98
## maic.svl.reg       4 -13179.39
## maic.svl.by.reg    6 -13161.66
## maic.svl.sex       3 -13159.18
## maic.svl.by.sex    4 -13141.50
## maic.mancova       7 -13144.65
## maic.full         12 -13088.92

3: Cross-Validation

Model selection can identify the best model, but it still may be a poor descriptor of the data patterns. Cross-validation can assess this. Here, one fits the model with part of the data, and then obtains predicted values (and residuals) for the remaining portion. The variation in the residuals is a measure of model fit. Iterating this process for multiple models informs on which provides a better description of the patterns in the data.

Here is some data:

x<-seq(1:10)
y<-c(2,2,3,4,4,7,7,7,8,8)
plot(x,y)

anova(lm(y~x))

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## x          1 49.648  49.648  100.52 8.327e-06 ***
## Residuals  8  3.952   0.494                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here is a cross-validation procedure:

iter=1000
rep<-lapply(1:iter, function(j) sample(seq(1:10),5))
diff<-lapply(1:iter, function(j) setdiff(seq(1:10),rep[[j]]))
  #full model
model.x<- lapply(1:iter, function(j) lm(y[rep[[j]]]~x[rep[[j]]])) 
resid.y<-lapply(1:iter, function(j)  resid(model.x[[j]],newdata=data.frame(c(x[diff[[j]]]))))
ss.x<-unlist(lapply(1:iter, function(j) crossprod(resid.y[[j]])))
  #reduced model
model.1<- lapply(1:iter, function(j) lm(y[rep[[j]]]~1)) 
resid.1<-lapply(1:iter, function(j)  resid(model.1[[j]],newdata=data.frame(c(x[diff[[j]]]))))
ss.1<-unlist(lapply(1:iter, function(j) crossprod(resid.1[[j]])))

c(mean(ss.1),var(ss.1))

## [1] 23.8832 45.4257

c(mean(ss.x),var(ss.x))

## [1] 1.5367753 0.8878436

Here we are using SS.resid from each model as measure of model adequacy; smaller values are better. Clearly, the model Y~X is a better description of the trend as compared to Y~1.