Model Comparison

Likelihood Ratio Tests and Information Criteria

Dean Adams, Iowa State University

Model Comparison

Model Comparison

Model Comparison: Goals and Aims

Likelihood Ratio Tests

\[\small{L}\mathbf{(\hat{\beta}|Y)}\]

Likelihood Ratio Tests

\[\small{L}\mathbf{(\hat{\beta}|Y)}\]

\[\small\log{L} = -\frac{n}{2}ln{2\pi}-\frac{n}{2}ln(\hat{\sigma}^2)-\frac{1}{2\hat{\sigma}^{2}}{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\]

Likelihood Ratio Tests

\[\small{L}\mathbf{(\hat{\beta}|Y)}\]

\[\small\log{L} = -\frac{n}{2}ln{2\pi}-\frac{n}{2}ln(\hat{\sigma}^2)-\frac{1}{2\hat{\sigma}^{2}}{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\]

\(\small\log{L}=-\frac{np}{2}ln{2\pi}-\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}-\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\)

What’s In a Likelihood?

\[\small\log{L}=-\frac{np}{2}ln{2\pi}-\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}-\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\]

Element Component Meaning
1 \(\small\frac{np}{2}ln{2\pi}\) A constant
2 \(\small\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\) Represents summed-squared deviations from the predicted values in some fashion
3 \(\small\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}\) The error covariance of the model

What’s In a Likelihood?

\[\small\log{L}=-\frac{np}{2}ln{2\pi}-\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}-\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\]

Element Component Meaning
1 \(\small\frac{np}{2}ln{2\pi}\) A constant
2 \(\small\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\) Represents summed-squared deviations from the predicted values in some fashion
3 \(\small\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}\) The error covariance of the model

Likelihood: Component Parts

Let’s look at the \(\small\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\) component across two models for the same data:

\(\tiny\mathbf{Y}=\begin{bmatrix} 6\\ 7\\ 4\\ 5\\ 9\\ 3\\ 2\\ 5 \end{bmatrix}\) \(\tiny\mathbf{X_{0}}=\begin{bmatrix} 1\\ 1\\ 1\\ 1\\ 1\\ 1\\ 1\\ 1 \end{bmatrix}\) \(\tiny\mathbf{X_{1}}=\begin{bmatrix} 1 & 1\\ 1& 3\\ 1& 3\\ 1& 4\\ 1& 5\\ 1& 6\\ 1& 7\\ 1& 8 \end{bmatrix}\)

beta<-solve(t(x)%*%x)%*%t(x)%*%y
sigma<-(t(y-x%*%beta)%*%(y-x%*%beta) ) / N
beta1<-solve(t(x)%*%x)%*%t(x)%*%y
sigma1<-(t(y-x%*%beta1)%*%(y-x%*%beta1) ) / N

(t(y-x%*%beta)%*%(y-x%*%beta)) / (2*sigma)
##      [,1]
## [1,]    4
(t(y-x%*%beta1)%*%(y-x%*%beta1)) / (2*sigma1)
##      [,1]
## [1,]    4

Wait! What? For two different models this component is IDENTICAL!

ALGEBRA, please provide us with wisdom…

Likelihood: Component Parts

And algebra answers…“Yes I will!”

Likelihood: Component Parts

And algebra answers…“Yes I will!”

\(\small\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\) must be a constant, specifically \(\small\frac{np}{2}\)

First, recall that: \(\small\mathbf{\hat{\Sigma}} = \frac{\left(\mathbf{Y} -\mathbf{X\hat{\beta}}\right)^T \left(\mathbf{Y} -\mathbf{X\hat{\beta}}\right)}{n}\)

Revisiting the logL Components

Ok, so for OLS models, what does the likelihood really contain?

\[\small\log{L}=-\frac{np}{2}ln{2\pi}-\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}-\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\]

Element Component Meaning
1 \(\small\frac{np}{2}ln{2\pi}\) A constant
2 \(\small\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\) A constant
3 \(\small\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}\) The error covariance of the model

Revisiting the logL Components

Ok, so for OLS models, what does the likelihood really contain?

\[\small\log{L}=-\frac{np}{2}ln{2\pi}-\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}-\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\]

Element Component Meaning
1 \(\small\frac{np}{2}ln{2\pi}\) A constant
2 \(\small\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\) A constant
3 \(\small\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}\) The error covariance of the model

This exploration reveals that the only component of \(\small\log{L}\) that is free to vary is \(\small\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}\)

By extension, this implies that ALL parametric statistical hypothesis testing becomes a comparison of the error covariance of the model: lower error covariance means a better fitting model

As such, model comparisons (via LRT AIC, etc.) are also based on this one component: the error covariance of the model.

Now let’s get back to model comparisons and see how this is done…

Likelihood Ratio Tests: Nested Models

The formal method for comparing nested models (e.g., Y~X1+X2 vs. Y~X1) is a likelihood ratio test (LRT):

\[\small\log \left( \frac{\mathcal{L}\left(\mathbf{Y}| \mathbf{X}_f\right)} {\mathcal{L}\left(\mathbf{Y}| \mathbf{X}_r\right)} \right)=\]

\[\small-\frac{1}{2} \left( np - n\log|\mathbf{\hat{\Sigma}}_{p_f}| - np\log\left(2\pi \right ) \right) + \frac{1}{2} \left( np - n\log|\mathbf{\hat{\Sigma}}_{p_r}| - np\log\left(2\pi \right ) \right) =\]

Likelihood Ratio Tests: Nested Models

The formal method for comparing nested models (e.g., Y~X1+X2 vs. Y~X1) is a likelihood ratio test (LRT):

\[\small\log \left( \frac{\mathcal{L}\left(\mathbf{Y}| \mathbf{X}_f\right)} {\mathcal{L}\left(\mathbf{Y}| \mathbf{X}_r\right)} \right)=\]

\[\small-\frac{1}{2} \left( np - n\log|\mathbf{\hat{\Sigma}}_{p_f}| - np\log\left(2\pi \right ) \right) + \frac{1}{2} \left( np - n\log|\mathbf{\hat{\Sigma}}_{p_r}| - np\log\left(2\pi \right ) \right) =\]

Eliminate the constants and one has:

\[\small\frac{n}{2}\left(\log|\mathbf{\hat{\Sigma}}_{p_f}| - \log|\mathbf{\hat{\Sigma}}_{p_r}|\right)\]

Likelihood Ratio Tests: Nested Models

The formal method for comparing nested models (e.g., Y~X1+X2 vs. Y~X1) is a likelihood ratio test (LRT):

\[\small\log \left( \frac{\mathcal{L}\left(\mathbf{Y}| \mathbf{X}_f\right)} {\mathcal{L}\left(\mathbf{Y}| \mathbf{X}_r\right)} \right)=\]

\[\small-\frac{1}{2} \left( np - n\log|\mathbf{\hat{\Sigma}}_{p_f}| - np\log\left(2\pi \right ) \right) + \frac{1}{2} \left( np - n\log|\mathbf{\hat{\Sigma}}_{p_r}| - np\log\left(2\pi \right ) \right) =\]

Eliminate the constants and one has:

\[\small\frac{n}{2}\left(\log|\mathbf{\hat{\Sigma}}_{p_f}| - \log|\mathbf{\hat{\Sigma}}_{p_r}|\right)\]

For univariate data, we could say, \[\small{D} = 2\log\left(\frac{\mathcal{L}_1}{\mathcal{L}_2}\right) = n\left(\log|\mathbf{\hat{\Sigma}}_{p_f}| - \log|\mathbf{\hat{\Sigma}}_{p_r}|\right) = \]

\[\small{n}\left(\log \hat{\sigma}_{f}^2 - \log \hat{\sigma}_{r}^2\right) \sim\chi^2(k_f-k_r)\]

where \(k\) corresponds to the rank of the two model design matrices, \(\mathbf{X}\).

Likelihood Ratio Hypothesis Tests

\[\small{LRT}={2}\log\left(\frac{\mathcal{L}_f}{\mathcal{L}_r}\right) = nlog\left(\frac{|\mathbf{\hat{\Sigma}}_{p_f}|}{|\mathbf{\hat{\Sigma}}_{p_r}|}\right)=\Lambda_{Wilks}\]

LRT: Example

LRT: Example (Cont.)

Models considered:

LRTs:

LRT: Example (Cont.)

Results: Full model MANOVA

Full<-lm(Y~svl + sex*region, model=T,x=T)
Red1<-lm(Y~svl + region, model=T,x=T)
Red2<-lm(Y~svl, model=T,x=T)
Red3<-lm(Y~region, model=T,x=T)
anova(Full)
## Analysis of Variance Table
## 
##              Df  Pillai approx F num Df den Df    Pr(>F)    
## (Intercept)   1 0.00000   0.0000     15     86 1.0000000    
## svl           1 0.38713   3.6215     15     86 7.359e-05 ***
## sex           1 0.19622   1.3996     15     86 0.1659931    
## region        2 0.56941   2.3085     30    174 0.0004171 ***
## sex:region    2 0.29323   0.9965     30    174 0.4787552    
## Residuals   100                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Full, Red1)
## Analysis of Variance Table
## 
## Model 1: Y ~ svl + sex * region
## Model 2: Y ~ svl + region
##   Res.Df Df   Gen.var. Pillai approx F num Df den Df Pr(>F)
## 1    100    7.3000e-05                                     
## 2    103  3 7.3356e-05 0.4598   1.0619     45    264  0.375

LRT: Example (Cont.)

anova(Red1,Red2)
## Analysis of Variance Table
## 
## Model 1: Y ~ svl + region
## Model 2: Y ~ svl
##   Res.Df Df   Gen.var.  Pillai approx F num Df den Df  Pr(>F)    
## 1    103    7.3356e-05                                           
## 2    105  2 7.5192e-05 0.55423   2.3001     30    180 0.00042 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Red1,Red3)
## Analysis of Variance Table
## 
## Model 1: Y ~ svl + region
## Model 2: Y ~ region
##   Res.Df Df   Gen.var.  Pillai approx F num Df den Df    Pr(>F)    
## 1    103    7.3356e-05                                             
## 2    104  1 7.4736e-05 0.34579   3.1361     15     89 0.0003966 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusions:

Sexual dimorphism does not contribute to phenotypic variation, but region and SVL are important

LRTs demonstrate that adding parameters for Sex and Sex × Region does not offer a significant improvement over a simpler model!

Likelihood Ratio Tests: Thoughts

Advantages

Provides a statistical assessment of model difference, based on importance of parameters that differ between models – ELEGANT!

Can be performed on any level of hierarchical structure (e.g., a model with parameters A, B, C, D, E, F can be compared to one with parameters A and E, one with just F, or one with B, C, and D in the same manner).

Disadvantages

As long as n:kp ratio is large, significant differences between models are likely, even for small effects.

Models must be nested comparisons

Likelihood Ratio Tests: Thoughts

Advantages

Provides a statistical assessment of model difference, based on importance of parameters that differ between models – ELEGANT!

Can be performed on any level of hierarchical structure (e.g., a model with parameters A, B, C, D, E, F can be compared to one with parameters A and E, one with just F, or one with B, C, and D in the same manner).

Disadvantages

As long as n:kp ratio is large, significant differences between models are likely, even for small effects.

Models must be nested comparisons

\[\small{LRT}={2}\log\left(\frac{\mathcal{L}_f}{\mathcal{L}_r}\right) = nlog\left(\frac{|\mathbf{\hat{\Sigma}}_{p_f}|}{|\mathbf{\hat{\Sigma}}_{p_r}|}\right)=\Lambda_{Wilks}\]

Really? What about this equation implies models must be nested?

LRT for Non-Nested Models

LRT for Non-Nested Models

See Lewis et al. 2011. Methods Ecol. Evol.

Information Theory Criteria

Approach compares models by ranking them based on some score: typically AIC (Akaike’s information criterion)

Model that explains the most information with the fewest parameters (a ‘parsimony’ criterion) is then chosen as the preferred model

IT scores “reward” an increase in likelihood, but “penalize” the increase in number of parameters

Does NOT use significance testing!

Instead, best model is chosen when difference between models is \(\small\Delta_{AIC} > 4\)

Information Theory: Parsimony Principle

Goal of model comparison is to select the model that explains the most information with the fewest parameters

Akaike Information Criterion (AIC)

The most commonly used criterion to score and compare candidate models is:

\[\small\mathrm{AIC} = -2\mathcal{L} + 2K\]

where \(\mathcal{L}\) is the model likelihood, and \(\small{2K}\) is the parameter penalty, which is usually presented as \(K=k+1\) (the number of model coefficients: \(\small{k}\) = rank of the design matrix, and \(\small{1}\) for the residual model variance \(\small\sigma^2\)).

Akaike Information Criterion (AIC)

The most commonly used criterion to score and compare candidate models is:

\[\small\mathrm{AIC} = -2\mathcal{L} + 2K\]

where \(\mathcal{L}\) is the model likelihood, and \(\small{2K}\) is the parameter penalty, which is usually presented as \(K=k+1\) (the number of model coefficients: \(\small{k}\) = rank of the design matrix, and \(\small{1}\) for the residual model variance \(\small\sigma^2\)).

Actually, this is the univariate version of AIC. The general multivariate form is (Bedrick and Tsai 1994):

\[\small\mathrm{AIC} = -2\mathcal{L} + 2[pk+0.5p(p+1)]\]

where the number of parameters for the matrix of coefficients for multivariate data is \(\small{pk}\) and the number of parameters for the residual covariance matrix is \(\small\frac{1}{2}p(p + 1)\).

AIC Dissected

AIC Modifications

AIC is essentially the likelihood of the model (\(\mathcal{L}\)) plus a parameter penalty for model complexity. The quantification of this parameter penalty is arbitrary, and various have been proposed:

\(\small\mathrm{AIC} = -2\mathcal{L} + 2K\)

\(\small\mathrm{AIC}_c = -2\mathcal{L} + 2K\frac{n}{n-K-1}\) (often considered a ‘sample size correction’)

\(\small\mathrm{QAIC} = -2\mathcal{L}/c + 2K\) where \(\small{c}\) is a dispersion index based on the \(\small\chi^2\) distribution, to account for overdispersion

\(\small\mathrm{BIC} = -2\mathcal{L} + 2K\log{(n)}\) Bayesian AIC

And there are many, many others (see Burnham and Anderson. 2002)

AIC: Example

res
##           df       AIC     dAIC
## maic.Full  7 -13144.65 34.73799
## maic.Red1  4 -13179.39  0.00000
## maic.Red2  2 -13168.85 10.53920
## maic.Red3  3 -13163.98 15.40285

Best model (as found by \(\small\Delta{AIC}\)) is Red1: Y~SVL + Region. No other model is within 4 \(\small\Delta{AIC}\) units, so this model is strongly preferred.

Nested Models: LRT or AIC?

Nested Models: LRT or AIC?

\[\small\Delta{AIC}=(n\log|\mathbf{\hat{\Sigma}}_{f}| + 2[pk_f+0.5p(p+1)])-(n\log|\mathbf{\hat{\Sigma}}_{r}| + 2[pk_r+0.5p(p+1)])=\]

Model Uncertainty

Model \(\small\Delta{AIC}\)
A 3
A + B 0
C 2

Model Uncertainty

Model \(\small\Delta{AIC}\)
A 3
A + B 0
C 2

Model Uncertainty

Model \(\small\Delta{AIC}\)
A 3
A + B 0
C 2
Model \(\small\Delta{AIC}\) \(\small{w}_i\)
A 3 0.122
A + B 0 0.547
C 2 0.331

Model Averaging

Model \(\small\Delta{AIC}\) \(\small{w}_i\)
A 3 0.122
A + B 0 0.547
C 2 0.331

\[\small\beta^*=(0.122+0.547)\beta_A+0.547\beta_B+0.331\beta_C\]

Model Averaging

Model \(\small\Delta{AIC}\) \(\small{w}_i\)
A 3 0.122
A + B 0 0.547
C 2 0.331

\[\small\beta^*=(0.122+0.547)\beta_A+0.547\beta_B+0.331\beta_C\]

Model Averaging: Example

Returning to the snake data (univariate this time), consider two models:

Y~SVL+Region+SVL:Region

Y~SVL+Region

AIC(hs.svl.by.reg, hs.svl.reg)
##               df      AIC
## hs.svl.by.reg  7 1414.552
## hs.svl.reg     5 1413.114

Models have very similar AIC values. Obtain \(\Delta{AIC}\) , \(\small{w}_i\), and model averaged coefficients:

c(w1,w2)
## [1] 0.6723665 0.3276335
beta.avg
##    (Intercept)            svl     regionEast     regionWest svl:regionEast 
##     432.698370       3.701737     -13.601998     -14.739733      -0.570554 
## svl:regionWest 
##      -1.128373

Model Averaging: Example (Cont.)

Now fit averaged model and obtain AIC:

c(AIC(hs.svl.reg), AIC(hs.svl.by.reg), AIC.mdl.avg )
## [1] 1413.114 1414.552 1415.718

Here the averaged model attains a WORSE AIC than the models from which it was averaged! This makes no sense, as the point of model averaging is to arrive at a Better model than those in the candidate set!

Model Averaging: Example (Cont.)

Now fit averaged model and obtain AIC:

c(AIC(hs.svl.reg), AIC(hs.svl.by.reg), AIC.mdl.avg )
## [1] 1413.114 1414.552 1415.718

Here the averaged model attains a WORSE AIC than the models from which it was averaged! This makes no sense, as the point of model averaging is to arrive at a Better model than those in the candidate set!

Finally, for this particular example (of 2 nested models), we know the preferred model. It is actually the simpler model:

anova(hs.svl.by.reg)
## Analysis of Variance Table
## 
## Response: hs
##             Df  Sum Sq Mean Sq F value  Pr(>F)   
## svl          1  293048  293048  9.7668 0.00232 **
## region       2  207559  103779  3.4588 0.03524 * 
## svl:region   2   73442   36721  1.2238 0.29842   
## Residuals  101 3030455   30005                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Don’t forget basic statistical logic. Here, the more complex model containing the SVL:Region interaction was not a significant improvement in fit over the main effects model. Thus, the interaction term should not be included!
Exclusive reliance on AIC-based philosophy (and its extensions to model averaging) can cloud one’s judgement and get one into trouble! Thinking is always encouraged in statistical endeavors!

AIC: A Comment on Multivariate Data

Information Criteria: Thoughts

Advantages

Can consider non-nested models Penalizes model likelihood by the number of parameters needed in the model

Limitations

Choice of parameter penalty is arbitrary. Although there are sound statistical reasons to choose one type of information criterion (IC) over another, there are no statistical distributions for ICs, and ICs do not incorporate biological reasoning.

Indices are univariate even if data are multivariate. ICs only measure magnitude of deviation of models, but not direction in the parameter space.

Disadvantages

Not really a statistical test, and proponents of ITMS have argued strongly that statistical tests are useless. However, the statistical theory of LRT is well-founded and does not necessarily differ from IC for nested models.

Model averaging – proceed with extreme caution IF AT ALL

Cross-Validation

Cross-Validation: Example

Here is some data:

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## x          1 49.648  49.648  100.52 8.327e-06 ***
## Residuals  8  3.952   0.494                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Cross-Validation (Cont.)

\[\tiny{Y}=\beta_0+\beta_1X\]

\[\tiny{Y}=\beta_0\]

Cross-Validation (Cont.)

\[\tiny{Y}=\beta_0+\beta_1X\]

\[\tiny{Y}=\beta_0\]

c(mean(ss.1),var(ss.1))
## [1] 24.14600 43.36493
c(mean(ss.x),var(ss.x))
## [1] 1.5381977 0.8830653

Cross-validation demonstrates that \(\tiny{Y}=\beta_0+\beta_1X\) is a more robust model for these data.

Model Comparisons: Summary