Dean Adams, Iowa State University
Often, (especially in observational studies) it is unknown what ‘explanatory’ variables are important sources of variation.
How do we compare models and ‘select’ the optimal one?
We can always ‘dump’ any or all explanatory variables into a model, but there are some problems with this.
Often, (especially in observational studies) it is unknown what ‘explanatory’ variables are important sources of variation.
How do we compare models and ‘select’ the optimal one?
We can always ‘dump’ any or all explanatory variables into a model, but there are some problems with this.
Increasing model parameters (\(\small{k}\)) results in:
The goal is to evaluate two or more models with different sets of parameters to determine if one is ‘better’ (i.e., explains more variation in the response given certain criteria).
Today we discuss:
Likelihood ratio tests are used to statistically compare one model (e.g., full model) with an alternative model that is nested within the full model (e.g., reduced model).
But first, what is the model likelihood of a model? This describes the likelihood of a set of model parameters (\(\small\beta\)), given the data (\(\small\mathbf{Y}\)). It is equal to the probability of the observed outcomes given the parameters.
\[\small{L}\mathbf{(\hat{\beta}|Y)}\]
Likelihood ratio tests are used to statistically compare one model (e.g., full model) with an alternative model that is nested within the full model (e.g., reduced model).
But first, what is the model likelihood of a model? This describes the likelihood of a set of model parameters (\(\small\beta\)), given the data (\(\small\mathbf{Y}\)). It is equal to the probability of the observed outcomes given the parameters.
\[\small{L}\mathbf{(\hat{\beta}|Y)}\]
\[\small\log{L} = -\frac{n}{2}ln{2\pi}-\frac{n}{2}ln(\hat{\sigma}^2)-\frac{1}{2\hat{\sigma}^{2}}{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\]
Likelihood ratio tests are used to statistically compare one model (e.g., full model) with an alternative model that is nested within the full model (e.g., reduced model).
But first, what is the model likelihood of a model? This describes the likelihood of a set of model parameters (\(\small\beta\)), given the data (\(\small\mathbf{Y}\)). It is equal to the probability of the observed outcomes given the parameters.
\[\small{L}\mathbf{(\hat{\beta}|Y)}\]
\[\small\log{L} = -\frac{n}{2}ln{2\pi}-\frac{n}{2}ln(\hat{\sigma}^2)-\frac{1}{2\hat{\sigma}^{2}}{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\]
\(\small\log{L}=-\frac{np}{2}ln{2\pi}-\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}-\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\)
\[\small\log{L}=-\frac{np}{2}ln{2\pi}-\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}-\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\]
| Element | Component | Meaning |
|---|---|---|
| 1 | \(\small\frac{np}{2}ln{2\pi}\) | A constant |
| 2 | \(\small\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\) | Represents summed-squared deviations from the predicted values in some fashion |
| 3 | \(\small\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}\) | The error covariance of the model |
\[\small\log{L}=-\frac{np}{2}ln{2\pi}-\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}-\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\]
| Element | Component | Meaning |
|---|---|---|
| 1 | \(\small\frac{np}{2}ln{2\pi}\) | A constant |
| 2 | \(\small\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\) | Represents summed-squared deviations from the predicted values in some fashion |
| 3 | \(\small\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}\) | The error covariance of the model |
Let’s look at the \(\small\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\) component across two models for the same data:
\(\tiny\mathbf{Y}=\begin{bmatrix} 6\\ 7\\ 4\\ 5\\ 9\\ 3\\ 2\\ 5 \end{bmatrix}\) \(\tiny\mathbf{X_{0}}=\begin{bmatrix} 1\\ 1\\ 1\\ 1\\ 1\\ 1\\ 1\\ 1 \end{bmatrix}\) \(\tiny\mathbf{X_{1}}=\begin{bmatrix} 1 & 1\\ 1& 3\\ 1& 3\\ 1& 4\\ 1& 5\\ 1& 6\\ 1& 7\\ 1& 8 \end{bmatrix}\)
beta<-solve(t(x)%*%x)%*%t(x)%*%y
sigma<-(t(y-x%*%beta)%*%(y-x%*%beta) ) / N
beta1<-solve(t(x)%*%x)%*%t(x)%*%y
sigma1<-(t(y-x%*%beta1)%*%(y-x%*%beta1) ) / N
(t(y-x%*%beta)%*%(y-x%*%beta)) / (2*sigma)## [,1]
## [1,] 4
## [,1]
## [1,] 4
Wait! What? For two different models this component is IDENTICAL!
ALGEBRA, please provide us with wisdom…
And algebra answers…“Yes I will!”
And algebra answers…“Yes I will!”
\(\small\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\) must be a constant, specifically \(\small\frac{np}{2}\)
First, recall that: \(\small\mathbf{\hat{\Sigma}} = \frac{\left(\mathbf{Y} -\mathbf{X\hat{\beta}}\right)^T \left(\mathbf{Y} -\mathbf{X\hat{\beta}}\right)}{n}\)
Ok, so for OLS models, what does the likelihood really contain?
\[\small\log{L}=-\frac{np}{2}ln{2\pi}-\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}-\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\]
| Element | Component | Meaning |
|---|---|---|
| 1 | \(\small\frac{np}{2}ln{2\pi}\) | A constant |
| 2 | \(\small\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\) | A constant |
| 3 | \(\small\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}\) | The error covariance of the model |
Ok, so for OLS models, what does the likelihood really contain?
\[\small\log{L}=-\frac{np}{2}ln{2\pi}-\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}-\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\]
| Element | Component | Meaning |
|---|---|---|
| 1 | \(\small\frac{np}{2}ln{2\pi}\) | A constant |
| 2 | \(\small\frac{1}{2}tr{\left(\mathbf{Y}-\mathbf{X\beta}\right)^{T}\mathbf{\hat{\Sigma}}^{-1}\left(\mathbf{Y}-\mathbf{X\beta}\right)}\) | A constant |
| 3 | \(\small\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}\) | The error covariance of the model |
This exploration reveals that the only component of \(\small\log{L}\) that is free to vary is \(\small\frac{n}{2}ln{\begin{vmatrix}\mathbf{\hat{\Sigma}}\end{vmatrix}}\)
By extension, this implies that ALL parametric statistical hypothesis testing becomes a comparison of the error covariance of the model: lower error covariance means a better fitting model
As such, model comparisons (via LRT AIC, etc.) are also based on this one component: the error covariance of the model.
Now let’s get back to model comparisons and see how this is done…
The formal method for comparing nested models (e.g., Y~X1+X2 vs. Y~X1) is a likelihood ratio test (LRT):
\[\small\log \left( \frac{\mathcal{L}\left(\mathbf{Y}| \mathbf{X}_f\right)} {\mathcal{L}\left(\mathbf{Y}| \mathbf{X}_r\right)} \right)=\]
\[\small-\frac{1}{2} \left( np - n\log|\mathbf{\hat{\Sigma}}_{p_f}| - np\log\left(2\pi \right ) \right) + \frac{1}{2} \left( np - n\log|\mathbf{\hat{\Sigma}}_{p_r}| - np\log\left(2\pi \right ) \right) =\]
The formal method for comparing nested models (e.g., Y~X1+X2 vs. Y~X1) is a likelihood ratio test (LRT):
\[\small\log \left( \frac{\mathcal{L}\left(\mathbf{Y}| \mathbf{X}_f\right)} {\mathcal{L}\left(\mathbf{Y}| \mathbf{X}_r\right)} \right)=\]
\[\small-\frac{1}{2} \left( np - n\log|\mathbf{\hat{\Sigma}}_{p_f}| - np\log\left(2\pi \right ) \right) + \frac{1}{2} \left( np - n\log|\mathbf{\hat{\Sigma}}_{p_r}| - np\log\left(2\pi \right ) \right) =\]
Eliminate the constants and one has:
\[\small\frac{n}{2}\left(\log|\mathbf{\hat{\Sigma}}_{p_f}| - \log|\mathbf{\hat{\Sigma}}_{p_r}|\right)\]
The formal method for comparing nested models (e.g., Y~X1+X2 vs. Y~X1) is a likelihood ratio test (LRT):
\[\small\log \left( \frac{\mathcal{L}\left(\mathbf{Y}| \mathbf{X}_f\right)} {\mathcal{L}\left(\mathbf{Y}| \mathbf{X}_r\right)} \right)=\]
\[\small-\frac{1}{2} \left( np - n\log|\mathbf{\hat{\Sigma}}_{p_f}| - np\log\left(2\pi \right ) \right) + \frac{1}{2} \left( np - n\log|\mathbf{\hat{\Sigma}}_{p_r}| - np\log\left(2\pi \right ) \right) =\]
Eliminate the constants and one has:
\[\small\frac{n}{2}\left(\log|\mathbf{\hat{\Sigma}}_{p_f}| - \log|\mathbf{\hat{\Sigma}}_{p_r}|\right)\]
For univariate data, we could say, \[\small{D} = 2\log\left(\frac{\mathcal{L}_1}{\mathcal{L}_2}\right) = n\left(\log|\mathbf{\hat{\Sigma}}_{p_f}| - \log|\mathbf{\hat{\Sigma}}_{p_r}|\right) = \]
\[\small{n}\left(\log \hat{\sigma}_{f}^2 - \log \hat{\sigma}_{r}^2\right) \sim\chi^2(k_f-k_r)\]
where \(k\) corresponds to the rank of the two model design matrices, \(\mathbf{X}\).
\[\small{LRT}={2}\log\left(\frac{\mathcal{L}_f}{\mathcal{L}_r}\right) = nlog\left(\frac{|\mathbf{\hat{\Sigma}}_{p_f}|}{|\mathbf{\hat{\Sigma}}_{p_r}|}\right)=\Lambda_{Wilks}\]
Thus, LRT typically involve statistically comparing one model with an alternative model that is nested within the full model.
The null hypothesis of an LRT is that the additional parameters in the full model offer no improvement over the simpler set of parameters in the reduced model. A significant LRT (rejection of the null hypothesis) implies that significantly more variation of the dependent variable was described by the additional parameters of the full model. Therefore, the full model is considered a significant improvement over the reduced model.
Patterns of phenotypic variation in the heads of LIVE rattlesnakes quantified
Compared several existing hypotheses which seek to describe phenotypic variation
Models considered:
Head ~ SVL + Sex + Region + Sex × RegionHead ~ SVL + RegionHead ~ SVLHead ~ RegionLRTs:
Results: Full model MANOVA
Full<-lm(Y~svl + sex*region, model=T,x=T)
Red1<-lm(Y~svl + region, model=T,x=T)
Red2<-lm(Y~svl, model=T,x=T)
Red3<-lm(Y~region, model=T,x=T)
anova(Full)## Analysis of Variance Table
##
## Df Pillai approx F num Df den Df Pr(>F)
## (Intercept) 1 0.00000 0.0000 15 86 1.0000000
## svl 1 0.38713 3.6215 15 86 7.359e-05 ***
## sex 1 0.19622 1.3996 15 86 0.1659931
## region 2 0.56941 2.3085 30 174 0.0004171 ***
## sex:region 2 0.29323 0.9965 30 174 0.4787552
## Residuals 100
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: Y ~ svl + sex * region
## Model 2: Y ~ svl + region
## Res.Df Df Gen.var. Pillai approx F num Df den Df Pr(>F)
## 1 100 7.3000e-05
## 2 103 3 7.3356e-05 0.4598 1.0619 45 264 0.375
## Analysis of Variance Table
##
## Model 1: Y ~ svl + region
## Model 2: Y ~ svl
## Res.Df Df Gen.var. Pillai approx F num Df den Df Pr(>F)
## 1 103 7.3356e-05
## 2 105 2 7.5192e-05 0.55423 2.3001 30 180 0.00042 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: Y ~ svl + region
## Model 2: Y ~ region
## Res.Df Df Gen.var. Pillai approx F num Df den Df Pr(>F)
## 1 103 7.3356e-05
## 2 104 1 7.4736e-05 0.34579 3.1361 15 89 0.0003966 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusions:
Sexual dimorphism does not contribute to phenotypic variation, but region and SVL are important
LRTs demonstrate that adding parameters for Sex and Sex × Region does not offer a significant improvement over a simpler model!
Advantages
Provides a statistical assessment of model difference, based on importance of parameters that differ between models – ELEGANT!
Can be performed on any level of hierarchical structure (e.g., a model with parameters A, B, C, D, E, F can be compared to one with parameters A and E, one with just F, or one with B, C, and D in the same manner).
Disadvantages
As long as n:kp ratio is large, significant differences between models are likely, even for small effects.
Models must be nested comparisons
Advantages
Provides a statistical assessment of model difference, based on importance of parameters that differ between models – ELEGANT!
Can be performed on any level of hierarchical structure (e.g., a model with parameters A, B, C, D, E, F can be compared to one with parameters A and E, one with just F, or one with B, C, and D in the same manner).
Disadvantages
As long as n:kp ratio is large, significant differences between models are likely, even for small effects.
Models must be nested comparisons
\[\small{LRT}={2}\log\left(\frac{\mathcal{L}_f}{\mathcal{L}_r}\right) = nlog\left(\frac{|\mathbf{\hat{\Sigma}}_{p_f}|}{|\mathbf{\hat{\Sigma}}_{p_r}|}\right)=\Lambda_{Wilks}\]
Really? What about this equation implies models must be nested?
For nested models, \(\small{Q}={2}\log\left(\frac{\mathcal{L}_A}{\mathcal{L}_B}\right)\) is always positive.
For non-nested models, it can be negative, depending on which model is assigned to A and B, so direct comparison to \(\small\chi^2\) not possible.
Procedure for LRT of non-nested models
A and generate distribution of \(\small{Q}_{rand}\)B and repeatFor nested models, \(\small{Q}={2}\log\left(\frac{\mathcal{L}_A}{\mathcal{L}_B}\right)\) is always positive.
For non-nested models, it can be negative, depending on which model is assigned to A and B, so direct comparison to \(\small\chi^2\) not possible.
Procedure for LRT of non-nested models
A and generate distribution of \(\small{Q}_{rand}\)B and repeatPossible outcomes:
A null, but not when B null: (B is preferred)B null, but not when A null: (A is preferred)Approach compares models by ranking them based on some score: typically AIC (Akaike’s information criterion)
Model that explains the most information with the fewest parameters (a ‘parsimony’ criterion) is then chosen as the preferred model
IT scores “reward” an increase in likelihood, but “penalize” the increase in number of parameters
Does NOT use significance testing!
Instead, best model is chosen when difference between models is \(\small\Delta_{AIC} > 4\)
Goal of model comparison is to select the model that explains the most information with the fewest parameters
The most commonly used criterion to score and compare candidate models is:
\[\small\mathrm{AIC} = -2\mathcal{L} + 2K\]
where \(\mathcal{L}\) is the model likelihood, and \(\small{2K}\) is the parameter penalty, which is usually presented as \(K=k+1\) (the number of model coefficients: \(\small{k}\) = rank of the design matrix, and \(\small{1}\) for the residual model variance \(\small\sigma^2\)).
The most commonly used criterion to score and compare candidate models is:
\[\small\mathrm{AIC} = -2\mathcal{L} + 2K\]
where \(\mathcal{L}\) is the model likelihood, and \(\small{2K}\) is the parameter penalty, which is usually presented as \(K=k+1\) (the number of model coefficients: \(\small{k}\) = rank of the design matrix, and \(\small{1}\) for the residual model variance \(\small\sigma^2\)).
Actually, this is the univariate version of AIC. The general multivariate form is (Bedrick and Tsai 1994):
\[\small\mathrm{AIC} = -2\mathcal{L} + 2[pk+0.5p(p+1)]\]
where the number of parameters for the matrix of coefficients for multivariate data is \(\small{pk}\) and the number of parameters for the residual covariance matrix is \(\small\frac{1}{2}p(p + 1)\).
AIC is essentially the likelihood of the model (\(\mathcal{L}\)) plus a parameter penalty for model complexity. The quantification of this parameter penalty is arbitrary, and various have been proposed:
\(\small\mathrm{AIC} = -2\mathcal{L} + 2K\)
\(\small\mathrm{AIC}_c = -2\mathcal{L} + 2K\frac{n}{n-K-1}\) (often considered a ‘sample size correction’)
\(\small\mathrm{QAIC} = -2\mathcal{L}/c + 2K\) where \(\small{c}\) is a dispersion index based on the \(\small\chi^2\) distribution, to account for overdispersion
\(\small\mathrm{BIC} = -2\mathcal{L} + 2K\log{(n)}\) Bayesian AIC
And there are many, many others (see Burnham and Anderson. 2002)
Revisit the snake data (patterns of phenotypic variation in rattlesnake heads)
Models
Head ~ SVL + Sex + Region + Sex × RegionHead ~ SVL + RegionHead ~ SVLHead ~ Region## df AIC dAIC
## maic.Full 7 -13144.65 34.73799
## maic.Red1 4 -13179.39 0.00000
## maic.Red2 2 -13168.85 10.53920
## maic.Red3 3 -13163.98 15.40285
Best model (as found by \(\small\Delta{AIC}\)) is Red1: Y~SVL + Region. No other model is within 4 \(\small\Delta{AIC}\) units, so this model is strongly preferred.
In the previous example, models were nested and LRT and IC gave the same results. That brings up the question: in such cases, which should be used (LRT or AIC?)
Burnham and Anderson (1998) believe that ITMS should be performed in place of LRT.
In the previous example, models were nested and LRT and IC gave the same results. That brings up the question: in such cases, which should be used (LRT or AIC?)
Burnham and Anderson (1998) believe that ITMS should be performed in place of LRT.
From a mathematical perspective, AIC offers no advantage:
\[\small\Delta{AIC}=(n\log|\mathbf{\hat{\Sigma}}_{f}| + 2[pk_f+0.5p(p+1)])-(n\log|\mathbf{\hat{\Sigma}}_{r}| + 2[pk_r+0.5p(p+1)])=\]
Sometimes, a single best model cannot be identified (i.e., \(\small\Delta{AIC}\) are ‘close’)
For example:
| Model | \(\small\Delta{AIC}\) |
|---|---|
A |
3 |
A + B |
0 |
C |
2 |
Sometimes, a single best model cannot be identified (i.e., \(\small\Delta{AIC}\) are ‘close’)
For example:
| Model | \(\small\Delta{AIC}\) |
|---|---|
A |
3 |
A + B |
0 |
C |
2 |
One way of looking at the support for different models is by using Akaike weights: \(\small{w}_i=\frac{exp(-\frac{1}{2}\Delta{AIC}_i)}{\sum_i^C{exp}(-\frac{1}{2}\Delta{AIC}_i)}\)
Model weight is akin to the probability of model \(\small{i}\) given the set of \(\small{C}\) candidate models
Sometimes, a single best model cannot be identified (i.e., \(\small\Delta{AIC}\) are ‘close’)
For example:
| Model | \(\small\Delta{AIC}\) |
|---|---|
A |
3 |
A + B |
0 |
C |
2 |
One way of looking at the support for different models is by using Akaike weights: \(\small{w}_i=\frac{exp(-\frac{1}{2}\Delta{AIC}_i)}{\sum_i^C{exp}(-\frac{1}{2}\Delta{AIC}_i)}\)
Model weight is akin to the probability of model \(\small{i}\) given the set of \(\small{C}\) candidate models
| Model | \(\small\Delta{AIC}\) | \(\small{w}_i\) |
|---|---|---|
A |
3 | 0.122 |
A + B |
0 | 0.547 |
C |
2 | 0.331 |
Model weights are often used as a means of performing model averaging:
Model averaging obtains a new model by combining parameters from all ‘close’ models weighted by their respective \(\small{w}_i\)
| Model | \(\small\Delta{AIC}\) | \(\small{w}_i\) |
|---|---|---|
A |
3 | 0.122 |
A + B |
0 | 0.547 |
C |
2 | 0.331 |
\[\small\beta^*=(0.122+0.547)\beta_A+0.547\beta_B+0.331\beta_C\]
Model weights are often used as a means of performing model averaging:
Model averaging obtains a new model by combining parameters from all ‘close’ models weighted by their respective \(\small{w}_i\)
| Model | \(\small\Delta{AIC}\) | \(\small{w}_i\) |
|---|---|---|
A |
3 | 0.122 |
A + B |
0 | 0.547 |
C |
2 | 0.331 |
\[\small\beta^*=(0.122+0.547)\beta_A+0.547\beta_B+0.331\beta_C\]
Returning to the snake data (univariate this time), consider two models:
Y~SVL+Region+SVL:Region
Y~SVL+Region
## df AIC
## hs.svl.by.reg 7 1414.552
## hs.svl.reg 5 1413.114
Models have very similar AIC values. Obtain \(\Delta{AIC}\) , \(\small{w}_i\), and model averaged coefficients:
## [1] 0.6723665 0.3276335
## (Intercept) svl regionEast regionWest svl:regionEast
## 432.698370 3.701737 -13.601998 -14.739733 -0.570554
## svl:regionWest
## -1.128373
Now fit averaged model and obtain AIC:
## [1] 1413.114 1414.552 1415.718
Here the averaged model attains a WORSE AIC than the models from which it was averaged! This makes no sense, as the point of model averaging is to arrive at a Better model than those in the candidate set!
Now fit averaged model and obtain AIC:
## [1] 1413.114 1414.552 1415.718
Here the averaged model attains a WORSE AIC than the models from which it was averaged! This makes no sense, as the point of model averaging is to arrive at a Better model than those in the candidate set!
Finally, for this particular example (of 2 nested models), we know the preferred model. It is actually the simpler model:
## Analysis of Variance Table
##
## Response: hs
## Df Sum Sq Mean Sq F value Pr(>F)
## svl 1 293048 293048 9.7668 0.00232 **
## region 2 207559 103779 3.4588 0.03524 *
## svl:region 2 73442 36721 1.2238 0.29842
## Residuals 101 3030455 30005
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
SVL:Region interaction was not a significant improvement in fit over the main effects model. Thus, the interaction term should not be included!Advantages
Can consider non-nested models Penalizes model likelihood by the number of parameters needed in the model
Limitations
Choice of parameter penalty is arbitrary. Although there are sound statistical reasons to choose one type of information criterion (IC) over another, there are no statistical distributions for ICs, and ICs do not incorporate biological reasoning.
Indices are univariate even if data are multivariate. ICs only measure magnitude of deviation of models, but not direction in the parameter space.
Disadvantages
Not really a statistical test, and proponents of ITMS have argued strongly that statistical tests are useless. However, the statistical theory of LRT is well-founded and does not necessarily differ from IC for nested models.
Model averaging – proceed with extreme caution IF AT ALL
The goal of model comparison is to select the best model from a set of candidate models. But what if all the models are garbage?
Cross-validation evaluates the robustness of a particular model
It is related to model comparison in that the model that is most robust (from a set of candidates) may be considered the best.
Procedure:
A robust model is one where error is minimized
Here is some data:
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 49.648 49.648 100.52 8.327e-06 ***
## Residuals 8 3.952 0.494
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[\tiny{Y}=\beta_0+\beta_1X\]
\[\tiny{Y}=\beta_0\]
\[\tiny{Y}=\beta_0+\beta_1X\]
\[\tiny{Y}=\beta_0\]
## [1] 24.14600 43.36493
## [1] 1.5381977 0.8830653
Cross-validation demonstrates that \(\tiny{Y}=\beta_0+\beta_1X\) is a more robust model for these data.