Dean Adams, Iowa State University
A linear model where the response variable \(\small\mathbf{Y}\) is continuous, and \(\small\mathbf{X}\) contains one or more categorical factors (predictor variables). This is a comparison of groups
\[\mathbf{Y}=\mathbf{X}\mathbf{\beta } +\mathbf{E}\]
\(\small{H}_{0}\): No difference among groups. i.e., variation in \(\small\mathbf{Y}\) is not explained by \(\small\mathbf{X}\): i.e., \(\small{H}_{0}\): \(\small{SS}_{X}\sim{0}\)
\(\small{H}_{1}\): Difference exist among groups (i.e., group means differ from one another). i.e., some variation in \(\small\mathbf{Y}\) is explained by \(\small\mathbf{X}\): i.e., \(\small{H}_{1}\): \(\small{SS}_{X}>0\)
Parameters: model coefficients \(\small\hat\beta\) represent components of the group means relative to the overall mean
Note: Model also written as: \(\small{Y}_{ij}=\mu+\alpha_{i}+\epsilon_{ij}\) (\(\small{\mu}\) is the grand mean, \(\small{\alpha}_{i}\) is a component of the mean for the \(\small{i}^{th}\) group, and \(\small\epsilon_{ij}\) is the residual error)
1: Independence: \(\small\epsilon_{ij}\) of objects must be independent
2: Normality: requires normally distributed \(\small\epsilon_{ij}\)
3: Homoscedasticity: equal variance
\[\small{SST}=\sum_{i=1}^{a}\sum_{j=1}^{n_a}\left({Y}_{ij}-\overline{\overline{Y}}\right)^{2}\]
\[\small{MS}={\sigma}^{2}=\frac{1}{n-1}SS\]
Model contains single grouping variable (e.g.,
species or sex)
Procedure
Do male and female sparrows differ in total length?
## Analysis of Variance Table
##
## Response: bumpus$TL
## Df Sum Sq Mean Sq F value Pr(>F)
## bumpus$sex 1 187.49 187.491 16.483 8.304e-05 ***
## Residuals 134 1524.24 11.375
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Welch Two Sample t-test
##
## data: Y by Group
## t = -4.4721, df = 8, p-value = 0.002077
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -3.0312764 -0.9687236
## sample estimates:
## mean in group 1 mean in group 2
## 4 6
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 10 10.0 20 0.002077 **
## Residuals 8 4 0.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[\tiny{t}^{2}=\frac{\left(\overline{Y}_{1}-\overline{Y}_{2}\right)^{2}}{s_{p}^{2}\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)}=\frac{n\left(\overline{Y}_{1}-\overline{Y}_{2}\right)^{2}}{2s_{p}^{2}}\] - Numerator is equivalent to numerator in ANOVA:
\[\tiny{n}\left(\overline{Y}_{1}-\overline{Y}_{2}\right)^{2}=\left({n\left(\overline{Y}_{1}-\overline{\overline{Y}}\right)^{2}+n\left(\overline{Y}_{2}-\overline{\overline{Y}}\right)^{2}}\right)=\sum_{i=1}^{2}n\left({\overline{Y}_{i}-\overline{\overline{Y}}}\right)^{2}\]
a groups (levels) must
be represented by a series of a-1columns that code the
groups in binary fashion## [1] 1 1 1 2 2 2 3 3 3
## Levels: 1 2 3
## (Intercept) x12 x13
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 1 0
## 5 1 1 0
## 6 1 1 0
## 7 1 0 1
## 8 1 0 1
## 9 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$x1
## [1] "contr.treatment"
## x11 x12 x13
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 0 1 0
## 5 0 1 0
## 6 0 1 0
## 7 0 0 1
## 8 0 0 1
## 9 0 0 1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$x1
## [1] "contr.treatment"
What do the parameters \(\beta\) represent in ANOVA?
\(\beta\) contains the first group mean, and differences from this mean to the others
What do the parameters \(\beta\) represent in ANOVA?
\(\beta\) contains the first group mean, and differences from this mean to the others
Example:
## (Intercept) gp2 gp3 gp4
## 5.0 -4.0 -2.0 0.5
## 1 2 3 4
## 5.0 1.0 3.0 5.5
Multiple testing of data increases chance of ‘finding’ significance, so adjust \(\small\alpha_{critical}\) for number of comparisons (k)
Various approaches
Bonferroni considered quite conservative: sequential Bonferroni much less so
## Analysis of Variance Table
##
## Response: bumpus$TL
## Df Sum Sq Mean Sq F value Pr(>F)
## gp 3 369.49 123.163 12.112 4.712e-07 ***
## Residuals 132 1342.25 10.169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Pairwise comparisons using t tests with pooled SD
##
## data: bumpus$TL and gp
##
## f FALSE f TRUE m FALSE
## f TRUE 0.257 - -
## m FALSE 1.2e-05 3.4e-07 -
## m TRUE 0.273 0.025 7.9e-05
##
## P value adjustment method: none
res <-summary(pairwise(lm.rrpp(TL~gp, data = mydat,print.progress = FALSE),groups = gp),test.type = "dist", stat.table = FALSE)
res$pairwise.tables$D #Distance matrix## f FALSE f TRUE m FALSE m TRUE
## f FALSE 0.0000000 1.047619 3.654762 0.8263305
## f TRUE 1.0476190 0.000000 4.702381 1.8739496
## m FALSE 3.6547619 4.702381 0.000000 2.8284314
## m TRUE 0.8263305 1.873950 2.828431 0.0000000
## f FALSE f TRUE m FALSE m TRUE
## f FALSE 0.0000000 0.5913741 3.128898 0.5217306
## f TRUE 0.5913741 0.0000000 3.505043 1.6195253
## m FALSE 3.1288979 3.5050431 0.000000 2.7630346
## m TRUE 0.5217306 1.6195253 2.763035 0.0000000
## f FALSE f TRUE m FALSE m TRUE
## f FALSE 1.0000 0.3115 0.001 0.331
## f TRUE 0.3115 1.0000 0.001 0.043
## m FALSE 0.0010 0.0010 1.000 0.001
## m TRUE 0.3310 0.0430 0.001 1.000
ANOVA multiple \(\small{X}\) variables (e.g, species AND sex)
R-code: Y ~ A+B
2 factor Model: \(\small{Y}=\beta_{0}+X_{1}\beta_{1}+X_{2}\beta_{2}+\epsilon\)
A and B), \(\small{SS}_{A}\) & \(\small{SS}_{B}\) are independent of one
another, and are thus calculated as before: e.g. \[\tiny{SSA}=\sum_{i=1}^{a}{n}_{i}\left({\overline{Y}_{i}-\overline{\overline{Y}}}\right)^{2}\]Find means of rows, columns, and individual cells, and comparison
of these yields tests for Factors A &
B
NOTE: This ignores any interaction between A &
B (A:B)
Interactions measure the joint effect of main effects
A & B
R-code: Y ~ A+B + A:B implying: \(\small{Y}=\beta_{0}+X_{1}\beta_{1}+X_{2}\beta_{2}+X_{1:2}\beta_{1:2}+\epsilon\)
A:B identifies whether response to A is
dependent on level of B found as: \(\small{SS}_{A:B}=\sum_{i=1}^{a}\sum_{j=1}^{b}{n}_{ij}\left(\overline{Y}_{AB}-\overline{Y}_{A_{i}}-\overline{Y}_{B_{j}}+\overline{\overline{Y}}\right)^{2}\)
Significant interaction: main effects not interpretable without clarification (e.g., species 2 higher growth rate relative to species 1 ONLY in wet environments…)
Interactions are VERY common in biology
\[\tiny{Y}=\beta_{0}+X_{1}\beta_{1}+X_{2}\beta_{2}+X_{3}\beta_{3}+X_{1:2}\beta_{1:2}+X_{1:3}\beta_{1:3}+X_{2:3}\beta_{2:3}+X_{1:2:3}\beta_{1:2:3}+\epsilon\]
Adding factors also means adding many interactions
Inclusion of additional factors allows one to ‘account for’ their variation in the analysis (e.g., evaluate variation in sex while taking species into consideration), and allows assessment of interactions between factors, but requires larger \(\small{N}\)
In general, assess significance of interactions, make decisions regarding whether to pool non-significant interaction \(\small{MS}\) with \(\small{MSE}\), then interpret main effects whenever possible
## Analysis of Variance Table
##
## Response: TL
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 187.49 187.491 18.4384 3.374e-05 ***
## surv 1 157.74 157.738 15.5123 0.0001321 ***
## sex:surv 1 24.26 24.260 2.3858 0.1248339
## Residuals 132 1342.25 10.169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Pairwise comparisons using t tests with pooled SD
##
## data: TL and gp
##
## f FALSE f TRUE m FALSE
## f TRUE 0.257 - -
## m FALSE 1.2e-05 3.4e-07 -
## m TRUE 0.273 0.025 7.9e-05
##
## P value adjustment method: none
## Df SS MS Rsq F Z Pr(>F)
## sex 1 187.49 187.491 0.10953 18.4384 3.0977 0.001 **
## surv 1 157.74 157.738 0.09215 15.5123 2.9439 0.001 **
## sex:surv 1 24.26 24.260 0.01417 2.3858 1.2237 0.114
## Residuals 132 1342.25 10.169 0.78414
## Total 135 1711.74
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(pairwise(lm.rrpp(TL~gp, data = mydat,print.progress = FALSE),groups = gp), test.type = "dist")##
## Pairwise comparisons
##
## Groups: f FALSE f TRUE m FALSE m TRUE
##
## RRPP: 1000 permutations
##
## LS means:
## Vectors hidden (use show.vectors = TRUE to view)
##
## Pairwise distances between means, plus statistics
## d UCL (95%) Z Pr > d
## f FALSE:f TRUE 1.0476190 1.893452 0.5913741 0.3115
## f FALSE:m FALSE 3.6547619 1.738294 3.1288979 0.0010
## f FALSE:m TRUE 0.8263305 1.730462 0.5217306 0.3310
## f TRUE:m FALSE 4.7023810 1.904762 3.5050431 0.0010
## f TRUE:m TRUE 1.8739496 1.827311 1.6195253 0.0430
## m FALSE:m TRUE 2.8284314 1.568873 2.7630346 0.0010
Fixed effects: Factors containing a fixed set of levels whose treatment effects have been specified (e.g., male/female). THIS IS WHAT WE’VE CONSIDERED THUS FAR.
Random effects: Factors whose levels have been chosen at random to to represent variation attributed to the factor. Here, variation among groups is due to specified treatment effects and a random component NOT fixed (modeled) by the experiment or statistical design.
Groups are chosen as a random sample from a larger population of possible groups
In biology, subject or family is often
a random effect (e.g., we select a set of individuals to characterize
variation that may be typical among individuals)
Evaluating model effects requires determination of whether they are fixed or random effects, as this impacts the Expected Mean Squares against which they should be compared
EMS describe the sources of variation for the effects in the model, and the sources against which effects are evaluated (i.e., Error \(\small{MS}\) for each term).
This relates to the REDUCED model \(\small{H}_{R}\) against which each term is compared (recall, we are evaluating variation of factor effect versus variation not including that factor)
Factorial model, all fixed effects
A is fixed, B is randomOne common mixed model contains nested factors (where levels of one factor are nested within levels of another factor)
Here factors are not independent, but are hierarchical Levels of
B ‘nested’ within levels of A
A) measured
from multiple trees (B)
An example
## Df SS MS Rsq F Z Pr(>F)
## sex 1 187.49 187.491 0.10953 2.0604 0.4877 0.309
## sex:surv 2 182.00 90.999 0.10632 8.9491 3.2517 0.001 **
## Residuals 132 1342.25 10.169 0.78414
## Total 135 1711.74
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compared with the same data as a factorial model
## Df SS MS Rsq F Z Pr(>F)
## sex 1 187.49 187.491 0.10953 18.4384 3.0977 0.001 **
## surv 1 157.74 157.738 0.09215 15.5123 2.9439 0.001 **
## sex:surv 1 24.26 24.260 0.01417 2.3858 1.2237 0.114
## Residuals 132 1342.25 10.169 0.78414
## Total 135 1711.74
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sex an independent factor, or nested
within species?
sex correspond across
speciessex is fixed effectsex:species interaction term meaningfultree nested within habitat
(for CO2 experiment), or is it an independent factor?
tree not common across
habitatstree is a random effect (trees selected as
representatives of the sample)tree:habitat interaction not biologically
meaningfulLinear mixed models contain both fixed and random effects
In biology, they are especially useful to account for:
Subjects differ by slope and intercept (fit random slope/intercept model)
## Analysis of Variance Table
## npar Sum Sq Mean Sq F value
## Days 1 30031 30031 45.853
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
## Data: sleepstudy
##
## REML criterion at convergence: 1743.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9536 -0.4634 0.0231 0.4634 5.1793
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Subject (Intercept) 612.10 24.741
## Days 35.07 5.922 0.07
## Residual 654.94 25.592
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.405 6.825 36.838
## Days 10.467 1.546 6.771
##
## Correlation of Fixed Effects:
## (Intr)
## Days -0.138
## Analysis of Variance Table
##
## Response: bumpus$TL
## Df Sum Sq Mean Sq F value Pr(>F)
## bumpus$sex 1 187.49 187.491 18.4384 3.374e-05 ***
## bumpus$surv 1 157.74 157.738 15.5123 0.0001321 ***
## bumpus$sex:bumpus$surv 1 24.26 24.260 2.3858 0.1248339
## Residuals 132 1342.25 10.169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: bumpus$TL
## Df Sum Sq Mean Sq F value Pr(>F)
## bumpus$surv 1 106.88 106.876 10.5105 0.001503 **
## bumpus$sex 1 238.35 238.353 23.4403 3.545e-06 ***
## bumpus$surv:bumpus$sex 1 24.26 24.260 2.3858 0.124834
## Residuals 132 1342.25 10.169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For a model containing two factors (sex, and population), consider the following model comparisons (\(\small{H}_{F}\) vs. \(\small{H}_{R}\)):
| Effect | Reduced | Full | Purpose |
|---|---|---|---|
| Sex | Null | Null + Sex | Determine if Sex mean difference is “significant” compared to an overall mean |
| Pop | Null + Sex | Null + Sex + Pop | Determine if population differences are significant, after accounting for general difference between males and females |
| Sex:Pop | Null + Sex + Pop | Null + Sex + Pop + Sex:Pop | Determine if sexual dimorphism varies between populations, after accounting for general difference in sex and population. |
| Full model | Null | Null + Sex + Pop + Sex:Pop | Ascertain if the model is signficant (if not, the other parts should not be considered significant either) |
Estimating \(SS\) between models for the first three model comparisons is called Sequential (Type I) \(SS\) estimation. We start with a null model and build up to the fullest model.
Alternatively, model comparisons (\(\small{H}_{F}\) vs. \(\small{H}_{R}\)) could be constructed as follows:
| Effect | Reduced | Full | Purpose |
|---|---|---|---|
| Sex | Null + Pop | Null + Pop + Sex | Determine if Sex mean difference is “significant”, after accounting for population mean difference |
| Pop | Null + Sex | Null + Sex + Pop | Determine if population differences are significant, after accounting for general difference between males and females |
| Sex:Pop | Null + Sex + Pop | Null + Sex + Pop + Sex:Pop | Determine if sexual dimorphism varies between populations, after accounting for general difference in sex and population. |
| Full model | Null | Null + Sex + Pop + Sex:Pop | Ascertain if the model is signficant (if not, the other parts should not be considered significant either) |
Estimating \(SS\) between models for the first three model comparisons is called Conditional (Type II) \(SS\) estimation. We analyze effects with respect to other effects, but do not include interactions with targeted effects.
Or, model comparisons (\(\small{H}_{F}\) vs. \(\small{H}_{R}\)) could be constructed as follows:
| Effect | Reduced | Full | Purpose |
|---|---|---|---|
| Sex | Null + Pop + Pop:Sex | Null + Pop + Sex + Pop:Sex | Determine if Sex mean difference is “significant”, after accounting for all other effects |
| Pop | Null + Sex + Pop:Sex | Null + Sex + Pop + Pop:Sex | Determine if population differences are significant, after accounting for all other effects. |
| Sex:Pop | Null + Sex + Pop | Null + Sex + Pop + Sex:Pop | Determine if sexual dimorphism varies between populations, after accounting for general difference in sex and population. |
| Full model | Null | Null + Sex + Pop + Sex:Pop | Ascertain if the model is signficant (if not, the other parts should not be considered significant either) |
Estimating \(SS\) between models for the first three model comparisons is called Marginal (Type III) \(SS\) estimation. Whereas Type I \(SS\) evaluates the importance of effects through inclusion, Type III \(SS\) evaluates the importance of effects through exclusion.