Dean Adams, Iowa State University
A linear model where the response variable \(\small\mathbf{Y}\) is continuous, and \(\small\mathbf{X}\) contains one or more continuous covariates (predictor variables)
\[\mathbf{Y}=\mathbf{X}\mathbf{\beta } +\mathbf{E}\]
\(\small{H}_{0}\): No covariation between \(\small\mathbf{Y}\) & \(\small\mathbf{X}\). More formally, variation in \(\small\mathbf{Y}\) is not explained by \(\small\mathbf{X}\): i.e., \(\small{H}_{0}\): \(\small{SS}_{X}\sim{0}\)
\(\small{H}_{1}\): Difference covariation is present between \(\small\mathbf{Y}\) & \(\small\mathbf{X}\). More formally, some variation in \(\small\mathbf{Y}\) is explained by \(\small\mathbf{X}\): i.e., \(\small{H}_{1}\): \(\small{SS}_{X}>0\)
Parameters: model coefficients \(\small\hat\beta\) represent slopes describing the relationship between \(\small\mathbf{Y}\) & \(\small\mathbf{X}\)
Model written as: \(\small{Y}=\beta_{0}+X_{1}\beta_{1}+\epsilon\)
1: Independence: \(\small\epsilon_{ij}\) of objects must be independent
2: Normality: requires normally distributed \(\small\epsilon_{ij}\)
3: Homoscedasticity: equal variance
4: \(\small{X}\) values are independent and measured without error
Fit line that minimizes sum of squared deviations (LS fit) from \(\small{Y}\)-variates to line (vertical deviations b/c no error in \(\small{X}\))
Slope calculated as: \(\small{\beta}_{1}=\beta_{YX}=\frac{\sum\left(X-\overline{X}\right)\left(Y-\overline{Y}\right)}{\sum\left(X-\overline{X}\right)^{2}}\)
Regression line always crosses \(\small{\left(\overline{X},\overline{Y}\right)}\) so intercept is: \(\small{\beta}_{0}=\overline{Y}-\beta_{1}\overline{X}\)
SS explained by the model: \(\small{SSM}=\sum\left(\hat{Y}-\overline{Y}\right)^{2}\)
SS residual error: \(\small{SSE}=\sum\left(Y_{i}-\hat{Y}\right)^{2}\)
Convert to variances (mean squares), calculate \(\small{F}\) and assess significance with (\(\small{df}_{1},df_{2}\))
\(\small{F}\)-test assesses variance explained, but not how much \(\small{Y}\) changes with \(\small{X}\)
Can evaluate the model parameters separately
Slope test (\(\small{\beta}_{1}\neq0\)): \(\small{t}=\frac{\beta_{1}-0}{s_{\beta_{1}}}\) with \(\tiny{s}_{\beta_{1}}=\sqrt{\frac{MSE}{\sum\left(X_{i}-\overline{X}\right)^{2}}}\) & \(\tiny{n-2}\) \(\small{df}\)
Intercept test (\(\small{\beta}_{0}\neq0\)): \(\small{t}=\frac{\beta_{0}-0}{s_{\beta_{0}}}\) with \(\tiny{s}_{\beta_{0}}=\sqrt{MSE\Big[\frac{1}{n}+\frac{\overline{X}^{2}}{\sum\left(X_{i}-\overline{X}\right)^{2}}\Big]}\) & \(\tiny{n-2}\) \(\small{df}\)
Can also test against particular values (e.g,. Isometry: is \(\small\beta_{1}=1\)?)
Very useful for certain biological hypotheses
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X1 1 0.032412 0.032412 124.01 < 2.2e-16 ***
## Residuals 134 0.035023 0.000261
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Intercept) X1
## 1.3044593 0.6847984
## Df SS MS Rsq F Z Pr(>F)
## X1 1 0.032412 0.032412 0.48063 124.01 5.9368 0.001 **
## Residuals 134 0.035023 0.000261 0.51937
## Total 135 0.067435
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [,1]
## (Intercept) 1.3044593
## X1 0.6847984
When both \(\small{X}\) & \(\small{Y}\) contain measurement error, model I regression underestimates slope
Model II regression accounts for this by minimizing deviations perpendicular to regression line (not in \(\small{Y}\) direction only)
Different types of model II regression, depending on data (major axis, reduced major axis, etc.)
\(\small{X}\) & \(\small{Y}\) in ‘same’ units/scale: major axis regression (PCA)
\(\small{X}\) & \(\small{Y}\) not in same units/scale: standard (reduced) major axis regression
## RMA was not requested: it will not be computed.
## Method Intercept Slope Angle (degrees) P-perm (1-tailed)
## 1 OLS 1.3044593 0.6847984 34.40328 0.001
## 2 MA -0.3329159 0.9824064 44.49152 0.001
## 3 SMA -0.3624212 0.9877692 44.64746 NA
## Method 2.5%-Intercept 97.5%-Intercept 2.5%-Slope 97.5%-Slope
## 1 OLS 0.6352913 1.9736273 0.5631720 0.8064248
## 2 MA -1.3892010 0.5532596 0.8213358 1.1743959
## 3 SMA -1.0726264 0.2656984 0.8736027 1.1168556
## RMA was not requested: it will not be computed.
## No permutation test will be performed
Justification for using model II regression has been overplayed in biology
What matters is NOT whether \(\small{X}\) has measurement error: \(\small{s^{2}_{\epsilon_{X}}}\)
Instead, ONLY when \(\small{s^{2}_{\epsilon_{X}}} / s^{2}_{X}\) is large, might there be an issue with Model I regression.
Predict \(\small{Y}\) using several \(\small{X}\) variables simultaneously
Model: \(\small{Y}=\beta_{0}+X_{1}\beta_{1}+X_{2}\beta_{2}+\dots+\epsilon\)
\(\small{\beta}_{i}\) are partial regression coefficients (effect of \(\small{X}_{i}\) while holding effects of other \(\small{X}\) constant)
For 2 \(\small{X}\) variables, think of fitting a plane to the data
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X1 1 0.032412 0.032412 137.118 < 2.2e-16 ***
## X2 1 0.003585 0.003585 15.168 0.0001551 ***
## Residuals 133 0.031438 0.000236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df SS MS Rsq F Z Pr(>F)
## X1 1 0.032412 0.032412 0.48063 137.118 6.1079 0.001 **
## X2 1 0.003585 0.003585 0.05317 15.168 2.9586 0.001 **
## Residuals 133 0.031438 0.000236 0.46620
## Total 135 0.067435
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 0.573966
## Df SS MS Rsq F Z Pr(>F)
## X1 1 0.012782 0.0127819 0.18954 54.074 4.4652 0.001 **
## X2 1 0.003585 0.0035853 0.05317 15.168 2.9586 0.001 **
## Residuals 133 0.031438 0.0002364 0.46620
## Total 135 0.067435
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For all models, \(\small{R}^{2}\) is the proportion of variation explained by model
Should I add another variable to my model?
Important related topic: model comparison. Which explanatory model best explains variation in \(\small{Y}\)? We discuss this topic later in the semester
Say one has two regressions (e.g., allometry patterns in each of two species). Can I compare these trends?
To compare the slopes of two regression lines, calculate an \(\small{F}\)-ratio as:
\[\tiny{F}=\frac{\left(\beta_{1}-\beta_{2}\right)^{2}}{\overline{s}^{2}_{YX}\frac{\sum\left(X_{1}-\overline{X}_{1}\right)^{2}+\sum\left(X_{2}-\overline{X}_{2}\right)^{2}}{\sum\left(X_{1}-\overline{X}_{1}\right)^{2}\sum\left(X_{2}-\overline{X}_{2}\right)^{2}}}\]
Where \(\small{\overline{s}_{YX}}\) is the weighted average of \(\small{s}_{YX}\), and \(\small{df}=1,(n_{1}+n_{2}-4)\)
Procedure can be generalized to compare > 2 regression lines (see Biometry)
ANCOVA is a linear model containing both a categorical and a continuous \(\small{X}\) explanatory variable (i.e., a ‘combination’ of ANOVA and regression)
\(\small{H}_{0}\): no differences among slopes, no differences among groups
Must first compare slopes, then compare groups (ANOVA) while holding effects of covariate constant
Several possible outcomes to ANCOVA
\(\beta\) contains components of adjusted least-squares (LS) means and group slopes
y <- c(6, 4, 0, 2, 3, 3, 4, 7 )
x <- c(7,8,2,3,5,4,3,6)
gp <- factor(c(1,1,1,1,2,2,2,2));
df <- data.frame(x = x, y = y, gp = gp)
fit <- lm(y~x*gp, data = df); coef(fit)## (Intercept) x gp2 x:gp2
## -0.8461538 0.7692308 1.0461538 0.1307692
## (Intercept) x
## 0.2 0.9
## 1 2
## (Intercept) -0.8461538 0.2
## x 0.7692308 0.9
ANCOVA explicitly evaluates a series of sequential hypotheses
1: Are slopes for groups different? (\(\small{SS}_{cov:group} > 0\)?)
2: If the cov:gp interaction is NOT significant, we remove that effect and fit a common slopess model:‘Y~cov + group’. This evaluates whether the adjusted LS-means for the groups differ while accounting for a common covariate
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X2 1 0.023215 0.0232149 77.3890 8.139e-15 ***
## SexBySurv 3 0.005172 0.0017239 5.7467 0.001014 **
## X2:SexBySurv 3 0.000651 0.0002172 0.7239 0.539496
## Residuals 128 0.038397 0.0003000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X2 1 0.023215 0.0232149 77.8814 6.015e-15 ***
## SexBySurv 3 0.005172 0.0017239 5.7833 0.0009591 ***
## Residuals 131 0.039048 0.0002981
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Pairwise comparisons using t tests with pooled SD
##
## data: model.ancova$fitted.values and SexBySurv
##
## f FALSE f TRUE m FALSE
## f TRUE 0.028 - -
## m FALSE 4.2e-15 < 2e-16 -
## m TRUE 0.029 1.6e-05 1.0e-12
##
## P value adjustment method: none
## Df SS MS Rsq F Z Pr(>F)
## X2 1 0.023215 0.0232149 0.34426 77.3890 5.2846 0.001 **
## SexBySurv 3 0.005172 0.0017239 0.07669 5.7467 3.0048 0.002 **
## X2:SexBySurv 3 0.000651 0.0002172 0.00966 0.7239 -0.0862 0.530
## Residuals 128 0.038397 0.0003000 0.56939
## Total 135 0.067435
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df SS MS Rsq F Z Pr(>F)
## X2 1 0.023215 0.0232149 0.34426 77.8814 5.2836 0.001 **
## SexBySurv 3 0.005172 0.0017239 0.07669 5.7833 3.0176 0.002 **
## Residuals 131 0.039048 0.0002981 0.57905
## Total 135 0.067435
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## f FALSE f TRUE m FALSE m TRUE
## f FALSE 0.000000000 0.001283828 0.01595601 0.004128217
## f TRUE 0.001283828 0.000000000 0.01723984 0.005412045
## m FALSE 0.015956012 0.017239840 0.00000000 0.011827794
## m TRUE 0.004128217 0.005412045 0.01182779 0.000000000
## f FALSE f TRUE m FALSE m TRUE
## f FALSE 0.0000000 -0.9612166 2.772571 0.4670408
## f TRUE -0.9612166 0.0000000 2.578038 0.6999764
## m FALSE 2.7725709 2.5780381 0.000000 2.3954073
## m TRUE 0.4670408 0.6999764 2.395407 0.0000000
## f FALSE f TRUE m FALSE m TRUE
## f FALSE 1.000 0.819 0.002 0.337
## f TRUE 0.819 1.000 0.002 0.256
## m FALSE 0.002 0.002 1.000 0.002
## m TRUE 0.337 0.256 0.002 1.000
cov:group
interaction term is significant##
## Pairwise comparisons
##
## Groups: f FALSE f TRUE m FALSE m TRUE
##
## RRPP: 1000 permutations
##
## LS means:
## Vectors hidden (use show.vectors = TRUE to view)
##
## Pairwise statistics based on mean vector correlations
## r angle UCL (95%) Z Pr > angle
## f FALSE:f TRUE 1 0 8.537736e-07 -0.4814688 0.5955
## f FALSE:m FALSE 1 0 8.537736e-07 -0.4969422 0.6005
## f FALSE:m TRUE 1 0 8.537736e-07 -0.4889152 0.5980
## f TRUE:m FALSE 1 0 8.537736e-07 -0.4727927 0.5930
## f TRUE:m TRUE 1 0 8.537736e-07 -0.4735833 0.5930
## m FALSE:m TRUE 1 0 8.537736e-07 -0.5034108 0.6020
1: Perform ANOVA on regression residuals: NOT the same as ANCOVA (different \(\small{df}\), different pooled \(\small{\beta}\), etc.). Also, lose test of slopes, which is important (see J. Anim. Ecol. 2001. 70:708-711)
2: Significant cov:group interaction, but still compare groups: not useful, as answer depends upon where along regression you compare
3: “Size may be a covariate, so I’ll use a small size range to ‘standardize’ for it”: choosing animals of similar sizes will eliminate covariate, but also will eliminate potentially important biological information (e.g., what if male head width grows relatively faster than females (i.e. size:head interaction?)