Dean Adams, Iowa State University
Independent Variable: specifies the hypothesis; a predictor for other variables (e.g., sex, age). (X-matrix)
Dependent Variable: the response variable; its variation depends other variables. This is the ‘data’ (Y-matrix)
MANY ways of classifying statistical methods, I prefer…
Inferential Statistics: test for specific patterns in data using independent variables to generate hypotheses (Y vs. X)
Exploratory Statistics: describe patterns in data without independent variables or specific hypotheses (patterns in Y)
Some examples
Parametric Methods: Parameters estimated from the data are evaluated relative to theoretical distributions of those parameters. Implementations include:
Evaluate patterns in data relative to independent variables (e.g., \(\small\mathbf{Y}=\mathbf{X}\mathbf{\beta } +\mathbf{E}\))
This is all fine and dandy, but where do the expected values come from?
Parametric statistical theory has generated numerous expected distributions for various parameters, which were derived from theory by considering:
Over decades, many parametric distributions have been derived from theory
Each is used to evaluate statistical summary parameters from particular hypothesis tests
Sampling distributions may be obtained in other ways (e.g., permutation)
Example: Do male and female birds differ in size?
Sampling distributions may be obtained in other ways (e.g., permutation)
Example: Do male and female birds differ in size?
Calculate test value (\(D_{obs}=\mu_M-\mu_F\))
Generate empirical sampling distribution by shuffling specimens many times relative to groups (i.e., under \(H_0\) of no group difference) and calculating \(D_{rand}\) each time
Parametric statistical hypothesis testing is comprised of two distinct steps:
Parameter Estimation: Here we fit the data to the model, and estimate parameters that summarize that fit. These are commonly in the form of model coefficients, which for linear models are regression parameters.
Model Evaluation: Here we use statistical summary measures that summarize the fit of the data to the model.
Parametric statistical hypothesis testing is comprised of two distinct steps:
Parameter Estimation: Here we fit the data to the model, and estimate parameters that summarize that fit. These are commonly in the form of model coefficients, which for linear models are regression parameters.
Model Evaluation: Here we use statistical summary measures that summarize the fit of the data to the model.
Don’t forget there is error in model evaluation & hypothesis testing!!
Power \(\left(1-\beta\right)\) : Ability to detect significant effect when it is present (function of effect size, N, \(\sigma^2\))
Power of test can be empirically determined in many instances
Parametric statistics: estimate statistical parameters from data, and compare to a theoretical distribution of these parameters
Significance based upon how ‘extreme’ the observed value is relative to the distribution of values (under the null hypothesis of no pattern)
Different distributions used for different statistical parameters and tests
Distribution for binary events, calculate probabilities directly
Determine probability of obtaining x outcomes in n total tries
\[\small{Pr=}\left(\begin{array}{ccc}n \\x\end{array}\right)p^{x}q^{n-x}\] - \(n\) is the total # events, x is the # successes, and p & q are the probability of success and failure
Models probability of rare, and independent events
For Poisson: \(\mu_x = \sigma^2_x\)
Multiple Poisson distributions exist (for different means)
Ranges from 0 to + \(\infty\): Multiple \(\chi^2\) distributions exist (for different df)
Many statistics (particularly for categorical data) are compared to \(\chi^2\) distributions (\(\chi^2\), G-test, Fisher’s exact test, etc.)
The ‘bell curve’: data are symmetrical about the mean
Range: \(-\infty\) to \(+\infty\): \(\pm1\sigma\) = 68% of data, \(\pm2\sigma\) = 95% of data, \(\pm3\sigma\) = 99% of the data
T-distribution is a set of approximate normal distributions for finite sample sizes (used to compare to groups)
Models the ratio of variances: range: 0 to \(+\infty\)
One of the most commonly used distributions (used for Linear Models [LM])
Is the combination of 2 \(\small{t}\)-distributions (and has 2 \(\small{df}\))
Multiple \(\small{F}\)-distributions for various \(\small{df}\)
Sometimes, our data and hypotheses match a type of analysis (e.g., ANOVA), but violate the assumptions (e.g., normality)
One solution: transform the data so that they more closely match the assumptions of the test (meeting the test assumptions is important, so that results can be attributed to true differences in the data vs. violations of the properties of the test)
Some common transformations for biological data are:
There are MANY other possible transformations
A statistic summarizing the ‘typical’ location for a sample on the number line
Moment Statistics: deviations around the mean, raised to powers
Moment Statistics: deviations around the mean, raised to powers
\(1^{st}\) moment: sum of deviates (equals zero): \(\small{M_1=\frac{1}{n}\sum{\left(Y_i-\bar{Y}\right)^1}=0}\)
\(2^{nd}\) moment (variance): sum of squared deviates, measures dispersion around mean: \(\small{\sigma^2=\frac{1}{n}\sum{\left(Y_i-\bar{Y}\right)^2}}\) (NOTE: \(\small{\sigma^2}\) for a sample is calculated using \(n-1\))
\(3^{rd}\) moment (skewness): describes the direction (skew) of the distribution
\[\sigma=\sqrt{\frac{1}{n-1}\sum{\left(Y_i-\bar{Y}\right)^2}}\]
Degrees of Freedom (\(\small{df}\)): describes the number of parameters in data that are free to vary after we’ve calculated some parameter (e.g., if you know the mean and all but 1 value from the data, you can figure out the remaining variate)
Becomes important when determining whether your sample size is sufficient for a particular test (each test has associated df based on how many parameters are estimated)
Main categories of inferential models: Linear Models (LM) and Log-linear models
Maximum likelihood (ML) used to calculate all parameters
LM (ANOVA, regression): Used when \(\small{Y}\) is continuous. Fitting procedure is Least Squares (minimize the sum of squares deviations). This is equivalent to ML when error is normally distributed
Log-Linear Models (logistic regression, contingency tables): Used when \(\small{Y}\) is categorical. Called log-linear because ML estimate of logs of variables is linear
If \(\small\mathbf{X}\) contains one or more categorical factors, the LM exemplifies a comparison of groups
\[\mathbf{Y}=\mathbf{X}\mathbf{\beta } +\mathbf{E}\]
\(\small{H_{0}}\): No difference among groups. More formally, variation in \(\small\mathbf{Y}\) is not explained by \(\small\mathbf{X}\): i.e., \(\small{H_{0}}\): \(\small{SS}_{X}\sim{0}\)
\(\small{H_{1}}\): Difference exist among groups (i.e., group means differ from one another). More formally, some variation in \(\small\mathbf{Y}\) is explained by \(\small\mathbf{X}\): i.e., \(\small{H_{1}}\): \(\small{SS}_{X}>0\)
Parameters: model coefficients \(\small\hat\beta\) represent components of the group means relative to the overall mean
Do male and female sparrows differ in total length?
## Df SS MS Rsq F Z Pr(>F)
## bumpus$sex 1 187.49 187.491 0.10953 16.483 2.9719 0.001 **
## Residuals 134 1524.24 11.375 0.89047
## Total 135 1711.74
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Intercept) bumpus$sexm
## 157.979592 2.445696
If \(\small\mathbf{X}\) contains one or more continuous variables, the LM exemplifies a regression analysis
\[\mathbf{Y}=\mathbf{X}\mathbf{\beta } +\mathbf{E}\]
\(\small{H_{0}}\): No covariation between \(\small\mathbf{Y}\) & \(\small\mathbf{X}\). More formally, variation in \(\small\mathbf{Y}\) is not explained by \(\small\mathbf{X}\): i.e., \(\small{H_{0}}\): \(\small{SS}_{X}\sim{0}\)
\(\small{H_{1}}\): Difference covariation is present between \(\small\mathbf{Y}\) & \(\small\mathbf{X}\). More formally, some variation in \(\small\mathbf{Y}\) is explained by \(\small\mathbf{X}\): i.e., \(\small{H_{1}}\): \(\small{SS}_{X}>0\)
Parameters: model coefficients \(\small\hat\beta\) represent slopes describing the relationship between covariation between \(\small\mathbf{Y}\) & \(\small\mathbf{X}\).
Does sparrow wingspan (alar extent) covary with total length?
## Df SS MS Rsq F Z Pr(>F)
## bumpus$TL 1 1964.7 1964.68 0.47744 122.43 6.1954 0.001 **
## Residuals 134 2150.3 16.05 0.52256
## Total 135 4115.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Intercept) bumpus$TL
## 74.264953 1.071341
If \(\small\mathbf{X}\) contains one or more continuous variables and the response variable is binary, then the log-LM exemplifies a logistic regression analysis
\(\small{H_{0}}\): No covariation between \(\small\mathbf{Y}\) & \(\small\mathbf{X}\)
\(\small\mathbf{Y}\) is derived from the categorical response variable (e.g., %males)
\[\ln{\left[\frac{p}{1-p}\right]}=\mathbf{X}\mathbf{\beta } +\mathbf{E}\]
-Note: above formulation is a linear regression of the logits of the proportions
If both the predictor \(\small\mathbf{X}\) variable and response \(\small\mathbf{Y}\) variable are categorical, then our log-LM is a contingency table analysis (R x C test)
\(\small{H_{0}}\): No association between \(\small\mathbf{Y}\) & \(\small\mathbf{X}\) (rows and columns are independent)
\(\chi^2\) tests and G-tests fit this type of model
VERY common in medical studies (e.g., Is smoking associate with cancer rates?)
Compare single sample to a known value, or compare 2 samples
Determine whether means are significantly different
For a single sample calculate: \(t=\frac{\bar{Y}-\mu}{s_{\bar{Y}}}\)
For two samples calculate: \(t=\frac{\bar{Y}_{1}-\bar{Y}_{2}}{s_{p}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\)
where: \(s_p=\sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}}\)
Compare value to a \(\small{t}\)-distribution (or perform resampling)
Other variations exist (e.g., for paired data)
Does sparrow wingspan (alar extent) covary with total length?
##
## Welch Two Sample t-test
##
## data: bumpus$TL by bumpus$sex
## t = -3.9134, df = 89.245, p-value = 0.0001775
## alternative hypothesis: true difference in means between group f and group m is not equal to 0
## 95 percent confidence interval:
## -3.687434 -1.203957
## sample estimates:
## mean in group f mean in group m
## 157.9796 160.4253
Determine amount of association (covariation) between two variables (\(\small{H_{0}}\): no association)
Range: -1 to +1 (more extreme values = higher correlation)
\[\small{r}_{ij}=\frac{cov_{ij}}{s_is_j}=\frac{\frac{1}{n-1}\sum(Y_i-\bar{Y}_i)(Y_j-\bar{Y}_j)}{\sqrt{\frac{1}{n-1}\sum(Y_i-\bar{Y}_i)^2\frac{1}{n-1}\sum(Y_j-\bar{Y}_j)^2}}=\frac{\sum(Y_i-\bar{Y}_i)(Y_j-\bar{Y}_j)}{\sqrt{\sum(Y_i-\bar{Y}_i)^2\sum(Y_j-\bar{Y}_j)^2}}\]
Measures ‘tightness’ of scatter of one variable relative to the other
Assess significance by converting \(\small{r}\) to \(\small{t}\) (or resampling)
##
## Pearson's product-moment correlation
##
## data: bumpus$AE and bumpus$TL
## t = 11.065, df = 134, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5914289 0.7697695
## sample estimates:
## cor
## 0.6909709
Numerator of \(\small{r_{ij}}\) is covariance: describes deviations in 1 variable as they change with deviations in another variable (similar to variance, but for 2 variables)
Note the similarity between:
\[\small{var}_i=\frac{\sum(Y_i-\bar{Y}_i)^2}{n-1}=\frac{\sum(Y_i-\bar{Y}_i)(Y_i-\bar{Y}_i)}{n-1}\]
\[\small{cov}_{ij}=\frac{\sum(Y_i-\bar{Y}_i)(Y_j-\bar{Y}_j)}{n-1}\]
Variance is just a covariance between a variable and itself (very important to see connection here!)
Can also think of correlation as angle between vectors i and j in variable space: the tighter the angle, the higher the correlation (parallel vectors have \(r=1.0\) while orthogonal [perpendicular] vectors have \(r=0.0\))
Thus, correlations are cosines of angles between vectors: \(r= cos{\theta}\) (1 of MANY ways to visualize correlations)
Often used to summarize categorical data from contingency tables
Tests for independence of values in cells (i.e. between rows and columns)
For 2 x 2 table calculate: \(\chi^2=\frac{\sum{(O-E)}^2}{E}\)
Compare \(\chi^2_{obs}\) to \(\chi^2\) distribution with (n-1) df [frequently = (r-1)(c-1)]
Other derivations exist for particular data types, but this is general concept