Resampling Methods

Motivation

Increasingly, quantitative analyses of biological data involve resampling procedures: randomization/permutation, bootstrap, jackknife, MCMC, Monte Carlo simuations, etc. are all widely used approaches to evaluate the significance of patterns in data, or evaluate the degree to which the observed pattern corresponds with expected patterns generated from some model-based process.

These approaches may be implemented in R. The tools required to do so include the generation (simulation) of data based on some model, and the selection of values from a vector of observations.

1: Simulating Data

Here are some ways in which data may be simulated in R. Note that more complex functions exist for particular tasks (e.g., simulate data along the branches of a phylogeny under a Brownian motion model of evolution using ‘sim.char’ in ‘geiger’), but that is simply a more complicated function that simulates random normal deviates in a particular fashion: under the hood the simulation generator is one of the functions below (mvrnorm).

x <- rnorm(5000)
hist(x,20,freq=T)

x2<-runif(5000)
hist(x2,20,freq=T)

z1<-rnorm(500, mean = 1, sd = 0.3)
hist(z1,20,freq=T)

z2<-rnorm(500, mean = 3, sd = 0.3)
plot(z1,z2)

z3<-runif(500, 0, 1)
z4<-runif(500, 0,1)
plot(z3,z4)

##Setting the 'seed' for random number generation
    #IMPORTANT for checking simulations prior to full run
set.seed(2)  #sets the seed
rnorm(10)

##  [1] -0.89691455  0.18484918  1.58784533 -1.13037567 -0.08025176  0.13242028
##  [7]  0.70795473 -0.23969802  1.98447394 -0.13878701

set.seed(2)  #sets the seed
rnorm(10)

##  [1] -0.89691455  0.18484918  1.58784533 -1.13037567 -0.08025176  0.13242028
##  [7]  0.70795473 -0.23969802  1.98447394 -0.13878701

rnorm(10)       #note that the seed is not held indefinitely in memory)

##  [1]  0.41765075  0.98175278 -0.39269536 -1.03966898  1.78222896 -2.31106908
##  [7]  0.87860458  0.03580672  1.01282869  0.43226515

## Simulate CORRELATED Data Using MVTNORM
    ##basic idea: simulate rnorm vectors, & multiply by decomposition of a covariance matrix
        ## covariance matrix specifies correlation structure
library(mvtnorm) 
corr.val<-0.7
a <- rmvnorm(n=500,mean=c(0,0),sigma=matrix(c(1,corr.val,corr.val,1),2,2)) 
cor(a)

##           [,1]      [,2]
## [1,] 1.0000000 0.6997185
## [2,] 0.6997185 1.0000000

plot(a)

2: Resampling Data

A second crucial skill is to be able to resample data in some way. Permutation and bootstrap approaches are accmoplished in R using the function ‘sample’:

x <- 1:10
#dim(x)<-c(10,1) #force dimensions to make column vector
x

##  [1]  1  2  3  4  5  6  7  8  9 10

sample(x)   # Randomize the order of locations

##  [1]  3  6  5  7 10  4  9  2  8  1

sample(x,replace=FALSE) #more explicit

##  [1]  1  9  6  7  3  5  2  4 10  8

sample(x,(length(x)-2),replace=FALSE)   #sub-sample

## [1]  2  3  1  6  4  5  8 10

Note that if used incorrectly, ‘sample’ can get one into trouble. Imagine the cases where rows of a matrix are to be shuffled:

x <- 1:10
y<-cbind(x,x,x)
y

##        x  x  x
##  [1,]  1  1  1
##  [2,]  2  2  2
##  [3,]  3  3  3
##  [4,]  4  4  4
##  [5,]  5  5  5
##  [6,]  6  6  6
##  [7,]  7  7  7
##  [8,]  8  8  8
##  [9,]  9  9  9
## [10,] 10 10 10

sample(y,replace=FALSE)    #We did not tell it to preserve rows!!

##  [1]  8  6  2  9  8  6  2  1  3 10  5  4  9  4  8  6  4  1  1 10  3  5  7  3  2
## [26]  9  5  7 10  7

  #THIS IS NOT CORRECT. It shuffles all values

y[sample(nrow(y)),]        # ROWS Preserved (needed for resampling multivariate data)

##        x  x  x
##  [1,]  2  2  2
##  [2,]  3  3  3
##  [3,]  5  5  5
##  [4,]  7  7  7
##  [5,]  8  8  8
##  [6,]  4  4  4
##  [7,]  9  9  9
##  [8,] 10 10 10
##  [9,]  1  1  1
## [10,]  6  6  6

3: A Simple Randomization Test

Here is a simple randomization test. All randomization tests have 3 main sections: 1) Obtain test parameters and summary measures from observed data 2) Permute the data; obtain summary measures from permuted data and generate histogram 3) Compare observed value to permuted values to assess significance.

Below is a simple randomization test. It is NOT optimized for speed or coding efficiency, but rather is written in ‘readable’ form based on these steps (in future weeks we will see how to improve this approach for computational efficiency):

bumpus<-read.csv("Data/bumpus.csv",header=T)
bumpus.data<-log(bumpus[,(5:13)]) # set of log-linear measurements
sex<-as.factor(bumpus[,2])
TL<-bumpus.data$TL

#Observed data
t.test(formula=bumpus.data$TL~sex)

## 
##  Welch Two Sample t-test
## 
## data:  bumpus.data$TL by sex
## t = -3.9205, df = 88.257, p-value = 0.0001742
## alternative hypothesis: true difference in means between group f and group m is not equal to 0
## 95 percent confidence interval:
##  -0.023246093 -0.007607292
## sample estimates:
## mean in group f mean in group m 
##        5.062204        5.077631

plot(sex,bumpus.data$TL,ylab="Total Length")

t.obs<-t.test(formula=bumpus.data$TL~sex)$statistic[[1]]  #grab t-statistic from frame  
t.obs

## [1] -3.920514

#Randomization Test
permute<-999
P.1tailed<-P.2tailed<-1
t.rand.vec<-array(NA,(permute+1))
t.rand.vec[permute+1]<-t.obs
for(i in 1:permute){
  ###Shuffle Data
    y.rand<-sample(bumpus.data$TL)  #NOTE: notation works ONLY for single variable 
  ###Run analysis on random data
    t.rand.vec[i]<-t.test(formula=y.rand~sex)$statistic[[1]]
}  #end permute

##Significance assessment
P.1tailed<-length(which(t.rand.vec<=t.rand.vec[permute+1])) / (permute+1)  #because observed is negative
P.2tailed<-length(which(abs(t.rand.vec)>=abs(t.rand.vec[permute+1]))) / (permute+1)  
P.1tailed

## [1] 0.001

P.2tailed

## [1] 0.001

####Plot
hist(t.rand.vec,20,freq=T,col="gray")
segments(t.obs, 0, t.obs, 50)  ##Plot Observed value

4: Bootstrapping Data

Bootstrapping is sampling with replacement. Thus in any bootstrap replicate, some observations are represented more than once while others are left out. Such mehtods are particularly useful for obtaining confidence intervals. Below are examples of various types of Bootstrap CI, coded in various ways:

x <- 1:10
sample(x,replace=TRUE)

##  [1]  3  9 10  4  9  1  6  8  3  4

#Observed mean and CI
mean.TL<-mean(TL)   
mean.TL

## [1] 5.072073

int<-1.96*sqrt(var(TL)/length(TL))
CIlow<-mean.TL-int
CIhi<-mean.TL+int
CIlow

## [1] 5.068316

CIhi

## [1] 5.075829

#Bootstrap data
boot.mean<-numeric(1000)
for (i in 1:1000){
  boot.mean[i]<-mean(sample(TL,replace=T))
}

### PERCENTILE BOOTSTRAP CI
quantile(boot.mean,c(0.025,0.975))

##     2.5%    97.5% 
## 5.068100 5.075713

###Standard Boostrap CI
int.boot<-1.96*sd(boot.mean)
CIlow.boot<-mean.TL-int.boot
CIhi.boot<-mean.TL+int.boot
CIlow.boot

## [1] 5.068319

CIhi.boot

## [1] 5.075827

#Using the packag boot
library(boot)
mymean<-function(TL,i)mean(TL[i])
myboot<-boot(TL,mymean,R=1000)
myboot

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = TL, statistic = mymean, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original       bias    std. error
## t1* 5.072073 1.883144e-05 0.001910687

quantile(myboot$t,c(0.025,0.975))

##     2.5%    97.5% 
## 5.068267 5.075668

#Various Bootstrap CI
boot.ci(myboot)

## Warning in boot.ci(myboot): bootstrap variances needed for studentized intervals

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = myboot)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 5.068,  5.076 )   ( 5.068,  5.076 )  
## 
## Level     Percentile            BCa          
## 95%   ( 5.068,  5.076 )   ( 5.068,  5.075 )  
## Calculations and Intervals on Original Scale