In R there are often multiple ways of doing the same thing. Typically, we don’t consider which implementation is preferable; instead we ask: ‘Does it work?’ This is usually sufficient for our needs, as for simple one-off analyses of reasonably-sized datasets all we are concerned with is obtaining a consistent numerical answer.
However, while multiple implementations may yield the same numerical answer, they may not be identical in terms of the time they require to obtain it. This difference can matter a great deal, especially in this era of big data, of which 21st century biology is clearly playing a part.
Additionally, biologists do not only use simple analyses. Increasingly, they are using computationally-intensive approaches such as permutation, bootstrapping, or Monte Carlo simulations to evaluate patterns in their data. Here also, computational efficiency matters a great deal.
So the question is: how do different implementations of some set of operations compare in terms of computational speed?
Finding the mean can be obtained in various ways: mean(), apply(), or sum(x)/length(x), etc. How do they compare? We can use the library microbenchmark to evaluate.
mymean<-function(x){ #Create my own function for the mean
n<-length(x)
tmp<-0
for (i in 1:n){
tmp<-tmp+x[i]
}
mn<-tmp/n
return(mn)
}
## simulate some data
x<-matrix(rnorm(1000))
n<-length(x)
#Compare implementations
library(microbenchmark)
library(ggplot2)
microbenchmark(mean(x),apply(x,2,mean), sum(x)/length(x),mymean(x),colSums(x)/length(x))
## Unit: nanoseconds
## expr min lq mean median uq max neval cld
## mean(x) 3282 3693 6182.80 4103.0 4513.0 143180 100 a
## apply(x, 2, mean) 29128 30359 38913.12 32205.5 39385.0 151795 100 ab
## sum(x)/length(x) 820 821 1173.62 821.0 1231.0 14360 100 a
## mymean(x) 27077 27488 68730.60 27898.0 36102.5 3330468 100 b
## colSums(x)/length(x) 3282 3693 5185.99 4103.0 5744.0 22975 100 a
Note that the base function mean() is NOT the fastest,apply() was rather slow for this particular task, and my function was the worst!
So why are some implementations slow and others fast? There are many reasons, but two big ones are:
Inefficient Code. The code works but is computationally inefficient.
Computational Overhead. The code uses functions that do (or evaluate) many things. For example, base functions are usually quite robust and very general, and are thus incredibly useful. However, they also have considerable data checks to ensure the proper data type is provided, often calculate, obtain, or pass many additional components beyond the main values being calculated, and they have many options (print, labels, names, etc.) that must be evaluated at run-time. The result is that these functions are often slower than more streamlined algorithms designed to do the same thing.
As an end-user, this generates a trade-off between coding efficiency (how long does it take me to write this?) and computational efficiency (how long does it take to run?). Improving the latter comes at the expense of the former. The trick: improving the speed of one’s code where it matters most (ie, find the slowest parts of code and speeding them up).
There are several ways to find the slow portions of your code. Both microbenchmark and system.time can be used creatively to identify choke points. See H. Wickham’s Advanced R book; especially Optimizing code: http://adv-r.had.co.nz/Profiling.html. One approach I’ll illustrate here is using the aprof library.
First, let’s look at a function.
NOTE: This function is easy to read, but it turns out is rather poorly written: as we will see there are numerous ways of improving its performance.
#Peform multivariate correlation assessed via permutation
pls.slow<-function(x,y,iter=999){
data.all<-cbind(x,y)
data.all<-scale(data.all,scale=FALSE)
XY.vcv<-cov(data.all)
S12<-XY.vcv[1:dim(x)[2],(dim(x)[2]+1):(dim(x)[2]+dim(y)[2])]
pls<-svd(S12)
U<-pls$u; V<-pls$v
XScores<-x%*%U[,1]; YScores<-y%*%V[,1]
PLS.obs<-cor(XScores,YScores)
P.val<-1
PLS.rand<-NULL
for(i in 1:iter){
y.r<-y[sample(nrow(y)),]
data.r<-cbind(x,y.r)
data.r<-scale(data.r,scale=FALSE)
XY.vcv.r<-cov(data.r)
S12.r<-XY.vcv.r[1:dim(x)[2],(dim(x)[2]+1):(dim(x)[2]+dim(y.r)[2])]
pls.r<-svd(S12.r)
U.r<-pls.r$u; V.r<-pls.r$v
XScores.r<-x%*%U.r[,1]; YScores.r<-y.r%*%V.r[,1]
corr.rand<-cor(XScores.r,YScores.r)
PLS.rand<-rbind(PLS.rand,corr.rand)
P.val<-ifelse(corr.rand>=PLS.obs, P.val+1,P.val)
}
P.val<-P.val/(iter+1)
return(list(PLS.corr=PLS.obs,pvalue=P.val))
}
Using the aprof package, we can give this function some data, and profile its performance to find choke-points in the code:
library(aprof)
source("07-pls.slow.r")
tmp<-tempfile() #create tmp file for saving profiler output
Rprof(tmp,line.profiling=TRUE) #profile the function
x<-matrix(rnorm(1000),ncol=10)
y<-matrix(rnorm(1000),ncol=10)
pls.slow(x,y)
Rprof(append=FALSE)
fooaprof<-aprof("07-pls.slow.r",tmp) #Create aprof object
plot(fooaprof)
# new code to create plot
png("Aprof.out.png")
plot(fooaprof)
dev.off()
The output looks something like this:
Aprof output
Here we can see which lines of the function consume the most time. With this, we can determine which parts would provide the best speed-up if they are altered. In this case, all are within the permutation loop:
Obtaining the covariance matrix cov
Performing the singular-value decomposition svd
Scaling the data scale
Obtaining the correlation cor.
Obtaining the cross-covariance component of the covariance matrix
Now we will discuss some general principles for speeding up R code.
Don’t grow objects within loops or functions. Instead, pre-allocate them and fill the values in as they are calculated.
iter<-99
SS<-array(NA,iter) #pre-allocate
newSS<-function(iter){ #grow object on the fly
SS<-NULL
for (i in 1:iter){SS<-rbind(SS,NA)}
return(SS)
}
microbenchmark(SS<-array(NA,99), x<-newSS(99))
## Unit: nanoseconds
## expr min lq mean median uq max neval cld
## SS <- array(NA, 99) 0 410 792.08 411 821.0 4103 100 a
## x <- newSS(99) 92719 95180 133649.69 99898 121846.5 2476723 100 b
microbenchmark(SS<-array(NA,9999), x<-newSS(9999),times=10)
## Unit: microseconds
## expr min lq mean median uq
## SS <- array(NA, 9999) 3.282 3.283 7.4261 7.3855 10.667
## x <- newSS(9999) 86242.603 86453.066 97829.1659 90099.8400 99744.985
## max neval cld
## 13.129 10 a
## 143216.646 10 b
As we can see, pre-allocating memory can be a huge time savings, particularly for large objects or many iterations.
Components used repeatedly in a function can be calculated first, then re-used.
#Linear regression using matrix algebra
x<-cbind(1,matrix(rnorm(1000),ncol=10))
y<-matrix(rnorm(100))
all.calc<-function(x,y){
coef.r<-array(NA,dim=c(999,ncol(x)))
for (i in 1:999){
y.r<-y[sample(nrow(y)),]
coef.r[i,]<-solve(t(x)%*%x)%*%t(x)%*%y.r
}
}
#notice that the hat matrix is a constant. Pull out of loop
hat.calc<-function(x,y){
hat<-solve(t(x)%*%x)%*%t(x)
coef.r<-array(NA,dim=c(999,ncol(x)))
for (i in 1:999){
y.r<-y[sample(nrow(y)),]
coef.r[i,]<-hat%*%y.r
}
}
microbenchmark(all.calc(x,y),hat.calc(x,y),times=10)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## all.calc(x, y) 61.81836 63.56482 66.58850 66.25385 70.97119 71.68053 10 b
## hat.calc(x, y) 12.72084 13.65951 16.39145 16.28721 18.89521 22.40291 10 a
Pre-calculating components can be a huge time savings; particularly if parts of calculations are repeated in numerous places within some function.
Base R-functions are robust, but can be slow. Instead, use lower-level functions, or other algebraic formulations for accomplishing the same task.
#Some ways of performing linear models
lm(y~x) #Common method
solve(t(xf)%*%xf)%*%t(xf)%*%y
crossprod(solve(crossprod(xf)),crossprod(xf,y))
lm.fit(xf,y)$coefficients
.lm.fit(xf,y)$coefficients # NOTE: a very low-level function (cannot use in packages submitted to CRAN)
qr.coef(qr(xf),y)
Now compare them:
x<-matrix(rnorm(10000),ncol=2)
xf<-cbind(1,x)
y<-matrix(rnorm(nrow(x)))
microbenchmark(
lm(y~x),
solve(t(xf)%*%xf)%*%t(xf)%*%y,
crossprod(solve(crossprod(xf)),crossprod(xf,y)),
lm.fit(xf,y),.lm.fit(xf,y),
qr.coef(qr(xf),y)
)
## Unit: microseconds
## expr min lq mean
## lm(y ~ x) 936.617 1050.8735 1513.4510
## solve(t(xf) %*% xf) %*% t(xf) %*% y 177.642 196.9235 271.0407
## crossprod(solve(crossprod(xf)), crossprod(xf, y)) 82.872 94.3595 123.9348
## lm.fit(xf, y) 259.283 286.7700 469.8923
## .lm.fit(xf, y) 186.257 194.8725 314.9711
## qr.coef(qr(xf), y) 175.180 201.6410 338.6346
## median uq max neval cld
## 1242.2590 1664.0030 4960.008 100 d
## 230.9750 334.5650 694.156 100 ab
## 109.1285 126.3595 531.283 100 a
## 326.5650 494.9755 4688.419 100 c
## 220.7185 315.2830 3291.903 100 a c
## 241.6415 326.7705 5892.933 100 bc
Note that the base function lm is not the most computationally efficient.
Be aware that the size of the data matrices, and which dimensions (or objects) are larger or smaller can make a difference.
#Large X univ. Y
x<-matrix(rnorm(10000),ncol=50)
xf<-cbind(1,x)
y<-matrix(rnorm(nrow(x)))
microbenchmark(
lm(y~x),
solve(t(xf)%*%xf)%*%t(xf)%*%y,
crossprod(solve(crossprod(xf)),crossprod(xf,y)),
lm.fit(xf,y),.lm.fit(xf,y),
qr.coef(qr(xf),y)
)
## Unit: microseconds
## expr min lq mean
## lm(y ~ x) 1026.874 1125.7455 1353.6230
## solve(t(xf) %*% xf) %*% t(xf) %*% y 732.719 786.8735 927.1444
## crossprod(solve(crossprod(xf)), crossprod(xf, y)) 392.206 417.8470 484.6205
## lm.fit(xf, y) 464.001 489.6420 548.0217
## .lm.fit(xf, y) 424.206 432.4110 477.7734
## qr.coef(qr(xf), y) 443.078 482.6675 584.4525
## median uq max neval cld
## 1221.5410 1416.0025 7405.961 100 c
## 824.4120 940.1045 5108.522 100 b
## 438.9755 487.3860 1329.234 100 a
## 520.8215 582.1550 963.694 100 a
## 460.3095 493.7450 788.104 100 a
## 508.9245 584.8215 5051.906 100 a
##Large Y univ. X
y<-matrix(rnorm(10000),ncol=100)
x<-matrix(rnorm(nrow(y)))
xf<-cbind(1,x)
microbenchmark(
lm(y~x),
solve(t(xf)%*%xf)%*%t(xf)%*%y,
crossprod(solve(crossprod(xf)),crossprod(xf,y)),
lm.fit(xf,y),.lm.fit(xf,y),
qr.coef(qr(xf),y)
)
## Unit: microseconds
## expr min lq mean
## lm(y ~ x) 730.668 800.412 1068.07584
## solve(t(xf) %*% xf) %*% t(xf) %*% y 46.769 55.180 70.38808
## crossprod(solve(crossprod(xf)), crossprod(xf, y)) 37.333 43.693 58.11731
## lm.fit(xf, y) 155.488 169.847 202.83958
## .lm.fit(xf, y) 117.744 122.667 138.15026
## qr.coef(qr(xf), y) 85.334 96.616 124.42714
## median uq max neval cld
## 851.6935 1045.5405 5707.087 100 c
## 61.9495 74.2565 230.975 100 a
## 51.2825 65.8465 146.051 100 a
## 178.4620 208.8215 379.488 100 b
## 127.1805 147.8980 221.540 100 ab
## 108.7185 133.1290 329.026 100 ab
#large Y and X
y<-matrix(rnorm(20000),ncol=100)
x<-matrix(rnorm(10000),ncol=50)
xf<-cbind(1,x)
microbenchmark(
lm(y~x),
solve(t(xf)%*%xf)%*%t(xf)%*%y,
crossprod(solve(crossprod(xf)),crossprod(xf,y)),
lm.fit(xf,y),.lm.fit(xf,y),
qr.coef(qr(xf),y)
)
## Unit: milliseconds
## expr min lq mean
## lm(y ~ x) 4.178058 4.415187 4.954002
## solve(t(xf) %*% xf) %*% t(xf) %*% y 1.302566 1.344003 2.526392
## crossprod(solve(crossprod(xf)), crossprod(xf, y)) 1.460925 1.530054 1.624614
## lm.fit(xf, y) 3.262364 3.416006 3.765409
## .lm.fit(xf, y) 3.200005 3.285954 3.595206
## qr.coef(qr(xf), y) 1.849029 1.937029 2.131634
## median uq max neval cld
## 4.643905 5.099906 9.940940 100 c
## 1.406566 1.518156 107.916075 100 ab
## 1.575182 1.675080 2.160003 100 a
## 3.578057 3.874264 6.759396 100 bc
## 3.376621 3.658057 6.818884 100 bc
## 2.016618 2.206568 5.345650 100 ab
This exercise demonstrates that care must be taken when writing functions. Knowing the size of the input objects can make a difference in which implementation is preferable. Also, when writing general functions for yourself and others (e.g., to go in an R-package), one can consider first performing a check of the RxC of input matrices, then shunting the analysis to the appropriate pipeline for speed efficiency.
When writing a function, one often thinks in terms of an algorithm: “Do this, then do that, etc.” For iterative tasks these algorithms take the form of loops. Loops are very useful in programming and are straightforward to understand. However, one must remember that in R, the primary data type is a vector, and many R functions are ‘vectorized’. This means that they perform a given operation over the set of elements in the vector automatically. Basically, these functions already contain a loop which has been programmed in a lower-level compiled computer languages (e.g., C). This means that vectorized R functions are often faster than performing the same task using a for-loop. NOTE: the apply family: apply, sapply, lapply, tapply, mapply, Map, etc. are designed precisily to speed up these sorts of ‘looping’ tasks. USE THEM!!
Looping in R is especially inefficient when indexing rows and columns in an array or matrix. Let’s see an example:
fn1<-function(x){
means<-array(0,ncol(x))
for(i in 1:ncol(x)){
for(j in 1:nrow(x)){
means[i]<-means[i]+x[j,i]
}
}
means<-means/nrow(x)
return(means)
}
x<-matrix(rnorm(1000*1000),ncol=1000)
microbenchmark(fn1(x),apply(x,2,mean),colMeans(x),times=10)
## Unit: microseconds
## expr min lq mean median uq max
## fn1(x) 73492.633 74396.430 76997.7879 76442.792 78662.694 84748.037
## apply(x, 2, mean) 18004.542 18446.800 20093.2434 19827.930 21490.497 23100.756
## colMeans(x) 743.386 786.053 888.2478 824.412 935.796 1242.259
## neval cld
## 10 c
## 10 b
## 10 a
Notice that the loop is ~100 times slower than apply. Additionally, note that colMeans is even more efficient: about ~20X faster than apply and ~2000X times faster than the loop. Both apply and colMeans use internal vectorized coding, but for this particular task, colMeans is far superior. NOTE: Be aware that not all loops in R are slow. If the loop does not call indexed elements or uses vectorization within the loop, it may be as fast as a vectorized function for that same task. But as the example above illustrates, calling indexed elements within loops is certainly very slow!
If the above tips still do not result in sufficient speed gains, there are additional options.
Parallelization. One can use parallel processing within R to distribute the job to multiple processors. This is particularly useful for permutation, bootstrap and Monte Carlo procedures where the iterations can be divided and then recombined. The package parallel provides some tools.
Bit-compiled R code. One can also compile portions of the R code at run-time to improve performance. The package Rcpp provides useful tools.
Incorporate C or Fortran code. One can also write code in lower-level languages such as C and Fortran and call these from within one’s R script. In fact, the lowest levels of R are written in these languages. Using .C and .Call allow one to interface directly with code written in C. So long as the code is written efficiently, this will provide the best improvement in computational performance. Note however, that considerably greater programming skill is required to wite C and Fortran code (and to interface with it properly), so this option is generally not recommended for the novice user.
Let’s revisit our original PLS function:
#Peform multivariate correlation assessed via permutation
pls.slow<-function(x,y,iter=999){
data.all<-cbind(x,y)
data.all<-scale(data.all,scale=FALSE)
XY.vcv<-cov(data.all)
S12<-XY.vcv[1:dim(x)[2],(dim(x)[2]+1):(dim(x)[2]+dim(y)[2])]
pls<-svd(S12)
U<-pls$u; V<-pls$v
XScores<-x%*%U[,1]; YScores<-y%*%V[,1]
PLS.obs<-cor(XScores,YScores)
P.val<-1
PLS.rand<-NULL
for(i in 1:iter){
y.r<-y[sample(nrow(y)),]
data.r<-cbind(x,y.r)
data.r<-scale(data.r,scale=FALSE)
XY.vcv.r<-cov(data.r)
S12.r<-XY.vcv.r[1:dim(x)[2],(dim(x)[2]+1):(dim(x)[2]+dim(y.r)[2])]
pls.r<-svd(S12.r)
U.r<-pls.r$u; V.r<-pls.r$v
XScores.r<-x%*%U.r[,1]; YScores.r<-y.r%*%V.r[,1]
corr.rand<-cor(XScores.r,YScores.r)
PLS.rand<-rbind(PLS.rand,corr.rand)
P.val<-ifelse(corr.rand>=PLS.obs, P.val+1,P.val)
}
P.val<-P.val/(iter+1)
return(list(PLS.corr=PLS.obs,pvalue=P.val))
}
Using the tips we’ve learned, here is a modified version:
#Peform multivariate correlation assessed via permutation
pls.fast<-function(x,y,iter=999){
n<-nrow(x)
x<-scale(x,scale=FALSE); y<-scale(y,scale=FALSE)
ind <- c(list(1:n), (Map(function(x) sample.int(n, n), 1:iter)))
y.rand <- lapply(1:(iter+1), function(i) y[ind[[i]], ])
S12.r<-lapply(1:(iter+1), function(i) crossprod(x, y.rand[[i]])/(n - 1))
pls.r<-lapply(1:(iter+1), function(i) La.svd(S12.r[[i]], 1, 1))
r.rand<-sapply(1:(iter+1), function(i) cor(x %*% pls.r[[i]]$u, y.rand[[i]] %*% t(pls.r[[i]]$vt)))
p.val<- 1-(rank(r.rand)[1])/(iter+1)
return(list(PLS.corr=r.rand[[1]],pvalue=p.val))
}
In this modified version, we’ve pre-calculated components once (scale), used lower-level functions (crossprod and La.svd), are not growing the result inside the loop, and have eliminated the loop by vectorizing the function with lapply, sapply, and Map.
Now let’s compare them:
pls.slow<-function(x,y,iter=999){
data.all<-cbind(x,y)
data.all<-scale(data.all,scale=FALSE)
XY.vcv<-cov(data.all)
S12<-XY.vcv[1:dim(x)[2],(dim(x)[2]+1):(dim(x)[2]+dim(y)[2])]
pls<-svd(S12)
U<-pls$u; V<-pls$v
XScores<-x%*%U[,1]; YScores<-y%*%V[,1]
PLS.obs<-cor(XScores,YScores)
P.val<-1
PLS.rand<-NULL
for(i in 1:iter){
y.r<-y[sample(nrow(y)),]
data.r<-cbind(x,y.r)
data.r<-scale(data.r,scale=FALSE)
XY.vcv.r<-cov(data.r)
S12.r<-XY.vcv.r[1:dim(x)[2],(dim(x)[2]+1):(dim(x)[2]+dim(y.r)[2])]
pls.r<-svd(S12.r)
U.r<-pls.r$u; V.r<-pls.r$v
XScores.r<-x%*%U.r[,1]; YScores.r<-y.r%*%V.r[,1]
corr.rand<-cor(XScores.r,YScores.r)
PLS.rand<-rbind(PLS.rand,corr.rand)
P.val<-ifelse(corr.rand>=PLS.obs, P.val+1,P.val)
}
P.val<-P.val/(iter+1)
return(list(PLS.corr=PLS.obs,pvalue=P.val))
}
pls.fast<-function(x,y,iter=999){
n<-nrow(x)
x<-scale(x,scale=FALSE); y<-scale(y,scale=FALSE)
ind <- c(list(1:n), (Map(function(x) sample.int(n, n), 1:iter)))
y.rand <- lapply(1:(iter+1), function(i) y[ind[[i]], ])
S12.r<-lapply(1:(iter+1), function(i) crossprod(x, y.rand[[i]])/(n - 1))
pls.r<-lapply(1:(iter+1), function(i) La.svd(S12.r[[i]], 1, 1))
r.rand<-sapply(1:(iter+1), function(i) cor(x %*% pls.r[[i]]$u, y.rand[[i]] %*% t(pls.r[[i]]$vt)))
p.val<- 1-(rank(r.rand)[1])/(iter+1)
return(list(PLS.corr=r.rand[[1]],pvalue=p.val))
}
x<-matrix(rnorm(10000),ncol=10)
y<-matrix(rnorm(20000),ncol=20)
microbenchmark(pls.slow(x,y),pls.fast(x,y),times=10)
## Unit: milliseconds
## expr min lq mean median uq max
## pls.slow(x, y) 1503.4199 1571.3006 1654.6739 1698.2766 1709.4633 1759.859
## pls.fast(x, y) 598.3419 605.0677 658.2071 607.3961 618.0383 1036.834
## neval cld
## 10 b
## 10 a
For this example, we’ve sped up the function by a factor of 2-3.