Thoughts on speeding up (and cleaning up) R code

Motivation

In R there are often multiple ways of doing the same thing. Typically, we don’t consider which implementation is preferable; instead we ask: ‘Does it work?’ This is usually sufficient for our needs, as for simple one-off analyses of reasonably-sized datasets all we are concerned with is obtaining a consistent numerical answer.

However, while multiple implementations may yield the same numerical answer, they may not be identical in terms of the time they require to obtain it. This difference can matter a great deal, especially in this era of big data, of which 21^st century biology is clearly playing a part.

Additionally, biologists do not only use simple analyses. Increasingly, they are using computationally-intensive approaches such as permutation, bootstrapping, or Monte Carlo simulations to evaluate patterns in their data. Here also, computational efficiency matters a great deal.

So the question is: how do different implementations of some set of operations compare in terms of computational speed?

Example: Find the mean of a sample

Finding the mean can be obtained in various ways: mean(), apply(), or sum(x)/length(x), etc. How do they compare? We can use the library microbenchmark to evaluate.

mymean<-function(x){  #Create my own function for the mean
  n<-length(x)
  tmp<-0
  for (i in 1:n){
    tmp<-tmp+x[i]
  }
  mn<-tmp/n
  return(mn)
}
## simulate some data
x<-matrix(rnorm(1000))
n<-length(x)

#Compare implementations
library(microbenchmark)
library(ggplot2)
microbenchmark(mean(x),apply(x,2,mean), sum(x)/length(x),mymean(x),colSums(x)/length(x))

## Unit: nanoseconds
##                  expr   min    lq     mean  median      uq     max neval cld
##               mean(x)  3282  3693  6182.80  4103.0  4513.0  143180   100  a 
##     apply(x, 2, mean) 29128 30359 38913.12 32205.5 39385.0  151795   100  ab
##      sum(x)/length(x)   820   821  1173.62   821.0  1231.0   14360   100  a 
##             mymean(x) 27077 27488 68730.60 27898.0 36102.5 3330468   100   b
##  colSums(x)/length(x)  3282  3693  5185.99  4103.0  5744.0   22975   100  a

Note that the base function mean() is NOT the fastest,apply() was rather slow for this particular task, and my function was the worst!

Why R can be Slow

So why are some implementations slow and others fast? There are many reasons, but two big ones are:

Inefficient Code. The code works but is computationally inefficient.
Computational Overhead. The code uses functions that do (or evaluate) many things. For example, base functions are usually quite robust and very general, and are thus incredibly useful. However, they also have considerable data checks to ensure the proper data type is provided, often calculate, obtain, or pass many additional components beyond the main values being calculated, and they have many options (print, labels, names, etc.) that must be evaluated at run-time. The result is that these functions are often slower than more streamlined algorithms designed to do the same thing.

The Challenge

As an end-user, this generates a trade-off between coding efficiency (how long does it take me to write this?) and computational efficiency (how long does it take to run?). Improving the latter comes at the expense of the former. The trick: improving the speed of one’s code where it matters most (ie, find the slowest parts of code and speeding them up).

Speeding up R: Finding the Slow Bits

There are several ways to find the slow portions of your code. Both microbenchmark and system.time can be used creatively to identify choke points. See H. Wickham’s Advanced R book; especially Optimizing code: http://adv-r.had.co.nz/Profiling.html. One approach I’ll illustrate here is using the aprof library.

First, let’s look at a function.

NOTE: This function is easy to read, but it turns out is rather poorly written: as we will see there are numerous ways of improving its performance.

#Peform multivariate correlation assessed via permutation
pls.slow<-function(x,y,iter=999){
  data.all<-cbind(x,y)
  data.all<-scale(data.all,scale=FALSE)
  XY.vcv<-cov(data.all)
  S12<-XY.vcv[1:dim(x)[2],(dim(x)[2]+1):(dim(x)[2]+dim(y)[2])]
  pls<-svd(S12)
  U<-pls$u; V<-pls$v
  XScores<-x%*%U[,1]; YScores<-y%*%V[,1]
  PLS.obs<-cor(XScores,YScores)
  P.val<-1
  PLS.rand<-NULL
  for(i in 1:iter){
    y.r<-y[sample(nrow(y)),]    
    data.r<-cbind(x,y.r)
    data.r<-scale(data.r,scale=FALSE)
    XY.vcv.r<-cov(data.r)
    S12.r<-XY.vcv.r[1:dim(x)[2],(dim(x)[2]+1):(dim(x)[2]+dim(y.r)[2])]
    pls.r<-svd(S12.r)
    U.r<-pls.r$u; V.r<-pls.r$v
    XScores.r<-x%*%U.r[,1]; YScores.r<-y.r%*%V.r[,1]
    corr.rand<-cor(XScores.r,YScores.r)
    PLS.rand<-rbind(PLS.rand,corr.rand)
    P.val<-ifelse(corr.rand>=PLS.obs, P.val+1,P.val) 
  }  
  P.val<-P.val/(iter+1)
  return(list(PLS.corr=PLS.obs,pvalue=P.val))
}

Using the aprof package, we can give this function some data, and profile its performance to find choke-points in the code:

library(aprof)
source("07-pls.slow.r")
tmp<-tempfile() #create tmp file for saving profiler output
Rprof(tmp,line.profiling=TRUE)  #profile the function
x<-matrix(rnorm(1000),ncol=10)
y<-matrix(rnorm(1000),ncol=10)
pls.slow(x,y)
Rprof(append=FALSE)
fooaprof<-aprof("07-pls.slow.r",tmp) #Create aprof object
plot(fooaprof)

# new code to create plot
png("Aprof.out.png")
plot(fooaprof)
dev.off()

The output looks something like this:

Aprof output

Here we can see which lines of the function consume the most time. With this, we can determine which parts would provide the best speed-up if they are altered. In this case, all are within the permutation loop:

Obtaining the covariance matrix cov
Performing the singular-value decomposition svd
Scaling the data scale
Obtaining the correlation cor.
Obtaining the cross-covariance component of the covariance matrix

Now we will discuss some general principles for speeding up R code.

1: Pre-allocate Memory

Don’t grow objects within loops or functions. Instead, pre-allocate them and fill the values in as they are calculated.

iter<-99
SS<-array(NA,iter) #pre-allocate
newSS<-function(iter){ #grow object on the fly
  SS<-NULL
  for (i in 1:iter){SS<-rbind(SS,NA)}
  return(SS)
}

microbenchmark(SS<-array(NA,99), x<-newSS(99))

## Unit: nanoseconds
##                 expr   min    lq      mean median       uq     max neval cld
##  SS <- array(NA, 99)     0   410    792.08    411    821.0    4103   100  a 
##       x <- newSS(99) 92719 95180 133649.69  99898 121846.5 2476723   100   b

microbenchmark(SS<-array(NA,9999), x<-newSS(9999),times=10)

## Unit: microseconds
##                   expr       min        lq       mean     median        uq
##  SS <- array(NA, 9999)     3.282     3.283     7.4261     7.3855    10.667
##       x <- newSS(9999) 86242.603 86453.066 97829.1659 90099.8400 99744.985
##         max neval cld
##      13.129    10  a 
##  143216.646    10   b

As we can see, pre-allocating memory can be a huge time savings, particularly for large objects or many iterations.

2: Pre-Calculate Components

Components used repeatedly in a function can be calculated first, then re-used.

#Linear regression using matrix algebra
x<-cbind(1,matrix(rnorm(1000),ncol=10))
y<-matrix(rnorm(100))

all.calc<-function(x,y){
  coef.r<-array(NA,dim=c(999,ncol(x)))
  for (i in 1:999){
    y.r<-y[sample(nrow(y)),]    
    coef.r[i,]<-solve(t(x)%*%x)%*%t(x)%*%y.r
  }
}

#notice that the hat matrix is a constant. Pull out of loop
hat.calc<-function(x,y){
  hat<-solve(t(x)%*%x)%*%t(x)
  coef.r<-array(NA,dim=c(999,ncol(x)))
  for (i in 1:999){
    y.r<-y[sample(nrow(y)),]    
    coef.r[i,]<-hat%*%y.r
  }  
}

microbenchmark(all.calc(x,y),hat.calc(x,y),times=10)

## Unit: milliseconds
##            expr      min       lq     mean   median       uq      max neval cld
##  all.calc(x, y) 61.81836 63.56482 66.58850 66.25385 70.97119 71.68053    10   b
##  hat.calc(x, y) 12.72084 13.65951 16.39145 16.28721 18.89521 22.40291    10  a

Pre-calculating components can be a huge time savings; particularly if parts of calculations are repeated in numerous places within some function.

3: Use Lower-Level R Functions and Algebra

Base R-functions are robust, but can be slow. Instead, use lower-level functions, or other algebraic formulations for accomplishing the same task.

#Some ways of performing linear models
lm(y~x)  #Common method
solve(t(xf)%*%xf)%*%t(xf)%*%y
crossprod(solve(crossprod(xf)),crossprod(xf,y))
lm.fit(xf,y)$coefficients
.lm.fit(xf,y)$coefficients  # NOTE: a very low-level function (cannot use in packages submitted to CRAN)
qr.coef(qr(xf),y)

Now compare them:

x<-matrix(rnorm(10000),ncol=2)
xf<-cbind(1,x)
y<-matrix(rnorm(nrow(x)))

microbenchmark(
  lm(y~x),
  solve(t(xf)%*%xf)%*%t(xf)%*%y,
  crossprod(solve(crossprod(xf)),crossprod(xf,y)),
  lm.fit(xf,y),.lm.fit(xf,y),
  qr.coef(qr(xf),y)
)

## Unit: microseconds
##                                               expr     min        lq      mean
##                                          lm(y ~ x) 936.617 1050.8735 1513.4510
##                solve(t(xf) %*% xf) %*% t(xf) %*% y 177.642  196.9235  271.0407
##  crossprod(solve(crossprod(xf)), crossprod(xf, y))  82.872   94.3595  123.9348
##                                      lm.fit(xf, y) 259.283  286.7700  469.8923
##                                     .lm.fit(xf, y) 186.257  194.8725  314.9711
##                                 qr.coef(qr(xf), y) 175.180  201.6410  338.6346
##     median        uq      max neval  cld
##  1242.2590 1664.0030 4960.008   100    d
##   230.9750  334.5650  694.156   100 ab  
##   109.1285  126.3595  531.283   100 a   
##   326.5650  494.9755 4688.419   100   c 
##   220.7185  315.2830 3291.903   100 a c 
##   241.6415  326.7705 5892.933   100  bc

Note that the base function lm is not the most computationally efficient.

Be aware that the size of the data matrices, and which dimensions (or objects) are larger or smaller can make a difference.

#Large X univ. Y
x<-matrix(rnorm(10000),ncol=50)
xf<-cbind(1,x)
y<-matrix(rnorm(nrow(x)))
microbenchmark(
  lm(y~x),
  solve(t(xf)%*%xf)%*%t(xf)%*%y,
  crossprod(solve(crossprod(xf)),crossprod(xf,y)),
  lm.fit(xf,y),.lm.fit(xf,y),
  qr.coef(qr(xf),y)
)

## Unit: microseconds
##                                               expr      min        lq      mean
##                                          lm(y ~ x) 1026.874 1125.7455 1353.6230
##                solve(t(xf) %*% xf) %*% t(xf) %*% y  732.719  786.8735  927.1444
##  crossprod(solve(crossprod(xf)), crossprod(xf, y))  392.206  417.8470  484.6205
##                                      lm.fit(xf, y)  464.001  489.6420  548.0217
##                                     .lm.fit(xf, y)  424.206  432.4110  477.7734
##                                 qr.coef(qr(xf), y)  443.078  482.6675  584.4525
##     median        uq      max neval cld
##  1221.5410 1416.0025 7405.961   100   c
##   824.4120  940.1045 5108.522   100  b 
##   438.9755  487.3860 1329.234   100 a  
##   520.8215  582.1550  963.694   100 a  
##   460.3095  493.7450  788.104   100 a  
##   508.9245  584.8215 5051.906   100 a

##Large Y univ. X
y<-matrix(rnorm(10000),ncol=100)
x<-matrix(rnorm(nrow(y)))
xf<-cbind(1,x)
microbenchmark(
  lm(y~x),
  solve(t(xf)%*%xf)%*%t(xf)%*%y,
  crossprod(solve(crossprod(xf)),crossprod(xf,y)),
  lm.fit(xf,y),.lm.fit(xf,y),
  qr.coef(qr(xf),y)
)

## Unit: microseconds
##                                               expr     min      lq       mean
##                                          lm(y ~ x) 730.668 800.412 1068.07584
##                solve(t(xf) %*% xf) %*% t(xf) %*% y  46.769  55.180   70.38808
##  crossprod(solve(crossprod(xf)), crossprod(xf, y))  37.333  43.693   58.11731
##                                      lm.fit(xf, y) 155.488 169.847  202.83958
##                                     .lm.fit(xf, y) 117.744 122.667  138.15026
##                                 qr.coef(qr(xf), y)  85.334  96.616  124.42714
##    median        uq      max neval cld
##  851.6935 1045.5405 5707.087   100   c
##   61.9495   74.2565  230.975   100 a  
##   51.2825   65.8465  146.051   100 a  
##  178.4620  208.8215  379.488   100  b 
##  127.1805  147.8980  221.540   100 ab 
##  108.7185  133.1290  329.026   100 ab

#large Y and X
y<-matrix(rnorm(20000),ncol=100)
x<-matrix(rnorm(10000),ncol=50)
xf<-cbind(1,x)
microbenchmark(
  lm(y~x),
  solve(t(xf)%*%xf)%*%t(xf)%*%y,
  crossprod(solve(crossprod(xf)),crossprod(xf,y)),
  lm.fit(xf,y),.lm.fit(xf,y),
  qr.coef(qr(xf),y)
)

## Unit: milliseconds
##                                               expr      min       lq     mean
##                                          lm(y ~ x) 4.178058 4.415187 4.954002
##                solve(t(xf) %*% xf) %*% t(xf) %*% y 1.302566 1.344003 2.526392
##  crossprod(solve(crossprod(xf)), crossprod(xf, y)) 1.460925 1.530054 1.624614
##                                      lm.fit(xf, y) 3.262364 3.416006 3.765409
##                                     .lm.fit(xf, y) 3.200005 3.285954 3.595206
##                                 qr.coef(qr(xf), y) 1.849029 1.937029 2.131634
##    median       uq        max neval cld
##  4.643905 5.099906   9.940940   100   c
##  1.406566 1.518156 107.916075   100 ab 
##  1.575182 1.675080   2.160003   100 a  
##  3.578057 3.874264   6.759396   100  bc
##  3.376621 3.658057   6.818884   100  bc
##  2.016618 2.206568   5.345650   100 ab

This exercise demonstrates that care must be taken when writing functions. Knowing the size of the input objects can make a difference in which implementation is preferable. Also, when writing general functions for yourself and others (e.g., to go in an R-package), one can consider first performing a check of the RxC of input matrices, then shunting the analysis to the appropriate pipeline for speed efficiency.

4: Vectorize When Possible

Or: ‘Don’t speak R with a C accent’.

When writing a function, one often thinks in terms of an algorithm: “Do this, then do that, etc.” For iterative tasks these algorithms take the form of loops. Loops are very useful in programming and are straightforward to understand. However, one must remember that in R, the primary data type is a vector, and many R functions are ‘vectorized’. This means that they perform a given operation over the set of elements in the vector automatically. Basically, these functions already contain a loop which has been programmed in a lower-level compiled computer languages (e.g., C). This means that vectorized R functions are often faster than performing the same task using a for-loop. NOTE: the apply family: apply, sapply, lapply, tapply, mapply, Map, etc. are designed precisily to speed up these sorts of ‘looping’ tasks. USE THEM!!

Looping in R is especially inefficient when indexing rows and columns in an array or matrix. Let’s see an example:

fn1<-function(x){
  means<-array(0,ncol(x))
  for(i in 1:ncol(x)){
    for(j in 1:nrow(x)){
      means[i]<-means[i]+x[j,i]
    }
  }
  means<-means/nrow(x)
  return(means)
}

x<-matrix(rnorm(1000*1000),ncol=1000)
microbenchmark(fn1(x),apply(x,2,mean),colMeans(x),times=10)

## Unit: microseconds
##               expr       min        lq       mean    median        uq       max
##             fn1(x) 73492.633 74396.430 76997.7879 76442.792 78662.694 84748.037
##  apply(x, 2, mean) 18004.542 18446.800 20093.2434 19827.930 21490.497 23100.756
##        colMeans(x)   743.386   786.053   888.2478   824.412   935.796  1242.259
##  neval cld
##     10   c
##     10  b 
##     10 a

Notice that the loop is ~100 times slower than apply. Additionally, note that colMeans is even more efficient: about ~20X faster than apply and ~2000X times faster than the loop. Both apply and colMeans use internal vectorized coding, but for this particular task, colMeans is far superior. NOTE: Be aware that not all loops in R are slow. If the loop does not call indexed elements or uses vectorization within the loop, it may be as fast as a vectorized function for that same task. But as the example above illustrates, calling indexed elements within loops is certainly very slow!

5: When More Speed is Needed

If the above tips still do not result in sufficient speed gains, there are additional options.

Parallelization. One can use parallel processing within R to distribute the job to multiple processors. This is particularly useful for permutation, bootstrap and Monte Carlo procedures where the iterations can be divided and then recombined. The package parallel provides some tools.
Bit-compiled R code. One can also compile portions of the R code at run-time to improve performance. The package Rcpp provides useful tools.
Incorporate C or Fortran code. One can also write code in lower-level languages such as C and Fortran and call these from within one’s R script. In fact, the lowest levels of R are written in these languages. Using .C and .Call allow one to interface directly with code written in C. So long as the code is written efficiently, this will provide the best improvement in computational performance. Note however, that considerably greater programming skill is required to wite C and Fortran code (and to interface with it properly), so this option is generally not recommended for the novice user.

Using What We’ve Learned

Let’s revisit our original PLS function:

#Peform multivariate correlation assessed via permutation
pls.slow<-function(x,y,iter=999){
  data.all<-cbind(x,y)
  data.all<-scale(data.all,scale=FALSE)
  XY.vcv<-cov(data.all)
  S12<-XY.vcv[1:dim(x)[2],(dim(x)[2]+1):(dim(x)[2]+dim(y)[2])]
  pls<-svd(S12)
  U<-pls$u; V<-pls$v
  XScores<-x%*%U[,1]; YScores<-y%*%V[,1]
  PLS.obs<-cor(XScores,YScores)
  P.val<-1
  PLS.rand<-NULL
  for(i in 1:iter){
    y.r<-y[sample(nrow(y)),]    
    data.r<-cbind(x,y.r)
    data.r<-scale(data.r,scale=FALSE)
    XY.vcv.r<-cov(data.r)
    S12.r<-XY.vcv.r[1:dim(x)[2],(dim(x)[2]+1):(dim(x)[2]+dim(y.r)[2])]
    pls.r<-svd(S12.r)
    U.r<-pls.r$u; V.r<-pls.r$v
    XScores.r<-x%*%U.r[,1]; YScores.r<-y.r%*%V.r[,1]
    corr.rand<-cor(XScores.r,YScores.r)
    PLS.rand<-rbind(PLS.rand,corr.rand)
    P.val<-ifelse(corr.rand>=PLS.obs, P.val+1,P.val) 
  }  
  P.val<-P.val/(iter+1)
  return(list(PLS.corr=PLS.obs,pvalue=P.val))
}

Using the tips we’ve learned, here is a modified version:

#Peform multivariate correlation assessed via permutation
pls.fast<-function(x,y,iter=999){
  n<-nrow(x)
  x<-scale(x,scale=FALSE); y<-scale(y,scale=FALSE) 
  ind <- c(list(1:n), (Map(function(x) sample.int(n, n), 1:iter)))  
  y.rand <- lapply(1:(iter+1), function(i) y[ind[[i]], ])
  S12.r<-lapply(1:(iter+1), function(i) crossprod(x, y.rand[[i]])/(n - 1)) 
  pls.r<-lapply(1:(iter+1), function(i) La.svd(S12.r[[i]], 1, 1))
  r.rand<-sapply(1:(iter+1), function(i) cor(x %*% pls.r[[i]]$u, y.rand[[i]] %*% t(pls.r[[i]]$vt)))
  p.val<- 1-(rank(r.rand)[1])/(iter+1)
  return(list(PLS.corr=r.rand[[1]],pvalue=p.val))
}

In this modified version, we’ve pre-calculated components once (scale), used lower-level functions (crossprod and La.svd), are not growing the result inside the loop, and have eliminated the loop by vectorizing the function with lapply, sapply, and Map.

Now let’s compare them:

pls.slow<-function(x,y,iter=999){
  data.all<-cbind(x,y)
  data.all<-scale(data.all,scale=FALSE)
  XY.vcv<-cov(data.all)
  S12<-XY.vcv[1:dim(x)[2],(dim(x)[2]+1):(dim(x)[2]+dim(y)[2])]
  pls<-svd(S12)
  U<-pls$u; V<-pls$v
  XScores<-x%*%U[,1]; YScores<-y%*%V[,1]
  PLS.obs<-cor(XScores,YScores)
  P.val<-1
  PLS.rand<-NULL
  for(i in 1:iter){
    y.r<-y[sample(nrow(y)),]    
    data.r<-cbind(x,y.r)
    data.r<-scale(data.r,scale=FALSE)
    XY.vcv.r<-cov(data.r)
    S12.r<-XY.vcv.r[1:dim(x)[2],(dim(x)[2]+1):(dim(x)[2]+dim(y.r)[2])]
    pls.r<-svd(S12.r)
    U.r<-pls.r$u; V.r<-pls.r$v
    XScores.r<-x%*%U.r[,1]; YScores.r<-y.r%*%V.r[,1]
    corr.rand<-cor(XScores.r,YScores.r)
    PLS.rand<-rbind(PLS.rand,corr.rand)
    P.val<-ifelse(corr.rand>=PLS.obs, P.val+1,P.val) 
  }  
  P.val<-P.val/(iter+1)
  return(list(PLS.corr=PLS.obs,pvalue=P.val))
}

pls.fast<-function(x,y,iter=999){
  n<-nrow(x)
  x<-scale(x,scale=FALSE); y<-scale(y,scale=FALSE)  
  ind <- c(list(1:n), (Map(function(x) sample.int(n, n), 1:iter)))  
  y.rand <- lapply(1:(iter+1), function(i) y[ind[[i]], ])
  S12.r<-lapply(1:(iter+1), function(i) crossprod(x, y.rand[[i]])/(n - 1)) 
  pls.r<-lapply(1:(iter+1), function(i) La.svd(S12.r[[i]], 1, 1))
  r.rand<-sapply(1:(iter+1), function(i) cor(x %*% pls.r[[i]]$u, y.rand[[i]] %*% t(pls.r[[i]]$vt)))
  p.val<- 1-(rank(r.rand)[1])/(iter+1)
  return(list(PLS.corr=r.rand[[1]],pvalue=p.val))
}

x<-matrix(rnorm(10000),ncol=10)
y<-matrix(rnorm(20000),ncol=20)
microbenchmark(pls.slow(x,y),pls.fast(x,y),times=10)

## Unit: milliseconds
##            expr       min        lq      mean    median        uq      max
##  pls.slow(x, y) 1503.4199 1571.3006 1654.6739 1698.2766 1709.4633 1759.859
##  pls.fast(x, y)  598.3419  605.0677  658.2071  607.3961  618.0383 1036.834
##  neval cld
##     10   b
##     10  a

For this example, we’ve sped up the function by a factor of 2-3.