SC2

Statistical Computing 2

2. RcppArmadillo


Introduction

Many standard statistical models/algorithms (e.g., linear regression, principal component analysis, …) require using numerical linear algebra routines, hence this section explains how to perform such computations efficiently using the RcppArmadillo R package. RcppArmadillo provides an interface to the Armadillo C++ numerical linear algebra library, which is highly useful in practice because it provides an interesting balance between performance and ease of use. The syntax of Armadillo resembles that of Matlab, which is not too different from R’s syntax. Part of the performance speed-ups of Armadillo, relative to base R, are obtained via automatic pooling of several linear algebra operations into one at compile-time.

Motivating example 1: matrix-matrix-vector product

To demostrate the capabilities of Armadillo, we start by considering a simple example on matrix multiplication. In particular, consider computing the product \({\bf A} {\bf B} {\bf y}\) where \({\bf A}\) and \({\bf B}\) are \(d \times d\) matrices and \(y\) is a \(d\)-dimensional vector. In R we would compute the product simply by doing A%*%B%*%y. Could we get some speed-ups by using Rcpp? To verify this, we try to create a function that computes the matrix-matrix-vector product in Rcpp:

library(Rcpp)
sourceCpp(code = '
 #include <Rcpp.h>
 using namespace Rcpp;

 // [[Rcpp::export(name = "MMv_wrong")]]
 NumericVector MMv_I(NumericMatrix A, NumericMatrix B, NumericVector y) {
   return A * B * y;
 }'
)

The problem with this function is that it is not equivalent to A%*%B%*%y. In fact, it does not computes A%*%B%*%y, but something quite different:

d <- 3
A <- matrix(1, d, d)
B <- matrix(1, d, d)
y <- 1:d

MMv_wrong(A, B, y)
## [1]  1.000000e+00  2.000000e+00  3.000000e+00  0.000000e+00 2.917744e-314
## [6] 4.672873e-310 4.672877e-310 4.672877e-310 1.482197e-323

In particular, A * B in MMv_I is equivalent to as.vector(A * B) is R, that is it computes the element-wise product of two matrices and then converts the output to a vector. Then one might expect that A * B * y in MMv_I should be equivalent to:

as.vector(A * B) * y
## [1] 1 2 3 1 2 3 1 2 3

in R, but this is also wrong because, when multiplying vectors of different length, Rcpp does not perform the recycling as R does. For example:

addDiffLengh <- cppFunction("
 NumericVector addDiffLengh(NumericVector x1, NumericVector x2) {
     return x1 + x2;
}")
addDiffLengh(1:3, 1)
## [1] 2 2 3
addDiffLengh(1, 1:3)
## [1] 2

does not behave in the way we would expect in R. In the first case we are adding 1 to the vector 1:3, and the output shows that no recycling in going on (compare this with \(1:3 + 1\) in R). In the second case, we are adding 1:3 to a vector of length 1 and Rcpp does not do any recycling either.

The bottom line is that we should not expect Rcpp code to behave like R code when it comes to recycling, and that Rcpp does not provide a direct equivalent of the %*% operator for matrix-matrix and matrix-vector multiplication. Hence, when doing linear algebra in Rcpp, we are better off using the methods offered by the RcppArmadillo package, as done in the following function:

sourceCpp(code = '
 // [[Rcpp::depends(RcppArmadillo)]]
 #include <RcppArmadillo.h>
 using namespace Rcpp;

 // [[Rcpp::export(name = "MMv_arma")]]
 arma::vec MMv_arma_I(arma::mat& A, arma::mat& B, arma::vec& y) {
   return A * B * y;
}', verbose = TRUE)
## 
## Generated extern "C" functions 
## --------------------------------------------------------
## 
## 
## #include <Rcpp.h>
## #ifdef RCPP_USE_GLOBAL_ROSTREAM
## Rcpp::Rostream<true>&  Rcpp::Rcout = Rcpp::Rcpp_cout_get();
## Rcpp::Rostream<false>& Rcpp::Rcerr = Rcpp::Rcpp_cerr_get();
## #endif
## 
## // MMv_arma_I
## arma::vec MMv_arma_I(arma::mat& A, arma::mat& B, arma::vec& y);
## RcppExport SEXP sourceCpp_5_MMv_arma_I(SEXP ASEXP, SEXP BSEXP, SEXP ySEXP) {
## BEGIN_RCPP
##     Rcpp::RObject rcpp_result_gen;
##     Rcpp::RNGScope rcpp_rngScope_gen;
##     Rcpp::traits::input_parameter< arma::mat& >::type A(ASEXP);
##     Rcpp::traits::input_parameter< arma::mat& >::type B(BSEXP);
##     Rcpp::traits::input_parameter< arma::vec& >::type y(ySEXP);
##     rcpp_result_gen = Rcpp::wrap(MMv_arma_I(A, B, y));
##     return rcpp_result_gen;
## END_RCPP
## }
## 
## Generated R functions 
## -------------------------------------------------------
## 
## `.sourceCpp_5_DLLInfo` <- dyn.load('/tmp/RtmpywOfWB/sourceCpp-x86_64-pc-linux-gnu-1.0.10/sourcecpp_b65e236439e7/sourceCpp_6.so')
## 
## MMv_arma <- Rcpp:::sourceCppFunction(function(A, B, y) {}, FALSE, `.sourceCpp_5_DLLInfo`, 'sourceCpp_5_MMv_arma_I')
## 
## rm(`.sourceCpp_5_DLLInfo`)
## 
## Building shared library
## --------------------------------------------------------
## 
## DIR: /tmp/RtmpywOfWB/sourceCpp-x86_64-pc-linux-gnu-1.0.10/sourcecpp_b65e236439e7
## 
## /opt/R/4.2.2/lib/R/bin/R CMD SHLIB -o 'sourceCpp_6.so' 'fileb65e728181f3.cpp'

Before illustrating the performance of this function, let’s explain what is going on in the code above:

Let us now compare our Armadillo implementation with the plain R version:

d <- 1e3
A <- matrix(rnorm(d^2), d, d)
B <- matrix(rnorm(d^2), d, d)
y <- rnorm(d)

max(abs(A %*% B %*% y - MMv_arma(A, B, y)))
## [1] 4.547474e-12
library(microbenchmark)
microbenchmark(R = A %*% B %*% y, 
               Arma = MMv_arma(A, B, y))
## Unit: microseconds
##  expr     min       lq      mean   median       uq     max neval
##     R 20495.2 20913.95 21430.818 21103.85 21466.55 25317.7   100
##  Arma   329.8   394.30   484.931   489.90   542.30  1835.2   100

The performance gains are substantial! It might be tempting to conclude that the Armadillo library is just so much better than R default linear algebra routines, but we need to understand how this performance gain was achieved. To do so, consider the following comparison:

microbenchmark(R = A %*% (B %*% y), 
               Arma = MMv_arma(A, B, y))
## Unit: microseconds
##  expr    min      lq     mean  median     uq    max neval
##     R 1196.2 1213.45 1231.914 1219.30 1228.1 2045.4   100
##  Arma  318.9  328.55  335.958  332.25  335.6  638.1   100

It seems that simply adding a pair of brackets made our R code much more efficient, and closer to Armadillo’s performance! The reason is simply that the code A %*% B %*% y computes the matrix product first, which is a \(O(d^3)\) operation, and then multiplies that resulting matrix by the vector y, which is a \(O(d^2)\) operation. Instead, A %*% (B %*% y) computes two matrix-vector multiplications, thus avoiding the \(O(d^3)\) computation. The question is whether the speed-up we are getting from Armadillo is due to the fact that this library is automatically computing the matrix-matrix-vector product using this more efficient order of operations. To verify this, we write the following:

sourceCpp(code = '
 // [[Rcpp::depends(RcppArmadillo)]]
 #include <RcppArmadillo.h>
 using namespace Rcpp;

 // [[Rcpp::export(name = "MMv_arma_slow")]]
 arma::vec MMv_arma_I(arma::mat& A, arma::mat& B, arma::vec& y) {
   arma::mat C = A * B;
   return C * y;
}')

And we compare it with the inefficient R implementation:

microbenchmark(R = A %*% B %*% y,
               Arma_slow = MMv_arma_slow(A, B, y))
## Unit: milliseconds
##       expr     min       lq     mean   median      uq     max neval
##          R 18.8147 19.00590 20.74272 20.96715 21.2787 25.9407   100
##  Arma_slow 16.7753 16.90205 18.10158 18.79905 19.0159 20.4416   100

Now the performance is comparable, hence the efficiency of the Armadillo implementation is mostly due to the order in which the computations are performed. The fact that we can achieve similar performance using base R and a pair of (well-placed) brackets might be disappointing, but note that it is remarkable that Armadillo is able to automatically determine the order of operations so as to optimize efficiency. In reality, Armadillo does more than just reordering operation, but it can combine several operations into one and reduce the need for temporary objects, via delayed evaluation. This can result in faster computation and lower memory usage, relative to naive evaluation. Delayed evaluation is achieved via template meta-programming, which is beyond the scope of this course. From a user point of view, what matters is that template meta-programming allows Armadillo to reason about linear algebra expression at compile-time, with the aim of producing code that is tailored to each mathematical expression. See Armadillo’s website for more details.

Before describing some of the main functions and structures provide by Armadillo and RcppArmadillo, we cover another basic example in the next section.

Motivating example 2: sums of matrices

In the previous section we considered the R code A %*% B %*% y and we showed how it can be made much more efficient simply by writing it as A %*% (B %*% y). Here we consider a simple linear algebra example, where improving the efficiency of a base R implementation is not straightforward. In particular, consider calculating \({\bf A} + {\bf B} + {\bf C} + {\bf D} + {\bf E}\), where \({\bf A}, \dots, {\bf E}\) are \(d \times d\) matrices (the shape of the matrices is not important). A simple Armadillo function for computing such a sum is:

sourceCpp(code = '
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace arma;

// [[Rcpp::export(name = "mSum_arma")]]
mat mSum_i(mat& A, mat& B, mat& C, mat& D, mat& E) {
  return A + B + C + D + E;
}')

Let us compare it with with plain R code:

A <- B <- C <- D <- E <- matrix(rnorm(1e6), 1e3, 1e3)

microbenchmark(R = Z <- A + B + C + D + E,
               arma = Z <- mSum_arma(A, B, C, D, E))
## Unit: milliseconds
##  expr    min      lq      mean  median       uq     max neval
##     R 5.2717 7.40915 13.692592 8.93920 10.76755 58.4846   100
##  arma 2.2029 2.55870  5.912126 4.49205  6.34125 52.4831   100

As we might have expected, the Armadillo version of the sum is faster. How was this better performance achieved? Note that when evaluating the A + B + C + D + E, R will compute A + B and store the result in a temporary object, which we call ApB. Then, it will compute ApB + C and store the temporary result in another matrix, and so on. Hence, the four matrix sums will result in the allocation of four matrices, three of these are temporary and will be discarded, the last one will be stored as Z. An efficient implementation would avoid allocating memory for three temporary matrices, but would simply allocate memory for Z and then do:

for(i in 1:d)
  for(j in 1:d)
    Z[i, j] <- A[i, j] + B[i, j] + C[i, j] + D[i, j] + E[i, j]  

but doing few floating point operations within a nested loop is of course hopelessly inefficient in R. The question is whether Armadillo has been able to work out that this is the best way of framing of computation. To investigate whether the performance of Armadillo could be attributed to such compile-time optimization, we write a version of mSum_arma which explicitly creates temporary matrices:

sourceCpp(code = '
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace arma;

// [[Rcpp::export(name = "mSum_arma_slow")]]
mat mSum_i(mat& A, mat& B, mat& C, mat& D, mat& E) {
  mat T1 = A + B;
  mat T2 = T1 + C;
  mat T3 = T2 + D;
  return T3 + E;
}')

We then add this version to the comparison:

microbenchmark(R = Z <- A + B + C + D + E,
               arma = Z <- mSum_arma(A, B, C, D, E), 
               arma_slow = Z <- mSum_arma_slow(A, B, C, D, E))
## Unit: milliseconds
##       expr    min      lq      mean  median       uq     max neval
##          R 5.0734 6.50125 10.415961 7.21800  8.45595 55.6120   100
##       arma 1.8495 2.31500  6.775004 2.61480  4.19805 51.6637   100
##  arma_slow 5.2380 6.61635 11.047487 8.88785 12.71660 61.1350   100

The performance of the second Armadillo version is close to that of the R code, hence we can conclude that the efficiency of the first Armadillo version is attributable to the fact that it avoids creating temporary matrices.

To conclude this example, consider its extension to the problem of computing \({\bf A} + {\bf B} + \cdots\), where the number of matrices involved in the sum is arbitrary (that is, not fixed to five as above). If the matrices are contained in a list, for example:

nmat <- 10
mats <- lapply(1:nmat, function(.nouse) matrix(rnorm(1e6), 1e3, 1e3))

then one compact way of calculating their sum in R is Reduce("+", M). Can we modify our RcppArmadillo code to handle any number of matrices? One way of doing this so that the compile-time efficiency tricks of Armadillo are exploited is based on some text editing. In particular, consider the following function:

getMatSumFun <- function(n){
  args <- paste0("mat& ", paste0("A", 1:n), ",", collapse = ' ')
  args <- substr(args, 1, nchar(args)-1)

  ret <- paste0(paste0("A", 1:n), " +", collapse = ' ')
  ret <- substr(ret, 1, nchar(ret)-2)

  funBody <- paste0("
   // [[Rcpp::depends(RcppArmadillo)]]
   #include <RcppArmadillo.h>
   using namespace arma;

   // [[Rcpp::export(name = \"mSum_arma_list\")]]
   mat mSum_i(", args, "){
   return ", ret, ";
  }")

  return(funBody)
}

Given n, this function creates the text for an RcppArmadillo function which sums up n matrices, in particular:

armaCode <- getMatSumFun(nmat)
cat(armaCode)
## 
##    // [[Rcpp::depends(RcppArmadillo)]]
##    #include <RcppArmadillo.h>
##    using namespace arma;
## 
##    // [[Rcpp::export(name = "mSum_arma_list")]]
##    mat mSum_i(mat& A1, mat& A2, mat& A3, mat& A4, mat& A5, mat& A6, mat& A7, mat& A8, mat& A9, mat& A10){
##    return A1 + A2 + A3 + A4 + A5 + A6 + A7 + A8 + A9 + A10;
##   }

Writing down the sum explicily guarantees that Armadillo will avoid creating temporary matrices. We can now compare the R and Armadillo solutions:

sourceCpp(code = armaCode)

microbenchmark(R = Reduce("+", mats), arma = do.call("mSum_arma_list", mats))
## Unit: milliseconds
##  expr     min       lq     mean   median       uq     max neval
##     R 15.7547 16.34720 20.58238 17.55995 19.65135 69.2895   100
##  arma  5.9786  6.42275  7.65936  6.77025  8.78680 11.4526   100

Basic RcppArmadillo usage

Here we describe some commonly used features of RcppArmadillo and Armadillo. The Armadillo library is quite extensive, hence we refer to its documentation for more details. The most commonly used object classes are:

Armadillo provides also 3D matrices, sparse matrices and fields (matrices whose elements can be arbitrary objects, such as matrices), which we will not use here. In order to effectively use an Armadillo object class, we need to know (or look in the documentation for):

  1. the constructor types available for such class. For example, a matrix of doubles A could be construted using mat A(10, 5) or mat A(10, 5, fill::zeros), for example. The difference being that in the first case the entries of the \(10 \times 5\) matrix A are arbitrary, while in the second they are set to 0.
  2. The member functions and variables of that class. These are accessed using object.member_function() or object.variable. For example, if y is a Col<float> vector, the y.n_elem gives its lengths while y.ones() sets all its elements to 1. The docs clearly explain which member functions are available for each object class.
  3. How the elements of the objects can be accessed. This is documented as part of the “Member Functions & Variables” section the Armadillo docs. An element \(ij\) of a matrix A of class Mat can be accessed via A(i-1,j-1), as indexing starts at zero. It can also be accessed via A.at(i-1,j-1) which is faster but more dangerous as no bounds check is performed (hence we could read/write outside the memory allocated to A).
  4. The available operators and functions. Operators are overloaded so if, for example, A and B are matrices while s is a scalar, then A * B will calculate the standard matrix product while A * s will multiply each element of A by s. Several element-wise functions are provided (e.g., exp(A), sqrt(A), …), as well as standard decomposition (e.g, Cholesky, LU, …) and solvers for linear systems.

Some guidance on how standard linear algebra operations in R can be translated to RcppArmadillo is provided here. Before moving to a more involved example, we clarify the relation between Armadillo objects and Rcpp matrices and vectors. Conversion from Armadillo object to Rcpp object is performed using Rcpp::wrap, for example:

sourceCpp(code = '
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace arma;

// [[Rcpp::export(name = "getIntMat")]]
Mat<int> tmpFun_i(int nr, int nc) {
  Mat<int> MA(nr, nc);
  Rcpp::IntegerMatrix MR = Rcpp::wrap(MA);
  MR[0, 0] = 1;
  return MA;
}')

getIntMat(3, 3)
##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0
## [3,]    0    0    0

Here we are creating a \(3 \times 3\) un-initialized integer matrix in Armadillo and we are converting it to an Rcpp::IntegerMatrix using wrap. It is interesting to note that here Rcpp::wrap is copying MA, which is demostrated by the fact that changing the top-left element of MR does not entail changing MA. The following version avoids any copy:

sourceCpp(code = '
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace arma;

// [[Rcpp::export(name = "getIntMat_noCopy")]]
Rcpp::IntegerMatrix tmpFun_i(int nr, int nc) {
  Rcpp::IntegerMatrix MR(nr, nc);
  Mat<int> MA(MR.begin(), MR.nrow(), MR.ncol(), false);
  MA(0, 0) = 1;
  return MR;
}')

getIntMat_noCopy(3, 3)
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    0    0
## [3,]    0    0    0

Let’s see how this worked step by step:

In summary getIntMat allocates memory outside R via Mat<int> MA(nr, nc); then takes one copy when we do Rcpp::wrap(MA) and another copy when Rcpp::wrap is called within the wrapper written by sourceCpp. In contrast, getIntMat_noCopy allocates R memory via Rcpp::IntegerMatrix MR(nr, nc); and does no take any copy. This makes the second version much more efficient, in fact:

d <- 1e3
microbenchmark(copy = getIntMat(d, d), noCopy = getIntMat_noCopy(d, d))
## Unit: microseconds
##    expr    min      lq     mean  median      uq    max neval
##    copy 1314.8 1977.30 3622.988 3599.15 4959.35 7240.9   100
##  noCopy  170.6  194.35  965.220  359.50 1505.05 3680.7   100

The bottom line is that, when we want to manipulate large matrices or vectors in RcppArmadillo, it is often more efficient to:

    1. allocate memory in R and then pass the objects by reference (e.g., via mat&, vec&, …) to the Rcpp function or (b) to allocate R memory directly in Rcpp via one of its data structure (e.g., NumericMatrix, NumericVector, …);
  1. wrap the R object into Armadillo objects using an advanced constructor, such as the one we have see above;
  2. manipulate the Armadillo objects using the operators and functions provided by Armadillo;
  3. return an Rcpp data structure to avoid copies being made by Rcpp::wrap.

The advanced constructor provides a way of converting Rcpp objects into Armadillo objects, without taking a copy. Often taking a copy is acceptable, in which case it is more convenient to perform the conversion using the Rcpp::as<type>() function, where the template argument type can be an Armadillo object class such as Mat<int>, Row<double> and so on. An example is:

sourceCpp(code = '
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace arma;

// [[Rcpp::export(name = "matVettArma")]]
Col<double> tmpFun_i(Rcpp::NumericMatrix X, Rcpp::NumericVector y) {
  Mat<double> Xa = Rcpp::as<Mat<double>>(X);
  Col<double> ya = Rcpp::as<Col<double>>(y);
  return Xa*ya;
}')

matVettArma(matrix(1, 3, 3), 1:3)
##      [,1]
## [1,]    6
## [2,]    6
## [3,]    6

which performs matrix-vector multiplication in Armadillo. Note that Rcpp::as generally takes a copy of the object being converted.

References