The bigmemory package offers a set of tools for R which allow for manipulation larger-than-memory objects within R. It has some basic functions but is certainly not comprehensive. The eigen C++ linear algebra library is a highly efficient numerical linear algebra library and can be interfaced to R through RcppEigen by Douglas Bates and Dirk Eddelbuettel. If bigmemory and Eigen can be linked, then one would be able to do highly efficient linear algebra computation on data that is too big for memory (exactly what you thought R couldn’t do).

Since bigmemory works with pointers to C++ objects, it’s natural to link bigmemory objects to Eigen matrix objects. I’m not going to go too much into the details of this from the bigmemory/Rcpp side of things, as it’s well exposed here.

In this post I’ll create a `colSums()`

function and a `crossprod()`

function for `big.matrix`

objects. All of the code posted below can be found in my rfunctions R package on github. `big.matrix`

objects can have one of 4 types (1, 2, 4, 8), corresponding to (char, short, int, double), so we need to define extra Eigen matrix types like the following `MatrixXi`

/`VectorXi`

for ints and `MatrixXd`

/`VectorXd`

for doubles are already defined):

Then ``reading’’ in a big.matrix object from R to C++ and getting its data type looks like the following:

Then in order to associate the data from `xpMat`

with an Eigen matrix object, we use the Eigen `map`

(map)functionality to map the big.matrix data into an Eigen object (without copying it and hence loading it to memory). For data with the double type, this looks like:

where `bM`

is the new Eigen object pointing to the big.matrix data located on disk. Now we are basically done. Performing the column-wise sum in Eigen is straightforward:

Putting it altogether:

If we want to make a `crossprod`

function for `big.matrix`

objects (ie computing $X^TX$), then we would do this with the following:

Now let’s run a big example to demonstrate the performance. The R function which calls `colsums_big`

is called `big.colSums()`

and the corresponding crossprod function is called `big.crossprod()`

. If we have a `big.matrix`

object `big_mat`

, then the data can be loaded into memory as a matrix as `big_mat[,]`

, so we can compare with the standard R functions for `colSums`

and `crossprod`

.

The memory usage is obviously much lower when we don’t load the big.matrix object into memory too.

In a following post I’ll investigate fitting linear models via Eigen and bigmemory `big.matrix`

objects and see how the speed compares with the biglm package.