The bigmemorypackage offers a set of tools for R which allow for manipulation larger-than-memory objects within R. It has some basic functions but is certainly not comprehensive. The eigen C++ linear algebra library is a highly efficient numerical linear algebra library and can be interfaced to R through RcppEigen by Douglas Bates and Dirk Eddelbuettel. If bigmemory and Eigen can be linked, then one would be able to do highly efficient linear algebra computation on data that is too big for memory (exactly what you thought R couldn’t do).
Since bigmemory works with pointers to C++ objects, it’s natural to link bigmemory objects to Eigen matrix objects. I’m not going to go too much into the details of this from the bigmemory/Rcpp side of things, as it’s well exposed here.
In this post I’ll create a colSums() function and a crossprod() function for big.matrix objects. All of the code posted below can be found in my rfunctions R package on github. big.matrix objects can have one of 4 types (1, 2, 4, 8), corresponding to (char, short, int, double), so we need to define extra Eigen matrix types like the following MatrixXi/VectorXi for ints and MatrixXd/VectorXd for doubles are already defined):
Then ``reading’’ in a big.matrix object from R to C++ and getting its data type looks like the following:
Then in order to associate the data from xpMat with an Eigen matrix object, we use the Eigen map (map)functionality to map the big.matrix data into an Eigen object (without copying it and hence loading it to memory). For data with the double type, this looks like:
where bM is the new Eigen object pointing to the big.matrix data located on disk. Now we are basically done. Performing the column-wise sum in Eigen is straightforward:
Putting it altogether:
If we want to make a crossprod function for big.matrix objects (ie computing $X^TX$), then we would do this with the following:
Now let’s run a big example to demonstrate the performance. The R function which calls colsums_big is called big.colSums() and the corresponding crossprod function is called big.crossprod(). If we have a big.matrix object big_mat, then the data can be loaded into memory as a matrix as big_mat[,], so we can compare with the standard R functions for colSums and crossprod.
The memory usage is obviously much lower when we don’t load the big.matrix object into memory too.
In a following post I’ll investigate fitting linear models via Eigen and bigmemory big.matrix objects and see how the speed compares with the biglm package.