The Bigmemory Project

Scalable Data Processing in R

Michael Kane

Assistant Professor, Yale University

bigmemory

bigmemory is used to store, manipulate, and process big matrices, that may be larger than a computer's RAM

Scalable Data Processing in R

big.matrix

  • Create
  • Retrieve
  • Subset
  • Summarize
Scalable Data Processing in R

What does "out-of-core" mean?

  • R objects are kept in RAM

  • When you run out of RAM

    • Things get moved to disk
    • Programs keep running (slowly) or crash

You are better off moving data to RAM only when the data are needed for processing.

Scalable Data Processing in R

When to use a big.matrix?

  • 20% of the size of RAM
  • Dense matrices
Scalable Data Processing in R

An Overview of bigmemory

  • bigmemory implements the big.matrix data type, which is used to create, store, access, and manipulate matrices stored on the disk

  • Data are kept on the disk and moved to RAM implicitly

Scalable Data Processing in R

An Overview of bigmemory

A big.matrix object:

  • Only needs to be imported once
  • "backing" file
  • "descriptor" file
Scalable Data Processing in R

An example using bigmemory

library(bigmemory)

# Create a new big.matrix object x <- big.matrix(nrow = 1, ncol = 3, type = "double", init = 0, backingfile = "hello_big_matrix.bin", descriptorfile = "hello_big_matrix.desc")
Scalable Data Processing in R

backing and descriptor files

  • backing file: binary representation of the matrix on the disk
  • descriptor file: holds metadata, such as number of rows, columns, names, etc..
Scalable Data Processing in R

An example using bigmemory

# See what's in it
 x[,]
  0    0    0
x
An object of class "big.matrix"
Slot "address":
<pointer: 0x108e2a9a0>
Scalable Data Processing in R

Similarities with matrices

# Change the value in the first row and column
x[1, 1] <- 3
# Verify the change has been made
x[,]
   3    0    0
Scalable Data Processing in R

Let's practice!

Scalable Data Processing in R

Preparing Video For Download...