The Bigmemory Suite of Packages

Scalable Data Processing in R

Michael Kane

Assistant Professor, Yale University

So far ..

  • Import
  • Subset
  • Assign values to big.matrix objects
Scalable Data Processing in R

Associated Packages

Tables and summaries
  • biganalytics
  • bigtabulate
Scalable Data Processing in R

Associated Packages

Linear algebra
  • bigalgebra
Scalable Data Processing in R

Associated Packages

Fit Models
  • bigpca
  • bigFastLM
  • biglasso
  • bigrf
Scalable Data Processing in R

The FHFA's Mortgage Data Set

  • Mortgages that were held or securitized by both Federal National Mortgage Association (Fannie Mae) and Federal Home Loan Mortgage Corporation (Freddie Mac) from 2009-2015
  • FHFA Mortgage data is available online here
  • We will focus on a random subset of 70000 loans
Scalable Data Processing in R

1st example: using bigtabulate with bigmemory

library(bigtabulate)

# How many samples do we have per year? bigtable(mort, "year")
 2008  2009  2010  2011  2012  2013  2014  2015 
 8468 11101  8836  7996 10935 10216  5714  6734
# Create nested tables
bigtable(mort, c("msa", "year"))
  2008 2009 2010 2011 2012 2013 2014 2015
0 1064 1343  998  851 1066 1005  504  564
1 7404 9758 7838 7145 9869 9211 5210 6170
Scalable Data Processing in R

Let's practice!

Scalable Data Processing in R

Preparing Video For Download...