A first look at iotools: Importing data

Scalable Data Processing in R

Simon Urbanek

Member of R-Core, Lead Inventive Scientist, AT&T Labs Research

Chunk-wise processing

  1. Load pieces of data
  2. Convert them into native objects
  3. Perform computation and store the results

Repeat 1 to 3 until all data is processed

Scalable Data Processing in R

Importing data

  • Often loading data takes more time than processing, and it happens in 2 steps
    • Retrieving data from disk is a relatively slow operation
    • Converting raw data into native R objects
Scalable Data Processing in R

Importing data using iotools

In the iotools package, the physical loading of data and parsing of input into R objects are separated for better flexibility and performance.

Scalable Data Processing in R

iotools: Importing data

  • readAsRaw() reads the entire data into a raw vector
  • read.chunk() reads the data in chunks into a raw vector
Scalable Data Processing in R

iotools: Parsing data

  • mstrsplit() converts raw data into a matrix
  • dstrsplit() converts raw data into a data frame
Scalable Data Processing in R

iotools: Loading and parsing data

read.delim.raw() = readAsRaw() + dstrsplit()

Scalable Data Processing in R

Chunk-wise processing

  • Not necessary to import all the data
  • Read a "chunk" of rows at a time from the data source
  • No intermediate structure
Scalable Data Processing in R

File connections

# Open a file connection
fc <- file("data-file.csv", "rb")
# Read the first line if the data has a header
readLines(fc, n = 1)
....
# Code to import and parse the data
....
# Close the file connection
close(fc)
Scalable Data Processing in R

Let's practice!

Scalable Data Processing in R

Preparing Video For Download...