A first look at iotools: Importing data

Pemrosesan Data yang Dapat Diskalakan di R

Simon Urbanek

Member of R-Core, Lead Inventive Scientist, AT&T Labs Research

Chunk-wise processing

  1. Load pieces of data
  2. Convert them into native objects
  3. Perform computation and store the results

Repeat 1 to 3 until all data is processed

Pemrosesan Data yang Dapat Diskalakan di R

Importing data

  • Often loading data takes more time than processing, and it happens in 2 steps
    • Retrieving data from disk is a relatively slow operation
    • Converting raw data into native R objects
Pemrosesan Data yang Dapat Diskalakan di R

Importing data using iotools

In the iotools package, the physical loading of data and parsing of input into R objects are separated for better flexibility and performance.

Pemrosesan Data yang Dapat Diskalakan di R

iotools: Importing data

  • readAsRaw() reads the entire data into a raw vector
  • read.chunk() reads the data in chunks into a raw vector
Pemrosesan Data yang Dapat Diskalakan di R

iotools: Parsing data

  • mstrsplit() converts raw data into a matrix
  • dstrsplit() converts raw data into a data frame
Pemrosesan Data yang Dapat Diskalakan di R

iotools: Loading and parsing data

read.delim.raw() = readAsRaw() + dstrsplit()

Pemrosesan Data yang Dapat Diskalakan di R

Chunk-wise processing

  • Not necessary to import all the data
  • Read a "chunk" of rows at a time from the data source
  • No intermediate structure
Pemrosesan Data yang Dapat Diskalakan di R

File connections

# Open a file connection
fc <- file("data-file.csv", "rb")
# Read the first line if the data has a header
readLines(fc, n = 1)
....
# Code to import and parse the data
....
# Close the file connection
close(fc)
Pemrosesan Data yang Dapat Diskalakan di R

Let's practice!

Pemrosesan Data yang Dapat Diskalakan di R

Preparing Video For Download...