Advanced file reading

Data Manipulation with data.table in R

Matt Dowle, Arun Srinivasan

Instructors, DataCamp

Reading big integers using integer64 type

  • By default, R can only represent numbers less than or equal to 2^31 - 1 = 2147483647
  • Large integers are automatically read in as integer64 type, provided by the bit64 package
ans <- fread("id,name\n1234567890123,Jane\n5284782381811,John\n")
ans
           id name
1234567890123 Jane
5284782381811 John
class(ans$id)
"integer64"
Data Manipulation with data.table in R

Specifying column class types with colClasses

str <- "x1,x2,x3,x4,x5\n1,2,1.5,true,cc\n3,4,2.5,false,ff"

ans <- fread(str, colClasses = c(x5 = "factor")) str(ans)
Classes ‘data.table’ and 'data.frame':    2 obs. of  5 variables:
 $ x1: int  1 3
 $ x2: int  2 4
 $ x3: num  1.5 2.5
 $ x4: logi  TRUE FALSE
 $ x5: Factor w/ 2 levels "cc","ff": 1 2
Data Manipulation with data.table in R

Specifying column class types with colClasses

ans <- fread(str, colClasses = c("integer", "integer", 
                                 "numeric", "logical", "factor"))
str(ans)
Classes ‘data.table’ and 'data.frame':    2 obs. of  5 variables:
 $ x1: int  1 3
 $ x2: int  2 4
 $ x3: num  1.5 2.5
 $ x4: logi  TRUE FALSE
 $ x5: Factor w/ 2 levels "cc","ff": 1 2
Data Manipulation with data.table in R

Specifying column class types with colClasses

str <- "x1,x2,x3,x4,x5,x6\n1,2,1.5,2.5,aa,bb\n3,4,5.5,6.5,cc,dd"
ans <- fread(str, colClasses = list(numeric = 1:4, factor = c("x5", "x6")))
str(ans)
Classes ‘data.table’ and 'data.frame': 2 obs. of 6 variables:
 $ x1: num  1 3
 $ x2: num  2 4
 $ x3: num  1.5 5.5
 $ x4: num  2.5 6.5
 $ x5: Factor w/ 2 levels "aa","cc": 1 2
 $ x6: Factor w/ 2 levels "bb","dd": 1 2
Data Manipulation with data.table in R

The fill argument

str <- "1,2\n3,4,a\n5,6\n7,8,b"
fread(str) 
V1 5 6
 7 8 b
Warning message:
In fread(str) :
  Detected 2 column names but the data has 3 columns (i.e. invalid file). 
  Added 1 extra default column name for the first column which is guessed to 
  be row names or an index. 
  Use setnames() afterwards if this guess is not correct, 
  or fix the file write command that created the file to create a valid file.
Data Manipulation with data.table in R

The fill argument

fread(str, fill = TRUE)
V1 V2 V3
 1  2
 3  4  a
 5  6
 7  8  b
Data Manipulation with data.table in R

The na.strings argument

Missing values are commonly encoded as: "999" or "##NA" or "N/A"

str <- "x,y,z\n1,###,3\n2,4,###\n#N/A,7,9"
ans <- fread(str, na.strings = c("###", "#N/A"))
ans
x  y  z
1 NA  3
2  4 NA
NA 7  9
Data Manipulation with data.table in R

Let's practice!

Data Manipulation with data.table in R

Preparing Video For Download...