Advanced file reading

Gegevens manipuleren met data.table in R

Matt Dowle, Arun Srinivasan

Instructors, DataCamp

Reading big integers using integer64 type

  • By default, R can only represent numbers less than or equal to 2^31 - 1 = 2147483647
  • Large integers are automatically read in as integer64 type, provided by the bit64 package
ans <- fread("id,name\n1234567890123,Jane\n5284782381811,John\n")
ans
           id name
1234567890123 Jane
5284782381811 John
class(ans$id)
"integer64"
Gegevens manipuleren met data.table in R

Specifying column class types with colClasses

str <- "x1,x2,x3,x4,x5\n1,2,1.5,true,cc\n3,4,2.5,false,ff"

ans <- fread(str, colClasses = c(x5 = "factor")) str(ans)
Classes ‘data.table’ and 'data.frame':    2 obs. of  5 variables:
 $ x1: int  1 3
 $ x2: int  2 4
 $ x3: num  1.5 2.5
 $ x4: logi  TRUE FALSE
 $ x5: Factor w/ 2 levels "cc","ff": 1 2
Gegevens manipuleren met data.table in R

Specifying column class types with colClasses

ans <- fread(str, colClasses = c("integer", "integer", 
                                 "numeric", "logical", "factor"))
str(ans)
Classes ‘data.table’ and 'data.frame':    2 obs. of  5 variables:
 $ x1: int  1 3
 $ x2: int  2 4
 $ x3: num  1.5 2.5
 $ x4: logi  TRUE FALSE
 $ x5: Factor w/ 2 levels "cc","ff": 1 2
Gegevens manipuleren met data.table in R

Specifying column class types with colClasses

str <- "x1,x2,x3,x4,x5,x6\n1,2,1.5,2.5,aa,bb\n3,4,5.5,6.5,cc,dd"
ans <- fread(str, colClasses = list(numeric = 1:4, factor = c("x5", "x6")))
str(ans)
Classes ‘data.table’ and 'data.frame': 2 obs. of 6 variables:
 $ x1: num  1 3
 $ x2: num  2 4
 $ x3: num  1.5 5.5
 $ x4: num  2.5 6.5
 $ x5: Factor w/ 2 levels "aa","cc": 1 2
 $ x6: Factor w/ 2 levels "bb","dd": 1 2
Gegevens manipuleren met data.table in R

The fill argument

str <- "1,2\n3,4,a\n5,6\n7,8,b"
fread(str) 
V1 5 6
 7 8 b
Warning message:
In fread(str) :
  Detected 2 column names but the data has 3 columns (i.e. invalid file). 
  Added 1 extra default column name for the first column which is guessed to 
  be row names or an index. 
  Use setnames() afterwards if this guess is not correct, 
  or fix the file write command that created the file to create a valid file.
Gegevens manipuleren met data.table in R

The fill argument

fread(str, fill = TRUE)
V1 V2 V3
 1  2
 3  4  a
 5  6
 7  8  b
Gegevens manipuleren met data.table in R

The na.strings argument

Missing values are commonly encoded as: "999" or "##NA" or "N/A"

str <- "x,y,z\n1,###,3\n2,4,###\n#N/A,7,9"
ans <- fread(str, na.strings = c("###", "#N/A"))
ans
x  y  z
1 NA  3
2  4 NA
NA 7  9
Gegevens manipuleren met data.table in R

Let's practice!

Gegevens manipuleren met data.table in R

Preparing Video For Download...