Advanced file reading

Manipolazione dei dati con data.table in R

Matt Dowle, Arun Srinivasan

Instructors, DataCamp

Reading big integers using integer64 type

  • By default, R can only represent numbers less than or equal to 2^31 - 1 = 2147483647
  • Large integers are automatically read in as integer64 type, provided by the bit64 package
ans <- fread("id,name\n1234567890123,Jane\n5284782381811,John\n")
ans
           id name
1234567890123 Jane
5284782381811 John
class(ans$id)
"integer64"
Manipolazione dei dati con data.table in R

Specifying column class types with colClasses

str <- "x1,x2,x3,x4,x5\n1,2,1.5,true,cc\n3,4,2.5,false,ff"

ans <- fread(str, colClasses = c(x5 = "factor")) str(ans)
Classes ‘data.table’ and 'data.frame':    2 obs. of  5 variables:
 $ x1: int  1 3
 $ x2: int  2 4
 $ x3: num  1.5 2.5
 $ x4: logi  TRUE FALSE
 $ x5: Factor w/ 2 levels "cc","ff": 1 2
Manipolazione dei dati con data.table in R

Specifying column class types with colClasses

ans <- fread(str, colClasses = c("integer", "integer", 
                                 "numeric", "logical", "factor"))
str(ans)
Classes ‘data.table’ and 'data.frame':    2 obs. of  5 variables:
 $ x1: int  1 3
 $ x2: int  2 4
 $ x3: num  1.5 2.5
 $ x4: logi  TRUE FALSE
 $ x5: Factor w/ 2 levels "cc","ff": 1 2
Manipolazione dei dati con data.table in R

Specifying column class types with colClasses

str <- "x1,x2,x3,x4,x5,x6\n1,2,1.5,2.5,aa,bb\n3,4,5.5,6.5,cc,dd"
ans <- fread(str, colClasses = list(numeric = 1:4, factor = c("x5", "x6")))
str(ans)
Classes ‘data.table’ and 'data.frame': 2 obs. of 6 variables:
 $ x1: num  1 3
 $ x2: num  2 4
 $ x3: num  1.5 5.5
 $ x4: num  2.5 6.5
 $ x5: Factor w/ 2 levels "aa","cc": 1 2
 $ x6: Factor w/ 2 levels "bb","dd": 1 2
Manipolazione dei dati con data.table in R

The fill argument

str <- "1,2\n3,4,a\n5,6\n7,8,b"
fread(str) 
V1 5 6
 7 8 b
Warning message:
In fread(str) :
  Detected 2 column names but the data has 3 columns (i.e. invalid file). 
  Added 1 extra default column name for the first column which is guessed to 
  be row names or an index. 
  Use setnames() afterwards if this guess is not correct, 
  or fix the file write command that created the file to create a valid file.
Manipolazione dei dati con data.table in R

The fill argument

fread(str, fill = TRUE)
V1 V2 V3
 1  2
 3  4  a
 5  6
 7  8  b
Manipolazione dei dati con data.table in R

The na.strings argument

Missing values are commonly encoded as: "999" or "##NA" or "N/A"

str <- "x,y,z\n1,###,3\n2,4,###\n#N/A,7,9"
ans <- fread(str, na.strings = c("###", "#N/A"))
ans
x  y  z
1 NA  3
2  4 NA
NA 7  9
Manipolazione dei dati con data.table in R

Let's practice!

Manipolazione dei dati con data.table in R

Preparing Video For Download...