Data type constraints

Cleaning Data in R

Maggie Matsui

Content Developer @ DataCamp

Course outline

Server with magnifying glass and "diagnose dirty data"

Cleaning Data in R

Course outline

Server with beetle bug and "side effects of dirty data"

Cleaning Data in R

Course outline

Round database with broom and sparkles for clean data

Cleaning Data in R

Course outline

Diagnosing dirty data, side effects of dirty data, and cleaning dirty data

Chapter 1 - Common data problems

Cleaning Data in R

Why do we need clean data?

 

Data science workflow: access data, explore and process data, extract insights, report insights

Cleaning Data in R

Why do we need clean data?

 

Human and technical error

Cleaning Data in R

Why do we need clean data?

 

Errors propagate through the workflow

Cleaning Data in R

Data type constraints

Data type Example
Text First name, last name, address, ...
Integer Subscriber count, # products sold, ...
Decimal Temperature, exchange rate, ...
Binary Is married, new customer, yes/no, ...
Category Marriage status, color, ...
Date Order dates, date of birth, ...
R data type
character
integer
numeric
logical
factor
Date
Cleaning Data in R

Glimpsing at data types

sales <- read.csv("sales.csv")
head(sales)
  order_id revenue quantity
1     7432   5,454      494
2     7808   5,668      334
3     4893   4,062      259
4     6107   3,936       15
5     7661   1,067      307
6     5908   6,635      235
library(dplyr)
glimpse(sales)
Observations: 100
Variables: 3
$ order_id <dbl> 7432, 7808, ...
$ revenue  <chr> "$5454", "$5668", ...
$ quantity <dbl> 494, 334, ...
Cleaning Data in R

Checking data types

is.numeric(sales$revenue)
FALSE
library(assertive)
assert_is_numeric(sales$revenue)
Error: is_numeric : sales$revenue is not of class 'numeric'; it has class 'character'.
assert_is_numeric(sales$quantity)


Cleaning Data in R

Checking data types

Logical checking - returns TRUE/FALSE

  • is.character()
  • is.numeric()
  • is.logical()
  • is.factor()
  • is.Date()
  • ...

assertive checking - errors when FALSE

  • assert_is_character()
  • assert_is_numeric()
  • assert_is_logical()
  • assert_is_factor()
  • assert_is_date()
  • ...
Cleaning Data in R

Why does data type matter?

class(sales$revenue)
"character"
mean(sales$revenue)
NA
Warning message:
In mean.default(sales$revenue) :
  argument is not numeric or logical: returning NA
Cleaning Data in R

Comma problems

sales$revenue
"5,454" "5,668" "4,062" "3,936" "1,067" ...

Cleaning Data in R

Character to number

library(stringr)
revenue_trimmed = str_remove(sales$revenue, ",")

revenue_trimmed
"5454" "5668" "4062" "3936" "1067" ...
as.numeric(revenue_trimmed)
5454 5668 4062 3936 1067 ...
Cleaning Data in R

Putting it together

sales %>%
  mutate(revenue_usd = as.numeric(str_remove(revenue, ",")))
# A tibble: 100 x 4
   order_id revenue quantity revenue_usd
      <dbl> <chr>      <dbl>       <dbl>
 1     7432 5,454        494        5454
 2     7808 5,668        334        5668
 3     4893 4,062        259        4062
 4     6107 3,936         15        3936
 5     7661 1,067        307        1067
# ... with 95 more rows
Cleaning Data in R

Same function, different outcomes

mean(sales$revenue)
NA
Warning message:
In mean.default(sales$revenue) :
  argument is not numeric or logical: returning NA
mean(sales$revenue_usd)
5361.4
Cleaning Data in R

Converting data types

  • as.character()
  • as.numeric()
  • as.logical()
  • as.factor()
  • as.Date()
  • ...
Cleaning Data in R

Watch out: factor to numeric

product_type
1000 1000 3000 2000 3000
Levels: 1000 2000 3000
class(product_type)
"factor"
as.numeric(product_type)
1 1 3 2 3
as.numeric(as.character(product_type))
1000 1000 3000 2000 3000
Cleaning Data in R

Let's practice!

Cleaning Data in R

Preparing Video For Download...