Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp







| Data type | Example | 
|---|---|
| Text | First name, last name, address, ... | 
| Integer | Subscriber count, # products sold, ... | 
| Decimal | Temperature, exchange rate, ... | 
| Binary | Is married, new customer, yes/no, ... | 
| Category | Marriage status, color, ... | 
| Date | Order dates, date of birth, ... | 
| R data type | 
|---|
| character | 
| integer | 
| numeric | 
| logical | 
| factor | 
| Date | 
sales <- read.csv("sales.csv")
head(sales)
  order_id revenue quantity
1     7432   5,454      494
2     7808   5,668      334
3     4893   4,062      259
4     6107   3,936       15
5     7661   1,067      307
6     5908   6,635      235
library(dplyr)
glimpse(sales)
Observations: 100
Variables: 3
$ order_id <dbl> 7432, 7808, ...
$ revenue  <chr> "$5454", "$5668", ...
$ quantity <dbl> 494, 334, ...
is.numeric(sales$revenue)
FALSE
library(assertive)
assert_is_numeric(sales$revenue)
Error: is_numeric : sales$revenue is not of class 'numeric'; it has class 'character'.
assert_is_numeric(sales$quantity)
Logical checking - returns TRUE/FALSE
is.character()is.numeric()is.logical()is.factor()is.Date()assertive checking - errors when FALSE
assert_is_character()assert_is_numeric()assert_is_logical()assert_is_factor()assert_is_date()class(sales$revenue)
"character"
mean(sales$revenue)
NA
Warning message:
In mean.default(sales$revenue) :
  argument is not numeric or logical: returning NA
sales$revenue
"5,454" "5,668" "4,062" "3,936" "1,067" ...
library(stringr) revenue_trimmed = str_remove(sales$revenue, ",")revenue_trimmed
"5454" "5668" "4062" "3936" "1067" ...
as.numeric(revenue_trimmed)
5454 5668 4062 3936 1067 ...
sales %>%
  mutate(revenue_usd = as.numeric(str_remove(revenue, ",")))
# A tibble: 100 x 4
   order_id revenue quantity revenue_usd
      <dbl> <chr>      <dbl>       <dbl>
 1     7432 5,454        494        5454
 2     7808 5,668        334        5668
 3     4893 4,062        259        4062
 4     6107 3,936         15        3936
 5     7661 1,067        307        1067
# ... with 95 more rows
mean(sales$revenue)
NA
Warning message:
In mean.default(sales$revenue) :
  argument is not numeric or logical: returning NA
mean(sales$revenue_usd)
5361.4
as.character()as.numeric()as.logical()as.factor()as.Date()product_type
1000 1000 3000 2000 3000
Levels: 1000 2000 3000
class(product_type)
"factor"
as.numeric(product_type)
1 1 3 2 3
as.numeric(as.character(product_type))
1000 1000 3000 2000 3000
Cleaning Data in R