Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
Data type | Example |
---|---|
Text | First name, last name, address, ... |
Integer | Subscriber count, # products sold, ... |
Decimal | Temperature, exchange rate, ... |
Binary | Is married, new customer, yes/no, ... |
Category | Marriage status, color, ... |
Date | Order dates, date of birth, ... |
R data type |
---|
character |
integer |
numeric |
logical |
factor |
Date |
sales <- read.csv("sales.csv")
head(sales)
order_id revenue quantity
1 7432 5,454 494
2 7808 5,668 334
3 4893 4,062 259
4 6107 3,936 15
5 7661 1,067 307
6 5908 6,635 235
library(dplyr)
glimpse(sales)
Observations: 100
Variables: 3
$ order_id <dbl> 7432, 7808, ...
$ revenue <chr> "$5454", "$5668", ...
$ quantity <dbl> 494, 334, ...
is.numeric(sales$revenue)
FALSE
library(assertive)
assert_is_numeric(sales$revenue)
Error: is_numeric : sales$revenue is not of class 'numeric'; it has class 'character'.
assert_is_numeric(sales$quantity)
Logical checking - returns TRUE
/FALSE
is.character()
is.numeric()
is.logical()
is.factor()
is.Date()
assertive
checking - errors when FALSE
assert_is_character()
assert_is_numeric()
assert_is_logical()
assert_is_factor()
assert_is_date()
class(sales$revenue)
"character"
mean(sales$revenue)
NA
Warning message:
In mean.default(sales$revenue) :
argument is not numeric or logical: returning NA
sales$revenue
"5,454" "5,668" "4,062" "3,936" "1,067" ...
library(stringr) revenue_trimmed = str_remove(sales$revenue, ",")
revenue_trimmed
"5454" "5668" "4062" "3936" "1067" ...
as.numeric(revenue_trimmed)
5454 5668 4062 3936 1067 ...
sales %>%
mutate(revenue_usd = as.numeric(str_remove(revenue, ",")))
# A tibble: 100 x 4
order_id revenue quantity revenue_usd
<dbl> <chr> <dbl> <dbl>
1 7432 5,454 494 5454
2 7808 5,668 334 5668
3 4893 4,062 259 4062
4 6107 3,936 15 3936
5 7661 1,067 307 1067
# ... with 95 more rows
mean(sales$revenue)
NA
Warning message:
In mean.default(sales$revenue) :
argument is not numeric or logical: returning NA
mean(sales$revenue_usd)
5361.4
as.character()
as.numeric()
as.logical()
as.factor()
as.Date()
product_type
1000 1000 3000 2000 3000
Levels: 1000 2000 3000
class(product_type)
"factor"
as.numeric(product_type)
1 1 3 2 3
as.numeric(as.character(product_type))
1000 1000 3000 2000 3000
Cleaning Data in R