Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
Cross field validation = a sanity check
Does this value make sense based on other values?
head(credit_cards)
date_opened dining_cb groceries_cb gas_cb total_cb acct_age
1 2018-07-05 26.08 83.43 78.90 188.41 1
2 2016-01-23 1309.33 4.46 1072.25 2386.04 4
3 2016-03-25 205.84 119.20 800.62 1125.66 4
4 2018-06-20 14.00 16.37 18.41 48.78 1
5 2017-02-08 98.50 283.68 281.70 788.33 3
6 2014-11-18 889.28 2626.34 2973.62 6489.24 5
credit_cards %>%
select(dining_cb:total_cb)
dining_cb groceries_cb gas_cb total_cb
1 26.08 83.43 78.90 188.41
2 1309.33 4.46 1072.25 2386.04
3 205.84 119.20 800.62 1125.66
4 14.00 16.37 18.41 48.78
5 98.50 283.68 281.70 788.33
6 889.28 2626.34 2973.62 6489.24
credit_cards %>%
mutate(theoretical_total = dining_cb + groceries_cb + gas_cb) %>%
filter(theoretical_total != total_cb) %>% select(dining_cb:theoretical_total)
dining_cb groceries_cb gas_cb total_cb theoretical_total
1 98.50 283.68 281.70 788.33 663.88
2 3387.53 363.85 2706.42 4502.94 6457.80
credit_cards %>%
select(date_opened, acct_age)
date_opened acct_age
1 2018-07-05 1
2 2016-01-23 4
3 2016-03-25 4
4 2018-06-20 1
5 2017-02-08 3
6 2014-11-18 5
library(lubridate)
date_difference <- as.Date("2015-09-04") %--% today()
date_difference
2015-09-04 UTC--2020-03-09 UTC
as.numeric(date_difference, "years")
4.511978
floor(as.numeric(date_difference, "years"))
4
credit_cards %>%
mutate(theor_age = floor(as.numeric(date_opened %--% today(), "years"))) %>%
filter(theor_age != acct_age)
date_opened acct_age dining_cb groceries_cb gas_cb total_cb theor_age
1 2016-03-25 4 814.34 471.58 3167.41 4453.33 3
2 2018-03-06 3 238.48 186.05 213.84 638.37 2
Cleaning Data in R