Cross field validation

Cleaning Data in R

Maggie Matsui

Content Developer @ DataCamp

What is cross field validation?

  • Cross field validation = a sanity check

  • Does this value make sense based on other values?

A screenshot from a news channel. Results of a poll asking, "Should Scotland be independent?". Yes is 52% of the answers, No is 58% of the answers.

1 https://www.buzzfeednews.com/article/katienotopoulos/graphs-that-lied-to-us
Cleaning Data in R

Credit card data

head(credit_cards)
  date_opened dining_cb groceries_cb  gas_cb total_cb acct_age
1  2018-07-05     26.08        83.43   78.90   188.41        1
2  2016-01-23   1309.33         4.46 1072.25  2386.04        4
3  2016-03-25    205.84       119.20  800.62  1125.66        4
4  2018-06-20     14.00        16.37   18.41    48.78        1
5  2017-02-08     98.50       283.68  281.70   788.33        3
6  2014-11-18    889.28      2626.34 2973.62  6489.24        5
Cleaning Data in R

Validating numbers

credit_cards %>% 
  select(dining_cb:total_cb)
   dining_cb groceries_cb  gas_cb total_cb
1      26.08        83.43   78.90   188.41
2    1309.33         4.46 1072.25  2386.04
3     205.84       119.20  800.62  1125.66
4      14.00        16.37   18.41    48.78
5      98.50       283.68  281.70   788.33
6     889.28      2626.34 2973.62  6489.24
Cleaning Data in R

Validating numbers

credit_cards %>%

mutate(theoretical_total = dining_cb + groceries_cb + gas_cb) %>%
filter(theoretical_total != total_cb) %>% select(dining_cb:theoretical_total)
  dining_cb groceries_cb  gas_cb total_cb theoretical_total
1     98.50       283.68  281.70   788.33            663.88
2   3387.53       363.85 2706.42  4502.94           6457.80
Cleaning Data in R

Validating date and age

credit_cards %>%
  select(date_opened, acct_age)
   date_opened acct_age
1   2018-07-05        1
2   2016-01-23        4
3   2016-03-25        4
4   2018-06-20        1
5   2017-02-08        3
6   2014-11-18        5
Cleaning Data in R

Calculating age

library(lubridate)
date_difference <- as.Date("2015-09-04") %--% today()
date_difference
2015-09-04 UTC--2020-03-09 UTC
as.numeric(date_difference, "years")
4.511978
floor(as.numeric(date_difference, "years"))
4
Cleaning Data in R

Validating age

credit_cards %>%
  mutate(theor_age = floor(as.numeric(date_opened %--% today(), "years"))) %>%
  filter(theor_age != acct_age)
  date_opened acct_age dining_cb groceries_cb  gas_cb total_cb theor_age
1  2016-03-25        4    814.34       471.58 3167.41  4453.33         3
2  2018-03-06        3    238.48       186.05  213.84   638.37         2
Cleaning Data in R

What next?

On the left, a trash can to represent dropping data. In the middle, a question mark to represent set to missing and impute. On the right, some squares linked by lines containing check marks and x's to represent applying rules from domain knowledge.

Cleaning Data in R

Let's practice!

Cleaning Data in R

Preparing Video For Download...