Cross field validation

Nettoyer des données avec R

Maggie Matsui

Content Developer @ DataCamp

What is cross field validation?

  • Cross field validation = a sanity check

  • Does this value make sense based on other values?

A screenshot from a news channel. Results of a poll asking, "Should Scotland be independent?". Yes is 52% of the answers, No is 58% of the answers.

1 https://www.buzzfeednews.com/article/katienotopoulos/graphs-that-lied-to-us
Nettoyer des données avec R

Credit card data

head(credit_cards)
  date_opened dining_cb groceries_cb  gas_cb total_cb acct_age
1  2018-07-05     26.08        83.43   78.90   188.41        1
2  2016-01-23   1309.33         4.46 1072.25  2386.04        4
3  2016-03-25    205.84       119.20  800.62  1125.66        4
4  2018-06-20     14.00        16.37   18.41    48.78        1
5  2017-02-08     98.50       283.68  281.70   788.33        3
6  2014-11-18    889.28      2626.34 2973.62  6489.24        5
Nettoyer des données avec R

Validating numbers

credit_cards %>% 
  select(dining_cb:total_cb)
   dining_cb groceries_cb  gas_cb total_cb
1      26.08        83.43   78.90   188.41
2    1309.33         4.46 1072.25  2386.04
3     205.84       119.20  800.62  1125.66
4      14.00        16.37   18.41    48.78
5      98.50       283.68  281.70   788.33
6     889.28      2626.34 2973.62  6489.24
Nettoyer des données avec R

Validating numbers

credit_cards %>%

mutate(theoretical_total = dining_cb + groceries_cb + gas_cb) %>%
filter(theoretical_total != total_cb) %>% select(dining_cb:theoretical_total)
  dining_cb groceries_cb  gas_cb total_cb theoretical_total
1     98.50       283.68  281.70   788.33            663.88
2   3387.53       363.85 2706.42  4502.94           6457.80
Nettoyer des données avec R

Validating date and age

credit_cards %>%
  select(date_opened, acct_age)
   date_opened acct_age
1   2018-07-05        1
2   2016-01-23        4
3   2016-03-25        4
4   2018-06-20        1
5   2017-02-08        3
6   2014-11-18        5
Nettoyer des données avec R

Calculating age

library(lubridate)
date_difference <- as.Date("2015-09-04") %--% today()
date_difference
2015-09-04 UTC--2020-03-09 UTC
as.numeric(date_difference, "years")
4.511978
floor(as.numeric(date_difference, "years"))
4
Nettoyer des données avec R

Validating age

credit_cards %>%
  mutate(theor_age = floor(as.numeric(date_opened %--% today(), "years"))) %>%
  filter(theor_age != acct_age)
  date_opened acct_age dining_cb groceries_cb  gas_cb total_cb theor_age
1  2016-03-25        4    814.34       471.58 3167.41  4453.33         3
2  2018-03-06        3    238.48       186.05  213.84   638.37         2
Nettoyer des données avec R

What next?

On the left, a trash can to represent dropping data. In the middle, a question mark to represent set to missing and impute. On the right, some squares linked by lines containing check marks and x's to represent applying rules from domain knowledge.

Nettoyer des données avec R

Let's practice!

Nettoyer des données avec R

Preparing Video For Download...