Cross field validation

Cleaning Data in Python

Adel Nehme

Content Developer @ DataCamp

Motivation

import pandas as pd

flights = pd.read_csv('flights.csv')
flights.head()
  flight_number  economy_class  business_class  first_class  total_passengers
0         DL140            100              60           40               200
1         BA248            130             100           70               300
2        MEA124            100              50           50               200
3        AFR939            140              70           90               300
4        TKA101            130             100           20               250
Cleaning Data in Python

Cross field validation

The use of multiple fields in a dataset to sanity check data integrity

  flight_number  economy_class  business_class  first_class  total_passengers
0         DL140            100       +      60      +    40        =      200
1         BA248            130       +     100      +    70        =      300
2        MEA124            100       +      50      +    50        =      200
3        AFR939            140       +      70      +    90        =      300
4        TKA101            130       +     100      +    20        =      250
sum_classes = flights[['economy_class', 'business_class', 'first_class']].sum(axis = 1)

passenger_equ = sum_classes == flights['total_passengers']
# Find and filter out rows with inconsistent passenger totals inconsistent_pass = flights[~passenger_equ] consistent_pass = flights[passenger_equ]
Cleaning Data in Python

Cross field validation

users.head()
   user_id  Age   Birthday
0    32985   22 1998-03-02
1    94387   27 1993-12-04
2    34236   42 1978-11-24
3    12551   31 1989-01-03
4    55212   18 2002-07-02
Cleaning Data in Python

Cross field validation

import pandas as pd
import datetime as dt

# Convert to datetime and get today's date
users['Birthday'] = pd.to_datetime(users['Birthday'])

today = dt.date.today()
# For each row in the Birthday column, calculate year difference age_manual = today.year - users['Birthday'].dt.year
# Find instances where ages match age_equ = age_manual == users['Age']
# Find and filter out rows with inconsistent age inconsistent_age = users[~age_equ] consistent_age = users[age_equ]
Cleaning Data in Python

What to do when we catch inconsistencies?

Cleaning Data in Python

Let's practice!

Cleaning Data in Python

Preparing Video For Download...