Cleaning Data in Python
Adel Nehme
Content Developer @ DataCamp
import pandas as pd
flights = pd.read_csv('flights.csv')
flights.head()
flight_number economy_class business_class first_class total_passengers
0 DL140 100 60 40 200
1 BA248 130 100 70 300
2 MEA124 100 50 50 200
3 AFR939 140 70 90 300
4 TKA101 130 100 20 250
The use of multiple fields in a dataset to sanity check data integrity
flight_number economy_class business_class first_class total_passengers
0 DL140 100 + 60 + 40 = 200
1 BA248 130 + 100 + 70 = 300
2 MEA124 100 + 50 + 50 = 200
3 AFR939 140 + 70 + 90 = 300
4 TKA101 130 + 100 + 20 = 250
sum_classes = flights[['economy_class', 'business_class', 'first_class']].sum(axis = 1)
passenger_equ = sum_classes == flights['total_passengers']
# Find and filter out rows with inconsistent passenger totals inconsistent_pass = flights[~passenger_equ] consistent_pass = flights[passenger_equ]
users.head()
user_id Age Birthday
0 32985 22 1998-03-02
1 94387 27 1993-12-04
2 34236 42 1978-11-24
3 12551 31 1989-01-03
4 55212 18 2002-07-02
import pandas as pd import datetime as dt # Convert to datetime and get today's date users['Birthday'] = pd.to_datetime(users['Birthday'])
today = dt.date.today()
# For each row in the Birthday column, calculate year difference age_manual = today.year - users['Birthday'].dt.year
# Find instances where ages match age_equ = age_manual == users['Age']
# Find and filter out rows with inconsistent age inconsistent_age = users[~age_equ] consistent_age = users[age_equ]
Cleaning Data in Python