Validación entre campos

Limpieza de datos en Python

Adel Nehme

VP of AI Curriculum, DataCamp

Motivación

import pandas as pd

flights = pd.read_csv('flights.csv')
flights.head()
  flight_number  economy_class  business_class  first_class  total_passengers
0         DL140            100              60           40               200
1         BA248            130             100           70               300
2        MEA124            100              50           50               200
3        AFR939            140              70           90               300
4        TKA101            130             100           20               250
Limpieza de datos en Python

Validación entre campos

El uso de múltiples campos en un conjunto de datos para verificar la integridad de los datos.

  flight_number  economy_class  business_class  first_class  total_passengers
0         DL140            100       +      60      +    40        =      200
1         BA248            130       +     100      +    70        =      300
2        MEA124            100       +      50      +    50        =      200
3        AFR939            140       +      70      +    90        =      300
4        TKA101            130       +     100      +    20        =      250
sum_classes = flights[['economy_class', 'business_class', 'first_class']].sum(axis = 1)

passenger_equ = sum_classes == flights['total_passengers']
# Find and filter out rows with inconsistent passenger totals inconsistent_pass = flights[~passenger_equ] consistent_pass = flights[passenger_equ]
Limpieza de datos en Python

Validación entre campos

users.head()
   user_id  Age   Birthday
0    32985   22 1998-03-02
1    94387   27 1993-12-04
2    34236   42 1978-11-24
3    12551   31 1989-01-03
4    55212   18 2002-07-02
Limpieza de datos en Python

Validación entre campos

import pandas as pd
import datetime as dt

# Convert to datetime and get today's date
users['Birthday'] = pd.to_datetime(users['Birthday'])

today = dt.date.today()
# For each row in the Birthday column, calculate year difference age_manual = today.year - users['Birthday'].dt.year
# Find instances where ages match age_equ = age_manual == users['Age']
# Find and filter out rows with inconsistent age inconsistent_age = users[~age_equ] consistent_age = users[age_equ]
Limpieza de datos en Python

¿Qué hacer cuando detectamos inconsistencias?

Limpieza de datos en Python

¡Vamos a practicar!

Limpieza de datos en Python

Preparing Video For Download...