Validatie over velden heen

Data opschonen in Python

Adel Nehme

VP of AI Curriculum, DataCamp

Motivatie

import pandas as pd

flights = pd.read_csv('flights.csv')
flights.head()
  flight_number  economy_class  business_class  first_class  total_passengers
0         DL140            100              60           40               200
1         BA248            130             100           70               300
2        MEA124            100              50           50               200
3        AFR939            140              70           90               300
4        TKA101            130             100           20               250
Data opschonen in Python

Validatie over velden heen

Het gebruik van meerdere velden in een dataset om de dataconsistentie te checken

  flight_number  economy_class  business_class  first_class  total_passengers
0         DL140            100       +      60      +    40        =      200
1         BA248            130       +     100      +    70        =      300
2        MEA124            100       +      50      +    50        =      200
3        AFR939            140       +      70      +    90        =      300
4        TKA101            130       +     100      +    20        =      250
sum_classes = flights[['economy_class', 'business_class', 'first_class']].sum(axis = 1)

passenger_equ = sum_classes == flights['total_passengers']
# Zoek en filter rijen met inconsistente passagierstotalen inconsistent_pass = flights[~passenger_equ] consistent_pass = flights[passenger_equ]
Data opschonen in Python

Validatie over velden heen

users.head()
   user_id  Age   Birthday
0    32985   22 1998-03-02
1    94387   27 1993-12-04
2    34236   42 1978-11-24
3    12551   31 1989-01-03
4    55212   18 2002-07-02
Data opschonen in Python

Validatie over velden heen

import pandas as pd
import datetime as dt

# Converteren naar datetime en de datum van vandaag ophalen
users['Birthday'] = pd.to_datetime(users['Birthday'])

today = dt.date.today()
# Voor elke rij in de kolom Birthday het jaarsverschil berekenen age_manual = today.year - users['Birthday'].dt.year
# Vind gevallen waar leeftijden overeenkomen age_equ = age_manual == users['Age']
# Zoek en filter rijen met inconsistente leeftijd inconsistent_age = users[~age_equ] consistent_age = users[age_equ]
Data opschonen in Python

Wat te doen bij gevonden inconsistenties?

Data opschonen in Python

Laten we oefenen!

Data opschonen in Python

Preparing Video For Download...