Nettoyage des données en Python
Adel Nehme
Content Developer @DataCamp
Ensemble fini prédéfini de catégories
| Type de données | Exemples de valeurs | Représentation numérique |
|---|---|---|
| État civil | unmarried, married |
0,1 |
| Catégorie de revenu des ménages | 0-20K, 20-40K, … |
0,1, … |
| Statut du prêt | default,payed,no_loan |
0,1,2 |
Le statut marital peut uniquement être unmarried _ou_ married


# Read study data and print it
study_data = pd.read_csv('study.csv')
study_data
name birthday blood_type
1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
5 Jennifer 2019-12-17 Z+
6 Kennedy 2020-04-27 A+
7 Keith 2019-04-19 AB+
# Correct possible blood types
categories
blood_type
1 O-
2 O+
3 A-
4 A+
5 B+
6 B-
7 AB+
8 AB-
# Read study data and print it
study_data = pd.read_csv('study.csv')
study_data
name birthday blood_type
1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
5 Jennifer 2019-12-17 Z+ <--
6 Kennedy 2020-04-27 A+
7 Keith 2019-04-19 AB+
# Correct possible blood types
categories
blood_type
1 O-
2 O+
3 A-
4 A+
5 B+
6 B-
7 AB+
8 AB-



inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories)
{'Z+'}
# Get and print rows with inconsistent categories inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)study_data[inconsistent_rows]
name birthday blood_type
5 Jennifer 2019-12-17 Z+
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type']) inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories) inconsistent_data = study_data[inconsistent_rows]# Drop inconsistent categories and get consistent data only consistent_data = study_data[~inconsistent_rows]
name birthday blood_type
1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
... ... ... ...
Nettoyage des données en Python