Cleaning Data in Python
Adel Nehme
Content Developer @DataCamp
Predefined finite set of categories
Type of data | Example values | Numeric representation |
---|---|---|
Marriage Status | unmarried , married |
0 ,1 |
Household Income Category | 0-20K , 20-40K , ... |
0 ,1 , .. |
Loan Status | default ,payed ,no_loan |
0 ,1 ,2 |
Marriage status can only be unmarried
_or_ married
# Read study data and print it
study_data = pd.read_csv('study.csv')
study_data
name birthday blood_type
1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
5 Jennifer 2019-12-17 Z+
6 Kennedy 2020-04-27 A+
7 Keith 2019-04-19 AB+
# Correct possible blood types
categories
blood_type
1 O-
2 O+
3 A-
4 A+
5 B+
6 B-
7 AB+
8 AB-
# Read study data and print it
study_data = pd.read_csv('study.csv')
study_data
name birthday blood_type
1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
5 Jennifer 2019-12-17 Z+ <--
6 Kennedy 2020-04-27 A+
7 Keith 2019-04-19 AB+
# Correct possible blood types
categories
blood_type
1 O-
2 O+
3 A-
4 A+
5 B+
6 B-
7 AB+
8 AB-
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories)
{'Z+'}
# Get and print rows with inconsistent categories inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
study_data[inconsistent_rows]
name birthday blood_type
5 Jennifer 2019-12-17 Z+
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type']) inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories) inconsistent_data = study_data[inconsistent_rows]
# Drop inconsistent categories and get consistent data only consistent_data = study_data[~inconsistent_rows]
name birthday blood_type
1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
... ... ... ...
Cleaning Data in Python