Membership constraints

Cleaning Data in Python

Adel Nehme

Content Developer @DataCamp

 

 

 

 

 

 

 

Chapter 2 - Text and categorical data problems

Cleaning Data in Python

Categories and membership constraints

Predefined finite set of categories

Type of data Example values Numeric representation
Marriage Status unmarried, married 0,1
Household Income Category 0-20K, 20-40K, ... 0,1, ..
Loan Status default,payed,no_loan 0,1,2

 

Marriage status can only be unmarried _or_ married

Cleaning Data in Python

Why could we have these problems?

categorical_issues

Cleaning Data in Python

How do we treat these problems?

    categories

Cleaning Data in Python

An example

# Read study data and print it
study_data = pd.read_csv('study.csv')
study_data
      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
5 Jennifer 2019-12-17         Z+
6  Kennedy 2020-04-27         A+
7    Keith 2019-04-19        AB+
# Correct possible blood types
categories
  blood_type
1         O-
2         O+
3         A-
4         A+
5         B+
6         B-
7        AB+
8        AB-
Cleaning Data in Python

An example

# Read study data and print it
study_data = pd.read_csv('study.csv')
study_data
      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
5 Jennifer 2019-12-17         Z+  <--
6  Kennedy 2020-04-27         A+
7    Keith 2019-04-19        AB+
# Correct possible blood types
categories
  blood_type
1         O-
2         O+
3         A-
4         A+
5         B+
6         B-
7        AB+
8        AB-
Cleaning Data in Python

A note on joins

Cleaning Data in Python

A left anti join on blood types

Cleaning Data in Python

An inner join on blood types

Cleaning Data in Python

Finding inconsistent categories

inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories)
{'Z+'}
# Get and print rows with inconsistent categories
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)

study_data[inconsistent_rows]
      name   birthday blood_type
5 Jennifer 2019-12-17         Z+
Cleaning Data in Python

Dropping inconsistent categories

inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
inconsistent_data = study_data[inconsistent_rows]

# Drop inconsistent categories and get consistent data only consistent_data = study_data[~inconsistent_rows]
      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
...    ...      ...          ...
Cleaning Data in Python

Let's practice!

Cleaning Data in Python

Preparing Video For Download...