Membership constraints

Cleaning Data in Python

Adel Nehme

Content Developer @DataCamp

Chapter 2 - Text and categorical data problems

Categories and membership constraints

Predefined finite set of categories

Type of data	Example values	Numeric representation
Marriage Status	`unmarried`, `married`	`0`,`1`
Household Income Category	`0-20K`, `20-40K`, ...	`0`,`1`, ..
Loan Status	`default`,`payed`,`no_loan`	`0`,`1`,`2`

Marriage status can only be unmarried _or_ married

Why could we have these problems?

categorical_issues

How do we treat these problems?

An example

# Read study data and print it
study_data = pd.read_csv('study.csv')
study_data

      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
5 Jennifer 2019-12-17         Z+
6  Kennedy 2020-04-27         A+
7    Keith 2019-04-19        AB+

# Correct possible blood types
categories

  blood_type
1         O-
2         O+
3         A-
4         A+
5         B+
6         B-
7        AB+
8        AB-

An example

# Read study data and print it
study_data = pd.read_csv('study.csv')
study_data

      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
5 Jennifer 2019-12-17         Z+  <--
6  Kennedy 2020-04-27         A+
7    Keith 2019-04-19        AB+

# Correct possible blood types
categories

  blood_type
1         O-
2         O+
3         A-
4         A+
5         B+
6         B-
7        AB+
8        AB-

A note on joins

A left anti join on blood types

An inner join on blood types

Finding inconsistent categories

inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories)

{'Z+'}

# Get and print rows with inconsistent categories
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)

study_data[inconsistent_rows]

      name   birthday blood_type
5 Jennifer 2019-12-17         Z+

Dropping inconsistent categories

inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
inconsistent_data = study_data[inconsistent_rows]

# Drop inconsistent categories and get consistent data only
consistent_data = study_data[~inconsistent_rows]

      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
...    ...      ...          ...

Let's practice!

Cleaning Data in Python