Membership constraints

Pulizia dei dati in Python

Adel Nehme

Content Developer @DataCamp

 

 

 

 

 

 

 

Chapter 2 - Text and categorical data problems

Pulizia dei dati in Python

Categories and membership constraints

Predefined finite set of categories

Type of data Example values Numeric representation
Marriage Status unmarried, married 0,1
Household Income Category 0-20K, 20-40K, ... 0,1, ..
Loan Status default,payed,no_loan 0,1,2

 

Marriage status can only be unmarried _or_ married

Pulizia dei dati in Python

Why could we have these problems?

categorical_issues

Pulizia dei dati in Python

How do we treat these problems?

    categories

Pulizia dei dati in Python

An example

# Read study data and print it
study_data = pd.read_csv('study.csv')
study_data
      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
5 Jennifer 2019-12-17         Z+
6  Kennedy 2020-04-27         A+
7    Keith 2019-04-19        AB+
# Correct possible blood types
categories
  blood_type
1         O-
2         O+
3         A-
4         A+
5         B+
6         B-
7        AB+
8        AB-
Pulizia dei dati in Python

An example

# Read study data and print it
study_data = pd.read_csv('study.csv')
study_data
      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
5 Jennifer 2019-12-17         Z+  <--
6  Kennedy 2020-04-27         A+
7    Keith 2019-04-19        AB+
# Correct possible blood types
categories
  blood_type
1         O-
2         O+
3         A-
4         A+
5         B+
6         B-
7        AB+
8        AB-
Pulizia dei dati in Python

A note on joins

Pulizia dei dati in Python

A left anti join on blood types

Pulizia dei dati in Python

An inner join on blood types

Pulizia dei dati in Python

Finding inconsistent categories

inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories)
{'Z+'}
# Get and print rows with inconsistent categories
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)

study_data[inconsistent_rows]
      name   birthday blood_type
5 Jennifer 2019-12-17         Z+
Pulizia dei dati in Python

Dropping inconsistent categories

inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
inconsistent_data = study_data[inconsistent_rows]

# Drop inconsistent categories and get consistent data only consistent_data = study_data[~inconsistent_rows]
      name   birthday blood_type
1     Beth 2019-10-20         B-
2 Ignatius 2020-07-08         A-
3     Paul 2019-08-12         O+
4    Helen 2019-03-17         O-
...    ...      ...          ...
Pulizia dei dati in Python

Let's practice!

Pulizia dei dati in Python

Preparing Video For Download...