Dealing with Categorical Variables

Feature Engineering for Machine Learning in Python

Robert O'Callaghan

Director of Data Science, Ordergroove

Encoding categorical features

Feature Engineering for Machine Learning in Python

Encoding categorical features

Feature Engineering for Machine Learning in Python

Encoding categorical features

  • One-hot encoding
  • Dummy encoding
Feature Engineering for Machine Learning in Python

One-hot encoding

pd.get_dummies(df, columns=['Country'], 
               prefix='C')
    C_France    C_India    C_UK    C_USA
0          0          1       0        0
1          0          0       0        1
2          0          0       1        0
3          0          0       1        0
4          1          0       0        0
Feature Engineering for Machine Learning in Python

Dummy encoding

pd.get_dummies(df, columns=['Country'],
               drop_first=True, prefix='C')
     C_India    C_UK    C_USA
0          1       0        0
1          0       0        1
2          0       1        0
3          0       1        0
4          0       0        0
Feature Engineering for Machine Learning in Python

One-hot vs. dummies

  • One-hot encoding: Explainable features
  • Dummy encoding: Necessary information without duplication
Feature Engineering for Machine Learning in Python
Index Sex
0 Male
1 Female
2 Male
Index Male Female
0 1 0
1 0 1
2 1 0
Index Male
0 1
1 0
2 1
Feature Engineering for Machine Learning in Python

Limiting your columns

counts = df['Country'].value_counts()
print(counts)
'USA'      8
'UK'       6
'India'    2
'France'   1
Name: Country, dtype: object
Feature Engineering for Machine Learning in Python

Limiting your columns

mask = df['Country'].isin(counts[counts < 5].index)

df['Country'][mask] = 'Other'
print(pd.value_counts(colors))
'USA'      8
'UK'       6
'Other'    3
Name: Country, dtype: object
Feature Engineering for Machine Learning in Python

Now you deal with categorical variables

Feature Engineering for Machine Learning in Python

Preparing Video For Download...