Removing highly correlated features

Dimensionality Reduction in Python

Jeroen Boeye

Head of Machine Learning, Faktion

Highly correlated data

highly correlated pairplot

Dimensionality Reduction in Python

Highly correlated features

highly correlated matrix

Dimensionality Reduction in Python

Removing highly correlated features

# Create positive correlation matrix
corr_df = chest_df.corr().abs()

# Create and apply mask mask = np.triu(np.ones_like(corr_df, dtype=bool))
tri_df = corr_df.mask(mask) tri_df

Dimensionality Reduction in Python

Removing highly correlated features

# Find columns that meet threshold 
to_drop = [c for c in tri_df.columns if any(tri_df[c] > 0.95)]

print(to_drop)
['Suprasternale height', 'Cervicale height']
# Drop those columns
reduced_df = chest_df.drop(to_drop, axis=1)
Dimensionality Reduction in Python

Feature selection

Feature selection schema

Feature extraction

Feature extraction schema

Dimensionality Reduction in Python

Correlation caveats - Anscombe's quartet

Anscombe's quartet

Dimensionality Reduction in Python

Correlation caveats - causation

sns.scatterplot(x="N firetrucks sent to fire", 
                y="N wounded by fire",data=fire_df)

firetrucks vs. wounded

Dimensionality Reduction in Python

Let's practice!

Dimensionality Reduction in Python

Preparing Video For Download...