Transforming features for better clusterings

Unsupervised Learning in Python

Benjamin Wilson

Director of Research at lateral.io

Piedmont wines dataset

  • 178 samples from 3 distinct varieties of red wine: Barolo, Grignolino and Barbera

  • Features measure chemical composition e.g. alcohol content

  • Visual properties like "color intensity"

1 Source: https://archive.ics.uci.edu/ml/datasets/Wine
Unsupervised Learning in Python

Clustering the wines

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
labels = model.fit_predict(samples)
Unsupervised Learning in Python

Clusters vs. varieties

df = pd.DataFrame({'labels': labels, 
                       'varieties': varieties})
ct = pd.crosstab(df['labels'], df['varieties'])

print(ct)
varieties  Barbera  Barolo  Grignolino
labels                                
0               29      13          20
1                0      46           1
2               19       0          50
Unsupervised Learning in Python

Feature variances

  • The wine features have very different variances!

  • Variance of a feature measures spread of its values

feature     variance
alcohol         0.65
malic_acid      1.24
...
od280           0.50
proline     99166.71

Scatter plot of od280 varible vs malic_acid variable

Unsupervised Learning in Python

Feature variances

  • The wine features have very different variances!

  • Variance of a feature measures spread of its values

feature     variance
alcohol         0.65
malic_acid      1.24
...
od280           0.50
proline     99166.71

Scatter plot of od280 varible vs observation number

Unsupervised Learning in Python

StandardScaler

  • In kmeans: feature variance = feature influence

  • StandardScaler transforms each feature to have mean 0 and variance 1

  • Features are said to be "standardized"

Standardized od280 vs standardized proline scatter plot

Unsupervised Learning in Python

sklearn StandardScaler

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(samples) StandardScaler(copy=True, with_mean=True, with_std=True)
samples_scaled = scaler.transform(samples)
Unsupervised Learning in Python

Similar methods

  • StandardScaler and KMeans have similar methods

  • Use fit() / transform() with StandardScaler

  • Use fit() / predict() with KMeans

Unsupervised Learning in Python

StandardScaler, then KMeans

  • Need to perform two steps: StandardScaler, then KMeans

  • Use sklearn pipeline to combine multiple steps

  • Data flows from one step into the next

Unsupervised Learning in Python

Pipelines combine multiple steps

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)

from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(scaler, kmeans)
pipeline.fit(samples)
Pipeline(steps=...)
labels = pipeline.predict(samples)
Unsupervised Learning in Python

Feature standardization improves clustering

With feature standardization:

varieties  Barbera  Barolo  Grignolino
labels                                
0                0      59           3
1               48       0           3
2                0       0          65

Without feature standardization was very bad:

varieties  Barbera  Barolo  Grignolino
labels                                
0               29      13          20
1                0      46           1
2               19       0          50
Unsupervised Learning in Python

sklearn preprocessing steps

  • StandardScaler is a "preprocessing" step

  • MaxAbsScaler and Normalizer are other examples

Unsupervised Learning in Python

Let's practice!

Unsupervised Learning in Python

Preparing Video For Download...