Unsupervised Learning in Python
Benjamin Wilson
Director of Research at lateral.io
178 samples from 3 distinct varieties of red wine: Barolo, Grignolino and Barbera
Features measure chemical composition e.g. alcohol content
Visual properties like "color intensity"
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
labels = model.fit_predict(samples)
df = pd.DataFrame({'labels': labels, 'varieties': varieties}) ct = pd.crosstab(df['labels'], df['varieties'])
print(ct)
varieties Barbera Barolo Grignolino
labels
0 29 13 20
1 0 46 1
2 19 0 50
The wine features have very different variances!
Variance of a feature measures spread of its values
feature variance
alcohol 0.65
malic_acid 1.24
...
od280 0.50
proline 99166.71
The wine features have very different variances!
Variance of a feature measures spread of its values
feature variance
alcohol 0.65
malic_acid 1.24
...
od280 0.50
proline 99166.71
In kmeans: feature variance = feature influence
StandardScaler
transforms each feature to have mean 0 and variance 1
Features are said to be "standardized"
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(samples) StandardScaler(copy=True, with_mean=True, with_std=True)
samples_scaled = scaler.transform(samples)
StandardScaler
and KMeans
have similar methods
Use fit()
/ transform()
with StandardScaler
Use fit()
/ predict()
with KMeans
Need to perform two steps: StandardScaler
, then KMeans
Use sklearn
pipeline to combine multiple steps
Data flows from one step into the next
from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans scaler = StandardScaler() kmeans = KMeans(n_clusters=3)
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(scaler, kmeans)
pipeline.fit(samples)
Pipeline(steps=...)
labels = pipeline.predict(samples)
With feature standardization:
varieties Barbera Barolo Grignolino
labels
0 0 59 3
1 48 0 3
2 0 0 65
Without feature standardization was very bad:
varieties Barbera Barolo Grignolino
labels
0 29 13 20
1 0 46 1
2 19 0 50
StandardScaler
is a "preprocessing" step
MaxAbsScaler
and Normalizer
are other examples
Unsupervised Learning in Python