Data preparation for cluster analysis

Cluster Analysis in Python

Shaumik Daityari

Business Analyst

Why do we need to prepare data for clustering?

Variables have incomparable units (product dimensions in cm, price in $)
Variables with same units have vastly different scales and variances (expenditures on cereals, travel)
Data in raw form may lead to bias in clustering
Clusters may be heavily dependent on one variable
Solution: normalization of individual variables

Normalization of data

Normalization: process of rescaling data to a standard deviation of 1

x_new = x / std_dev(x)

from scipy.cluster.vq import whiten

data = [5, 1, 3, 3, 2, 3, 3, 8, 1, 2, 2, 3, 5]

scaled_data = whiten(data)
print(scaled_data)

[2.73, 0.55, 1.64, 1.64, 1.09, 1.64, 1.64, 4.36, 0.55, 1.09, 1.09, 1.64, 2.73]

Illustration: normalization of data

# Import plotting library
from matplotlib import pyplot as plt

# Initialize original, scaled data
plt.plot(data, 
         label="original")
plt.plot(scaled_data, 
         label="scaled")

# Show legend and display plot
plt.legend()
plt.show()

Next up: some DIY exercises

Cluster Analysis in Python