Data preparation for cluster analysis

Cluster Analysis in Python

Shaumik Daityari

Business Analyst

Why do we need to prepare data for clustering?

  • Variables have incomparable units (product dimensions in cm, price in $)
  • Variables with same units have vastly different scales and variances (expenditures on cereals, travel)
  • Data in raw form may lead to bias in clustering
  • Clusters may be heavily dependent on one variable
  • Solution: normalization of individual variables
Cluster Analysis in Python

Normalization of data

Normalization: process of rescaling data to a standard deviation of 1

x_new = x / std_dev(x)

from scipy.cluster.vq import whiten
data = [5, 1, 3, 3, 2, 3, 3, 8, 1, 2, 2, 3, 5]
scaled_data = whiten(data)
print(scaled_data)
[2.73, 0.55, 1.64, 1.64, 1.09, 1.64, 1.64, 4.36, 0.55, 1.09, 1.09, 1.64, 2.73]
Cluster Analysis in Python

Illustration: normalization of data

# Import plotting library
from matplotlib import pyplot as plt

# Initialize original, scaled data
plt.plot(data, 
         label="original")
plt.plot(scaled_data, 
         label="scaled")
# Show legend and display plot
plt.legend()
plt.show()

Cluster Analysis in Python

Next up: some DIY exercises

Cluster Analysis in Python

Preparing Video For Download...