Basics of k-means clustering

Cluster Analysis in Python

Shaumik Daityari

Business Analyst

Why k-means clustering?

  • A critical drawback of hierarchical clustering: runtime
  • K means runs significantly faster on large datasets
Cluster Analysis in Python

Step 1: Generate cluster centers

kmeans(obs, k_or_guess, iter, thresh, check_finite)
  • obs: standardized observations
  • k_or_guess: number of clusters
  • iter: number of iterations (default: 20)
  • thres: threshold (default: 1e-05)
  • check_finite: whether to check if observations contain only finite numbers (default: True)

Returns two objects: cluster centers, distortion

Cluster Analysis in Python

How is distortion calculated?

Cluster Analysis in Python

Step 2: Generate cluster labels

vq(obs, code_book, check_finite=True)
  • obs: standardized observations
  • code_book: cluster centers
  • check_finite: whether to check if observations contain only finite numbers (default: True)

Returns two objects: a list of cluster labels, a list of distortions

Cluster Analysis in Python

A note on distortions

  • kmeans returns a single value of distortions
  • vq returns a list of distortions.
Cluster Analysis in Python

Running k-means

# Import kmeans and vq functions
from scipy.cluster.vq import kmeans, vq
# Generate cluster centers and labels
cluster_centers, _ = kmeans(df[['scaled_x', 'scaled_y']], 3)
df['cluster_labels'], _ = vq(df[['scaled_x', 'scaled_y']], cluster_centers)
# Plot clusters
sns.scatterplot(x='scaled_x', y='scaled_y', hue='cluster_labels', data=df)
plt.show()
Cluster Analysis in Python

Cluster Analysis in Python

Next up: exercises!

Cluster Analysis in Python

Preparing Video For Download...