Practical implementation of k-means clustering

Customer Segmentation in Python

Karolis Urbonas

Head of Data Science, Amazon

Key steps

  • Data pre-processing
  • Choosing a number of clusters
  • Running k-means clustering on pre-processed data
  • Analyzing average RFM values of each cluster
Customer Segmentation in Python

Data pre-processing

We've completed the pre-processing steps and have these two objects:

  • datamart_rfm
  • datamart_normalized
import numpy as np
datamart_log = np.log(datamart_rfm)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(datamart_log)

datamart_normalized = scaler.transform(datamart_log)
Customer Segmentation in Python

Methods to define the number of clusters

  • Visual methods - elbow criterion
  • Mathematical methods - silhouette coefficient
  • Experimentation and interpretation
Customer Segmentation in Python

Running k-means

# Import package
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=1)
# Compute k-means clustering on pre-processed data
kmeans.fit(datamart_normalized)
# Extract cluster labels from labels_ attribute
cluster_labels = kmeans.labels_
Customer Segmentation in Python

Analyzing average RFM values of each cluster

# Create a cluster label column in the original DataFrame
datamart_rfm_k2 = datamart_rfm.assign(Cluster = cluster_labels)
# Calculate average RFM values and size for each cluster
datamart_rfm_k2.groupby(['Cluster']).agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': ['mean', 'count'],
}).round(0)
Customer Segmentation in Python

Analyzing average RFM values of each cluster

The result of a simple 2-cluster solution:

Customer Segmentation in Python

Let's practice running k-means clustering!

Customer Segmentation in Python

Preparing Video For Download...