Centering and scaling variables

Customer Segmentation in Python

Karolis Urbonas

Head of Data Science, Amazon

Identifying an issue

  • Analyze key statistics of the dataset
  • Compare mean and standard deviation
datamart_rfm.describe()

Customer Segmentation in Python

Centering variables with different means

  • K-means works well on variables with the same mean
  • Centering variables is done by subtracting average value from each observation
datamart_centered = datamart_rfm - datamart_rfm.mean()
datamart_centered.describe().round(2)

Customer Segmentation in Python

Scaling variables with different variance

  • K-means works better on variables with the same variance / standard deviation
  • Scaling variables is done by dividing them by standard deviation of each
datamart_scaled = datamart_rfm / datamart_rfm.std()
datamart_scaled.describe().round(2)

Customer Segmentation in Python

Combining centering and scaling

  • Subtract mean and divide by standard deviation manually
  • Or use a scaler from scikit-learn library (returns numpy.ndarray object)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(datamart_rfm)
datamart_normalized = scaler.transform(datamart_rfm)

print('mean: ', datamart_normalized.mean(axis=0).round(2)) print('std: ', datamart_normalized.std(axis=0).round(2))
mean:  [-0. -0.  0.]
std:  [1. 1. 1.]
Customer Segmentation in Python

Test different approaches by yourself!

Customer Segmentation in Python

Preparing Video For Download...