Data preparation for segmentation

Machine Learning for Marketing in Python

Karolis Urbonas

Head of Analytics & Science, Amazon

Model assumptions

First we'll start with K-means
K-means clustering works well when data is 1) ~normally distributed (no skew), and 2) standardized (mean = 0, standard deviation = 1)
Second model - NMF - can be used on raw data, especially if the matrix is sparse

Unskewing data with log-transformation

# First option - log transformation
wholesale_log = np.log(wholesale)

sns.pairplot(wholesale_log, diag_kind='kde')
plt.show()

Explore log-transformed data

Pairplot log-transformed

Unskewing data with Box-Cox transformation

# Second option - Box-Cox transformation
from scipy import stats

def boxcox_df(x):
    x_boxcox, _ = stats.boxcox(x)
    return x_boxcox

wholesale_boxcox = wholesale.apply(boxcox_df, axis=0)

sns.pairplot(wholesale_boxcox, diag_kind='kde')
plt.show()

Explore Box-Cox transformed data

Pairplot Box-Cox

Scale the data

Subtract column average from each column value
Divide each column value by column standard deviation
Will use StandardScaler() module from sklearn

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(wholesale_boxcox)
wholesale_scaled = scaler.transform(wholesale_boxcox)
wholesale_scaled_df = pd.DataFrame(data=wholesale_scaled,
                                   index=wholesale_boxcox.index,
                                   columns=wholesale_boxcox.columns)
wholesale_scaled_df.agg(['mean','std']).round()

      Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicassen
mean   -0.0   0.0      0.0     0.0              -0.0         0.0
std     1.0   1.0      1.0     1.0               1.0         1.0

Let's practice!

Machine Learning for Marketing in Python