Data preparation for segmentation

Machine Learning for Marketing in Python

Karolis Urbonas

Head of Analytics & Science, Amazon

Model assumptions

  • First we'll start with K-means
  • K-means clustering works well when data is 1) ~normally distributed (no skew), and 2) standardized (mean = 0, standard deviation = 1)
  • Second model - NMF - can be used on raw data, especially if the matrix is sparse
Machine Learning for Marketing in Python

Unskewing data with log-transformation

# First option - log transformation
wholesale_log = np.log(wholesale)
sns.pairplot(wholesale_log, diag_kind='kde')
plt.show()
Machine Learning for Marketing in Python

Explore log-transformed data

Pairplot log-transformed

Machine Learning for Marketing in Python

Unskewing data with Box-Cox transformation

# Second option - Box-Cox transformation
from scipy import stats

def boxcox_df(x):
    x_boxcox, _ = stats.boxcox(x)
    return x_boxcox

wholesale_boxcox = wholesale.apply(boxcox_df, axis=0)
sns.pairplot(wholesale_boxcox, diag_kind='kde')
plt.show()
Machine Learning for Marketing in Python

Explore Box-Cox transformed data

Pairplot Box-Cox

Machine Learning for Marketing in Python

Scale the data

  • Subtract column average from each column value
  • Divide each column value by column standard deviation
  • Will use StandardScaler() module from sklearn
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(wholesale_boxcox) wholesale_scaled = scaler.transform(wholesale_boxcox) wholesale_scaled_df = pd.DataFrame(data=wholesale_scaled, index=wholesale_boxcox.index, columns=wholesale_boxcox.columns) wholesale_scaled_df.agg(['mean','std']).round()
      Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicassen
mean   -0.0   0.0      0.0     0.0              -0.0         0.0
std     1.0   1.0      1.0     1.0               1.0         1.0
Machine Learning for Marketing in Python

Let's practice!

Machine Learning for Marketing in Python

Preparing Video For Download...