Predicting customer transactions

Machine Learning for Marketing in Python

Karolis Urbonas

Head of Analytics & Science, Amazon

Modeling approach

Linear regression to predict next month's transactions.
Same modeling steps as with logistic regression.

Modeling steps

Split data to training and testing
Initialize the model
Fit the model on the training data
Predict values on the testing data
Measure model performance on testing data

Regression performance metrics

Key metrics:

Root mean squared error (RMSE) - Square root of the average squared difference between prediction and actuals
Mean absolute error (MAE) - Average absolute difference between prediction and actuals
Mean absolute percentage error (MAPE) - Average percentage difference between prediction and actuals (actuals can't be zeros)

Additional regression and supervised learning metrics

R-squared - statistical measure that represents the percentage proportion of variance that is explained by the model. Only applicable to regression, not classification. Higher is better.
Coefficient p-values - probability that the regression (or classification) coefficient is observed due to chance. Lower is better. Typical thresholds are 5% and 10%.

Fitting the model

# Import the linear regression module
from sklearn.linear_model import LinearRegression

# Initialize the regression instance
linreg = LinearRegression()

# Fit model on the training data
linreg.fit(train_X, train_Y)

# Predict values on both training and testing data
train_pred_Y = linreg.predict(train_X)
test_pred_Y = linreg.predict(test_X)

Measuring model performance

# Import performance measurement functions
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

# Calculate metrics for training data
rmse_train = np.sqrt(mean_squared_error(train_Y, train_pred_Y))
mae_train = mean_absolute_error(train_Y, train_pred_Y)

# Calculate metrics for testing data
rmse_test = np.sqrt(mean_squared_error(test_Y, test_pred_Y))
mae_test = mean_absolute_error(test_Y, test_pred_Y)

# Print performance metrics
print('RMSE train: {:.3f}; RMSE test: {:.3f}\nMAE train: {:.3f}, MAE test: {:.3f}'.format(
                            rmse_train, rmse_test, mae_train, mae_test))

RMSE train: 0.717; RMSE test: 1.216
MAE train: 0.514, MAE test: 0.555

Interpreting coefficients

Need to assess statistical significance
Introduction to statsmodels library
Gives in-depth model summary

Build regression model with statsmodels

# Import the library
import statsmodels.api as sm

# Convert target variable to `numpy` array
train_Y = np.array(train_Y)

# Initialize and fit the model
olsreg = sm.OLS(train_Y, train_X)
olsreg = olsreg.fit()

# Print model summary
print(olsreg.summary())

Regression summary table

OLS summary

Interpreting R-squared

R-squared

Interpreting coefficient p-values

Coefficient p-values

Let's build some regression models!

Machine Learning for Marketing in Python