Machine Learning for Marketing Analytics in R
Verena Pflieger
Data Scientist at INWT Statistics
1) Divide the dataset in training and test data
# Generating random index for training and test set
# set.seed ensures reproducibility of random components
set.seed(534381)
churnData$isTrain <- rbinom(nrow(churnData), 1, 0.66)
train <- subset(churnData, churnData$isTrain == 1)
test <- subset(churnData, churnData$isTrain == 0)
2) Build a model based on training data
# Modeling logitTrainNew
logitTrainNew <- glm( returnCustomer ~ title + newsletter +
websiteDesign + paymentMethod + couponDiscount +
purchaseValue + throughAffiliate +
shippingFees + dvd + blueray + vinyl +
videogameDownload + prodOthers + prodRemitted,
family = binomial, data = train)
# Out-of-sample prediction for logitTrainNew
test$predNew <- predict(logitTrainNew, type = "response",
newdata = test)
# Calculating the confusion matrix
confMatrixNew <- confusion.matrix(test$returnCustomer, test$predNew,
threshold = 0.3)
confMatrixNew
# Calculating the accuracy
accuracyNew <- sum(diag(confMatrixNew)) / sum(confMatrixNew)
accuracyNew
obs
pred 0 1
0 11939 2449
1 716 350
0.7951987
Calculation of cross-validated accuracy
library(boot)
# Accuracy function with threshold = 0.3
Acc03 <- function(r, pi = 0) {
cm <- confusion.matrix(r, pi, threshold = 0.3)
acc <- sum(diag(cm)) / sum(cm)
return(acc)}
# Accuracy
set.seed(534381)
cv.glm(churnData, logitModelNew, cost = Acc03, K = 6)$delta
0.7943894
Learnings Logistic Regression | |
---|---|
You have learned... | how to predict customers of an online shop that are likely to churn |
to use a binary logistic regression to calculate probabilities | |
that the choice of the threshold is crucial |
Learnings from the Model | |
---|---|
You have learned... | that customers, signing up for a newsletter are more likely to return |
that customers, using a coupon are less likely to return | |
that customers, without shipping fees are more likely to return | |
etc... |
Machine Learning for Marketing Analytics in R