Out-of-sample validation and cross validation

Machine Learning for Marketing Analytics in R

Verena Pflieger

Data Scientist at INWT Statistics

Out-of-sample fit: training and test data

1) Divide the dataset in training and test data

# Generating random index for training and test set
# set.seed ensures reproducibility of random components
set.seed(534381)

churnData$isTrain <- rbinom(nrow(churnData), 1, 0.66)
train <- subset(churnData, churnData$isTrain == 1)
test <- subset(churnData, churnData$isTrain == 0)

Out-of-sample fit: building model

2) Build a model based on training data

# Modeling logitTrainNew
logitTrainNew <- glm( returnCustomer ~ title + newsletter + 
                     websiteDesign + paymentMethod + couponDiscount +
                     purchaseValue + throughAffiliate + 
                     shippingFees + dvd + blueray + vinyl + 
                     videogameDownload + prodOthers + prodRemitted,
                     family = binomial, data = train)

# Out-of-sample prediction for logitTrainNew
test$predNew <- predict(logitTrainNew, type = "response",
                        newdata = test)

Out-of-sample accuracy

# Calculating the confusion matrix
confMatrixNew <- confusion.matrix(test$returnCustomer, test$predNew, 
                 threshold = 0.3)
confMatrixNew

# Calculating the accuracy 
accuracyNew <- sum(diag(confMatrixNew)) / sum(confMatrixNew)
accuracyNew

    obs
pred     0    1
   0 11939 2449
   1   716  350

0.7951987

Cross-validation: setup

Cross-validation: accuracy

Calculation of cross-validated accuracy

library(boot)
# Accuracy function with threshold = 0.3
Acc03 <- function(r, pi = 0) {
  cm <- confusion.matrix(r, pi, threshold = 0.3)
  acc <- sum(diag(cm)) / sum(cm)
  return(acc)}
# Accuracy
set.seed(534381)
cv.glm(churnData, logitModelNew, cost = Acc03, K = 6)$delta

0.7943894

	Learnings Logistic Regression
You have learned...	how to predict customers of an online shop that are likely to churn
	to use a binary logistic regression to calculate probabilities
	that the choice of the threshold is crucial

	Learnings from the Model
You have learned...	that customers, signing up for a newsletter are more likely to return
	that customers, using a coupon are less likely to return
	that customers, without shipping fees are more likely to return
	etc...

Last exercise!

Machine Learning for Marketing Analytics in R