Fraud Detection in R
Sebastiaan Höppner
PhD researcher in Data Science at KU Leuven
prop.table(table(train$Class))
0 1
0.98 0.02
prop.table(table(test$Class))
0 1
0.98 0.02
rpart
in rpart
packagelibrary(rpart)
model1 = rpart(Class ~ ., data = train)
library(partykit)
plot(as.party(model1))
## Predict fraud probability of test set scores1 = predict(model1, newdata = test, type = "prob")[, 2]
## Predict class (fraud or not) of test set predicted_class1 = factor(ifelse(scores1 > 0.5, 1, 0))
## Confusion matrix & accuracy, library(caret) CM1 = confusionMatrix(data = predicted_class1, reference = test$Class)
Reference
Prediction 0 1
0 12046 55
1 8 191 Accuracy : 0.994878
library(pROC)
auc(roc(response = test$Class, predictor = scores1)) ## Area Under ROC Curve (AUC)
Area under the ROC curve: 0.8938
library(smotefamily) set.seed(123) smote_result = SMOTE(X = train[, -17], target = train$Class, K = 5, dup_size = 10)
train_oversampled = smote_result$data colnames(train_oversampled)[17] = "Class"
prop.table(table(train_oversampled$Class))
0 1
0.8166667 0.1833333
library(rpart)
model2 = rpart(Class ~ ., data = train_oversampled)
## Predict fraud probability of test set scores2 = predict(model2, newdata = test, type = "prob")[, 2]
## Predict class (fraud or not) of test set predicted_class2 = factor(ifelse(scores2 > 0.5, 1, 0))
## Confusion matrix & accuracy library(caret) CM2 = confusionMatrix(data = predicted_class2, reference = test$Class)
Reference
Prediction 0 1
0 11967 34
1 87 212 Accuracy : 0.9901626
library(pROC)
auc(roc(response = test$Class, predictor = scores2)) ## Area Under ROC Curve (AUC)
Area under the curve: 0.9538
cost_model = function(predicted.classes, true.classes, amounts, fixedcost) {
cost = sum(true.classes * (1 - predicted.classes) * amounts +
predicted.classes * fixedcost)
return(cost)
}
## Total cost without using SMOTE:
cost_model(predicted_class1, test$Class, test$Amount, fixedcost = 10)
10061.8
## Total cost when using SMOTE:
cost_model(predicted_class2, test$Class, test$Amount, fixedcost = 10)
7431.93
Fraud Detection in R