Introduction & Motivation

Fraud Detection in R

Bart Baesens

Professor Data Science at KU Leuven

Instructors

picture_Bart.png

Fraud Detection in R

Instructors

picture_BartTim.png

Fraud Detection in R

Instructors

picture_BartTimSebastiaan.png

Fraud Detection in R

What is fraud?

Fraud is an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types and forms.

wolf_in_sheeps_clothing

Fraud Detection in R

Impact of fraud

  • Fraud is very rare, but cost of not detecting fraud can be huge!
  • Examples:
    • Organizations lose 5% of their yearly revenues to fraud
    • Money lost by businesses to fraud > $3.5 trillion each year
    • Credit card companies lose approximately 7 cents per $100 of transactions due to fraud
    • Fraud takes up 5-10% of the claim amounts paid for non-life insurance
Fraud Detection in R

Types of fraud

  • Anti-money laundering
  • Check fraud
  • Credit card fraud
  • Customs fraud
  • Counterfeit
  • Identity theft
  • Insurance fraud
  • Mortgage fraud
  • Non-delivery fraud
  • Online fraud
  • Product warranty fraud
  • Tax evasion
  • Telecommunication fraud
  • Theft of inventory
  • Ticket fraud
  • Transit fraud
  • Wire fraud
  • Workers compensation fraud
Fraud Detection in R

Key characteristics of successful fraud analytics models

  • Statistical accuracy

robinhood

Fraud Detection in R

Key characteristics of successful fraud analytics models

  • Statistical accuracy
  • Interpretability

minorityreport

Fraud Detection in R

Key characteristics of successful fraud analytics models

  • Statistical accuracy
  • Interpretability
  • Regulatory compliance

regulatory

Fraud Detection in R

Key characteristics of successful fraud analytics models

  • Statistical accuracy
  • Interpretability
  • Regulatory compliance
  • Economical impact

dagobert

Fraud Detection in R

Key characteristics of successful fraud analytics models

  • Statistical accuracy
  • Interpretability
  • Regulatory compliance
  • Economical cost
  • Complement expert based approaches with data-driven techniques

manmachine

Fraud Detection in R

Challenges of fraud detection model

  • Imbalance
    • e.g. in credit card fraud < 0.5% frauds typically

haystack

Fraud Detection in R

Challenges of fraud detection model

  • Imbalance
    • e.g. in credit card fraud < 0.5% frauds typically
  • Operational efficiency
    • e.g. in credit card fraud < 8 seconds decision time

flash

Fraud Detection in R

Challenges of fraud detection model

  • Imbalance
    • e.g. in credit card fraud < 0.5% frauds typically
  • Operational efficiency
    • e.g. in credit card fraud < 8 seconds decision time
  • Avoid harassing good customers

goodcustomer

Fraud Detection in R

Imbalanced data

  • After a major storm, an insurance company received many claims

    • Fraudulent claims are labeled with 1 and legitimate claims with 0
  • The percentage of fraud cases in the data can be determined by using the functions

    • table() and prop.table()
  • prop.table(table(...)) to determine proportion of fraud

prop.table(table(fraud_label))
     0      1
0.9911 0.0089
Fraud Detection in R

Visualize imbalance with pie chart

labels <- c("no fraud", "fraud")
labels <- paste(labels, round(100 * prop.table(table(fraud_label)), 2), "%")
pie(table(fraud_label), labels, col = c("blue", "red"),
      main = "Pie chart of storm claims")

stormclaims

Fraud Detection in R

Confusion matrix

Used for evaluating fraud detection model:

confusionmatrix

Fraud Detection in R
  • Suppose no detection model is used, so all claims are considered as legitimate:
predictions <- rep.int(0, times = nrow(claims))
predictions <- factor(predictions, levels = c("no fraud", "fraud"))
  • Function confusionMatrix() from package caret:
library(caret)
confusionMatrix(data = predictions, reference = fraud_label)
             Reference
 Prediction    0   1
            0 614  14
            1   0   0                                   
 Accuracy : 0.9777
Fraud Detection in R

Total cost of not detecting fraud: claims example

  • Total cost of fraud defined as the sum of fraudulent amounts
  • Total cost if no fraud is detected:
    > total_cost <- sum(claim_amount[fraud_label == "fraud"])
    > print(total_cost)
    
2301508
Fraud Detection in R

Let's practice!

Fraud Detection in R

Preparing Video For Download...