Introduction & Motivation

Fraud Detection in R

Bart Baesens

Professor Data Science at KU Leuven

Instructors

What is fraud?

Fraud is an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types and forms.

wolf_in_sheeps_clothing

Impact of fraud

Fraud is very rare, but cost of not detecting fraud can be huge!
Examples:
- Organizations lose 5% of their yearly revenues to fraud
- Money lost by businesses to fraud > $3.5 trillion each year
- Credit card companies lose approximately 7 cents per $100 of transactions due to fraud
- Fraud takes up 5-10% of the claim amounts paid for non-life insurance

Types of fraud

Anti-money laundering
Check fraud
Credit card fraud
Customs fraud
Counterfeit
Identity theft
Insurance fraud
Mortgage fraud
Non-delivery fraud

Online fraud
Product warranty fraud
Tax evasion
Telecommunication fraud
Theft of inventory
Ticket fraud
Transit fraud
Wire fraud
Workers compensation fraud

Key characteristics of successful fraud analytics models

Statistical accuracy

robinhood

Key characteristics of successful fraud analytics models

Statistical accuracy
Interpretability

minorityreport

Key characteristics of successful fraud analytics models

Statistical accuracy
Interpretability
Regulatory compliance

regulatory

Key characteristics of successful fraud analytics models

Statistical accuracy
Interpretability
Regulatory compliance
Economical impact

dagobert

Key characteristics of successful fraud analytics models

Statistical accuracy
Interpretability
Regulatory compliance
Economical cost
Complement expert based approaches with data-driven techniques

manmachine

Challenges of fraud detection model

Imbalance
- e.g. in credit card fraud < 0.5% frauds typically

haystack

Challenges of fraud detection model

Imbalance
- e.g. in credit card fraud < 0.5% frauds typically
Operational efficiency
- e.g. in credit card fraud < 8 seconds decision time

flash

Challenges of fraud detection model

Imbalance
- e.g. in credit card fraud < 0.5% frauds typically
Operational efficiency
- e.g. in credit card fraud < 8 seconds decision time
Avoid harassing good customers

goodcustomer

Imbalanced data

After a major storm, an insurance company received many claims
- Fraudulent claims are labeled with 1 and legitimate claims with 0
The percentage of fraud cases in the data can be determined by using the functions
- table() and prop.table()
prop.table(table(...)) to determine proportion of fraud

prop.table(table(fraud_label))

     0      1
0.9911 0.0089

Visualize imbalance with pie chart

labels <- c("no fraud", "fraud")
labels <- paste(labels, round(100 * prop.table(table(fraud_label)), 2), "%")
pie(table(fraud_label), labels, col = c("blue", "red"),
      main = "Pie chart of storm claims")

stormclaims

Confusion matrix

Used for evaluating fraud detection model:

confusionmatrix

Suppose no detection model is used, so all claims are considered as legitimate:

predictions <- rep.int(0, times = nrow(claims))
predictions <- factor(predictions, levels = c("no fraud", "fraud"))

Function confusionMatrix() from package caret:

library(caret)
confusionMatrix(data = predictions, reference = fraud_label)

             Reference
 Prediction    0   1
            0 614  14
            1   0   0                                   
 Accuracy : 0.9777

Total cost of not detecting fraud: claims example

Total cost of fraud defined as the sum of fraudulent amounts

Total cost if no fraud is detected:

> total_cost <- sum(claim_amount[fraud_label == "fraud"])
> print(total_cost)

Let's practice!

Fraud Detection in R