Dealing with imbalanced datasets

Fraud Detection in R

Bart Baesens

Professor Data Science at KU Leuven

Imbalanced data sets

Key challenge: label events as fraud or not
- Major challenge for classification methods & anomaly detection techniques
Classifier tends to favour majority class (= no-fraud)
- large classification error over the fraud cases
Classifiers learn better from a balanced distribution

haystack_and_needle_v2

Imbalanced data sets

Key challenge : label events as fraud or not
- Major challenge for classification methods & anomaly detection techniques
Classifier tends to favour majority class (= no-fraud)
- large classification error over the fraud cases
Classifiers learn better from a balanced distribution
Possible solution : change class distribution with sampling methods

haystack_and_needle_v1

Original imbalance

Over-sampling minority class...

class_barplot_over

... or under-sampling majority class ...

class_barplot_under

... or both!

class_barplot_both

Result after sampling...

class_barplot_result2

... or like this

class_barplot_result1

Random over-sampling (ROS)

orig_data_v0

orig_data_train_test

random_oversampling_v0

random_oversampling_v1

Random over-sampling in practice

Credit Card Fraud Detection dataset on Kaggle
- $\sim$ 300K anonymized credit card transfers labeled as fraudulent or genuine
About the data...
- Numerical (anonymized) variables: V1, V2, ... , V28
- Time = seconds elapsed between each transfer and first transfer in the dataset
- Amount = transaction amount
- Class = response variable: value 1 in case of fraud and 0 otherwise

creditcard_V2vsV1

Check the imbalance

head(creditcard)

  Time         V1         V2  ...            V27         V28 Amount Class
1    0  1.1918571  0.2661507  ...  -0.0089830991  0.01472417   2.69     0
2   10  0.3849782  0.6161095  ...   0.0424724419 -0.05433739   9.99     0
3   12 -0.7524170  0.3454854  ...  -0.1809975001  0.12939406  15.99     0
4   17  0.9624961  0.3284610  ...   0.0163706433 -0.01460533  34.09     0
5   34  0.2016859  0.4974832  ...   0.1427572469  0.21923761   9.99     0

prop.table(table(creditcard$Class))

   0    1 
0.98 0.02

ovun.sample from ROSE package

n_legit <- 24108
new_frac_legit <- 0.50
new_n_total <- n_legit / new_frac_legit ## = 24108 / 0.50 = 48216

library(ROSE)
oversampling_result <- ovun.sample(formula = Class ~ ., data = creditcard,
                                   method = "over",
                                   N = new_n_total,
                                   seed = 2018)

oversampled_credit <- oversampling_result$data
prop.table(table(oversampled_credit$Class))

  0   1 
0.5 0.5

creditcard_V2vsV1_oversampled

Let's practice!

Fraud Detection in R