Dealing with imbalanced datasets

Fraud Detection in R

Bart Baesens

Professor Data Science at KU Leuven

Imbalanced data sets

  • Key challenge: label events as fraud or not
    • Major challenge for classification methods & anomaly detection techniques
  • Classifier tends to favour majority class (= no-fraud)
    • large classification error over the fraud cases
  • Classifiers learn better from a balanced distribution

haystack_and_needle_v2

Fraud Detection in R

Imbalanced data sets

  • Key challenge : label events as fraud or not
    • Major challenge for classification methods & anomaly detection techniques
  • Classifier tends to favour majority class (= no-fraud)
    • large classification error over the fraud cases
  • Classifiers learn better from a balanced distribution
  • Possible solution : change class distribution with sampling methods

haystack_and_needle_v1

Fraud Detection in R

Original imbalance

class_barplot_orig.png

Fraud Detection in R

Over-sampling minority class...

class_barplot_over

Fraud Detection in R

... or under-sampling majority class ...

class_barplot_under

Fraud Detection in R

... or both!

class_barplot_both

Fraud Detection in R

Result after sampling...

class_barplot_result2

Fraud Detection in R

... or like this

class_barplot_result1

Fraud Detection in R

Random over-sampling (ROS)

orig_data_v0

Fraud Detection in R

orig_data_train_test

Fraud Detection in R

random_oversampling_v0

Fraud Detection in R

random_oversampling_v1

Fraud Detection in R

Random over-sampling in practice

  • Credit Card Fraud Detection dataset on Kaggle
    • $\sim$ 300K anonymized credit card transfers labeled as fraudulent or genuine
  • About the data...
    • Numerical (anonymized) variables: V1, V2, ... , V28
    • Time = seconds elapsed between each transfer and first transfer in the dataset
    • Amount = transaction amount
    • Class = response variable: value 1 in case of fraud and 0 otherwise
Fraud Detection in R

creditcard_V2vsV1

Fraud Detection in R

Check the imbalance

head(creditcard)
  Time         V1         V2  ...            V27         V28 Amount Class
1    0  1.1918571  0.2661507  ...  -0.0089830991  0.01472417   2.69     0
2   10  0.3849782  0.6161095  ...   0.0424724419 -0.05433739   9.99     0
3   12 -0.7524170  0.3454854  ...  -0.1809975001  0.12939406  15.99     0
4   17  0.9624961  0.3284610  ...   0.0163706433 -0.01460533  34.09     0
5   34  0.2016859  0.4974832  ...   0.1427572469  0.21923761   9.99     0
prop.table(table(creditcard$Class))
   0    1 
0.98 0.02
Fraud Detection in R

ovun.sample from ROSE package

n_legit <- 24108
new_frac_legit <- 0.50
new_n_total <- n_legit / new_frac_legit ## = 24108 / 0.50 = 48216

library(ROSE) oversampling_result <- ovun.sample(formula = Class ~ ., data = creditcard, method = "over", N = new_n_total, seed = 2018)
oversampled_credit <- oversampling_result$data prop.table(table(oversampled_credit$Class))
  0   1 
0.5 0.5
Fraud Detection in R

creditcard_V2vsV1_oversampled

Fraud Detection in R

Let's practice!

Fraud Detection in R

Preparing Video For Download...