Fraud Detection in R
Tim Verdonck
Professor Data Science at KU Leuven
An outlier is an observation that deviates from the pattern of the majority of the data.
An outlier can be a warning for fraud.
A popular tool for outlier detection is
The z-score $z_i$ for observation $x_i$ is calculated as:
$$z_i=\frac{x_i-\hat{\mu}}{\hat{\sigma}} = \frac{x_i-\overline{x}}{s}$$
Dataset loginc
contains monthly incomes of 10 persons after log transformation:
loginc: 7.876 7.681 7.628 ... 7.764 9.912 # <-- last income is clearly an outlier!
Mean <- mean(loginc)
Sd <- sd(loginc)
zscore <- (loginc - Mean) / Sd
abs(zscore) > 3
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sample mean: $$\overline{x}=\frac{1}{n}\sum_i x_i$$
mean(loginc)
mean(loginc9)
7.986447
7.772392
loginc9
contains same observations as loginc
except for the outlier.
Order $n$ observations from small to large, then sample median, $Med(X_n)$, is $(n+1)/2$th observation (if $n$ is odd) or average of $n/2$th and $n/2+1$th observation (if $n$ is even).
median(loginc)
7.816658
median(loginc9)
7.764296
(1) Sample standard deviation: $$s= \sqrt{\frac{1}{n-1}\sum_i (x_i-\hat{\mu})^2}$$
sd(loginc)
0.6976615
sd(loginc9)
0.1791729
(2) Median absolute deviation: $$Mad(X_n)=1.4826Med(|x_i-Med(X_n)|)$$
(3) Interquantile range (normalized): $$IQR(X_n)= IQR = 0.7413(Q_3-Q_1)$$ where $Q_1$ and $Q_3$ are first and third quartile of the data
IQR(loginc)/1.349
0.2056784
mad(loginc)
0.2396159
mad(loginc9)
0.201305
IQR(loginc9)/1.349
0.1839295
Plug in the robust estimators to compute robust z-scores:
$$z_i=\frac{x_i-\hat{\mu}}{\hat{\sigma}} =\frac{x_i-Med(X_n)}{Mad(X_n)}$$
robzscore <- (loginc - median(loginc)) / mad(loginc)
abs(robzscore) > 3 ## Check for outliers
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
robzscore[10] ## Robust z-score of the outlier
8.748523
boxplot(los, col = "blue", ylab = "Lenght of Stay (LOS)")$out
59 33 42 67 35 47 102 36 27 31 27 30 29 32 37 27 38
Outliers according to adjusted boxplot:
library(robustbase)
adjbox(los)$out
59 67 102
Statistics computed by adjusted boxplot:
adjboxStats(los)$stats
2 4 8 13 47
Fraud Detection in R