Fraud Detection in R
Tim Verdonck
Professor Data Science at KU Leuven
Animals
dataset in package MASS
contains average brain and body weights for 28 animals library(MASS)
data("Animals")
body brain
Mountain beaver 1.35 8.1
Cow 465.00 423.0
Grey wolf 36.33 119.5
Goat 27.66 115.0
X <- data.frame(log_body = log(Animals$body), log_brain = log(Animals$brain))
Boxplot of logarithms of body weight and brain weight
Mahalanobis (or generalized) distance for observation is the distance from this observation to the center, taking into account the covariance matrix
Classical Mahalanobis distances : sample mean as estimate for location and sample covariance matrix as estimate for scatter
To detect multivariate outliers the mahalanobis distance is compared with a cut-off value, which is derived from the chisquare distribution
In two dimensions we can construct corresponding $97.5\%$ tolerance ellipsoid, which is defined by those observations whose Mahalanobis distance does not exceed the cut-off value
animals.clcenter <- colMeans(X) animals.clcov <- cov(X) rad <- sqrt(qchisq(0.975, df = ncol(X)))
library(car) ellipse.cl <- data.frame(ellipse(center = animals.clcenter, shape = animals.clcov,radius = rad, segments = 100, draw = FALSE)) colnames(ellipse.cl) <- colnames(X)
fig <- fig + geom_polygon(data=ellipse.cl, color = "dodgerblue", fill = "dodgerblue", alpha = 0.2) + geom_point(aes(x = animals.clcenter[1], y = animals.clcenter[2]), color = "blue", size = 6)
Minimum Covariance Determinant (MCD) estimator of Rousseeuw is a popular robust estimator of multivariate location and scatter
Robust estimates of location and scatter using MCD
library(robustbase)
animals.mcd <- covMcd(X)
# Robust estimate of location
animals.mcd$center
# Robust estimate of scatter
animals.mcd$cov
By plugging in these robust estimates of location and scatter in the definition of the Mahalanobis distances, we obtain robust distances and can create a robust tolerance ellipsoid.
library(robustbase) animals.mcd <- covMcd(X) ellipse.mcd <- data.frame(ellipse(center = animals.mcd$center, shape = animals.mcd$cov, radius = rad, segments = 100, draw = FALSE)) colnames(ellipse.mcd) <- colnames(X)
fig2 <- fig + geom_polygon(data = ellipse.mcd, color = "red", fill = "red", alpha = 0.3) + geom_point(aes(x = animals.mcd$center[1], y = animals.mcd$center[2]), color = "red", size = 6)
MCD
objectplot(animals.mcd, which = "dd")
Fraud Detection in R