Detecting multivariate outliers

Fraud Detection in R

Tim Verdonck

Professor Data Science at KU Leuven

Animals data

  • Animals dataset in package MASS contains average brain and body weights for 28 animals
library(MASS)
data("Animals")
                    body brain
Mountain beaver     1.35   8.1
Cow               465.00 423.0
Grey wolf          36.33 119.5
Goat               27.66 115.0
  • Apply logarithmic transformation on both body and brain weight
X <- data.frame(log_body = log(Animals$body), log_brain = log(Animals$brain))
Fraud Detection in R

Animals data: univariate outlier detection

Boxplot of logarithms of body weight and brain weight

animalsbp

Fraud Detection in R

Animals data: scatterplot

scatteranimals_ggplot

Fraud Detection in R

Mahalanobis distance

Mahalanobis (or generalized) distance for observation is the distance from this observation to the center, taking into account the covariance matrix

mahalanobiseuclidean_ggplot

Fraud Detection in R

Mahalanobis distance to detect multivariate outliers

  • Classical Mahalanobis distances : sample mean as estimate for location and sample covariance matrix as estimate for scatter

  • To detect multivariate outliers the mahalanobis distance is compared with a cut-off value, which is derived from the chisquare distribution

  • In two dimensions we can construct corresponding $97.5\%$ tolerance ellipsoid, which is defined by those observations whose Mahalanobis distance does not exceed the cut-off value

Fraud Detection in R

Tolerance ellipsoid based on Mahalanobis distance

animals.clcenter <- colMeans(X)
animals.clcov <- cov(X)
rad <- sqrt(qchisq(0.975, df = ncol(X)))

library(car) ellipse.cl <- data.frame(ellipse(center = animals.clcenter, shape = animals.clcov,radius = rad, segments = 100, draw = FALSE)) colnames(ellipse.cl) <- colnames(X)
fig <- fig + geom_polygon(data=ellipse.cl, color = "dodgerblue", fill = "dodgerblue", alpha = 0.2) + geom_point(aes(x = animals.clcenter[1], y = animals.clcenter[2]), color = "blue", size = 6)
Fraud Detection in R

cltolanimals_ggplot

Fraud Detection in R

Robust estimates of location and scatter

Minimum Covariance Determinant (MCD) estimator of Rousseeuw is a popular robust estimator of multivariate location and scatter

  • MCD looks for those $h$ observations whose classical covariance matrix has the lowest possible determinant
  • MCD estimate of location is then mean of these $h$ observations
  • MCD estimate of scatter is then sample covariance matrix of these $h$ points (multiplied by consistency factor)
  • Reweighting step is applied to improve efficiency at normal data
  • Computation of MCD is difficult, but several fast algorithms are proposed
Fraud Detection in R

Robust distance

Robust estimates of location and scatter using MCD

library(robustbase)
animals.mcd <- covMcd(X)

# Robust estimate of location
animals.mcd$center 

# Robust estimate of scatter
animals.mcd$cov

By plugging in these robust estimates of location and scatter in the definition of the Mahalanobis distances, we obtain robust distances and can create a robust tolerance ellipsoid.

Fraud Detection in R

Animals: robust tolerance ellipsoid

library(robustbase)
animals.mcd <- covMcd(X)
ellipse.mcd <- data.frame(ellipse(center = animals.mcd$center, 
                                   shape = animals.mcd$cov,
                                   radius = rad, segments = 100, draw = FALSE))
colnames(ellipse.mcd) <- colnames(X)

fig2 <- fig + geom_polygon(data = ellipse.mcd, color = "red", fill = "red", alpha = 0.3) + geom_point(aes(x = animals.mcd$center[1], y = animals.mcd$center[2]), color = "red", size = 6)
Fraud Detection in R

robtolanimals_ggplot_v3.png

Fraud Detection in R

Distance-distance plot

  • When $p>3$ it is not possible to visualize the tolerance ellipsoid.
  • The distance-distance plot shows the robust distance of each observation versus its classical Mahalanobis distance obtained immediately from MCD object
    plot(animals.mcd, which = "dd")
    
    animalsddplot
Fraud Detection in R

Animals: check outliers

animalsdetection

Fraud Detection in R

Let's practice!

Fraud Detection in R

Preparing Video For Download...