Other clustering fraud detection methods

Fraud Detection in Python

Charlotte Werger

Data Scientist

There are many different clustering methods

And different ways of flagging fraud: using smallest clusters

In reality it looks more like this

DBSCAN versus K-means

No need to predefine amount of clusters
Adjust maximum distance between points within clusters
Assign minimum amount of samples in clusters
Better performance on weirdly shaped data
But.. higher computational costs

Implementing DBSCAN

from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.5, min_samples=10, n_jobs=-1).fit(X_scaled)

# Get the cluster labels (aka numbers)
pred_labels = db.labels_

# Count the total number of clusters
n_clusters_ = len(set(pred_labels)) - (1 if -1 in pred_labels else 0)
# Print model results
print('Estimated number of clusters: %d' % n_clusters_)

Estimated number of clusters: 31

Checking the size of the clusters

# Print model results
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X_scaled, pred_labels))

Silhouette Coefficient: 0.359

# Get sample counts in each cluster 
counts = np.bincount(pred_labels[pred_labels>=0])
print (counts)

[ 763  496  840  355 1086  676   63  306  560  134   28   18  262  128  332  22  
   22   13   31   38   36   28   14   12   30   10   11   10   21   10    5]

Let's practice!

Fraud Detection in Python