Clustering analysis: selecting the right clustering algorithm

Practicing Machine Learning Interview Questions in Python

Lisa Stuart

Data Scientist

Clustering algorithms

  • Features >> Observations
  • Model training more challenging
  • Rely on distance calculations
  • Most commonly used unsupervised technique
Practicing Machine Learning Interview Questions in Python

Practical applications of clustering

  • Customer segmentation
  • Document classification
  • Insurance/transaction fraud detection
  • Image segmentation
  • Anomaly detection
  • Many more...
Practicing Machine Learning Interview Questions in Python

Distance metrics: Manhattan (taxicab) distance

Manhattan plot

1 https://en.wikipedia.org/wiki/Taxicab_geometry
Practicing Machine Learning Interview Questions in Python

Distance metrics: Euclidian distance

Euclidian plot

1 http://rosalind.info/glossary/euclidean-distance/
Practicing Machine Learning Interview Questions in Python

K-means

K-means steps

  1. Initial centroids
  2. Assign each observation to nearest centroid
  3. Create new centroids
  4. Repeat steps 2 and 3
1 http://sherrytowers.com/2013/10/24/k-means-clustering/
Practicing Machine Learning Interview Questions in Python

Hierarchical agglomerative clustering

Hierarchical clustering dendrogram

1 https://www.datanovia.com/en/lessons/agglomerative-hierarchical-clustering/
Practicing Machine Learning Interview Questions in Python

Agglomerative clustering linkage

  • Ward linkage
  • Maximum/complete linkage
  • Average linkage
  • Single linkage
Practicing Machine Learning Interview Questions in Python

Selecting a clustering algorithm

Cluster distances

  • Cluster stability assessment
  • K-means and HC use Euclidian distance
  • Inter- and intra-cluster distances

"An appropriate dissimilarity measure is far more important in obtaining success with clustering than choice of clustering algorithm." - from Elements of Statistical Learning

1 https://slideplayer.com/slide/8363774/
Practicing Machine Learning Interview Questions in Python

Clustering functions

Function/method returns
sklearn.cluster.Kmeans K-Means clustering algorithm
sklearn.cluster.AgglomerativeClustering Agglomerative clustering algorithm
kmeans.inertia_ SS distances of observations to closest cluster center
scipy.cluster.hierarchy as sch Hierachical clustering for dendrograms
sch.dendrogram() Dendrogram function
Practicing Machine Learning Interview Questions in Python

Let's practice!

Practicing Machine Learning Interview Questions in Python

Preparing Video For Download...