Clustering analysis: choosing the optimal number of clusters

Practicing Machine Learning Interview Questions in Python

Lisa Stuart

Data Scientist

Methods for optimal k

  • Silhouette method
  • Elbow method
Practicing Machine Learning Interview Questions in Python

Silhouette coefficient

  • Composed of 2 scores
    • Mean distance between each observation and all others:
      • in the same cluster
      • in the nearest cluster
Practicing Machine Learning Interview Questions in Python

Silhouette coefficient values

  • Between -1 and 1
    • 1
      • near others in same cluster
      • very far from others in other clusters
    • -1
      • not near others in same cluster
      • close to others in other clusters
    • 0
      • denotes overlapping clusters
Practicing Machine Learning Interview Questions in Python

Silhouette score

Silhouette score plot

1 https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
Practicing Machine Learning Interview Questions in Python

Elbow method

Elbow method plot

1 https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/
Practicing Machine Learning Interview Questions in Python

Optimal k selection functions

Function/method returns
sklearn.cluster.KMeans K-Means clustering algorithm
sklearn.metrics.silhouette_score score between -1 and 1 as measure of cluster stability
kmeans.inertia_ SS distances of observations to closest cluster center
range(start, stop) list of values beginning with start, up to but not including stop
list.append(kmeans.inertia_) appends inertia value to list
Practicing Machine Learning Interview Questions in Python

Let's practice!

Practicing Machine Learning Interview Questions in Python

Preparing Video For Download...