How many clusters?

Cluster Analysis in Python

Shaumik Daityari

Business Analyst

How to find the right k?

  • No absolute method to find right number of clusters (k) in k-means clustering
  • Elbow method

Cluster Analysis in Python

Distortions revisited

  • Distortion: sum of squared distances of points from cluster centers
  • Decreases with an increasing number of clusters
  • Becomes zero when the number of clusters equals the number of points
  • Elbow plot: line plot between cluster centers and distortion

Cluster Analysis in Python

Elbow method

  • Elbow plot: plot of the number of clusters and distortion
  • Elbow plot helps indicate number of clusters present in data
Cluster Analysis in Python

Elbow method in Python

# Declaring variables for use
distortions = []

num_clusters = range(2, 7)
# Populating distortions for various clusters
for i in num_clusters:
    centroids, distortion = kmeans(df[['scaled_x', 'scaled_y']], i)
    distortions.append(distortion)
# Plotting elbow plot data
elbow_plot_data = pd.DataFrame({'num_clusters': num_clusters,
                                'distortions': distortions})

sns.lineplot(x='num_clusters', y='distortions', 
             data = elbow_plot_data)
plt.show()
Cluster Analysis in Python

Cluster Analysis in Python

Final thoughts on using the elbow method

  • Only gives an indication of optimal k (numbers of clusters)
  • Does not always pinpoint how many k (numbers of clusters)
  • Other methods: average silhouette and gap statistic
Cluster Analysis in Python

Next up: exercises

Cluster Analysis in Python

Preparing Video For Download...