Basics of hierarchical clustering

Cluster Analysis in Python

Shaumik Daityari

Business Analyst

Creating a distance matrix using linkage

scipy.cluster.hierarchy.linkage(observations, 
                                method='single', 
                                metric='euclidean', 
                                optimal_ordering=False
)
  • method: how to calculate the proximity of clusters
  • metric: distance metric
  • optimal_ordering: order data points
Cluster Analysis in Python

Which method should use?

  • 'single': based on two closest objects
  • 'complete': based on two farthest objects
  • 'average': based on the arithmetic mean of all objects
  • 'centroid': based on the geometric mean of all objects
  • 'median': based on the median of all objects
  • 'ward': based on the sum of squares
Cluster Analysis in Python

Create cluster labels with fcluster

scipy.cluster.hierarchy.fcluster(distance_matrix, 
                                 num_clusters,
                                 criterion
)
  • distance_matrix: output of linkage() method
  • num_clusters: number of clusters
  • criterion: how to decide thresholds to form clusters
Cluster Analysis in Python

Hierarchical clustering with ward method

Cluster Analysis in Python

Hierarchical clustering with single method

Cluster Analysis in Python

Hierarchical clustering with complete method

Cluster Analysis in Python

Final thoughts on selecting a method

  • No one right method for all
  • Need to carefully understand the distribution of data
Cluster Analysis in Python

Let's try some exercises

Cluster Analysis in Python

Preparing Video For Download...