Basics of hierarchical clustering

Analisi di cluster in Python

Shaumik Daityari

Business Analyst

Creating a distance matrix using linkage

scipy.cluster.hierarchy.linkage(observations, 
                                method='single', 
                                metric='euclidean', 
                                optimal_ordering=False
)
  • method: how to calculate the proximity of clusters
  • metric: distance metric
  • optimal_ordering: order data points
Analisi di cluster in Python

Which method should use?

  • 'single': based on two closest objects
  • 'complete': based on two farthest objects
  • 'average': based on the arithmetic mean of all objects
  • 'centroid': based on the geometric mean of all objects
  • 'median': based on the median of all objects
  • 'ward': based on the sum of squares
Analisi di cluster in Python

Create cluster labels with fcluster

scipy.cluster.hierarchy.fcluster(distance_matrix, 
                                 num_clusters,
                                 criterion
)
  • distance_matrix: output of linkage() method
  • num_clusters: number of clusters
  • criterion: how to decide thresholds to form clusters
Analisi di cluster in Python

Hierarchical clustering with ward method

Analisi di cluster in Python

Hierarchical clustering with single method

Analisi di cluster in Python

Hierarchical clustering with complete method

Analisi di cluster in Python

Final thoughts on selecting a method

  • No one right method for all
  • Need to carefully understand the distribution of data
Analisi di cluster in Python

Let's try some exercises

Analisi di cluster in Python

Preparing Video For Download...