Cluster labels in hierarchical clustering

Unsupervised Learning in Python

Benjamin Wilson

Director of Research at lateral.io

Cluster labels in hierarchical clustering

  • Not only a visualization tool!
  • Cluster labels at any intermediate stage can be recovered
  • For use in e.g. cross-tabulations

Eurovision hierarchical clustering

Unsupervised Learning in Python

Intermediate clusterings & height on dendrogram

  • E.g. at height 15:
    • Bulgaria, Cyprus, Greece are one cluster
    • Russia and Moldova are another
    • Armenia in a cluster on its own

Zoomed in cluster with horizontal line at height 15

Unsupervised Learning in Python

Dendrograms show cluster distances

  • Height on dendrogram = distance between merging clusters
  • E.g. clusters with only Cyprus and Greece had distance approx. 6

Zoomed in cluster with Cyprus/Greece cluster highlighted

Unsupervised Learning in Python

Dendrograms show cluster distances

  • Height on dendrogram = distance between merging clusters
  • E.g. clusters with only Cyprus and Greece had distance approx. 6
  • This new cluster distance approx. 12 from cluster with only Bulgaria

Zoomed in cluster with Cyprus/Greece cluster and Cyprus/Greece/Bulgaria cluster highlighted

Unsupervised Learning in Python

Intermediate clusterings & height on dendrogram

  • Height on dendrogram specifies max. distance between merging clusters
  • Don't merge clusters further apart than this (e.g. 15)

Zoomed in cluster with horizontal line at height 15

Unsupervised Learning in Python

Distance between clusters

  • Defined by a "linkage method"
  • In "complete" linkage: distance between clusters is max. distance between their samples
  • Specified via method parameter, e.g. linkage(samples, method="complete")
  • Different linkage method, different hierarchical clustering!
Unsupervised Learning in Python

Extracting cluster labels

  • Use the fcluster() function
  • Returns a NumPy array of cluster labels

Zoomed in cluster with horizontal line at height 15

Unsupervised Learning in Python

Extracting cluster labels using fcluster

from scipy.cluster.hierarchy import linkage
mergings = linkage(samples, method='complete')
from scipy.cluster.hierarchy import fcluster

labels = fcluster(mergings, 15, criterion='distance') print(labels)
[ 9  8 11 20  2  1 17 14 ... ]
Unsupervised Learning in Python

Aligning cluster labels with country names

Given a list of strings country_names:

import pandas as pd
pairs = pd.DataFrame({'labels': labels, 'countries': country_names})
print(pairs.sort_values('labels'))
               countries  labels
5                Belarus       1
40               Ukraine       1
...
36                 Spain       5
8               Bulgaria       6
19                Greece       6
10                Cyprus       6
28               Moldova       7
...
Unsupervised Learning in Python

Let's practice!

Unsupervised Learning in Python

Preparing Video For Download...