Clustering linkage and practical matters

Unsupervised Learning in R

Hank Roark

Senior Data Scientist at Boeing

Linking clusters in hierarchical clustering

  • How is distance between clusters determined? Rules?
  • Four methods to determine which cluster should be linked
    • Complete: pairwise similarity between all observations in cluster 1 and cluster 2, and uses largest of similarities
    • Single: same as above but uses smallest of similarities
    • Average: same as above but uses average of similarities
    • Centroid: finds centroid of cluster 1 and centroid of cluster 2, and uses similarity between two centroids
Unsupervised Learning in R

Linking methods: complete and average

complete and average dendrograms

Unsupervised Learning in R

Linking method: single

single dendrogram

Unsupervised Learning in R

Linking method: centroid

centroid dendrogram

Unsupervised Learning in R

Linkage in R

# Fitting hierarchical clustering models using different methods
hclust.complete <- hclust(d, method = "complete")
hclust.average <- hclust(d, method = "average")
hclust.single <- hclust(d, method = "single")
Unsupervised Learning in R

Practical matters

  • Data on different scales can cause undesirable results in clustering methods
  • Solution is to scale data so that features have same mean and standard deviation
    • Subtract mean of a feature from all observations
    • Divide each feature by the standard deviation of the feature
    • Normalized features have a mean of zero and a standard deviation of one
Unsupervised Learning in R

Practical matters

# Check if scaling is necessary
colMeans(x)
-0.1337828  0.0594019
apply(x, 2, sd)
1.974376 2.112357
Unsupervised Learning in R

Practical matters

# Produce new matrix with columns of mean of 0 and sd of 1
scaled_x <- scale(x)
colMeans(scaled_x)
2.775558e-17 3.330669e-17
apply(scaled_x, 2, sd)
1 1
Unsupervised Learning in R

Let's practice!

Unsupervised Learning in R

Preparing Video For Download...