Silhouette analysis: observation level performance

Cluster Analysis in R

Dmitriy (Dima) Gorenshteyn

Lead Data Scientist, Memorial Sloan Kettering Cancer Center

Soccer lineup with K = 3

Cluster Analysis in R

Silhouette width

Within Cluster Distance: C(i)

Closest Neighbor Distance: N(i)

Cluster Analysis in R

Silhouette width

Within Cluster Distance: C(i)

Closest Neighbor Distance: N(i)

Cluster Analysis in R

Silhouette width

Within Cluster Distance: C(i)

Closest Neighbor Distance: N(i)

Cluster Analysis in R

Silhouette width

Within Cluster Distance: C(i)

Closest Neighbor Distance: N(i)

Cluster Analysis in R

Silhouette width

Within Cluster Distance: C(i)

Closest Neighbor Distance: N(i)

Cluster Analysis in R

Silhouette width: S(i)

Cluster Analysis in R

Silhouette width: S(i)

 

  • 1: Well matched to cluster
  • 0: On border between two clusters
  • -1: Better fit in neighboring cluster
Cluster Analysis in R

Calculating S(i)

library(cluster)
pam_k3 <- pam(lineup, k = 3)

pam_k3$silinfo$widths cluster neighbor sil_width 4 1 2 0.465320054 2 1 3 0.321729341 10 1 2 0.311385893 1 1 3 0.271890169 9 2 1 0.443606497 ... ... ... ...
Cluster Analysis in R

Silhouette plot

sil_plot <- silhouette(pam_k3)
plot(sil_plot)

Cluster Analysis in R

Silhouette plot

sil_plot <- silhouette(pam_k3)
plot(sil_plot)

Cluster Analysis in R

Average silhouette width

pam_k3$silinfo$avg.width
[1] 0.353414
  • 1: Well matched to each cluster
  • 0: On border between clusters
  • -1: Poorly matched to each cluster
Cluster Analysis in R

Highest average silhouette width

library(purrr)

sil_width <- map_dbl(2:10,  function(k){
  model <- pam(x = lineup, k = k)
  model$silinfo$avg.width
})
sil_df <- data.frame(
  k = 2:10,
  sil_width = sil_width
)
print(sil_df)
     k    sil_width
1    2    0.4164141
2    3    0.3534140
3    4    0.3535534
4    5    0.3724115
...  ...        ...
Cluster Analysis in R

Choosing K using average silhouette width

ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

Cluster Analysis in R

Choosing K using average silhouette width

ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

Cluster Analysis in R

Let's practice!

Cluster Analysis in R

Preparing Video For Download...