How k-means works and practical matters

Unsupervised Learning in R

Hank Roark

Senior Data Scientist at Boeing

Objectives

  • Explain how k-means algorithm is implemented visually
  • Model selection: determining number of clusters
Unsupervised Learning in R

observations

Unsupervised Learning in R

random cluster assignment

Unsupervised Learning in R

clusters centers calculated

Unsupervised Learning in R

after reassignment

Unsupervised Learning in R

iteration 2

Unsupervised Learning in R

iteration 3

Unsupervised Learning in R

iteration 4

Unsupervised Learning in R

iteration 5

Unsupervised Learning in R

Model selection

  • Recall k-means has a random component
  • Best outcome is based on total within cluster sum of squares:
    • For each cluster
      • For each observation in the cluster
        • Determine squared distance from observation to cluster center
      • Sum all of them together
Unsupervised Learning in R

Model selection

# k-means algorithm with 5 centers, run 20 times
kmeans(x, centers = 5, nstart = 20)
  • Running algorithm multiple times helps find the global minimum total within cluster sum of squares
  • You'll see an example in the exercises
Unsupervised Learning in R

running k-means multiple times

Unsupervised Learning in R

Determining the best number of clusters

  • Trial and error is not the best approach

determining the best number of clusters with an elbow plot

Unsupervised Learning in R

Determining the best number of clusters

  • Trial and error is not the best approach

determining the best number of clusters with an elbow plot

Unsupervised Learning in R

Let's practice!

Unsupervised Learning in R

Preparing Video For Download...