How k-means works and practical matters

Unsupervised Learning in R

Hank Roark

Senior Data Scientist at Boeing

Objectives

observations

random cluster assignment

clusters centers calculated

after reassignment

iteration 2

iteration 3

iteration 4

iteration 5

Recall k-means has a random component
Best outcome is based on total within cluster sum of squares:
- For each cluster
  - For each observation in the cluster
    - Determine squared distance from observation to cluster center
  - Sum all of them together

# k-means algorithm with 5 centers, run 20 times
kmeans(x, centers = 5, nstart = 20)

Running algorithm multiple times helps find the global minimum total within cluster sum of squares
You'll see an example in the exercises

running k-means multiple times

determining the best number of clusters with an elbow plot

determining the best number of clusters with an elbow plot

Unsupervised Learning in R