Unsupervised Learning

Unsupervised Learning in Python

Benjamin Wilson

Director of Research at lateral.io

Unsupervised learning

  • Unsupervised learning finds patterns in data
  • E.g., clustering customers by their purchases
  • Compressing the data using purchase patterns (dimension reduction)
Unsupervised Learning in Python

Supervised vs unsupervised learning

  • Supervised learning finds patterns for a prediction task
  • E.g., classify tumors as benign or cancerous (labels)
  • Unsupervised learning finds patterns in data
  • ... but without a specific prediction task in mind
Unsupervised Learning in Python

Iris dataset

  • Measurements of many iris plants
  • Three species of iris:
    • setosa
    • versicolor
    • virginica
  • Petal length, petal width, sepal length, sepal width (the features of the dataset)

Iris

1 https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html
Unsupervised Learning in Python

Arrays, features & samples

  • 2D NumPy array
  • Columns are measurements (the features)
  • Rows represent iris plants (the samples)
Unsupervised Learning in Python

Iris data is 4-dimensional

  • Iris samples are points in 4 dimensional space
  • Dimension = number of features
  • Dimension too high to visualize!
  • ... but unsupervised learning gives insight
Unsupervised Learning in Python

k-means clustering

  • Finds clusters of samples
  • Number of clusters must be specified
  • Implemented in sklearn ("scikit-learn")
Unsupervised Learning in Python
print(samples)
[[ 5.   3.3  1.4  0.2]
 [ 5.   3.5  1.3  0.3]
 ...
 [ 7.2  3.2  6.   1.8]]
from sklearn.cluster import KMeans

model = KMeans(n_clusters=3)
model.fit(samples)
KMeans(n_clusters=3)
labels = model.predict(samples)

print(labels)
[0 0 1 1 0 1 2 1 0 1 ...]
Unsupervised Learning in Python

Cluster labels for new samples

  • New samples can be assigned to existing clusters
  • k-means remembers the mean of each cluster (the "centroids")
  • Finds the nearest centroid to each new sample
Unsupervised Learning in Python

Cluster labels for new samples

print(new_samples)
[[ 5.7  4.4  1.5  0.4]
 [ 6.5  3.   5.5  1.8]
 [ 5.8  2.7  5.1  1.9]]
new_labels = model.predict(new_samples)

print(new_labels)
[0 2 1]
Unsupervised Learning in Python

Scatter plots

  • Scatter plot of sepal length vs. petal length
  • Each point represents an iris sample
  • Color points by cluster labels
  • PyPlot (matplotlib.pyplot)

Scatter plot

Unsupervised Learning in Python

Scatter plots

import matplotlib.pyplot as plt

xs = samples[:,0] ys = samples[:,2]
plt.scatter(xs, ys, c=labels)
plt.show()
Unsupervised Learning in Python

Let's practice!

Unsupervised Learning in Python

Preparing Video For Download...