Dimension reduction with PCA

Unsupervised Learning in Python

Benjamin Wilson

Director of Research at lateral.io

Dimension reduction

  • Represents same data, using less features
  • Important part of machine-learning pipelines
  • Can be performed using PCA
Unsupervised Learning in Python

Dimension reduction with PCA

  • PCA features are in decreasing order of variance
  • Assumes the low variance features are "noise"
  • ... and high variance features are informative

Bar plot showing pca feature number vs variance with vertical line between 1 and 2, with left arrow labeled informative and right arrow labeled noisy

Unsupervised Learning in Python

Dimension reduction with PCA

  • Specify how many features to keep
  • E.g. PCA(n_components=2)
  • Keeps the first 2 PCA features
  • Intrinsic dimension is a good choice
Unsupervised Learning in Python

Dimension reduction of iris dataset

  • samples = array of iris measurements (4 features)
  • species = list of iris species numbers
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(samples)
PCA(n_components=2)
transformed = pca.transform(samples)
print(transformed.shape)
(150, 2)
Unsupervised Learning in Python

Iris dataset in 2 dimensions

  • PCA has reduced the dimension to 2
  • Retained the 2 PCA features with highest variance
  • Important information preserved: species remain distinct
import matplotlib.pyplot as plt
xs = transformed[:,0]
ys = transformed[:,1]
plt.scatter(xs, ys, c=species)
plt.show()

Scatter plot of PCA performed on Iris dataset

Unsupervised Learning in Python

Dimension reduction with PCA

  • Discards low variance PCA features
  • Assumes the high variance features are informative
  • Assumption typically holds in practice (e.g. for iris)
Unsupervised Learning in Python

Word frequency arrays

  • Rows represent documents, columns represent words
  • Entries measure presence of each word in each document
  • ... measure using "tf-idf" (more later)

Word frequency array

Unsupervised Learning in Python

Sparse arrays and csr_matrix

  • "Sparse": most entries are zero
  • Can use scipy.sparse.csr_matrix instead of NumPy array
  • csr_matrix remembers only the non-zero entries (saves space!)

Word frequency array

Unsupervised Learning in Python

TruncatedSVD and csr_matrix

  • scikit-learn PCA doesn't support csr_matrix
  • Use scikit-learn TruncatedSVD instead
  • Performs same transformation
from sklearn.decomposition import TruncatedSVD
model = TruncatedSVD(n_components=3)
model.fit(documents)  # documents is csr_matrix
transformed = model.transform(documents)
Unsupervised Learning in Python

Let's practice!

Unsupervised Learning in Python

Preparing Video For Download...