Dimension reduction with PCA

Unsupervised Learning in Python

Benjamin Wilson

Director of Research at lateral.io

Dimension reduction

Represents same data, using less features
Important part of machine-learning pipelines
Can be performed using PCA

Dimension reduction with PCA

PCA features are in decreasing order of variance
Assumes the low variance features are "noise"
... and high variance features are informative

Bar plot showing pca feature number vs variance with vertical line between 1 and 2, with left arrow labeled informative and right arrow labeled noisy

Dimension reduction with PCA

Specify how many features to keep
E.g. PCA(n_components=2)
Keeps the first 2 PCA features
Intrinsic dimension is a good choice

Dimension reduction of iris dataset

samples = array of iris measurements (4 features)
species = list of iris species numbers

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

pca.fit(samples)

PCA(n_components=2)

transformed = pca.transform(samples)
print(transformed.shape)

(150, 2)

Iris dataset in 2 dimensions

PCA has reduced the dimension to 2
Retained the 2 PCA features with highest variance
Important information preserved: species remain distinct

import matplotlib.pyplot as plt
xs = transformed[:,0]
ys = transformed[:,1]
plt.scatter(xs, ys, c=species)
plt.show()

Scatter plot of PCA performed on Iris dataset

Dimension reduction with PCA

Discards low variance PCA features
Assumes the high variance features are informative
Assumption typically holds in practice (e.g. for iris)

Word frequency arrays

Rows represent documents, columns represent words
Entries measure presence of each word in each document
... measure using "tf-idf" (more later)

Word frequency array

Sparse arrays and csr_matrix

"Sparse": most entries are zero
Can use scipy.sparse.csr_matrix instead of NumPy array
csr_matrix remembers only the non-zero entries (saves space!)

Word frequency array

TruncatedSVD and csr_matrix

scikit-learn PCA doesn't support csr_matrix
Use scikit-learn TruncatedSVD instead
Performs same transformation

from sklearn.decomposition import TruncatedSVD
model = TruncatedSVD(n_components=3)
model.fit(documents)  # documents is csr_matrix
transformed = model.transform(documents)

Let's practice!

Unsupervised Learning in Python