Visualizing the PCA transformation

Unsupervised Learning in Python

Benjamin Wilson

Director of Research at lateral.io

Dimension reduction

More efficient storage and computation
Remove less-informative "noise" features
... which cause problems for prediction tasks, e.g. classification, regression

Principal Component Analysis

PCA = "Principal Component Analysis"
Fundamental dimension reduction technique
First step "decorrelation" (considered here)
Second step reduces dimension (considered later)

PCA aligns data with axes

Rotates data samples to be aligned with axes
Shifts data samples so they have mean 0
No information is lost

scatter plot of wines data with rotated axes

PCA follows the fit/transform pattern

PCA is a scikit-learn component like KMeans or StandardScaler
fit() learns the transformation from given data
transform() applies the learned transformation
transform() can also be applied to new data

Using scikit-learn PCA

samples = array of two features (total_phenols&od280)

[[ 2.8   3.92]
 ...
 [ 2.05  1.6 ]]

from sklearn.decomposition import PCA

model = PCA()
model.fit(samples)

PCA()

transformed = model.transform(samples)

PCA features

Rows of transformed correspond to samples
Columns of transformed are the "PCA features"
Row gives PCA feature values of corresponding sample

print(transformed)

[[  1.32771994e+00   4.51396070e-01]
 [  8.32496068e-01   2.33099664e-01]
 ...
 [ -9.33526935e-01  -4.60559297e-01]]

PCA features are not correlated

Features of dataset are often correlated, e.g. total_phenols and od280
PCA aligns the data with axes
Resulting PCA features are not linearly correlated ("decorrelation")

scatter plot of wines data with rotated axes

Pearson correlation

Measures linear correlation of features
Value between -1 and 1
Value of 0 means no linear correlation

3 scatter plots with correlation 0.7, 0, and -0.7

Principal components

"Principal components" = directions of variance
PCA aligns principal components with the axes

scatter plot of wines data with 2 red arrows showing direction of principal components (rotated axes)

Principal components

Available as components_ attribute of PCA object
Each row defines displacement from mean

print(model.components_)

[[ 0.64116665  0.76740167]
 [-0.76740167  0.64116665]]

Let's practice!

Unsupervised Learning in Python

Preparing Video For Download...