Visualizing the PCA transformation

Unsupervised Learning in Python

Benjamin Wilson

Director of Research at lateral.io

Dimension reduction

  • More efficient storage and computation
  • Remove less-informative "noise" features
  • ... which cause problems for prediction tasks, e.g. classification, regression
Unsupervised Learning in Python

Principal Component Analysis

  • PCA = "Principal Component Analysis"
  • Fundamental dimension reduction technique
  • First step "decorrelation" (considered here)
  • Second step reduces dimension (considered later)
Unsupervised Learning in Python

PCA aligns data with axes

  • Rotates data samples to be aligned with axes
  • Shifts data samples so they have mean 0
  • No information is lost

scatter plot of wines data with rotated axes

Unsupervised Learning in Python

PCA follows the fit/transform pattern

  • PCA is a scikit-learn component like KMeans or StandardScaler
  • fit() learns the transformation from given data
  • transform() applies the learned transformation
  • transform() can also be applied to new data
Unsupervised Learning in Python

Using scikit-learn PCA

  • samples = array of two features (total_phenols&od280)
[[ 2.8   3.92]
 ...
 [ 2.05  1.6 ]]
from sklearn.decomposition import PCA

model = PCA() model.fit(samples)
PCA()
transformed = model.transform(samples)
Unsupervised Learning in Python

PCA features

  • Rows of transformed correspond to samples
  • Columns of transformed are the "PCA features"
  • Row gives PCA feature values of corresponding sample
print(transformed)
[[  1.32771994e+00   4.51396070e-01]
 [  8.32496068e-01   2.33099664e-01]
 ...
 [ -9.33526935e-01  -4.60559297e-01]]
Unsupervised Learning in Python

PCA features are not correlated

  • Features of dataset are often correlated, e.g. total_phenols and od280
  • PCA aligns the data with axes
  • Resulting PCA features are not linearly correlated ("decorrelation")

scatter plot of wines data with rotated axes

Unsupervised Learning in Python

Pearson correlation

  • Measures linear correlation of features
  • Value between -1 and 1
  • Value of 0 means no linear correlation

3 scatter plots with correlation 0.7, 0, and -0.7

Unsupervised Learning in Python

Principal components

  • "Principal components" = directions of variance
  • PCA aligns principal components with the axes

scatter plot of wines data with 2 red arrows showing direction of principal components (rotated axes)

Unsupervised Learning in Python

Principal components

  • Available as components_ attribute of PCA object
  • Each row defines displacement from mean
print(model.components_)
[[ 0.64116665  0.76740167]
 [-0.76740167  0.64116665]]
Unsupervised Learning in Python

Let's practice!

Unsupervised Learning in Python

Preparing Video For Download...