Introduction to PCA

Unsupervised Learning in R

Hank Roark

Senior Data Scientist at Boeing

Two methods of clustering

  • Two methods of clustering - finding groups of homogeneous items
  • Next up, dimensionality reduction
    • Find structure in features
    • Aid in visualization
Unsupervised Learning in R

Dimensionality reduction

  • A popular method is principal component analysis (PCA)
  • Three goals when finding lower dimensional representation of features:
    • Find linear combination of variables to create principal components
    • Maintain most variance in the data
    • Principal components are uncorrelated (i.e., orthogonal to each other)
Unsupervised Learning in R

PCA intuition

scatter plot

Unsupervised Learning in R

PCA intuition

regression line

Unsupervised Learning in R

PCA intuition

component scores projection

Unsupervised Learning in R

Visualization of high dimensional data

different number of dimensions for data

Unsupervised Learning in R

Visualization

PCA on iris dataset

Unsupervised Learning in R

PCA in R

pr.iris <- prcomp(x = iris[-5],
                  scale = FALSE,
                  center = TRUE)

summary(pr.iris)
Importance of components:
                          PC1     PC2    PC3     PC4
Standard deviation     2.0563 0.49262 0.2797 0.15439
Proportion of Variance 0.9246 0.05307 0.0171 0.00521
Cumulative Proportion  0.9246 0.97769 0.9948 1.00000
Unsupervised Learning in R

Let's practice!

Unsupervised Learning in R

Preparing Video For Download...