Practical issues with PCA

Unsupervised Learning in R

Hank Roark

Senior Data Scientist at Boeing

Practical issues with PCA

  • Scaling the data
  • Missing values:
    • Drop observations with missing values
    • Impute / estimate missing values
  • Categorical data:
    • Do not use categorical data features
    • Encode categorical features as numbers
Unsupervised Learning in R

mtcars dataset

data(mtcars)
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0
Valiant           18.1   6  225 105 2.76 3.460 20.22  1
Unsupervised Learning in R

Scaling

# Means and standard deviations vary a lot
round(colMeans(mtcars), 2)
   mpg    cyl   disp     hp   drat     wt   qsec     vs
 20.09   6.19 230.72 146.69   3.60   3.22  17.85   0.44
round(apply(mtcars, 2, sd), 2)
   mpg    cyl   disp     hp   drat     wt   qsec     vs
  6.03   1.79 123.94  68.56   0.53   0.98   1.79   0.50
Unsupervised Learning in R

Importance of scaling data

comparing feature importance before and after scaling

Unsupervised Learning in R

Scaling and PCA in R

prcomp(x, center = TRUE, scale = FALSE)
Unsupervised Learning in R

Let's practice!

Unsupervised Learning in R

Preparing Video For Download...