The Linear Algebra Behind PCA

Linear Algebra for Data Science in R

Eric Eager

Data Scientist at Pro Football Focus

Theory

The matrix $A^T$, the transpose of $A$, is the matrix made by interchanging the rows and columns of $A$.

If your data set is in a matrix $A$, and the mean of each column has been subtracted from each element in a given column, then the $i,j$th element of the matrix

$$\frac{A^TA}{n - 1},$$

where $n$ is the number of rows of $A$, is the covariance between the variables in the $i$th and $j$th column of the data in the matrix.

Hence, the $i$th element of the diagonal of $\frac{A^TA}{n - 1}$ is the variance of the $i$th column of the matrix.

Theory

print(A)

     [,1] [,2]
[1,]    1    2
[2,]    2    4
[3,]    3    6
[4,]    4    8
[5,]    5   10

A[, 1] <- A[, 1] - mean(A[, 1])
A[, 2] <- A[, 2] - mean(A[, 2]) 
print(A)

     [,1] [,2]
[1,]   -2   -4
[2,]   -1   -2
[3,]    0    0
[4,]    1    2
[5,]    2    4

Theory

t(A)%*%A/(nrow(A) - 1)

     [,1] [,2]
[1,]  2.5    5
[2,]  5.0   10

cov(A[, 1], A[, 2])

var(A[, 1])

2.5

var(A[, 2])

PCA

The eigenvalues $\lambda_1, \lambda_2, ... \lambda_n$ of $\frac{A^TA}{n - 1}$ are real, and their corresponding eigenvectors are orthogonal, or point in distinct directions.
The total variance of the data set is the sum of the eigenvalues of $\frac{A^TA}{n - 1}$.
These eigenvectors $v_1, v_2, ..., v_n$ are called the principal components of the data set in the matrix $A$.
The direction that $v_j$ points in can explain $\lambda_j$ of the total variance in the data set. If $\lambda_j$, or a subset of $\lambda_1, \lambda_2, ... \lambda_n$ explain a significant amount of the total variance, there is an opportunity for dimension reduction.

Example

eigen(t(A)%*%A/(nrow(A) - 1))

eigen() decomposition
$`values`
[1] 12.5  0.0

$vectors
          [,1]       [,2]
[1,] 0.4472136 -0.8944272
[2,] 0.8944272  0.4472136

Let's practice!

Linear Algebra for Data Science in R