Last lecture we discussed how the OLS estimates in linear regression can become unstable and highly variable when \(p >n\). Some ways to control the variance include:
Subset selection (option choice infeasible for large \(p\))
Shrinkage (Ridge Regression or LASSO)
Dimensionality Reduction
Motivation
Methods 1. and 2. are defined using the original predictors, \(X_1, X_2, \dots , X_p\).
Dimension reduction is alternative class of approaches that transform the predictors and then fit a least squares model using the transformed variables.
This technique is referred to as a dimension reduction method reduces the problem from one of \(p\) predictors to on of \(M\) (where \(M < p\))
Transformed Variables
Let \(Z_1, Z_2, \ldots, Z_M\) represent \(M\) linear combinations of our original \(p\) predictors \(X_1, X_2, \dots, X_p\). That is, \[Z_m = \sum_{j=1}^{p} \phi_{mj} X_j\] for some constants \(\phi_{m1}, \ldots, \phi_{mp}\).
The regression coefficients are given by \(\theta_0, \theta_1, \ldots, \theta_M\).
Comment
If the constants \(\phi_{m1}, \ldots, \phi_{mp}\) are chosen wisely, then such dimension reduction approaches can often outperform OLS regression.
The selection of the \(\phi_{mj}\)s (or equivalently the \(Z_1, \dots Z_M\)), can be achieved in different ways.
In this lecture, we will consider principal components for this task.
Principal Components Analysis
Principal components analysis (PCA) is a popular approach for deriving a low-dimensional set of features from a large set of variables.
PCA can be used as a dimension reduction technique for regression (supervised learning)
Note that PCA can also be as a tool for clustering (unsupervised learning).
Order of topics
Section 6.3 of your textbook applies PCA (without description) to define the linear combinations of the predictors, for use in our regression.
They later go on to describe the method in the context of unsupervised learning in Section 12.2.
We will do it in the opposite order (PCA for clustering first – then return to PCA regression)
Recall Supervised vs. Unsupervised
In the supervised setting, we observe both a set of features \(X_1, X_2, ..., X_p\) for each object, as well as a response or outcome variable \(Y\). The goal is then to predict \(Y\) using \(X_1, X_2, ..., X_p\).
Here we instead focus on unsupervised learning, where we observe only the features \(X_1, X_2, ..., X_p\). The goal is to discover interesting things about these features.
Visualization Motivation
We saw how difficult is is to visualize even moderately sized data (recall the pairs plots for the wine data?)
PCA enables us to project our data into some lower dimensional space whilst preserving the maximum amount of information.
This in turn will increase the interpretability of data and be an extremely useful tool for data visualization.
Applications
“Big” data is quite common these days, so unsuprisingly, PCA is one of the most widely used tools in statistics and data analysis.
It has applications in many fields including image processing, population genetics, microbiome studies, medical physics1, finance, atmospheric science, amoung others (all usually having large \(p\))
PLoSONE
PCA allows us to go from processing something like this (raman spectroscopy data) …
… to one of these gray-scale images. left: non-radiated tumour middle: mice irradiated to 5 Gy right: mice irradiated to 10 Gy
Example of three gray-scale Raman maps with pixel intensities equal to the scaled and discretized PC1 scores [Soucre]
What is PCA
PCA produces a low-dimensional representation of a data such that most of the information is preserved.
It produces linear combinations of the original variables to generate new axes, known as principal components, or PCs.
These principal components will be selected such that:
they are all uncorrelated
The PC1 will have the largest variance, the PC2 will have the second largest variance, and so on and so forth, …
Linear Combination
This linear combination takes the form of: \[
\begin{equation}
Z_m = \sum_{j=1}^p \phi_{mj}X_j
\end{equation}
\] for some constants \(\phi_{m1}, \dots, \phi_{mp}\), such that \(\sum_{j=1}^p \phi_{mj}^2 = 1\)
Note
We want the \(Z\)s to be uncorrelated, i.e. we want \(\text{Cor}(Z_j, Z_k) = 0\) for all \(j \neq k\)
Simple Bivariate Example
Consider this bivariate data (notice the lack of \(Y\))
The population size (pop) and ad spending (ad) for 100 different cities are shown as purple circles. The green solid line indicates the first principal component, and the blue dashed line indicates the second principal component.
Left: The first principal component direction is shown in green. It is the dimension along which the data vary the most, and it also defines the line that is closest to all n of the observations. The distances from each observation to the principal component are represented using the black dashed line segments. The blue dot represents (\(\overline{\texttt{pop}}, \overline{\texttt{ad}}\)). Right: The left-hand panel has been rotated so that the first principal component direction coincides with the x-axis.
First Principal Component (visually)
To find PC1, we want to draw a line such that the the data are as spread out as possible on the first line.
Another interpretation is that the first principal component minimizes the sum of the squared perpendicular distances between each point and the line
This line will be our first PC, or PC1, that will form our new “Z1-axis” (to be used instead of our “X1-axis”)
Projection Animation
An animation of 2D points (blue) and their projection onto a 1D moving line (red). In the context of PCA, we want the choose PC1 to be the line such that the red projected points have the highest variance. Image Source: stackoverflow.
PC1 (mathematically)
We look for the linear combination of the sample feature values of the form \[
z_{i1} = \phi_{11}x_{i1} + \phi_{21} x_{i2} + \dots + \phi_{p1}x_{ip}
\]\(\text{for } i = 1, \ldots, n\) that has the largest sample variance, subject to the constraint that \(\sum_{j=1}^{p} \phi_{j1}^2 = 1\).
Computation of PC1
PC1 loading vector solves the optimization problem
This problem can be solved via a eigen-value decomposition (SVD) of the matrix \(X\), a standard technique in linear algebra.
Notation
We refer to \(Z_1\) as the first principal component, with realized values or scores\(z_{11}, \ldots, z_{n1}\).
We refer to the elements \(\phi_{11}, \dots, \phi_{p1}\) as the loadings of PC1 and the corresponding PC loading vector is given by \[\begin{align}
\phi_1 = (\phi_{11}, \phi_{21}, \cdots, \phi_{p1})^T
\end{align}\]
The loading vector \(\phi_1 = (\phi_{11}, \phi_{21}, \cdots, \phi_{p1})^T\) defines a direction in the original feature space along which the data vary the most.
The loadings represent the correlation between the original variables and the principal components.
Steps of PCA
Standardize data (mean-centered and scaled)
Compute the covariance matrix
Perform eigenvalue decomposition
Select \(M\) components
Transform the data
Assess the cumulative explained variance
Analyze the principal components
Uncentered Data
Since we are only interested in variance, we assume that each of the variables in \(X\) has been centered to have mean zero.
Mean-centered Data
Since we are only interested in variance, we assume that each of the variables in \(X\) has been centered to have mean zero.
Center and Scaled data
To ensures that all variables contribute equally to the analysis we also scale the variables to have a standard deviation of 1.
The new \(Z_1\) and \(Z_2\) variables that form the axes of our scatterplot are our principal components: PC1 and PC2.
Mathematically, the orientations of these new axes relative to the original axes are called eigenvectors and the variance along these axes are called eigenvalues Jump to prcomp output (rotation)
We can find the eigenvectors and eigenvalues by performing eigenvalue decomposition on the covariance matrix.
In R this can be accomplished using the eigen() function
eigen()
Eigenvalue decomposition on the covariance matrix of the scaled data:
The first eigenvector will span the greatest variance, and all subsequent eigenvectors will be perpendicular (aka uncorrelated) to the one calculated before it. Image Source tds
Comment of PCs
PC1 and PC2 are uncorrelated, i.e. \(Cor(Z_1, Z_2)\) is zero, i.e. they are perpendicular to each other in our plots
The PCs are chosen to have maximal variance
Any other choice for \(\phi_{1j}\)s1 will produced a \(Z_1\) having a lower variance than the one we obtained (similarly for \(Z_2\))
PCs are ranked in order of their importance; the first principal component (PC1) captures the most variance, the second principal component (PC2) captures the second most, and so on.
Comment on \(M\)
Here \(M=p= 2\), that is we haven’t reduce the dimension.
The spatial relationships of the points are unchanged; this process has merely rotated the data.
In other words, there is NO information loss between the \(p\) PCs and original \(X_1, \dots, X_p\) predictors and PCA is just a rotation of the data:
\(M<p\)
Typically when we are employing PCA, we would select \(M<p\) thereby reducing the dimension of our dataset.
Often PCA will only look at the first 2 PC, or the first \(k\) PCs that contain a certain (high) proportion of the variability in the data (more on this in a second).
PCA in R
While we could do PCA using the base eigen() function in R, going forward we will employ the prcomp() function.
As you will see, we can extract equivalent information for the output of the prcomp() function as we did from eigen().
Warning
The default for the scale. argument (a logical value indicating whether the variables should be scaled to have unit variance) is FALSE.
prcomp ($rotation)
The loadings are stored in the columns of of Rotation[^6]
Importance of components:
PC1 PC2
Standard deviation 6.0361 0.48208
Proportion of Variance 0.9937 0.00634
Cumulative Proportion 0.9937 1.00000
99% of the variance in our original features \(X_1\) and \(X_2\), can be captured by \(Z_1\)
# Notice:var(x1) +var(x2)var(z1) +var(z2)
[1] 36.66734
[1] 36.66734
Note: we do not loose any of the information when \(M = p\), we simply redistribute the variance across the new variables
How many components to keep
There are several common options on choosing \(M\):
Cumulative proportion/percent of variance: Keep number of components such that, say, 90% (or 95%, or 80%, etc) of the variance from original data is retained.
Kaiser criterion: Keep all \(\lambda_j \geq \bar \lambda\) where \(\bar \lambda = \frac{\sum_{j=1}^p \lambda_j}{p}\); where \(\lambda_j\) is the eigenvalue associated with \(Z_j\) (the \(j\)th eigenvector)
Scree plot: Look for an “elbow”, or plateauing of the variance explained in the scree plot
PCA and Scaling
PCA is NOT scale invariant
As we’ll see in an example, this has huge implications…notably, any variable with large variance (relative to the rest) will dominate the first principal component.
In most cases1, this is undesirable.
Most commonly, you will need/want to scale your data to have mean 0, and variance 1.
PCA on Heptathlon Data
Lets consider the results of the Olympic heptathlon competition, Seoul, 1988.
The heptathlon is a track and field combined event consisting of seven athletic events.
This data frame has 25 observations and variables:
Can PCA suggest a different (simpler?) scoring system to find a scoring system that best separates the participants?
In more statistical lingo, our goal is to find a single variable (made of the original measures) which will provide the bulk of the variation present in the data
So let’s remove the score column and work with the remaining in an unsupervised setting…
Run PCA
So let’s run PCA (be sure to remove the score first)
A biplot merges the PCA scores from the first two principal components with the loadings of each features
The horizontal axis is PC1 and the vertical axis is PC2
Axes of a Biplot
The labeled points represent the original observed data \(x_1, \dots, x_n\), where \(x_i\) is a \(p\)-dimensional vector \((x_{i1}, x_{i2}, \dots, x_{ip})\), projected into this 2D space.
In other words, the labeled points represent the [PC scores] \((z_1, \dots, z_n)\) where \(z_i\) is now a 2-dimensional vector \((z_{i1}, z_{i2})\)
The top horizontal and right vertical axis represents the FEATURE [loadings] on PC1 and PC2, respectively.
Arrows on the Biplot
Each of the original features (\(X_1, \dots, X_p\)) will be represented by an arrow in the new 2D space
The magnitude and direction of a given arrow tells us how much the corresponding feature contributed in the construction of that axis:
Since PC1 is made up mostly of run800m, \(\phi_{71}, \phi_{72}\) will point heavily in the horizontal (i.e. PC1-axis) direction.
Since PC2 is made up mostly of javelin, \(\phi_{61}, \phi_{62}\) will point heavily in the vertical (i.e. PC2-axis) direction.
Biplot (unscaled data)
PC1 is comprised mostly of run800m; PC2 composed mostly of javelin score; PC3 composed mostly of shot.
Comment of PCs
PC1 and PC2 are uncorrelated, i.e. \(Cor(Z_1, Z_2)\) is zero, i.e. they are perpendicular to each other in our plots
The PCs are chosen to have maximal variance
Any other choice for \(\phi_{1j}\)s1 will produced a \(Z_1\) having a lower variance than the one we obtained (similarly for \(Z_2\))
PCs are ranked in order of their importance; the first principal component (PC1) captures the most variance, the second principal component (PC2) captures the second most, and so on.