View/Print PDFPS
Module 6: Principal components regression
By Bent Jørgensen and Yuri Goegebeur


Table of Contents





6.1 The SVD factorization

Previous Section
Next Section

You're saying this only to make me go. [Ilsa Lund Laszlo, Casablanca, 1942]

In this module we turn to the Principal Components Regression (PCR) method, in which the PCA (Principal Components Analysis) method from the previous module is put to work in regression. To this end we consider the principal components of $ \boldsymbol{X}^{\top }\boldsymbol{X}$ , where $ \boldsymbol{X}$ is a centered $ n\times k$ data matrix.

There are several ways of finding the principal components of the $ \boldsymbol{X}^{\top }\boldsymbol{X}$ matrix. One possibility is to apply the SVD method to $ \boldsymbol{X}$ , writing the reduced form of SVD as follows:

$\displaystyle \boldsymbol{X}=\boldsymbol{UDP}^{\top },
$

where $ \boldsymbol{U}$ ($ n\times r$ ) and $ \boldsymbol{P}$ ($ k\times r$ ) are orthogonal matrices corresponding to $ r$ singular values, in the notation of Module 5.

Let the scores matrix be defined by

$\displaystyle \boldsymbol{T}=\boldsymbol{UD,}
$

a matrix with orthogonal, but not necessarily orthonormal columns. In fact
$\displaystyle \boldsymbol{T}^{\top }\boldsymbol{T}$ $\displaystyle =$ $\displaystyle \boldsymbol{DU}^{\top }\boldsymbol{UD}$  
  $\displaystyle =$ $\displaystyle \boldsymbol{D}^{2}$  
  $\displaystyle =$ $\displaystyle \boldsymbol{\Lambda }_{r}\boldsymbol{,}$  

where $ \boldsymbol{\Lambda }_{r}=\mathrm{diag}\left\{ \lambda _{1},\ldots
,\lambda _{r}\right\} $ contains the non-zero eigenvalues of $ \boldsymbol{X}^{\top }\boldsymbol{X}$ in its diagonal. We assume that the eigenvalues are in decreasing order, $ \lambda _{1}\geq \cdots \geq \lambda _{r}>0$ .

Since

$\displaystyle \boldsymbol{X}=\boldsymbol{TP}^{\top }$ (6.1)

we find that
$\displaystyle \boldsymbol{X}^{\top }\boldsymbol{X}$ $\displaystyle =$ $\displaystyle \boldsymbol{PT}^{\top }\boldsymbol{TP}^{\top }$  
  $\displaystyle =$ $\displaystyle \boldsymbol{P\Lambda }_{r}\boldsymbol{P}^{\top },$  

which is the spectral decomposition for $ \boldsymbol{X^{\top }\boldsymbol{X}}
$ , except that columns of $ \boldsymbol{P}$ corresponding to zero eigenvalues have been left out. By using that $ \boldsymbol{P}$ is orthogonal, we may also write (6.1) as follows:

$\displaystyle \boldsymbol{T=XP}$, (6.2)

which follows by noting that $ \boldsymbol{XP}=\boldsymbol{TP}^{\top }\boldsymbol{P=T}$ . Recall from Module 5 that the columns of $ \boldsymbol{T}$ are known as scores, and those of $ \boldsymbol{P}$ as loadings.




6.2 The NIPALS algorithm for PCA

Previous Section
Next Section

Now we consider the NIPALS (Nonlinear Iterative Partial Least Squares) algorithm for finding the principal components of $ \boldsymbol{X}^{\top }\boldsymbol{X}$ . We want to find the first $ g$ principal components of $ \boldsymbol{X}^{\top }\boldsymbol{X}$ , starting with the largest eigenvalue $ \lambda _{1}$ and down. $ g$ must be less than or equal to $ r.$




6.2.1 Iterations

Previous Section
Next Section

The NIPALS algorithm starts with the initialization $ j=1$ and $ \boldsymbol{X}_{1}=\boldsymbol{X}$ . The algorithm then iterates through the following steps:

  1. Choose $ \boldsymbol{t}_{j}$ as any column of $ \boldsymbol{X}_{j}$ .

  2. Let $ \boldsymbol{p}_{j}=\boldsymbol{X}_{j}^{\top }\boldsymbol{t}_{j}\,/\,\left\Vert \boldsymbol{X}_{j}^{\top }\boldsymbol{t}_{j}\right\Vert
. $

  3. Let $ \boldsymbol{t}_{j}=\boldsymbol{X}_{j}\boldsymbol{p}_{j}.$

  4. If $ \boldsymbol{t}_{j}$ is unchanged continue; otherwise return to Step 2.

  5. Let $ \boldsymbol{X}_{j+1}\boldsymbol{=X}_{j}-\boldsymbol{t}_{j}\boldsymbol{p}_{j}^{\top }$ .

  6. Stop if $ j=g$ ; otherwise let $ j=j+1$ and return to Step 1.

Assume first that $ g=r$ , so we have found all the principal components. Now form the matrices $ \boldsymbol{T}$ and $ \boldsymbol{P}$ with columns $ \boldsymbol{t}_{j}$ and $ \boldsymbol{p}_{j}$ , respectively; these matrices now satisfy (6.1).

It is possible to modify the NIPALS algorithm to take missing data into account, see Bro (1996), pp. 43-44.




6.2.2 Properties

Previous Section
Next Section

Let us consider some properties of the NIPALS algorithm, which also help understand the PCA method.

That the NIPALS algorithm gives PCA may be seen as follows. Let $ \lambda
_{j}=\left\Vert \boldsymbol{X}^{\top }\boldsymbol{t}_{j}\right\Vert $ and write Step 2 as follows:

$\displaystyle \boldsymbol{X}^{\top }\boldsymbol{t}_{j}=\lambda _{j}\boldsymbol{p}_{j}.
$

Now insert $ \boldsymbol{t}_{j}=\boldsymbol{Xp}_{j}$ from Step 3, giving

$\displaystyle \boldsymbol{X}^{\top }\boldsymbol{Xp}_{j}=\lambda _{j}\boldsymbol{p}_{j}.$ (6.3)

This equation is satisfied upon convergence of the loop 2-4. This shows that $ \lambda _{j}$ and $ \boldsymbol{p}_{j}$ are an eigenvalue and eigenvector of $ \boldsymbol{X}^{\top }\boldsymbol{X}$ , respectively. Also note that using $ \boldsymbol{t}_{j}=\boldsymbol{Xp}_{j}$ and (6.3) we obtain
$\displaystyle \boldsymbol{t}_{j}^{\top }\boldsymbol{t}_{j}$ $\displaystyle =$ $\displaystyle \boldsymbol{p}_{j}^{\top }\boldsymbol{X}^{\top }\boldsymbol{Xp}_{j}$  
  $\displaystyle =$ $\displaystyle \boldsymbol{p}_{j}^{\top }\left( \boldsymbol{X}^{\top }\boldsymbol{Xp}_{j}\right)$  
  $\displaystyle =$ $\displaystyle \lambda _{j}\boldsymbol{p}_{j}^{\top }\boldsymbol{p}_{j}$  
  $\displaystyle =$ $\displaystyle \lambda _{j},$ (6.4)

where in the last step we have used the fact that $ \boldsymbol{p}_{j}$ is a unit vector (see Step 2).

After the first run through the loop 1-5, Step 5 with $ j=1$ gives that

$\displaystyle \boldsymbol{X=t}_{1}\boldsymbol{p}_{1}^{\top }+\boldsymbol{X}_{2}.$ (6.5)

Let us show that $ \boldsymbol{t}_{1}$ and $ \boldsymbol{X}_{2}=\boldsymbol{X-t}_{1}\boldsymbol{p}_{1}^{\top }$ are orthogonal. In fact
$\displaystyle \left( \boldsymbol{X-t}_{1}\boldsymbol{p}_{1}^{\top }\right) ^{\top }\boldsymbol{t}_{1}$ $\displaystyle \boldsymbol{=}$ $\displaystyle \boldsymbol{X}^{\top }\boldsymbol{t}_{1}\boldsymbol{-p}_{1}\boldsymbol{t}_{1}^{\top }\boldsymbol{t}_{1}$  
  $\displaystyle =$ $\displaystyle \boldsymbol{X}^{\top }\boldsymbol{Xp}_{1}\boldsymbol{-p}_{1}\lambda _{1}$  
  $\displaystyle =$ $\displaystyle \boldsymbol{0},$  

as seen from (6.3) with $ j=1$ . Since $ \boldsymbol{t}_{2}$ was initially picked as a column of $ \boldsymbol{X}_{2}$ , it is hence orthogonal to $ \boldsymbol{t}_{1}$ and remains so to the end of the loop.

After the second run through the loop 1-5, we obtain

$\displaystyle \boldsymbol{X=t}_{1}\boldsymbol{p}_{1}^{\top }+\boldsymbol{t}_{2}\boldsymbol{p}_{2}^{\top }+\boldsymbol{X}_{3}.
$

After $ g$ runs through the loop, we similarly have

$\displaystyle \boldsymbol{X=t}_{1}\boldsymbol{p}_{1}^{\top }+\boldsymbol{t}_{2}...
...op }+\cdots +\boldsymbol{t}_{g}\boldsymbol{p}_{g}^{\top }+\boldsymbol{X}_{g+1},$ (6.6)

where $ \boldsymbol{X}_{g+1}=\boldsymbol{0}$ in the case $ g=r$ (compare with (6.1)).




6.3 How many components?

Previous Section
Next Section

As the example suggests, the essence of the PCA method is to decompose the $ \boldsymbol{X}$ matrix as in (6.6),


$\displaystyle \boldsymbol{X}$ $\displaystyle =$ $\displaystyle \boldsymbol{t}_{1}\boldsymbol{p}_{1}^{\top }+\boldsymbol{t}_{2}\b...
...top }+\cdots +\boldsymbol{t}_{g}\boldsymbol{p}_{g}^{\top }+\boldsymbol{X}_{g+1}$ (6.7)
  $\displaystyle =$ $\displaystyle \boldsymbol{T}_{g}\boldsymbol{P}_{g}^{\top }+\boldsymbol{X}_{g+1},$  

say, where $ \boldsymbol{T}_{g}$ and $ \boldsymbol{P}_{g}$ contain the first $ g $ columns of $ \boldsymbol{T}$ and $ \boldsymbol{P}$ , respectively. We want to choose $ g$ in such a way that $ \boldsymbol{X}_{g+1}$ is small and represents only noise, while the term $ \boldsymbol{T}_{g}\boldsymbol{P}_{g}^{\top }$ represents the salient features of $ \boldsymbol{X}$ . In order to accomplish this, $ g$ must be chosen in such a way that the $ r-g$ terms that are ignored correspond to zero or negligible eigenvalues.

In order to help rationalize the choice of $ g$ , the relative size of the eigenvalues are expressed as a percentage of the sum of all eigenvalues,

$\displaystyle \frac{\lambda _{1}}{\lambda _{1}+\cdots +\lambda _{r}}\times 100\
$

and this percentage is interpreted as the percent variation explained by the corresponding principal component. Often, the cumulated percentages are used, so that the percent variation explained by the first $ g$ components is

$\displaystyle \frac{\lambda _{1}+\cdots +\lambda _{g}}{\lambda _{1}+\cdots +\lambda _{r}}\times 100\
$

As a rule, $ g$ should be chosen so that at least about 80-90 percent of the variation is explained.

The justification for the above terminology is that the variance of the score vector $ \boldsymbol{t}_{j}$ is

$\displaystyle s^{2}\left( \boldsymbol{t}_{j}\right)$ $\displaystyle =$ $\displaystyle \frac{1}{n-1}\left\Vert \boldsymbol{t}_{j}\right\Vert ^{2}$  
  $\displaystyle =$ $\displaystyle \frac{1}{n-1}\boldsymbol{t}_{j}^{\top }\boldsymbol{t}_{j}$  
  $\displaystyle =$ $\displaystyle \frac{1}{n-1}\lambda _{j},$  

so that $ \lambda _{j}$ is proportional to the variance of the corresponding score. In particular, all components with $ \lambda _{j}=0$ should be left out. Also, since the covariance matrix of $ \boldsymbol{X}$ is
$\displaystyle \boldsymbol{V}_{X}$ $\displaystyle =$ $\displaystyle \frac{1}{n-1}\boldsymbol{X}^{\top }\boldsymbol{X}$  
  $\displaystyle =$ $\displaystyle \frac{1}{n-1}\sum_{j=1}^{k}\lambda _{j}\boldsymbol{p}_{j}\boldsymbol{p}_{j}^{\top }\boldsymbol{,}$  

we may interpret $ \lambda _{j}$ as the contribution of $ \boldsymbol{t}_{j}$ to the total (co-)variance for $ \boldsymbol{X}$ . Note also that the sum of the eigenvalues is equal to

$\displaystyle \mathrm{tr}\{\boldsymbol{X}^{\top }\boldsymbol{X}\}=(n-1)\left\{ s^{2}(\boldsymbol{x}_{1})+\cdots +s^{2}(\boldsymbol{x}_{k})\right\} ,
$

which is interpreted as the total variance in $ \boldsymbol{X}$ .




6.4 Principal components regression

Previous Section
Next Section

The basic idea in Principal Components Regression (PCR) is that after choosing a suitable value for $ g$ in (6.7), the important features of $ \boldsymbol{X}$ have been retained by $ \boldsymbol{T}_{g}$ . We then perform the MLR with $ \boldsymbol{T}_{g}$ in place of $ \boldsymbol{X}$ for an $ n\times m$ calibration data matrix $ \boldsymbol{Y}$ ,

$\displaystyle \boldsymbol{Y}=\boldsymbol{T}_{g}\boldsymbol{C}+\boldsymbol{F}.$ (6.8)

The least squares method then gives

$\displaystyle \widehat{\boldsymbol{C}}=\left( \boldsymbol{T}_{g}^{\top }\boldsymbol{T}_{g}\right) ^{-1}\boldsymbol{T}_{g}^{\top }\boldsymbol{Y,}
$

where $ \boldsymbol{T}_{g}^{\top }\boldsymbol{T}_{g}$ , being diagonal, is easy to invert. The fact that we have left out the loadings matrix $ \boldsymbol{P}_{g}$ in (6.8) is of no consequence for prediction, because the scores $ \boldsymbol{t}_{j}$ are linear combinations of the columns of $ \boldsymbol{X}$ , and the PCR method amounts to singling out those linear combinations that are best for predicting $ \boldsymbol{Y}$ .




6.4.1 Prediction

Previous Section
Next Section

For prediction with PCR, it is necessary to turn to $ \boldsymbol{X}$ again, and using (6.2) we may write the regression equation as follows:

$\displaystyle \boldsymbol{Y}$ $\displaystyle =$ $\displaystyle \boldsymbol{T}_{g}\boldsymbol{C}+\boldsymbol{F}.$  
  $\displaystyle =$ $\displaystyle \boldsymbol{XP}_{g}\boldsymbol{C}+\boldsymbol{F}.$ (6.9)

Consider a new sample spectrum $ \boldsymbol{z}$ and predicted value $ \widehat{\boldsymbol{y}}$ (both uncentered), and let $ \overline{\boldsymbol{x}}$ and $ \overline{\boldsymbol{y}}$ be the calibration sample averages. Then the prediction takes the form

$\displaystyle \widehat{{\boldsymbol{y}}}=\overline{\boldsymbol{y}}+\left( \bold...
...-\overline{\boldsymbol{x}}\right) \boldsymbol{P}_{g}\widehat{\boldsymbol{C}}.
$

The matrix $ \boldsymbol{P}_{g}\widehat{\boldsymbol{C}}$ is called the regression matrix, and may be compared with the $ \widehat{\boldsymbol{B}}$ matrix of MLR.




6.4.2 PCR and MLR

Previous Section
Next Section

Just as in MLR, the same prediction would be obtained if the columns of $ \boldsymbol{Y}$ were considered separately. The fact that $ \boldsymbol{P}_{g}
$ appears in the PCR Equation (6.9) may be seen as compensating for the fact that we left out $ \boldsymbol{P}_{g}$ in (6.8). When comparing with MLR, the role of the $ \boldsymbol{B}$ matrix is now played by $ \boldsymbol{P}_{g}\boldsymbol{C}$ .

In the case where $ \boldsymbol{X}$ has rank $ k$ and $ g=k$ , the two methods will give identical results. When $ g<k$ , and still $ \boldsymbol{X}$ has rank $ k$ , the results of PCR may differ somewhat from those of MLR, depending on the number and sizes of the components left out.

However, PCR has some major advantages over MLR, in that $ \boldsymbol{X}$ may be singular, and the case $ k\geq n$ may be dealt with. Note, however, that in the latter case $ \lambda _{n}=\cdots =\lambda _{k}=0$ , and so at most $ g=n-1$ components may be included.


HOME | Back

Last modified January 29, 2007. Webmaster
©2001-2005 Master Of Applied Statistics