View/Print PDFPS
Module 5: Principal components analysis
By Bent Jørgensen and Yuri Goegebeur


Table of Contents





5.1 Conventions about centering of data matrices

Previous Section
Next Section

We assume from now on that the data matrices $ \boldsymbol{X}$ and $ \boldsymbol{Y}$ are centered, unless otherwise stated. This will help simplify the notation. For example, $ \boldsymbol{X}^{\top }\boldsymbol{X}$ is then proportional to the covariance matrix for $ \boldsymbol{X}$ , $ \boldsymbol{V}_{X}$ .

You must remember this.

A kiss is just a kiss.

A sigh is just a sigh.

The fundamental things apply.

As time goes by.

[Casablanca, 1942]




5.2 Singular value decomposition (SVD)

Previous Section
Next Section

In this module we consider Principal Components Analysis (PCA). This is a general method for analysis of multivariate data, which will be applied in connection with the principal components regression in the next module. But first we consider the singular value decomposition.

  • Consider an $ n\times k$ matrix $ \boldsymbol{X}$ and let $ t=\min
\{n,k\} $ . The singular value decomposition (SVD) of $ \boldsymbol{X}
$ is the factorization

    $\displaystyle \boldsymbol{X}=\boldsymbol{UDP}^{\top },$ (5.1)

    where

    1. $ \boldsymbol{U}$ is an $ n\times t$ orthogonal matrix;

    2. $ \boldsymbol{P}$ is a $ k\times t$ orthogonal matrix;

    3. $ \boldsymbol{D}=\mathrm{diag}\left\{ d_{1},\ldots ,d_{t}\right\} $ is a $ t\times t$ diagonal matrix with elements $ d_{1}\geq \cdots \geq d_{t}\geq
0$ .

    4. The $ r\leq t$ non-zero values $ d_{i}$ are the singular values of $ \boldsymbol{X}$ .

    5. The last $ t-r$ zero $ d_{i}$ may be discarded together with the last $ t-r$ columns of $ \boldsymbol{U}$ and $ \boldsymbol{P}$ , giving the SVD on reduced form, which is the one usually output from SVD algorithms.

  • The matrix $ \boldsymbol{P}$ is called the loadings matrix, and its columns $ \boldsymbol{p}_{1},\ldots ,\boldsymbol{p}_{t}$ are called the loadings.

  • The matrix

    $\displaystyle \boldsymbol{T}=\boldsymbol{UD}
$

    is called the scores matrix, and its columns $ \boldsymbol{t}_{1},\ldots ,\boldsymbol{t}_{t}$ are called scores.




5.3 Principal components decomposition

Previous Section
Next Section

Recall the eigenvalue decomposition from Module 2. Thus let $ \boldsymbol{A}$ be a square symmetric $ n\times n$ matrix, and consider its eigenvalue decomposition

$\displaystyle \boldsymbol{A}$ $\displaystyle =$ $\displaystyle \boldsymbol{P\Lambda P}^{\top }$  
  $\displaystyle =$ $\displaystyle \sum_{i=1}^{n}\lambda _{i}\boldsymbol{p}_{i}\boldsymbol{p}_{i}^{\top }$.  

Here

  1. $ \boldsymbol{P}$ is an orthogonal matrix with columns $ \boldsymbol{p}_{1},\ldots ,\boldsymbol{p}_{n}$

  2. We may assume that the eigenvalues are ordered: $ \lambda _{1}\geq
\cdots \geq \lambda _{n}.$

  • Now consider the singular value decomposition for the centered $ n\times k$ data matrix $ \boldsymbol{X}$ , and let $ \boldsymbol{A}=\boldsymbol{X}^{\top }\boldsymbol{X}$ . Using (5.1) we find
    $\displaystyle \boldsymbol{A}$ $\displaystyle =$ $\displaystyle \boldsymbol{X}^{\top }\boldsymbol{X}$  
      $\displaystyle =$ $\displaystyle \boldsymbol{PDU}^{\top }\boldsymbol{UDP}^{\top }$  
      $\displaystyle =$ $\displaystyle \boldsymbol{PD}^{2}\boldsymbol{P}^{\top },$  

    where we have used that $ \boldsymbol{U}$ is orthogonal.

  • The squared singular values $ d_{1}^{2},\ldots ,d_{r}^{2}$ are the non-zero eigenvalues of both matrix $ \boldsymbol{X}^{\top }\boldsymbol{X}$ $ (k\times k)$ and matrix $ \boldsymbol{XX}^{\top }$ $ (n\times n),$ both of which are of rank $ r$ . The columns of $ \boldsymbol{P}$ are eigenvectors of $ \boldsymbol{X}^{\top }\boldsymbol{X}$ , corresponding to the non-zero eigenvalues.

When $ \boldsymbol{X}$ is a centered data matrix, then $ \boldsymbol{X}^{\top }\boldsymbol{X}/(n-1)$ is the covariance matrix of $ \boldsymbol{X}$ . The method of studying the eigenvalue decomposition of the covariance matrix is known as Principal Components Analysis (PCA). We consider the use of PCA as a multivariate data analysis tool below.




5.4 SVD and the eigenvalue decomposition

Previous Section
Next Section

A second connection between the eigenvalue decomposition and SVD comes if we apply the SVD method directly to $ \boldsymbol{A}$ itself. We consider two cases. If $ \boldsymbol{A}$ is positive semi-definite, then its eigenvalue decomposition

$\displaystyle \boldsymbol{A}=\boldsymbol{P\Lambda P}^{\top }$ (5.2)

is the same as the SVD for $ \boldsymbol{A}$ . In that case the matrices of the SVD for $ \boldsymbol{A}$ are
$\displaystyle \boldsymbol{U}$ $\displaystyle =$ $\displaystyle \boldsymbol{P}$  
$\displaystyle \boldsymbol{D}$ $\displaystyle =$ $\displaystyle \boldsymbol{\Lambda }$,  

and the $ \boldsymbol{P}$ of SVD and eigenvalue decomposition are the same. The eigenvalues and singular values for $ \boldsymbol{A}$ are the same, and we normally arrange them in decreasing order, $ \lambda _{1}\boldsymbol{\geq
\cdots \geq }\lambda _{n}\geq 0.$

Now, the second case is if $ \boldsymbol{A}$ is not positive semi-definite. In this case, a simple rearrangement of (5.2) leads to the SVD for $ \boldsymbol{A}$ . Thus, for each negative eigenvalue $ \lambda _{i}$ , we replace $ \lambda _{i}$ in $ \boldsymbol{\Lambda }$ by $ -\lambda _{i}$ to form $ \boldsymbol{D}$ , and form $ \boldsymbol{U}$ from $ \boldsymbol{P}$ by multiplying the corresponding columns of $ \boldsymbol{P}$ by $ -1.$

In both cases we may easily find the eigenvalue decomposition for $ \boldsymbol{A}$ by applying the SVD to $ \boldsymbol{A}$ .

The SVD, will serve as a tool for the method of PCR considered in the next module.




5.5 Principal components analysis (PCA)

Previous Section
Next Section

PCA may be considered a tool for discovering structures in multivariate data, in particular for the purpose of reducing the dimensionality. PCA, in effect, takes your cloud of data points, and rotates and projects it onto a space of lower dimension, selecting the directions in the data space with maximum variability, or equivalently high information.

For example, a spectral block $ \boldsymbol{X}$ contains a lot of redundant information, because absorbances for adjacent frequencies are highly correlated, and because features stemming from a given analyte are spread out over a range of different frequencies. We hence want to find out if there are one, two, or a few factors (directions) along which the spectra show high variability.

Figure 5.1 shows an example of loadings and scores plots for a set of simulated data.

Figure 5.1: Loadings plot (a) and scores plot (b) for simulated data.
\includegraphics[width=0.48\textwidth]{fig/m5fig1a} \includegraphics[width=0.48\textwidth]{fig/m5fig1b}
The loadings plot shows the vectors $ \boldsymbol{p}_{i}$ , giving the $ r$ directions with maximum variability. In higher dimensions, these plots must be made as spectral plots, as illustrated in the Examples section.

The first loadings vector is, in effect, chosen as the line through the centroid of the data that minimizes the square of the distance of each point to the line. Thus, in effect, the line is as close as possible to all data, and therefore shows the direction in the data with maximum variation. The second loadings vector is orthogonal to the first, and subject to that constraint satisfies the same conditions as the first loading, and so on.

The size of each eigenvalue $ \lambda _{i}$ relative to the sum $ \lambda
_{1}+\cdots +\lambda _{r}$ is used as a measure of the importance of the corresponding principal component. This is discussed in more detail in Module 6.

The scores plots are traditionally made by plotting the first two scores against each other, in order to show the main features of the data, such as for example groupings or outliers. The first three scores may be plotted in a 3-d perspective plot, but if more than three scores are to be plotted, they must be plotted as spectra.

As noted earlier, in Section 3.2.3, scaling is usually not necessary for spectral data, because they all have the same unit. For other kinds of $ x$ -data, scaling may be necessary if the different $ x$ -variables have very different magnitudes, in which case $ \boldsymbol{X}$ should be autoscaled. This, in effect, corresponds to using PCA on the correlation matrix for $ \boldsymbol{X}$ instead of the covariance matrix $ \boldsymbol{X}^{\top }\boldsymbol{X}/(n-1)$ .

Bibliography

1
Johnson, R.A. and Wichern, D.W. (1998). Applied Multivariate Statistical Analysis. Prentice Hall, Upper Saddle River, New Jersey.

2
StatSoft: Principal Components and Factor Analysis

3
OSU Ecology: Principal Components Analysis

HOME | Back

Last modified January 29, 2007. Webmaster
©2001-2005 Master Of Applied Statistics