View/Print PDFPS
Module 7: Partial least squares regression I
By Bent Jørgensen and Yuri Goegebeur


Table of Contents





7.1 The PLS1 algorithm

Previous Section
Next Section

Rick: I congratulate you. Victor Laszlo: What for? Rick: Your work. Victor: I try. Rick: We all try. You succeed. [Casablanca, 1942]

The PCR method from the previous module represents a considerable improvement over MLR and CLS. By using latent variables (scores), it is possible to use a large number of variables (frequencies), just as in CLS, but without having to know about all interferences.

Problems may arise, however, if there is a lot of variation in $ \boldsymbol{X}$ that is not due to the analyte as such. PCR finds, somewhat uncritically, those latent variables that describe as much as possible of the variation in $ \boldsymbol{X}$ . But sometimes the analyte itself gives rise to only small variations in $ \boldsymbol{X}$ , and if the interferences vary a lot, then the latent variables found by PCR may not be particularly good at describing $ \boldsymbol{Y}$ . In the worst case important information may be hidden in directions in the $ \boldsymbol{X}$ -space that PCR interprets as noise, and therefore leaves out.

Partial Least Squares Regression (PLS) is able to cope better with this problem, by forming variables that are relevant for describing $ \boldsymbol{Y}$ . See Examples for a motivating example.




7.1.1 The steps of the PLS1 algorithm

Previous Section
Next Section

We now consider the general form of the PLS1 algorithm. We assume that $ \boldsymbol{X}$ is an $ n\times k$ centered data matrix and $ \boldsymbol{y}$ an $ n\times 1$ centered data vector. The so-called PLS2 algorithm considered in Module 8 may be used for the case of more than one column in $ \boldsymbol{Y}$ . The PLS2 algorithm, however, is more complicated than PLS1 and even when several columns are available in $ \boldsymbol{Y}$ , it may be preferable to apply PLS1 separately to each column of $ \boldsymbol{Y}$ . On the other hand, PLS2 may be better for initial, more exploratory investigations, or in cases where the different analytes show covariation.

The PLS1 algorithm starts with the initialization $ j=1$ , $ \boldsymbol{X}_{1}\boldsymbol{=X}$ and $ \boldsymbol{y}_{1}\boldsymbol{=y}$ . The algorithm then proceeds through the following steps to find the first $ g$ latent variables:

  1. Let $ \boldsymbol{w}_{j}=\boldsymbol{X}_{j}^{\top }\boldsymbol{y}_{j}/\left\Vert \boldsymbol{X}_{j}^{\top }\boldsymbol{y}_{j}\right\Vert $ .

  2. Let $ \boldsymbol{t}_{j}=\boldsymbol{X}_{j}\boldsymbol{w}_{j}$ .

  3. Let $ \widehat{c}_{j}=\boldsymbol{t}_{j}^{\top }\boldsymbol{y}_{j}\boldsymbol{/t}_{j}^{\top }\boldsymbol{t}_{j}$ .

  4. Let $ \boldsymbol{p}_{j}=\boldsymbol{X}_{j}^{\top }\boldsymbol{t}_{j}/\boldsymbol{t}_{j}^{\top }\boldsymbol{t}_{j}$ .

  5. Let $ \boldsymbol{X}_{j+1}\boldsymbol{=X}_{j}-\boldsymbol{t}_{j}\boldsymbol{p}_{j}^{\top }$ and $ \boldsymbol{y}_{j+1}\boldsymbol{=y}_{j}-\boldsymbol{t}_{j}\widehat{c}_{j}$ .

  6. Stop if $ j=g$ ; otherwise let $ j=j+1$ and return to Step 1.

Now form the two $ k\times g$ matrices $ \boldsymbol{W}$ and $ \boldsymbol{P}$ and $ n\times g$ matrix $ \boldsymbol{T}$ with columns $ \boldsymbol{w}_{j}$ , $ \boldsymbol{p}_{j}$ and $ \boldsymbol{t}_{j}$ , respectively, and form a column vector $ \widehat{\boldsymbol{c}}$ ($ g\times 1$ ) with elements $ \widehat{c}_{j}$ . Let

$\displaystyle \widehat{\boldsymbol{X}}$ $\displaystyle =$ $\displaystyle \boldsymbol{TP}^{\top }$  
  $\displaystyle =$ $\displaystyle \sum_{j=1}^{g}\boldsymbol{t}_{j}\boldsymbol{p}_{j}^{\top }$  

and
$\displaystyle \widehat{\boldsymbol{y}}$ $\displaystyle =$ $\displaystyle \boldsymbol{T}\widehat{\boldsymbol{c}}$  
  $\displaystyle =$ $\displaystyle \boldsymbol{XW}(\boldsymbol{P}^{\top }\boldsymbol{W})^{-1}\widehat{\boldsymbol{c}}$,  

which are the predicted values of $ \boldsymbol{X}$ and $ \boldsymbol{y}$ , respectively. The matrix $ \boldsymbol{W}$ is orthogonal, and $ \boldsymbol{T}$ has orthogonal columns.

The PLS1 algorithm is used here in order to define the method, although there are alternative ways of organizing the computations; see Bro (1996, p. 57). Note that, in spite of the similarities with the NIPALS algorithm, the PLS1 algorithm is recursive and requires exactly $ g$ steps, whereas the NIPALS algorithm is iterative, the number of iterations cannot be determined in advance, and is dependent on the choice of a stopping criterion. In this sense, the PLS1 algorithm is simpler than the NIPALS algorithm.




7.1.2 Comments on the PLS1 algorithm

Previous Section
Next Section

We now comment on each step of the PLS1 algorithm in turn. For simplicity, we explain the first run of the algorithm ($ j=1$ ), and then go on to explain the general case.

Step 1. In PLS, we seek the direction in the space of $ \boldsymbol{X}$ , which yields the biggest covariance between $ \boldsymbol{X}$ and $ \boldsymbol{y}$ . This direction is given by a unit vector $ \boldsymbol{w}$ , and is such that large variations in $ x$ -values are accompanied by large variations in the corresponding $ y$ -values. The unit vector $ \boldsymbol{w}_{1}$ $ (k\times 1)$ is thus formed by standardizing the covariance matrix for $ \boldsymbol{X}$ and $ \boldsymbol{y}$ . A further interpretation of $ \boldsymbol{w}_{1}$ is that its transpose $ \boldsymbol{w}_{1}^{\top }$ is proportional to the CLS regression coefficient

$\displaystyle \widehat{\boldsymbol{a}}=\frac{\boldsymbol{y}^{\boldsymbol{\top }}\boldsymbol{X}}{\boldsymbol{y}^{\top }\boldsymbol{y}}.
$

It may hence be useful, for diagnostic purposes, to compare $ \boldsymbol{w}_{1}$ with any prior knowledge about the spectrum of the pure analyte, although possible interferences may obscure this picture.

Step 2. The $ n\times 1$ score vector $ \boldsymbol{t}_{1}$ is formed as a linear combination of the columns of $ \boldsymbol{X}$ with weights $ \boldsymbol{w}_{1}$ . As explained above, the relative weights are given by the covariances between $ \boldsymbol{y}$ and each of the columns of $ \boldsymbol{X}$ , and $ \boldsymbol{t}_{1}$ may hence be understood as the best linear combination of the columns of $ \boldsymbol{X}$ for the purpose of predicting $ \boldsymbol{y}$ . The latent vectors $ \boldsymbol{t}_{j}$ are also called scores, similar to the terminology for PCA.

Step 3. The regression coefficient $ \widehat{c}_{1}$ is calculated by ordinary linear regression of $ \boldsymbol{y}$ on $ \boldsymbol{t}_{1}$ .

Step 4. The $ k\times 1$ vector $ \boldsymbol{p}_{1}$ is the transpose of the vector of regression coefficients obtained from simple linear regressions of the columns of $ \boldsymbol{X}$ on $ \boldsymbol{t}_{1}$ .

Step 5. The $ n\times k$ vector $ \boldsymbol{X}_{2}\boldsymbol{=X}-\boldsymbol{t}_{1}\boldsymbol{p}_{1}^{\top }$ represents the residuals after regressing $ \boldsymbol{X}$ on $ \boldsymbol{t}_{1}$ , and correspondingly, $ \boldsymbol{y}_{2}\boldsymbol{=y}-\boldsymbol{t}_{1}\widehat{c}_{1}$ are the residuals after regressing $ \boldsymbol{y}$ on $ \boldsymbol{t}_{1}$ . This step ensures that the $ \boldsymbol{t}_{j}$ -vectors become orthogonal (just as the corresponding $ \boldsymbol{t}_{j}$ are in PCR), and thus ensures that the multiple regression of $ \boldsymbol{y}$ on $ \boldsymbol{T}$ can be calculated one column at a time, as done in Step 3.

After the first run through Steps 1-5, the procedure is repeated using the residuals $ \boldsymbol{X}_{2}$ and $ \boldsymbol{y}_{2}$ . The algorithm then finds the best linear combination of the columns of $ \boldsymbol{X}_{2}$ for the purpose of predicting $ \boldsymbol{y}_{2}$ , thus picking up any further structure in the connection between $ \boldsymbol{X}$ and $ \boldsymbol{y}$ not accounted for by $ \boldsymbol{t}_{1}$ . This is repeated on and on, such that each run of the algorithm in principle reveals more and more information about the connection between $ \boldsymbol{X}$ and $ \boldsymbol{y}
$ . Just as for PCR the information accounted for by each step usually becomes less and less for each step taken.

After the $ g$ runs have been completed, the following relations hold:

$\displaystyle \boldsymbol{X}$ $\displaystyle =$ $\displaystyle \boldsymbol{TP}^{\top }+\boldsymbol{X}_{g+1}$ (7.1)
$\displaystyle \boldsymbol{y}$ $\displaystyle =$ $\displaystyle \boldsymbol{T}\widehat{\boldsymbol{c}}+\boldsymbol{y}_{g+1}.$  

The number of scores $ g$ should hence, in principle, be chosen such that $ \boldsymbol{X}_{g+1}$ contains no further information about $ \boldsymbol{y}_{g+1}$ , or in other words, such that $ \boldsymbol{X}_{g+1}$ and $ \boldsymbol{y}_{g+1}$ are approximately uncorrelated with each other. In the extreme case where $ \boldsymbol{X}_{j}^{\top }\boldsymbol{y}_{j}$ becomes zero, the algorithm is stopped prematurely. In summary, further scores should be extracted only as long as each new variable contributes significantly to the description of $ \boldsymbol{y}$ . Criteria for deciding when this is the case will be discussed later, in Module 13.




7.2 Prediction for PLS1

Previous Section
Next Section

Prediction for the PLS1 method is slightly more complicated, than for PCR, in spite of the algorithm being simpler. Consider a new prediction sample $ \boldsymbol{z}$ ($ 1\times k$ vector) and predicted value $ \overrightarrow{\boldsymbol{y}}$ (both uncentered). Note the new notation for the predicted value. Let $ \overline{\boldsymbol{x}}$ ($ k\times 1$ ) and $ \overline{y}$ be the calibration sample averages. The prediction is performed by essentially retracing the steps of the algorithm, letting the row vector $ \boldsymbol{z}-\overline{\boldsymbol{x}}$ follow the same steps as a row of the $ \boldsymbol{X}$ matrix.

Let $ \boldsymbol{W}$ , $ \boldsymbol{T}$ , $ \boldsymbol{P}$ and $ \widehat{\boldsymbol{c}}$ be the matrices and vector formed after applying the PLS1 algorithm to the calibration data. Initialize by taking $ j=1$ and $ \boldsymbol{x}_{j}=\boldsymbol{z}-\overline{\boldsymbol{x}}$ . Then proceed through the following steps:

  1. Let $ t_{j}=\boldsymbol{x}_{j}\boldsymbol{w}_{j}.$

  2. Let $ \boldsymbol{x}_{j+1}\boldsymbol{=x}_{j}-t_{j}\boldsymbol{p}_{j}^{\top }$ .

  3. Let $ j=j+1$ , and repeat Steps 1 to 3 until $ j=g$ .

Now form the row vector $ \widehat{\boldsymbol{t}}=(t_{1},\ldots ,t_{g})$ , and complete the prediction as follows:

$\displaystyle \overrightarrow{\boldsymbol{y}}=\overline{y}+\widehat{\boldsymbol{t}}\widehat{\boldsymbol{c}}$.

It is possible, though, to summarize the prediction in a matrix formula (Bro, 1996, pp. 62-63), as follows:

$\displaystyle \overrightarrow{\boldsymbol{y}}=\overline{y}+\left( \boldsymbol{z}-\overline{\boldsymbol{x}}\right) ^{\top }\widehat{\boldsymbol{b}},
$

where $ \widehat{\boldsymbol{b}}$ , the so-called regression vector, is

$\displaystyle \widehat{\boldsymbol{b}}=\boldsymbol{W}\left( \boldsymbol{P}^{\top }\boldsymbol{W}\right) ^{-1}\widehat{\boldsymbol{c}}.
$

There is, however, a slight disadvantage to this method, because useful information contained in the individual latent variables $ t_{1},\ldots ,t_{g}
$ is not available here. Like the regression matrix in the previous methods, the regression vector $ \widehat{\boldsymbol{b}}$ contains useful information about which areas (frequencies) contribute to the prediction.

HOME | Back

Last modified January 29, 2007. Webmaster
©2001-2005 Master Of Applied Statistics