View/Print PDFPS
Module 3: Statistics and initial data processing
By Bent Jørgensen and Yuri Goegebeur


Table of Contents





3.1 Initial data preparation

Previous Section
Next Section

Capt. Renault: I'm shocked, shocked to find there is gambling going on here. Waiter: Your winnings, sir. Capt. Renault: Oh, thank you very much. [Casablanca, 1942]

Consider calibration data matrices $ \boldsymbol{X}$ and $ \boldsymbol{Y}$ .

  • $ Y$ -block: $ n\times m$ matrix $ \boldsymbol{Y}$ :

    \begin{displaymath}
\boldsymbol{Y}=\{y_{i\ell }\}=\left[
\begin{array}{ccc}
y_{...
...ots & \vdots \\
y_{n1} & \cdots & y_{nm}\end{array}\right] ,
\end{displaymath}

    $ n$ calibration samples (rows) and $ m$ analytes (columns).

  • $ X$ -block: $ n\times k$ matrix $ \boldsymbol{X}$ :

    \begin{displaymath}
\boldsymbol{X}=\{x_{ij}\}=\left[
\begin{array}{ccc}
x_{11} ...
...ots & \vdots \\
x_{n1} & \cdots & x_{nk}\end{array}\right] ,
\end{displaymath}

    $ n$ calibration samples (rows) and $ k$ frequencies/wavelengths (columns).

  • No missing data (important, but methods for coping with missing data do exist).

  • We call $ n$ the sample size, meaning the number of calibration samples.

  • The columns of $ \boldsymbol{X}$ and $ \boldsymbol{Y}$ are referred to as variables, and typical columns are denoted $ \boldsymbol{x}$ and $ \boldsymbol{y}$ , respectively. We use the generic notation $ x$ and $ y$ to denote one or more variables from $ \boldsymbol{X}$ and $ \boldsymbol{Y}$ , respectively.




3.1.1 Data checking and cleaning

Previous Section
Next Section

  • Descriptive statistics for each variable: histogram, min, max, mean, $ s$ (standard deviation) etc.

  • Spectral plot: Make a graph where each row of $ \boldsymbol{X}$ is plotted as a function of the wavelength. The following plot shows a spectral plot for the Pollution data.
    \includegraphics[width=0.98\textwidth]{fig/m3fig2a}
  • Composition plot: Make a graph where each row of $ \boldsymbol{Y}$ is plotted as a function of the variable number $ \ell $ . The following plots shows the composition plot for the Pollution data.
    \includegraphics[width=0.98\textwidth]{fig/m3fig2b}
    Note how the plot is dominated by the third variable. The following plot showing the standardized variables gives a better impression of the relative importance of the variables.
    \includegraphics[width=0.98\textwidth]{fig/m3fig2c}
  • Based on plots and descriptive statistics, check that the variables behave as expected.

  • Check that absorbances are strictly positive and all have reasonable values.

  • Check that concentrations are non-negative.

  • If some concentrations are negative, correct the problem, and make sure you understand the scale in which the concentrations were recorded.

  • If the scales in which concentrations were recorded, vary dramatically between the different analytes, consider scaling the $ y$ -variables.

  • Remove or correct outliers (incorrect data that are much too low or much too high).




3.2 Statistical concepts

Previous Section
Next Section

We now review some basic statistical concepts.




3.2.1 Theoretical quantities

Previous Section
Next Section

  • Random variable: A variable subject to uncertainty. Let $ X$ and $ Y$ denote random variables.

  • Constant (deterministic) variable: A variable not subject to uncertainty. Often denoted by $ c$ .

  • Mean (expectation) of random variable $ X$ : long-run average value of $ X$ , denoted $ \mathrm{E}(X)$ , or $ \mu =\mu _{X}$ .

  • Variance of random variable $ X$ : long-run average of squared deviation from mean, denoted $ \mathrm{Var}(X)$ or $ \sigma _{X}^{2}$ .

  • Standard deviation: square-root of variance (RMSE), often denoted $ \sigma =\sigma _{X}=\sqrt{\mathrm{Var}(X)}$

  • Covariance between two random variables $ X$ and $ Y$ : long-run average of product of deviations from means, denoted $ \mathrm{Cov}(X,Y)$ .

  • Correlation between two random variables: Their covariance scaled by the standard deviations, denoted by $ \mathrm{Corr}(X,Y)$ , or $ \rho
=\rho _{XY}$ .

  • The correlation is always between $ -1$ and $ 1$ .

  • A correlation of exactly $ \pm 1$ (completely correlated variables) means that the two random variables are exactly linearly dependent.

  • The two variables are highly positively correlated if the correlation is near $ 1$ .

  • The two variables are highly negatively correlated if the correlation is near $ -1$ .

  • Two two random variables $ X$ and $ Y$ are said to be uncorrelated if $ \mathrm{Cov}(X,Y)=0$ . If both standard deviations are positive, this is equivalent to $ \mathrm{Corr}(X,Y)=0$ .

  • Independence of two random variables $ X$ and $ Y$ means, in effect, that one cannot be predicted from the other. Implies zero correlation.




3.2.2 Sample (empirical) quantities

Previous Section
Next Section

The terminology sample quantity here refers to the variation between the $ n$ samples for a given variable, i.e. variation between the $ n$ values of a column.

  • Let $ \boldsymbol{x}$ ( $ n\times 1)$ be a given column of $ \boldsymbol{X}
$ .

  • Let $ \boldsymbol{y}$ ( $ n\times 1)$ be a given column of $ \boldsymbol{Y}.$

  • Let $ \boldsymbol{1}$ be an $ n\times 1$ vector of 1's.

  • The (sample) means of $ \boldsymbol{x}$ and $ \boldsymbol{y}$ are
    $\displaystyle \overline{x}$ $\displaystyle =$ $\displaystyle \frac{1}{n}\boldsymbol{1}^{\top }\boldsymbol{x}=\frac{1}{n}(x_{1}+\cdots +x_{n})$  
    $\displaystyle \overline{y}$ $\displaystyle =$ $\displaystyle \frac{1}{n}\boldsymbol{1^{\top }y}=\frac{1}{n}(y_{1}+\cdots
+y_{n}).$  

  • The (sample) variances of $ \boldsymbol{x}$ and $ \boldsymbol{y}$ are
    $\displaystyle s_{x}^{2}$ $\displaystyle =$ $\displaystyle \frac{1}{n-1}\left\Vert \boldsymbol{x}-\boldsymbol{1}\overline{x}\right\Vert ^{2}$  
      $\displaystyle =$ $\displaystyle \frac{1}{n-1}\left\{ (x_{1}-\overline{x})^{2}+\cdots +(x_{n}-\overline{x})^{2}\right\}$  
    $\displaystyle s_{y}^{2}$ $\displaystyle =$ $\displaystyle \frac{1}{n-1}\left\Vert \boldsymbol{y}-\boldsymbol{1}\overline{y}\right\Vert ^{2}$  
      $\displaystyle =$ $\displaystyle \frac{1}{n-1}\left\{ (y_{1}-\overline{y})^{2}+\cdots +(y_{n}-\overline{y})^{2}\right\} .$  

  • The (sample) standard deviations of $ \boldsymbol{x}$ and $ \boldsymbol{y}$ are the square-roots of the variances,
    $\displaystyle s_{x}$ $\displaystyle =$ $\displaystyle \sqrt{\frac{1}{n-1}\left\{ (x_{1}-\overline{x})^{2}+\cdots +(x_{n}-\overline{x})^{2}\right\} }$  
    $\displaystyle s_{y}$ $\displaystyle =$ $\displaystyle \sqrt{\frac{1}{n-1}\left\{ (y_{1}-\overline{y})^{2}+\cdots +(y_{n}-\overline{y})^{2}\right\} }$  

  • The (sample) covariance between $ \boldsymbol{x}$ and $ \boldsymbol{y}$ is
    $\displaystyle v_{xy}$ $\displaystyle =$ $\displaystyle \frac{1}{n-1}(\boldsymbol{x}-\boldsymbol{1}\overline{x})^{\top }(\boldsymbol{y}-\boldsymbol{1}\overline{y})$  
      $\displaystyle =$ $\displaystyle \frac{1}{n-1}\left\{ (x_{1}-\overline{x})(y_{1}-\overline{y})+\cdots
+(x_{n}-\overline{x})(y_{n}-\overline{y})\right\}$  

  • Note that $ v_{xx}=s_{x}^{2}$ .

  • The (sample) correlation between $ \boldsymbol{x}$ and $ \boldsymbol{y}$ is

    $\displaystyle r_{xy}=\frac{v_{xy}}{s_{x}s_{y}},
$

    which lies between $ -1$ and $ 1$ .

  • The vector of means for $ \boldsymbol{X}$ is the $ 1\times k$ row vector

    $\displaystyle \overline{\boldsymbol{x}}=\frac{1}{n}\boldsymbol{1}^{\top }\boldsymbol{X}=\left[ \overline{x}_{1},\ldots ,\overline{x}_{k}\right] ,
$

    where $ \overline{x}_{j}$ is the mean of the $ j$ 'th column of $ \boldsymbol{X}$ .

  • Similarly, the vector of means for $ \boldsymbol{Y}$ is the $ 1\times m$ row vector

    $\displaystyle \overline{\boldsymbol{y}}=\frac{1}{n}\boldsymbol{1}^{\top }\boldsymbol{Y}=\left[ \overline{y}_{1},\ldots ,\overline{y}_{m}\right] .
$




3.2.3 Standardization

Previous Section
Next Section

Centering

  • We may center $ \boldsymbol{x}$ and $ \boldsymbol{y}$ by subtracting their means:
    $\displaystyle \boldsymbol{\dot{x}}$ $\displaystyle =$ $\displaystyle \boldsymbol{x}-\boldsymbol{1}\overline{x}$  
    $\displaystyle \boldsymbol{\dot{y}}$ $\displaystyle =$ $\displaystyle \boldsymbol{y}-\boldsymbol{1}\overline{y}.$  

  • Note that
    $\displaystyle \boldsymbol{x}$ $\displaystyle =$ $\displaystyle \boldsymbol{\dot{x}}+\boldsymbol{1}\overline{x}$  
    $\displaystyle \boldsymbol{y}$ $\displaystyle =$ $\displaystyle \boldsymbol{\dot{y}}+\boldsymbol{1}\overline{y}$  

  • The (columnwise) centered $ \boldsymbol{X}$ and $ \boldsymbol{Y}$ matrices are
    $\displaystyle \boldsymbol{\dot{X}}$ $\displaystyle =$ $\displaystyle \boldsymbol{X}-\boldsymbol{1}\overline{\boldsymbol{x}}$  
    $\displaystyle \boldsymbol{\dot{Y}}$ $\displaystyle =$ $\displaystyle \boldsymbol{X}-\boldsymbol{1}\overline{\boldsymbol{y}},$  

  • For a centered vector $ \boldsymbol{\dot{x}}$ :
    $\displaystyle \overline{\boldsymbol{\dot{x}}}$ $\displaystyle =$ 0  
    $\displaystyle s_{\dot{x}}^{2}$ $\displaystyle =$ $\displaystyle s_{x}^{2}=\frac{1}{n-1}\left\Vert \boldsymbol{\dot{x}}\right\Vert ^{2}$  
      $\displaystyle =$ $\displaystyle \frac{1}{n-1}\boldsymbol{\dot{x}}^{\top }\boldsymbol{\dot{x}}$  

  • Using the centering notation, we may write the covariance between vectors $ \boldsymbol{x}$ and $ \boldsymbol{y}$ as follows:
    $\displaystyle v_{xy}$ $\displaystyle =$ $\displaystyle \frac{1}{n-1}\boldsymbol{\dot{x}}^{\top }\boldsymbol{\dot{y}}$  
      $\displaystyle =$ $\displaystyle \frac{1}{n-1}\boldsymbol{\dot{x}}^{\top }\boldsymbol{y}$ (3.1)
      $\displaystyle =$ $\displaystyle \frac{1}{n-1}\boldsymbol{x}^{\top }\boldsymbol{\dot{y}}$  

    The last two formulae follow by noting that, for example,
    $\displaystyle \boldsymbol{\dot{x}}^{\top }\boldsymbol{\dot{y}}$ $\displaystyle =$ $\displaystyle \boldsymbol{\dot{x}}^{\top }\left( \boldsymbol{y}-\boldsymbol{1}\overline{y}\right)$  
      $\displaystyle =$ $\displaystyle \boldsymbol{\dot{x}}^{\top }\boldsymbol{y}-\boldsymbol{\dot{x}}^{\top }\boldsymbol{1}\overline{y}$  
      $\displaystyle =$ $\displaystyle \boldsymbol{\dot{x}}^{\top }\boldsymbol{y}$  

    because $ \boldsymbol{\dot{x}}^{\top }\boldsymbol{1}$ is the sum of $ x_{i}-\overline{x}$ , which is zero.

  • The covariance matrix for a matrix $ \boldsymbol{X}$ is the $ k\times k$ matrix

    $\displaystyle \boldsymbol{V}_{X}=\frac{1}{n-1}\boldsymbol{\dot{X}}^{\top }\boldsymbol{\dot{X}}
$

    (sometimes called the variance-covariance matrix).

  • Similarly for matrix $ \boldsymbol{Y}$ we get the $ m\times m$ covariance matrix

    $\displaystyle \boldsymbol{V}_{Y}=\frac{1}{n-1}\boldsymbol{\dot{Y}}^{\top }\boldsymbol{\dot{Y}}.
$

  • Interpretation: if $ \boldsymbol{x}_{1},\ldots ,\boldsymbol{x}_{k}$ are the columns of $ \boldsymbol{X}$ , then $ \boldsymbol{V}_{X}$ contains the variances and covariances of the $ k$ columns:

    \begin{displaymath}
\boldsymbol{V}_{X}=\left[
\begin{array}{cccc}
s_{x_{1}}^{2}...
...& v_{x_{k}x_{2}} & \ldots & s_{x_{k}}^{2}\end{array}\right] .
\end{displaymath}

    It is a symmetric matrix: $ \boldsymbol{V}_{X}=\boldsymbol{V}_{X}^{\top }$ .

  • Similarly we define the covariance between $ \boldsymbol{X}$ and $ \boldsymbol{Y}$ as the $ k\times m$ matrix

    $\displaystyle \boldsymbol{V}_{XY}=\frac{1}{n-1}\boldsymbol{\dot{X}}^{\top }\boldsymbol{\dot{Y}}
$

    containing all possible covariances between a column of $ \boldsymbol{X}$ and a column of $ \boldsymbol{Y}.$

  • Note that $ \boldsymbol{V}_{XY}=\boldsymbol{V}_{YX}^{\top }.$

  • If we collect $ \boldsymbol{X}$ and $ \boldsymbol{Y}$ into a single $ n\times (m+k)$ matrix $ \left[ \boldsymbol{X},\boldsymbol{Y}\right] $ we may write

    \begin{displaymath}
\boldsymbol{V}_{\left[ X,Y\right] }=\left[
\begin{array}{cc...
...\
\boldsymbol{V}_{YX} & \boldsymbol{V}_{Y}\end{array}\right]
\end{displaymath}

    for the full covariance matrix for $ \left[ \boldsymbol{X},\boldsymbol{Y}\right] $ .

  • Special case: collecting the variances and covariances of vectors $ \boldsymbol{x}$ and $ \boldsymbol{y}$ into a $ 2\times 2$ matrix gives

    \begin{displaymath}
\boldsymbol{V}_{\left[ x,y\right] }=\left[
\begin{array}{cc}
s_{x}^{2} & v_{xy} \\
v_{yx} & s_{y}^{2}\end{array}\right]
\end{displaymath}

  • The correlation matrix for $ \boldsymbol{X}$ is the $ k\times k$ matrix

    \begin{displaymath}
\boldsymbol{R}_{X}=\left[
\begin{array}{cccc}
1 & r_{x_{1}x...
...x_{k}x_{1}} & r_{x_{k}x_{2}} & \ldots & 1\end{array}\right] .
\end{displaymath}

Scaling

  • Columns of $ \boldsymbol{Y}$ may need to be scaled, to make $ s_{y}$ about the same for each column.

  • Using the same measurement unit for all columns is often enough to achieve suitable scaling.

  • If scaling is a problem, we may correct this by replacing each column $ \boldsymbol{y}$ by

    $\displaystyle \boldsymbol{y}_{\mathrm{scaled}}=\frac{1}{s_{y}}\boldsymbol{y}
$

    giving $ s_{y_{scaled}}=1.$

  • Autoscaling: when the column is both centered and scaled:

    $\displaystyle \boldsymbol{y}_{\mathrm{autoscaled}}=\frac{1}{s_{y}}\boldsymbol{\dot{y}}
$

  • An (NIR) spectral block $ \boldsymbol{X}$ normally does not need to be scaled, because all columns of $ \boldsymbol{X}$ have the same unit (absorbance).

  • If the columns of $ \boldsymbol{X}$ represent unrelated variables with different units (e.g. ppm, %, km, kg etc.), then scaling is recommended, in order to avoid that certain variables have an undue influence on the calibration results.




3.3 Simple linear regression

Previous Section
Next Section

  • The simple linear regression model is

    $\displaystyle y_{i}=c+x_{i}b+f_{i}$ (3.2)

    for $ i=1,\ldots ,n$ , where

    1. $ y_{i}$ is the $ i$ 'th value of the response (dependent) variable, normally the concentration in the $ i$ 'th sample;

    2. $ x_{i}$ is the $ i$ 'th value of the explanatory (independent) variable, normally the absorbance for the $ i$ 'th sample;

    3. $ n$ is the sample size, which must be at least 2;

    4. $ c$ and $ b$ are the intercept and slope of the regression line, respectively;

    5. $ f_{i}$ is the $ i$ 'th random noise term, assumed independent, with zero mean and common variance $ \sigma ^{2}$ .

    6. $ c$ , $ b$ and $ \sigma ^{2}$ are unknown parameters, to be estimated from the data.

  • We may stack the equations (3.2) and write them in vector form:
    $\displaystyle \boldsymbol{y}$ $\displaystyle =$ $\displaystyle \boldsymbol{1}c+\boldsymbol{x}b+\boldsymbol{f}$ (3.3)
      $\displaystyle =$ $\displaystyle \boldsymbol{1}c+\left( \boldsymbol{\dot{x}+1}\overline{x}\right) b+\boldsymbol{f}$  
      $\displaystyle =$ $\displaystyle \boldsymbol{1}b_{0}+\boldsymbol{\dot{x}}b+\boldsymbol{f},$ (3.4)

    where

    $\displaystyle b_{0}=c+\overline{x}b
$

    is the constant term of the regression after centering $ \boldsymbol{x}$ . The following three graphs illustrate the effect of centering, first without centering:
    \includegraphics[width=0.98\textwidth]{fig/m3fig1a}
    after centering $ x$ :
    \includegraphics[width=0.98\textwidth]{fig/m3fig1b}
    after centering both $ x$ and $ y$ :
    \includegraphics[width=0.98\textwidth]{fig/m3fig1c}
  • The least squares estimators of the unknown parameters $ b_{0}$ and $ b$ are

    $\displaystyle \widehat{b}_{0}=\overline{y}
$

    and
    $\displaystyle \widehat{b}$ $\displaystyle =$ $\displaystyle \frac{\boldsymbol{\dot{x}}^{\top }\boldsymbol{\dot{y}}}{\boldsymbol{\dot{x}}^{\top }\boldsymbol{\dot{x}}}$  
      $\displaystyle =$ $\displaystyle \frac{v_{xy}}{s_{x}^{2}},$  

    provided $ s_{x}>0$ (note that the two $ n-1$ factors cancel out). This result is shown under Examples.

  • The case $ s_{x}^{2}=0$ , for that matter, is uninteresting, because then all $ x$ -values are the same, and such a variable is useless for prediction.

  • Another way to write $ \widehat{b}$ is
    $\displaystyle \widehat{b}$ $\displaystyle =$ $\displaystyle \frac{s_{y}}{s_{x}}\frac{v_{xy}}{s_{x}s_{y}}$  
      $\displaystyle =$ $\displaystyle \frac{s_{y}}{s_{x}}r_{xy}.$  

  • If $ s_{x}=s_{y}$ , in particular if $ \boldsymbol{x}$ and $ \boldsymbol{y}
$ are both autoscaled, then $ \widehat{b}$ is the correlation between $ \boldsymbol{x}$ and $ \boldsymbol{y}$ .

  • Let us insert the value $ \widehat{b}_{0}=\overline{y}$ in (3.4), giving

    $\displaystyle \boldsymbol{y}=\boldsymbol{1}\overline{y}+\boldsymbol{\dot{x}}b+\boldsymbol{f}
$

    or, equivalently,

    $\displaystyle \boldsymbol{\dot{y}}=\boldsymbol{\dot{x}}b+\boldsymbol{f}$ (3.5)

    where we have used the centering notation $ \boldsymbol{\dot{y}}=\boldsymbol{y}-\boldsymbol{1}\overline{y}$ .

  • Based on this equation, we often work with centered $ \boldsymbol{x}$ and $ \boldsymbol{y}$ , and leave out the constant term in the regression, but it is important to keep in mind that the statistical model we are using continues to be (3.2) or (3.4).

  • From (3.5) we see that the fitted regression line takes the form

    $\displaystyle y-\overline{y}=(x-\overline{x})\widehat{b}.$ (3.6)

    being the line through $ (\overline{x},\overline{y})$ with slope $ \widehat{b}$ . The intercept is $ \widehat{c}=\overline{y}-\overline{x}\widehat{b}$ .

  • Variance estimate:
    $\displaystyle \widehat{\sigma }^{2}$ $\displaystyle =$ $\displaystyle \frac{1}{n-2}\left( \boldsymbol{\dot{y}}^{\top }\boldsymbol{\dot{y}}-\widehat{b}\boldsymbol{\dot{x}}^{\top }\boldsymbol{\dot{y}}\right)$  
      $\displaystyle =$ $\displaystyle \frac{n-1}{n-2}\left( s_{y}^{2}-\widehat{b}v_{xy}\right) .$  

    The proof of this result is given under Examples.

  • Prediction: Suppose we are observing the same system that gave rise to the calibration data and thus continues to follow the model (3.2). We are given test data, in the form of a value of the independent variable denoted $ z$ , but instead of observing the corresponding value $ y$ , we want to predict the value of $ y$ that we would have obtained had $ y$ also been observed. Note that we use the notation $ z$ instead of $ x$ , in order to distinguish the calibration data $ (x,y)$ from the test data $ z$ .

  • The predicted value, denoted $ \widehat{y}$ , is obtained by simply inserting $ z$ for $ x$ in the fitted regression line (3.6), giving

    $\displaystyle \widehat{y}=\overline{y}+(z-\overline{x})\widehat{b}.$ (3.7)

    Note that $ \widehat{y}$ depends on the calibration data via $ \overline{x}$ , $ \overline{y}$ and $ \widehat{b}$ , but does not involve the unknown parameters $ c$ , $ b$ and $ \sigma ^{2}$ as such.

  • We have thus obtained a practical prediction method, which will serve as the prototype for the more complicated methods of chemometrics, which is the main topic of the course.

Bibliography

1
Box, G.E.P., Hunter, W.G. and Hunter, J.S. (1978). Statistics for Experimenters. An Introduction to Design, Data Analysis, and Model Building. John Wiley & Sons, New York.

2
Michael Friendly: Statistics and Statistical Graphics Resources

3
Heilmann, Ole J. (2000). Kemometri--Statistik for Kemikere. Nyt Teknisk Forlag, København.

4
Huff, D. (1954). How To Lie With Statistics. Gollancz, London.

5
NetStat Elementary statistics interactive website (in Danish).

6
Petruccelli, J.D., B. Nandram and M. Chen (1999). Applied Statistics for Engineers and Scientists, 1st Ed. Prentice Hall Upper Saddle River, New Jersey.

HOME | Back

Last modified January 29, 2007. Webmaster
©2001-2005 Master Of Applied Statistics