 |
Table of Contents
Capt. Renault: I'm shocked, shocked to find there is gambling
going on here. Waiter: Your winnings, sir. Capt. Renault: Oh,
thank you very much. [Casablanca, 1942]
Consider calibration data matrices
and
.
- Descriptive statistics for each variable: histogram, min,
max, mean,
(standard deviation) etc.
- Spectral plot: Make a graph where each row of
is plotted as a function of the wavelength. The following plot shows a
spectral plot for the Pollution data.
- Composition plot: Make a graph where each row of
is plotted as a function of the variable number
. The following
plots shows the composition plot for the Pollution data.
Note how the plot is dominated by the third variable. The following plot
showing the standardized variables gives a better impression of the relative
importance of the variables.
- Based on plots and descriptive statistics, check that the variables
behave as expected.
- Check that absorbances are strictly positive and all have reasonable
values.
- Check that concentrations are non-negative.
- If some concentrations are negative, correct the problem, and make
sure you understand the scale in which the concentrations were recorded.
- If the scales in which concentrations were recorded, vary dramatically
between the different analytes, consider scaling the
-variables.
- Remove or correct outliers (incorrect data that are much too low or
much too high).
We now review some basic statistical concepts.
- Random variable: A variable subject to uncertainty. Let
and
denote random variables.
- Constant (deterministic) variable: A variable not subject to
uncertainty. Often denoted by
.
- Mean (expectation) of random variable
: long-run average
value of
, denoted
, or
.
- Variance of random variable
: long-run average of squared
deviation from mean, denoted
or
.
- Standard deviation: square-root of variance (RMSE), often
denoted
- Covariance between two random variables
and
: long-run
average of product of deviations from means, denoted
.
- Correlation between two random variables: Their covariance
scaled by the standard deviations, denoted by
, or
.
- The correlation is always between
and
.
- A correlation of exactly
(completely correlated variables) means that the two random variables are exactly linearly dependent.
- The two variables are highly positively correlated if the
correlation is near
.
- The two variables are highly negatively correlated if the
correlation is near
.
- Two two random variables
and
are said to be uncorrelated
if
. If both standard deviations are positive, this is
equivalent to
.
- Independence of two random variables
and
means, in
effect, that one cannot be predicted from the other. Implies zero
correlation.
The terminology sample quantity here refers to the variation between
the
samples for a given variable, i.e. variation between the
values
of a column.
- We may center
and
by
subtracting their means:
- Note that
- The (columnwise) centered
and
matrices are
- For a centered vector
:
- Using the centering notation, we may write the covariance between
vectors
and
as follows:
The last two formulae follow by noting that, for example,
because
is the sum of
, which is zero.
- The covariance matrix for a matrix
is the
matrix
(sometimes called the variance-covariance matrix).
- Similarly for matrix
we get the
covariance matrix
- Interpretation: if
are the columns of
, then
contains the
variances and covariances of the
columns:
It is a symmetric matrix:
.
- Similarly we define the covariance between
and
as the
matrix
containing all possible covariances between a column of
and
a column of
- Note that
- If we collect
and
into a single
matrix
we may
write
for the full covariance matrix for
.
- Special case: collecting the variances and covariances of vectors
and
into a
matrix gives
- The correlation matrix for
is the
matrix
- Columns of
may need to be scaled, to make
about the same for each column.
- Using the same measurement unit for all columns is often enough to
achieve suitable scaling.
- If scaling is a problem, we may correct this by replacing each column
by
giving
- Autoscaling: when the column is both centered and scaled:
- An (NIR) spectral block
normally does not need to be
scaled, because all columns of
have the same unit
(absorbance).
- If the columns of
represent unrelated variables with
different units (e.g. ppm, %, km, kg etc.), then scaling is recommended, in
order to avoid that certain variables have an undue influence on the
calibration results.
- The simple linear regression model is
 |
(3.2) |
for
, where
is the
'th value of the response (dependent) variable,
normally the concentration in the
'th sample;
is the
'th value of the explanatory (independent) variable,
normally the absorbance for the
'th sample;
is the sample size, which must be at least 2;
and
are the intercept and slope of the regression line,
respectively;
is the
'th random noise term, assumed independent, with
zero mean and common variance
.
,
and
are unknown parameters, to be estimated
from the data.
- We may stack the equations (3.2) and write them in vector
form:
where
is the constant term of the regression after centering
. The
following three graphs illustrate the effect of centering, first without
centering:
after centering
:
after centering both
and
:
- The least squares estimators of the unknown parameters
and
are
and
provided
(note that the two
factors cancel out). This result
is shown under Examples.
- The case
, for that matter, is uninteresting, because
then all
-values are the same, and such a variable is useless for
prediction.
- Another way to write
is
- If
, in particular if
and
are both autoscaled, then
is the correlation between
and
.
- Let us insert the value
in (3.4),
giving
or, equivalently,
 |
(3.5) |
where we have used the centering notation
.
- Based on this equation, we often work with centered
and
, and leave out the constant term in the regression, but
it is important to keep in mind that the statistical model we are using
continues to be (3.2) or (3.4).
- From (3.5) we see that the fitted regression line takes the
form
 |
(3.6) |
being the line through
with slope
. The intercept is
.
- Variance estimate:
The proof of this result is given under Examples.
- Prediction: Suppose we are observing the same system that
gave rise to the calibration data and thus continues to follow the model (3.2). We are given test data, in the form of a value of the
independent variable denoted
, but instead of observing the corresponding
value
, we want to predict the value of
that we would have obtained
had
also been observed. Note that we use the notation
instead of
, in order to distinguish the calibration data
from the test data
.
- The predicted value, denoted
, is obtained by simply
inserting
for
in the fitted regression line (3.6), giving
 |
(3.7) |
Note that
depends on the calibration data via
,
and
, but does not involve the unknown parameters
,
and
as such.
- We have thus obtained a practical prediction method, which will serve
as the prototype for the more complicated methods of chemometrics, which is
the main topic of the course.
- 1
- Box, G.E.P., Hunter, W.G. and Hunter, J.S. (1978). Statistics for Experimenters. An Introduction to Design, Data Analysis, and
Model Building. John Wiley & Sons, New York.
- 2
-
Michael Friendly: Statistics and Statistical
Graphics Resources
- 3
- Heilmann, Ole J. (2000). Kemometri--Statistik for
Kemikere. Nyt Teknisk Forlag, København.
- 4
- Huff, D. (1954). How To Lie With Statistics. Gollancz,
London.
- 5
-
NetStat
Elementary statistics interactive website (in Danish).
- 6
- Petruccelli, J.D., B. Nandram and M. Chen (1999). Applied Statistics for Engineers and Scientists, 1st Ed. Prentice Hall
Upper Saddle River, New Jersey.
|
 |