 |
Table of Contents
Capt. Renault: What in heaven's name brought you to Casablanca? Rick: My health. I came to Casablanca for the waters. Capt. Renault:
The waters? What waters? We're in the desert. Rick: I was
misinformed. [Casablanca, 1942]
The topic of optimization concerns a set of techniques designed to ensure
that you get the best out of your model and data, and that you avoid any
pitfalls on the way. The techniques allow you to take a critical look at
your model in order to spot anything that might lead to incorrect or
misleading results. We first discuss the choice of calibration method, and
then move on to discuss outlier detection, and the use of graphical displays.
In the previous modules, we have considered several competing calibration
methods, and discussed some of their advantages and disadvantages. We now
review the main considerations regarding the choice of calibration method.
Assume that the calibration data consist of centered data matrices
and
, of dimensions
and
, respectively. Let us consider the four methods discussed until now:
- MLR--Multiple Linear Regression.
- CLS--Classical Least Squares.
- PCR--Principal Components Regression.
- PLS--Partial Least Squares.
First a note about linearity. As we have seen, prediction takes the
following linear form:
 ,
where
is the so-called regression matrix.
In effect, the only difference between the four methods is the way in which
the regression matrix
is calculated. In this
sense, all four methods are linear, and they are not suitable for genuinely
nonlinear problems. We comment below on nonlinearity and some other problems
that may occur with calibration models.
There are two particular problems often met in calibration, namely interference and matrix effects. Interference is when either other
signal-emitting elements present in the sample or physical conditions in the
equipment influence the results for the compound(s) of interest. Matrix
effects are interferences originating in conditions or compounds in the
sample, such as temperature or pH, that do not emit signals as such, but may
change the actual signal measured for the compound(s) of interest.
The case of few
-variables is when
, that is, there are fewer
-variables than calibration samples. Under this heading we consider MLR,
which is the only method among the four that limits the number of
-variables in this way. This may be a severe restriction in in a chemometric
setting, but it is important to recognize that statistically, a good
prediction method requires a large number of calibration samples (
large), and no amount of
-variables can compensate for too few
calibration samples.
Now let us summarize some of the pros and cons of MLR.
- Pros of MLR:
- Copes well with interferences.
- Copes well with matrix effects.
- Cons of MLR:
- Requires
.
- Does not cope well with nonlinearity.
- Does not cope well with collinearity (see below).
- Too many
-variables give less precise predictions.
The case of many
-variables refers to the case
, that is, there
are at least as many
-variables as calibrations samples, which is often
the case in chemometrics. This case may in principle be handled by any of
the three methods CLS, PCR and PLS. We now summarize some of the pros and
cons of these methods.
- General pros of CLS, PCR and PLS:
- Eliminate noise in
to a certain degree.
- Pros of CLS:
- Simple chemical interpretation (Beer-Lambert's law).
- Cons of CLS:
- Does not cope well with nonlinearity.
- Does not cope well with interferences, unless they are known in
advance.
- Does not cope well with matrix effects, unless they are linear.
The PCR and PLS methods are based on the principle of using as few latent
variables (scores) as possible, while at the same time including information
from the whole spectrum (all
-variables). The difference with the CLS
methods lies in the fact that the latter uses the whole spectrum directly,
whereas PCR and PLS use only functions of the spectrum, namely the scores.
PCR and PLS can never use more than
scores in the model, however.
- General pros of PCR and PLS.
- Cope with nonlinearities to some extent.
- Cope well with interferences.
- Cope well with collinearity.
- Pros of PCR:
- Simple expression for the prediction variance.
- Cons of PCR:
-
is not taken into account in the decomposition of
.
- Pros of PLS:
- Takes both
and
into account in the
decomposition of
.
- Cons of PLS:
- The prediction variance is difficult to calculate.
- PLS is inappropriate if
has too much variation.
We now present some of the potential problems in calibration that have to be
solved in order to get the best out of your data and model.
- The effect of random error.
Calibration data always contain errors, either measurement errors or other
forms of noise. There are many potential sources of errors, such as sample
heterogeneity, thermal noise in electronic circuits, false light sources,
uncontrolled interferences and so on. Some may be due to human mistake or
hardware failure, such as incorrect labeling of test tubes, defective
sensors, contamination of material, or samples that do not originate from
the intended study population. Although some of these errors may actually be
deterministic, we often assume that the errors are randomly distributed,
because we may not understand or know about all of them, and cannot predict
them.
- Outliers and robustness.
Any sample or variable that somehow deviates from the majority is called an
outlier. As the discussion of Item 1 suggests, random errors and
outliers may to some extent have common origins, but pragmatically, outliers
are defined as those errors that stand out, either visually on a plot, or
according to a more specific criteria.
Robustness, in the statistical sense, is a property of the
statistical method, such that the results obtained using the method are to
some extent insensitive to small deviations from the assumptions behind the
method. Second, robustness also means that results ought not depend
crucially on whether certain samples or variables are included in the
calibration set or not, so that, in particular, results are not unduly
distorted by one or more outliers in the data. Robustness is a desirable,
indeed crucial, property of a good calibration method. Methods for detection
and analysis of outliers, along with a discussion of robustness, will be
considered below.
- Collinearity (sometimes called multicollinearity).
Collinearity often arises for spectral data
, because absorbances for
neighboring frequencies are correlated. It may also be due to other causes,
in cases where the information carried by the
- or
-block is smaller
than the number of variables (
and
, respectively) might suggest. Both
PCR and PLS deal well with collinearity in the
- and
-blocks, but if
collinearity in
is revealed, one should look for the underlying cause,
be it chemical or from other causes such as closed systems (where
concentrations add up to a constant).
- Spanning the calibration space.
If the chosen calibration samples do not fully represent the actual
conditions under which the method is to be used, bad predictions may result.
For example, if a calibration study is conducted with large concentrations
of the analyte, whereas the actual conditions under which the method is to
be used require small concentrations, the calibration may be misleading.
External conditions should also be representative of the real conditions
where the calibration model will be used. For example, if trial runs of a
production are conducted during summer holidays, the method may fail in cold
weather if the influence of temperature was not properly taken into account.
To obtain a set of representative samples, proper use of experimental design
should be coupled with sound chemical knowledge. The question of
experimental design, although important, is not studied as such here. Later
we present various graphical displays that can help determine if the
calibration set may be considered representative for the set of conditions
we want to describe.
- Nonlinearity.
This problem is dealt with in the next section.
- Under- and overfitting.
Under- and overfitting is mainly a question of selecting the correct number
of scores. This problem will be treated in the next module.
The theory of nonlinear models is beyond the scope of the present notes, but
a few remarks about this topic are in order. The following considerations
show that the scope of linear models is wider than may appear at first sight.
Suppose that the relation between the spectrum
and the
vector of concentrations
(
and
vectors, respectively) is given by a smooth function
, possibly nonlinear, ignoring noise. Then a
Taylor-expansion of
gives
 |
(9.1) |
where
is the gradient operator. If we now have
calibration
samples with
-values
, we may form the matrices
(
),
(
) and
(
). Including the usual
noise term
,
and absorbing the error in (9.1) into
, we hence
obtain the following linear model:
 .
Note that a small amount of nonlinearity (from the missing remainder term in
(9.1)) may be absorbed into the noise term
. In
this sense, any nonlinear smooth model may be considered to be approximately
linear. The centered data matrices
and
do not involve
or
,
and since the value of
may be estimated from
and
, there is no need to specify the value of the
expansion point
explicitly, nor the actual form of
,
in order for this argument to work in a practical setting. Since the amount
of nonlinearity is always small for
close enough to
, any model may be considered approximately linear for
ranges that are small compared with the noise level.
A good strategy for problems that may or may not be nonlinear is hence to
attempt to use a linear model, and then to validate and optimize the model
by means of the methods considered here and in the next module. If a
well-fitting linear model is found in this way, then fine. If the correct
model is truly nonlinear, and a wide enough range of samples are included in
the calibration for the nonlinearity to reveal itself, then the nonlinearity
must be taken into account.
An important point to keep in mind is that some nonlinear models may be
linearized by transforming either
,
or both
by means of a suitable nonlinear function. Let us give two examples of this.
Consider the relation
By taking logs on both sides of the equation and moving
to
the left-hand side of the equation, we obtain
a linear model in terms of the transformed variable
. Note in
particular that any constant term, such as
will disappear
when
is centered. Such data may hence be analyzed by replacing
by
in the analysis. Similarly, if
is a product of powers, for
example
then taking logs and moving
to the left-hand side gives
This model may hence be analyzed by replacing both
and the two
-variables by their logs.
Such linearization techniques are discussed in many books on regression; see
for example Draper and Smith (1998) and Atkinson (1985). More generally,
suitable transformations of
and/or
may reduce the amount of
nonlinearity found in any given model, so to speak. The most common
transformation is by far the log transformation for positive variables, as
already illustrated. It is hence advisable to always consider the
possibility of taking logs for positive variables, in order to see if
nonlinearity may be eliminated or reduced. This should not be done blindly,
however, so the suitability of any given transformation should always be
verified by the validation and optimization methods discussed in this module.
Residual analysis consists of a set of diagnostic tools designed to
detect outliers and other potential problems in a given calibration model.
We shall phrase the discussion of residual analysis in terms of the MLR
method. We hence assume a model of the form
, |
(9.2) |
where
has dimension
and rank
, and
.
Assume that
are independent and
The
-matrix is not assumed to be centered. The methods may,
however, also be applied to PCR and PLS. First of all, methods developed for
a single
-column may simply be applied to the columns of the
-block
one column at a time. Second, when a model for
is used, for
example
 |
(9.3) |
(both PCR and PLS), then
in (9.2) is replaced by the
scores matrix
from (9.3), letting
(recall
that
must be less than
). In this sense, the methods are universally
applicable.
We shall not discuss the special problems of residual analysis encountered
in connection with CLS.
- Crude residuals
- Variance estimator
Note that
is known as the degrees of freedom.
- Hat matrix
- Vector of fitted (predicted) values
so
- Interpretation:
is the weight with which
enters
- Define
th leverage
the weight with which
enters in
.
- Properties
and
- Average value of
is
,
- Large
is a problem if
- Note
and
(variances depend on
).
- Hence
near 1 implies large variance for
and
small variance for
.
- Define standardized residuals
 ;
all have approximately variance 1.
-
de-emphasizes outliers, because
also is large when
is large.
- Define Studentized (cross-validation) residuals
where
is the estimate for
leaving
out
,
- Studentized residuals emphasize outliers.
- When no outliers are present,
follows a
distribution.
- When
is large: outlier means either
or
(much)
greater than 2.
- If outliers come in pairs, triples or larger groups, the deletion of
one data at a time does not change the fit much. This is called a masking effect.
- It may hence be useful to delete data in pairs, triplets etc. and
comparing with the fit based on the remaining data.
Summary of residuals is shown in Table 9.1.
Summary of residual plots
- Plot
against
, and check for
trumpet form (transform
if necessary).
- Plot
against
for all
, and check for
nonlinear form (transform
if necessary).
- Plot
against
, and check for points
outside the limits for
or
or both.
- Plot
against
, and look for
bigger than
.
- A final model check is to plot
against
.
There are many robust methods available in statistics, such as for example
robust regression, where outliers are downweighted in order to limit their
potential influence on the results. The purpose of such 'automatic' robust
methods is to make the results more stable and thus more reliable.
The same goal can, however, also be achieved by more simple-minded
procedures, where outliers are detected by inspection, and then treated
appropriately, which is the method chosen here. One should nevertheless be
aware that manual outlier detection and handling is time-consuming and
requires a certain amount of experience. Automatic outlier handling
therefore also has its place in the analyst's tool bag, especially in
connection with large-scale automated data handling.
When errors have occurred as a result of using samples from other
populations than the intended one, or using samples with erroneous
- or
-values due to human or technical error, this may result in outliers.
Robustness against such errors may be achieved by identifying the samples or
variables in question and eliminating them from the calibration set. Besides
eliminating the problems that would have been caused by such an outlier,
this is a useful exercise in itself, because it leads one to identify
weaknesses in procedures and techniques, and may hence lead one to take a
critical view of the whole process under study. For this reason, manual
outlier detection and handling is an indispensable tool, especially in the
initial phases of a calibration project.
Robustness against measurement errors and abnormal distributions may be
achieved by including many samples and variables in the calibration. In this
way, each individual sample or measurement has less influence on the final
result, and parameter estimation becomes less sensitive to random noise.
Appropriate spanning of the calibration space, using methods of experimental
design and common sense, serves the same purpose.
If, for economic reasons or otherwise, one is unable to include many samples
in the calibration, one may in certain situations encounter problems. This
may happen if variables are heteroscedastic (when the variance is a function
of the mean, rather than constant) or non-normal (not from the normal
distribution). In any case, the amount of information contained in a
calibration sample is, generally speaking, proportional to the number of
calibration samples. No amount of sophistication can substitute the lack of
proper and relevant calibration samples. Guidance about the appropriate
choice of the number of calibration samples needed in order to obtain a
certain accuracy and robustness are part of the methods of experimental
design. It may be useful to conduct a small pilot study under realistic
circumstances in order to obtain information necessary for proper
experimental design.
Graphical displays are very important in practical data analysis, and we
have already used them in the data examples in the previous modules. We now
review the main types of plots and their purposes.
Purpose: To obtain an overview of the full data set.
- Spectral plot (index plot for
). An index plot is a
plot of the
profiles
for
, usually taking the abscissa to be either frequency or
wavelength.
- Composition plot (index plot for
).
- Scatterplot (plot of a
-column versus an
-column). Useful only
for small
and
, such as simple linear regression.
Purpose: To summarize the information obtained from a PCR or PLS analysis,
and detect problems, such as nonlinearity.
- Scree plot (index plot of eigenvalues or component variability).
- Plot of percent variance explained (cumulative plot of eigenvalues or
component variability).
- Loading plots of the
-vectors (PCR) or the
-vectors (PLS), for example:
- Score plots of the
-vectors (PCR or PLS), for
example:
Purpose: To detect outliers, nonlinearity, lack of fit, or other problems
with the calibration model.
- 2
- Atkinson, A.C. (1985). Plots, Transformations, and
Regression. An Introduction to Graphical Methods of Diagnostic Regression
Analysis. Oxford: Clarendon Press.
- 2
- Draper, N.R. and Smith, H. (1998). Applied Regression
Analysis (3rd Ed.). John Wiley & Sons, New York.
|
 |