 |
Table of Contents
In complex regression situations, when there is a large number of
explanatory variables which may or may not be relevant for making
predictions about the response variable, it is useful to be able to reduce
the model to contain only the variables which provide important information
about the response variable. But deciding which explanatory variables to
include in the simpler model is not always trivial.
Example 8.1 Detoxification of malathion in chickens
It is known that one can alter the toxicity of various types of chemicals (e.g. drugs, pesticides or insecticides) in mammals by inducing liver
enzyme activity. This example relates to a study investigating the
relationship between detoxification of malathion (an insecticide containing
phosphorus) and induced enzyme activity in chickens. Five different enzyme
activities were considered. They were all induced using the enzyme inducer
3-methylcholanthrene (3-MC).
Here we have a response variable, detoxification of malathion (in per cent,
relative to a control, untreated, chicken), related to five explanatory
variables, corresponding to the five induced enzymes. Each explanatory
variable is the percentage of enzyme activity relative to the enzyme
activity in a control, untreated, chicken. There is no pre-knowledge about
which of the enzyme activities are most useful for providing important
information about the response variable. Thus, we need a general methodology
to select the `best' model for the response variable.
Further details on this dataset can be found here.
There are many different methods for selecting the best regression model,
but for each method, two key issues must always be taken into consideration:
`what do we mean by ``best" model?', and `how can we locate
the ``best" model?' In technical terms, the first issue
refers to choosing a selection criterion (Section 8.2), the second
issue to choosing a selection procedure (Section 8.3).
Before going into details with the actual procedures for selecting models in
Section 8.3, there are a few general considerations which must be taken into
account. First of all, we need to define the maximum model, that
is, the model containing all explanatory variables which could
possibly be present in the final model. Note that this includes interaction
terms that might affect the response variable. Thus, any possible model for
the data is a restriction of the maximum model, in the sense that it
can be achieved by omitting a number of the explanatory variables from the
maximum model. Let
denote the maximum number of feasible explanatory
variables (including appropriate interaction terms). Then, the maximum model
is given by
where
are the explanatory variables, and
are independent, normally distributed random error terms
with zero mean and common variance.
Example 8.1 (continued) Detoxification of malathion in
chickens
Suppose that we have no reason to believe that interactions between the
different enzyme activities affect the detoxification of malathion, hence
the maximum model for these data is given by
where
is the percentage of detoxification,
is the percentage of
enzyme activity relative to the activity in a control chicken, and
accounts for the random variation in the data.
When defining the maximum model, it is important to include all explanatory
variables which might have an effect on the response variable, however, one
has to be careful not to include too many unimportant explanatory variables.
If the model contains many explanatory variables compared to the number of
observations, the variation in the estimators of the regression parameters
can be very large, and thus lead to inaccurate parameter estimates. Further,
the more explanatory variables in a model, the greater the risk of
confounding or collinearity (that is, two or more variables are
linearly dependent). Confounding and collinearity can lead to omitting the
`wrong' explanatory variables. Finally, there is the issue of parsimony: if two models are equally good, one should prefer the simpler.
There are several reasons for this: firstly, complex models often confuse
interpretations of the results from the analysis; secondly, the larger the
model, the less precise the parameter estimates will be; and thirdly, if the
model is very large, analysing the data can take a very long time-and not
necessarily lead to more useful results.
In general, the number of explanatory variables in the maximum model should
take into account the sample size of the data set that is to be analysed:
the smaller the sample size, the smaller the maximum model should be. There
are various rules of thumbs for how big the sample size should be to support
a maximum model of a given size. The most common ones are that the error
degrees of freedom should be at least ten, that is
,
or, that there should be at least 5 observations for each explanatory
variable, that is
. Note that the second rule is much stronger
than the first, e.g. if
, the first rule requires
,
while the second rule requires
. There exists methods to reduce
the required sample size (e.g. using split-samples), however,
they are beyond the scope of this course.
When the maximum model has been defined, the next point to consider is how
to determine whether one model is `better' than the rest: which criterion should we use to compare the possible models? A selection
criterion is a criterion, which will order all possible models from `best'
to `worst'. Many different criteria have been suggested through time; some
are better than others, but there is no single criterion which is overall
preferred. In this section, we shall discuss some of the most common
selection criteria.
Essentially, the purpose of a selection criteria is to compare the maximum
model
 |
(8.1) |
with a reduced model
 |
(8.2) |
which is a restriction of the maximum model. If the reduced model provides
(almost) as good a fit to the data as the maximum model, then we prefer the
reduced model.
The
criterion: We have already come across a
crude measure of how well a model accounts for the variation in the data:
the coefficient of determination
. Recall that
is the
proportion of the total amount of variation in the data which can be
explained by the fitted model, that is,
where
is the corrected sum
of squares of the responses, and
is the residual sum of squares. In particular, the
s corresponding to the maximum model and the reduced model, are given by
and
, respectively, where
with
 |
(8.3) |
where
denotes the least squares estimator for the
regression parameter
in the model with
explanatory
variables. Recall that, the closer the model fits the data, the larger
will be. Thus, an intuitive, but crude, method to compare the two
models would be to compare the
s corresponding to the models: the
model with the highest
provides the closest fit. However, the method
has a number of drawbacks. The most important being that, due to the way
is defined, the largest model (the one with most explanatory
variables) will always have the largest
-whether the extra variables
provide any important information about the response variable or not! A
common way to avoid this problem is to use an adjusted version of
instead of
itself. The adjusted
statistic, for a model with
explanatory variables, is given by
Note that
does not necessarily increase when the number of
explanatory variables increases. According to the
(and
) criteria, one should choose the model which has the
largest
(or
, respectively).
The
-test criterion: A different, but equally
intuitively natural selection criterion is the
-test criterion. The idea
is to test significance of
explanatory variables, say,
, in the maximum model (8.1), in order to get the reduced
model (8.2). That is, we need test the null hypothesis
In Module 7, we used ANOVA tables to test hypothesis of this form. (Remember
that we have to order the variables in such a way that the variables we wish
to omit are the last variables in the model.) Recall that the
-test statistic for testing significance of
is given
by
where
and
are defined by (8.3). Under
,
the statistic
has an
-distribution. If
is not rejected, the reduced model (8.2) provides as
good a fit to the data as the maximum model, so we can use the reduced model
instead of the maximum model. The
-test criterion for
selecting variables, finds the smallest subset of explanatory variables
(with
as small as possible)
such that the test statistic
is not significant.
The
criterion: A very useful, if not immediately
obvious, selection criterion uses the Mallow's
statistic
where
is the number of explanatory variables in the reduced model, and
is the unbiased estimate of the error
variance. Suppose that the reduced model is as good as (or better than) the
maximum model. In that case, the error variance for the reduced model,
is close to (or smaller than) the error
variance for the maximum model
. The
statistic becomes
Thus, if
is approximately equal to (or smaller than)
, the
reduced model fits the data at least as good as the maximum model (in the
sense that the error variation is smaller for the reduced model). The
criterion says to choose the reduced model with the smallest
value of
. Part of the popularity of the
criterion is due to
the simplification of the decision of the number of explanatory variables to
retain in the final model: if the correct model has
explanatory
variables, then
is approximately equal to (or less than)
.
Example 8.1 (continued) Detoxification of malathion in
chickens
The maximum model for the data on detoxification of malathion in chickens
relates the detoxification to five different induced enzyme activities, that
is,
Comparing the maximum model to the reduced model which leaves out
,
that is
using the
,
and
criteria, yield the following
results
Thus, the
and
criteria agree that the reduced model (not
containing
) should be preferred, while the
criterion prefers
the maximum model-simply because the maximum model contains more variables.
Note that the
-value for the reduced model is approximately 4, which
is the number of explanatory variables in the reduced model. Thus, the
reduced model seems to be a fairly good model for the data according to the
criterion.
There are many other selection criteria than the ones discussed here. For
example the mean squares criteria (which compares the estimated error
variances of the models) and the PRESS criteria. Of the three
criteria discussed in this module, the
criterion and the
-test
criterion are the most reliable.
The actual method for selecting variables depends on the chosen selection
criterion, but it also depends on the selection procedure (or selection
strategy). The traditional procedures-the forward selection procedure and
the backward elimination procedure, respectively-concentrate on deciding
whether each of the explanatory variables should, or should not, be included
in the final model. The procedures are quick to do, even in situations with
many possible explanatory variables. However, they do not always lead to the
best model! The two traditional procedures are reviewed in Subsection 8.3.2.
The stepwise regression procedure was developed from the traditional
procedures, in order to improve the chance of achieving the best model.
Stepwise regression is discussed in Subsection 8.3.3. The most recent-and
most sensible-procedure, is the all possible models procedure. In this
procedure, all possible models are fitted and compared, and the best one is
chosen. Unless the number
of possible explanatory variables is large,
this procedure should always be preferred. However, for large
, the all
possible models procedure demands a huge amount of calculations, and the
result can be difficult to interpret. In this case one might prefer to use
the stepwise procedure instead. The all possible models procedure is
considered in Subsection 8.3.1.
The most careful selection procedure is the all possible models
procedure in which all possible models are fitted to the data, and the
selection criterion is used on all the models in order to find the model
which is preferable to all others. Note that one has to choose the selection
criterion carefully, as different selection criteria can result in different
`best' models!
Example 8.1 (continued) Detoxification of malathion in
chickens
All possible models have been fitted to the data on detoxification of
malathion in chickens, and for each model,
and
have been calculated. The results are given below.
The
criterion suggests the reduced model containing
and
(
), while the
criterion suggests the model
containing only
and
. As always,
the
criterion suggests the maximum model. Thus, the three criteria
suggest three different models!
Taking a closer look at the results of the different criteria, we observe
that the model preferred by the
criteria (that is, the model
containing
and
) has the second smallest value of
(
). That is, although this model is not the `best',
according to the
criterion, it is still a `good' model. Also, the
is fairly large for this model:
, which is almost the
same as the
for the maximum model (79.3%). Conversely, the model
suggested by the
criterion (the one containing only
and
), has reasonably large values of
(63.8%) and
(71.9%), so, according to these criteria, it is a reasonably good model.
Thus, it seems that both of the models in question are `good' according to
all three criteria, so we can choose either one. If simplicity of the model
is essential (e.g. if interpretation of the model is an important
issue), the simpler model
might be the preferred choice, whereas, if prediction is the main purpose of
the analysis, the larger model
may be better. The least squares models are given by, respectively,
Residual analyses for the two models suggest that there might be an outlier
in the data. This point should be investigated further before using the
models for further analyses. (The residual analyses are omitted here.)
In the above example, we had to fit 31 different models, and calculate
,
and
for each one, in order to find the best
model. The number of models to be fitted (and compared) increases rapidly
with the number of explanatory variables in the maximum model-for example,
if the maximum model contains 10 explanatory variables (which is not
unusual), a total of 1023 different models have to be fitted and compared!
In general, if the maximum model has
explanatory variables, one has to
fit (and compare)
different models. Thus, in situations with many
explanatory variables in the maximum model, the all possible models
procedure becomes impractical. The best alternative is to use stepwise
regression, which will be considered in Subsection 8.3.3. Stepwise
regression is based on the two traditional selection procedures forward selection and backward elimination. These are discussed in
Subsection 8.3.2.
The backward elimination procedure is basically a sequence
of tests for significance of explanatory variables. Starting out with the
maximum model
we remove (or, eliminate) the variable with the highest
-value for
the test of significance of the variable, conditioned on the
-value being
bigger than some pre-determined level (say, 0.10). Next, we fit the reduced
model (having removed the variable from the maximum model), and remove from
the reduced model the variable with the highest
-value for the
test of significance of that variable (if
). And so on. The
procedure ends when no more variables can be removed from the model at
significance level 10%. Note that we use the
-test criterion in this
procedure. (Here, we only describe the traditional version of the backward
elimination procedure, in order to present the general idea of the
procedure. There are various modified versions which may perform better.)
Example 8.1 (continued) Detoxification of malathion in
chickens
Fitting maximum model to the data on detoxification of malathion in
chickens, we obtain the following table for the least squares estimators
The variable with the highest
-value for the
-test of significance is
(recall that the
-test for significance is equivalent to the
-test for significance of a single variable.) So we eliminate
from
the model, and fit a new multiple linear regression model to the data. The
table for the least squares estimators becomes
Now, the variable with the highest
-value is
, so we eliminate
from the model, and fit a new model, resulting in the following table
for the least squares estimators.
The highest
-value in the table corresponds to the intercept, but we are
interested in significance of the variables, so we look for the second
highest
-value instead. This corresponds to
, which we eliminate
from the model and fit a new model to the data. We get
Both
-values, corresponding to
and
, are less than 0.10,
so we cannot reduce the model any further. Thus, the least squares line for
the final model is given by
Once again, we have selected a different model! However, this method is very
crude and should not be compared to the all possible models
procedure. The method can be refined into stepwise regression (in
Subsection 8.3.3), which is a better alternative to the all possible models
procedure. Note that we only fitted 4 different models using the backward
elimination procedure-as compared to 31 models in the all possible models
procedure. In general, the great advantage of this procedure is that we only
need to fit a maximum of
different models, if the maximum model has
explanatory variables.
The forward selection procedure is a reversed version of the
backward elimination procedure. Instead of starting with the maximum model,
and eliminating variables one by one, we start with an `empty' model with no
explanatory variables, and add variables one by one until we cannot improve
the model significantly by adding another variable. (Only the traditional
version of the forward selection procedure is described here. There are
various modified versions which may perform better.)
Example 8.1 (continued) Detoxification of malathion in
chickens
We start by fitting the five simple linear regression models, relating the
detoxification to each of the induced enzyme activities, one at the time.
That is, we fit the five models
 |
(8.4) |
The table below contains, for each model, the
-statistic and the
corresponding
-value for testing significance of
in (8.4). (Recall that the
-test is equivalent to the
-test for significance
of a single variable.)
 |
(8.5) |
For example, if we fit the simple model
to the data, the table for the least squares
estimators is given by
where the emphasised numbers 2.55 and 0.034 are the same as in
(8.5). We can see from (8.5) that each of the variables
,
and
seem to provide useful information about the
response variable, at a 10% significance level. Thus, we would be inclined
to include these variables in the model. We start by including the `most'
significant variable, that is, the variable with the lowest
-value:
. In the next step, we fit the four models, relating the response to
and each of the four remaining variables, one by one. That is, we fit the
models
to find out whether any of the four remaining variables provide additional
information about the response, given that
is already
in the model. The table of
-tests corresponding to these models is given
by
 |
(8.6) |
The variable
has the lowest
-value (which is lower than 0.10), so
we add
to the model. We now have
and
in the model.
Next, we fit the three models containing
and
, and one
of the three remaining variables
or
, respectively.
The table of
-tests becomes
All
-values in the table are above 0.10, so the procedure terminates at
this step. That is, the selected model is the one containing the explanatory
variables
and
. This is the same model that was suggested by
the all possible models procedure using
criterion-but,
unfortunately, this is not always the case.
The two procedures, that have been discussed in this subsection, have the
computational advantage that one only has to fit a small subset of the
possible models. The maximum number of models to be fitted are
in the
backward elimination procedure, and
in the forward
selection procedure, in situations where the maximum model has
explanatory variables. In both cases, it is a substantial reduction of the
models to be fitted in the all possible models procedure. However,
the main drawback of these two procedures is exactly the same as the
advantage: that we only consider a small subset of the possible models. The
risk of missing out the best model increases rapidly as the number of
explanatory variables increases.
The performances of the forward- and backward-procedures, described in
Subsection 8.3.2, can be greatly improved by introducing the modification
stepwise regression procedure. Here, we shall consider stepwise
regression based on forward selection. (Stepwise regression based on
backward elimination can be defined in a similar way.) Recall that once a
variable is added to the model, in forward selection, it stays in the
model-irrespective of which other variables are added later on. However, it
can easily happen that a variable entered early in the procedure becomes
superfluous because of its interrelationship with other variables added to
the model later on in the procedure.
The stepwise regression procedure modifies the forward selection
procedure in the following way. Each time a new variable is added to the
model, the significance of each of the variables already in the model is
re-examined. That is, at each step in the forward selection procedure, we
test for significance of each of the variables currently in the model, and
remove the one with the highest
-value (if the
-value is above some
threshold value, say 0.10). The model is then re-fitted without this
variable, before going to the next step in the forward selection procedure.
The stepwise regression procedure continues until no more variables can be
added or removed.
Example 8.1 (continued) Detoxification of malathion in
chickens
As in the forward selecting procedure, we start by fitting the five models
and find from (8.5) that
is the variable which is `most'
significant (it has the lowest
-value). As in forward selection, we add
to the model, and fit the four models, relating the response to
and each of the four remaining variables, one by one. As before, we
find that
is `most' significant of the four (see (8.6)), so
we add
to the model. But, before continuing to add more variables,
we fit the current model and test significance of
. The table for the
least squares estimators is given by
Since
is significant, we leave it in the model at continue to the
next step of forward selection, that is, we fit the three models containing
and
, and each of the three remaining variables
or
, one by one. The corresponding table of
-tests is the same as
in the forward selection procedure, that is,
Since all
-values are greater than 0.10, the procedure ends. Thus, the
stepwise regression procedure selects the model containing
and
, that is,
In this example, stepwise regression resulted in the same model as forward
selection, and as the all possible models procedure, using the
criterion. It seems that this model may be a good choice. In general, when
there are more explanatory variables in the maximum model, the three
procedures often result in different models.
Note that the stepwise regression procedure is better to use than the
forward selection and the backward elimination procedures, because it
considers more (relevant) models. At the same time, the procedure is much
quicker than the all possible models procedure, when the number of
explanatory variables in the maximum model is large. However, if the number
of explanatory variables is small, or if fitting vast numbers of models is
not a problem, it is recommended to use the all possible models procedure,
rather than stepwise regression.
Various methods for selecting the `best' regression model have been
discussed in this module. In order to simplify the maximum model, containing
all possible explanatory variables, to a reduced model that only contains
the explanatory variables which provide important information about the
response variable, one has to decide on a selection criterion (how to
compare the possible models), and a selection procedure (a strategy for
comparing relevant models). We have considered the
and
criteria, which compare models by comparing the values
and
(the adjusted
)
respectively, the
-test criterion,
which is based on testing significance of the explanatory variables, and the
criterion, which is based on a comparison of estimates of the error
variances. As for selection procedures, the traditional forward selection,
and backward elimination procedures have been reviewed. The stepwise
regression procedure is useful when the number of explanatory variables in
the maximum model is large-otherwise, the all possible models procedure is
the most sensible selection procedure to use.
Keywords: maximum model, collinearity, parsimony, selection
criterion, reduced model,
criterion, adjusted
statistic,
criterion,
-test criterion,
criterion, Mallow's
statistic, all possible models procedure, backward elimination procedure,
forward selection procedure, stepwise regression.
|
 |