PDFPS
Module 8: Selecting regression models
By Pia Veldt Larsen


Table of Contents





8.1 Introduction

Previous Section
Next Section

In complex regression situations, when there is a large number of explanatory variables which may or may not be relevant for making predictions about the response variable, it is useful to be able to reduce the model to contain only the variables which provide important information about the response variable. But deciding which explanatory variables to include in the simpler model is not always trivial.


Example 8.1        Detoxification of malathion in chickens

It is known that one can alter the toxicity of various types of chemicals (e.g. drugs, pesticides or insecticides) in mammals by inducing liver enzyme activity. This example relates to a study investigating the relationship between detoxification of malathion (an insecticide containing phosphorus) and induced enzyme activity in chickens. Five different enzyme activities were considered. They were all induced using the enzyme inducer 3-methylcholanthrene (3-MC).


Here we have a response variable, detoxification of malathion (in per cent, relative to a control, untreated, chicken), related to five explanatory variables, corresponding to the five induced enzymes. Each explanatory variable is the percentage of enzyme activity relative to the enzyme activity in a control, untreated, chicken. There is no pre-knowledge about which of the enzyme activities are most useful for providing important information about the response variable. Thus, we need a general methodology to select the `best' model for the response variable.


Further details on this dataset can be found here.


$ \diamondsuit$


There are many different methods for selecting the best regression model, but for each method, two key issues must always be taken into consideration: `what do we mean by ``best" model?', and `how can we locate the ``best" model?' In technical terms, the first issue refers to choosing a selection criterion (Section 8.2), the second issue to choosing a selection procedure (Section 8.3).




8.2 Selection criteria

Previous Section
Next Section

Before going into details with the actual procedures for selecting models in Section 8.3, there are a few general considerations which must be taken into account. First of all, we need to define the maximum model, that is, the model containing all explanatory variables which could possibly be present in the final model. Note that this includes interaction terms that might affect the response variable. Thus, any possible model for the data is a restriction of the maximum model, in the sense that it can be achieved by omitting a number of the explanatory variables from the maximum model. Let $ k$ denote the maximum number of feasible explanatory variables (including appropriate interaction terms). Then, the maximum model is given by

$\displaystyle Y_{i}=\beta _{0}+\beta _{1}x_{i,1}+\cdots +\beta _{k}x_{i,k}+\varepsilon
_{i},
$

where $ x_{i,1},\ldots ,x_{i,k}$ are the explanatory variables, and $ \varepsilon _{i}$ are independent, normally distributed random error terms with zero mean and common variance.


Example 8.1 (continued) Detoxification of malathion in chickens

Suppose that we have no reason to believe that interactions between the different enzyme activities affect the detoxification of malathion, hence the maximum model for these data is given by

$\displaystyle Y=\beta _{0}+\beta _{1}x_{1}+\cdots +\beta _{5}x_{5}+\varepsilon ,
$

where $ Y$ is the percentage of detoxification, $ x_{i}$ is the percentage of enzyme activity relative to the activity in a control chicken, and $ \varepsilon $ accounts for the random variation in the data.

$ \diamondsuit$


When defining the maximum model, it is important to include all explanatory variables which might have an effect on the response variable, however, one has to be careful not to include too many unimportant explanatory variables. If the model contains many explanatory variables compared to the number of observations, the variation in the estimators of the regression parameters can be very large, and thus lead to inaccurate parameter estimates. Further, the more explanatory variables in a model, the greater the risk of confounding or collinearity (that is, two or more variables are linearly dependent). Confounding and collinearity can lead to omitting the `wrong' explanatory variables. Finally, there is the issue of parsimony: if two models are equally good, one should prefer the simpler. There are several reasons for this: firstly, complex models often confuse interpretations of the results from the analysis; secondly, the larger the model, the less precise the parameter estimates will be; and thirdly, if the model is very large, analysing the data can take a very long time-and not necessarily lead to more useful results.


In general, the number of explanatory variables in the maximum model should take into account the sample size of the data set that is to be analysed: the smaller the sample size, the smaller the maximum model should be. There are various rules of thumbs for how big the sample size should be to support a maximum model of a given size. The most common ones are that the error degrees of freedom should be at least ten, that is $ n-k-1\geq 10$ , or, that there should be at least 5 observations for each explanatory variable, that is $ n\geq 5k$ . Note that the second rule is much stronger than the first, e.g. if $ k=5$ , the first rule requires $ n\geq 16$ , while the second rule requires $ n\geq 25$ . There exists methods to reduce the required sample size (e.g. using split-samples), however, they are beyond the scope of this course.


When the maximum model has been defined, the next point to consider is how to determine whether one model is `better' than the rest: which criterion should we use to compare the possible models? A selection criterion is a criterion, which will order all possible models from `best' to `worst'. Many different criteria have been suggested through time; some are better than others, but there is no single criterion which is overall preferred. In this section, we shall discuss some of the most common selection criteria.


Essentially, the purpose of a selection criteria is to compare the maximum model

$\displaystyle Y=\beta _{0}+\beta _{1}x_{1}+\cdots +\beta _{m}x_{m}+\beta _{m+1}x_{m+1}+\cdots +\beta _{k}x_{k}+\varepsilon$ (8.1)

with a reduced model

$\displaystyle Y=\beta _{0}+\beta _{1}x_{1}+\cdots +\beta _{m}x_{m}+\varepsilon ,$ (8.2)

which is a restriction of the maximum model. If the reduced model provides (almost) as good a fit to the data as the maximum model, then we prefer the reduced model.


The $ R_{a}^{2}$ criterion: We have already come across a crude measure of how well a model accounts for the variation in the data: the coefficient of determination $ R^{2}$ . Recall that $ R^{2}$ is the proportion of the total amount of variation in the data which can be explained by the fitted model, that is,

$\displaystyle R^{2}=\frac{S_{yy}-RSS}{S_{yy}},
$

where $ S_{yy}=\sum_{i=1}^{n}(Y_{i}-\overline{Y})^{2}$ is the corrected sum of squares of the responses, and $ RSS=\sum_{i=1}^{n}\left( Y_{i}-\hat{Y}_{i}\right) ^{2}$ is the residual sum of squares. In particular, the $ R^{2}$ s corresponding to the maximum model and the reduced model, are given by $ R_{k}^{2}$ and $ R_{m}^{2}$ , respectively, where

$\displaystyle R_{j}^{2}=\frac{S_{yy}-RSS_{j}}{S_{yy}},
$

with

$\displaystyle RSS_{j}=\sum_{i=1}^{n}\left( Y_{i}-\hat{\beta}_{j,0}-\hat{\beta}_...
...}x_{i,1}-\hat{\beta}_{j,2}x_{i,2}-\cdots -\hat{\beta}_{j,j}x_{i,j}\right) ^{2},$ (8.3)

where $ \hat{\beta}_{j,i}$ denotes the least squares estimator for the regression parameter $ \beta _{i}$ in the model with $ j$ explanatory variables. Recall that, the closer the model fits the data, the larger $ R^{2} $ will be. Thus, an intuitive, but crude, method to compare the two models would be to compare the $ R^{2}$ s corresponding to the models: the model with the highest $ R^{2}$ provides the closest fit. However, the method has a number of drawbacks. The most important being that, due to the way $ R^{2}$ is defined, the largest model (the one with most explanatory variables) will always have the largest $ R^{2}$ -whether the extra variables provide any important information about the response variable or not! A common way to avoid this problem is to use an adjusted version of $ R^{2}$ instead of $ R^{2} $ itself. The adjusted $ R^{2}$ statistic, for a model with $ k$ explanatory variables, is given by

$\displaystyle R_{a}^{2}=1-\frac{n-1}{n-k-1}\left( 1-R^{2}\right) .
$

Note that $ R_{a}^{2}$ does not necessarily increase when the number of explanatory variables increases. According to the $ R_{a}^{2}$ (and $ R^{2}$ ) criteria, one should choose the model which has the largest $ R_{a}^{2}$ (or $ R^{2}$ , respectively).


The $ F$ -test criterion: A different, but equally intuitively natural selection criterion is the $ F$ -test criterion. The idea is to test significance of $ k-m$ explanatory variables, say, $ x_{m+1},\ldots
,x_{k}$ , in the maximum model (8.1), in order to get the reduced model (8.2). That is, we need test the null hypothesis

$\displaystyle H_{0}:\beta _{m+1}=\cdots =\beta _{k}=0.
$

In Module 7, we used ANOVA tables to test hypothesis of this form. (Remember that we have to order the variables in such a way that the variables we wish to omit are the last variables in the model.) Recall that the $ F$ -test statistic for testing significance of $ x_{m+1},\ldots ,x_{k}$ is given by

$\displaystyle F_{m}=\frac{\left( RSS_{m}-RSS_{k}\right) /\left( k-m\right) }{
RSS_{k}/\left( n-k-1\right) },
$

where $ RSS_{m}$ and $ RSS_{k}$ are defined by (8.3). Under $ H_{0}$ , the statistic $ F_{m}$ has an $ F\left( k-m,n-k-1\right) $ -distribution. If $ H_{0}$ is not rejected, the reduced model (8.2) provides as good a fit to the data as the maximum model, so we can use the reduced model instead of the maximum model. The $ F$ -test criterion for selecting variables, finds the smallest subset of explanatory variables $ x_{1},x_{2},\ldots ,x_{m}$ (with $ m$ as small as possible) such that the test statistic $ F_{m}$ is not significant.


The $ C_{m}$ criterion: A very useful, if not immediately obvious, selection criterion uses the Mallow's $ C_{m}$ statistic

$\displaystyle C_{m}=\frac{RSS_{m}}{S^{2}}+2\left( m+1\right) -n,
$

where $ m$ is the number of explanatory variables in the reduced model, and $ S^{2}=RSS_{k}/\left( n-k-1\right) $ is the unbiased estimate of the error variance. Suppose that the reduced model is as good as (or better than) the maximum model. In that case, the error variance for the reduced model, $ RSS_{m}/\left( n-m-1\right) $ is close to (or smaller than) the error variance for the maximum model $ RSS_{k}/\left( n-k-1\right) $ . The $ C_{m}$ statistic becomes
$\displaystyle C_{m}$ $\displaystyle =$ $\displaystyle \left( n-m-1\right) \frac{RSS_{m}/\left( n-m-1\right) }{RSS_{k}/\left( n-k-1\right) }+2\left( m+1\right) -n$  
  $\displaystyle \leq$ $\displaystyle n-m-1+2\left( m+1\right) -n$  
  $\displaystyle =$ $\displaystyle m+1.$  

Thus, if $ C_{m}$ is approximately equal to (or smaller than) $ m+1$ , the reduced model fits the data at least as good as the maximum model (in the sense that the error variation is smaller for the reduced model). The $ C_{m}$ criterion says to choose the reduced model with the smallest value of $ C_{m}$ . Part of the popularity of the $ C_{m}$ criterion is due to the simplification of the decision of the number of explanatory variables to retain in the final model: if the correct model has $ m$ explanatory variables, then $ C_{m}$ is approximately equal to (or less than) $ m+1$ .


Example 8.1 (continued) Detoxification of malathion in chickens

The maximum model for the data on detoxification of malathion in chickens relates the detoxification to five different induced enzyme activities, that is,

$\displaystyle Y=\beta _{0}+\beta _{1}x_{1}+\cdots +\beta _{5}x_{5}+\varepsilon .
$

Comparing the maximum model to the reduced model which leaves out $ x_{3}$ , that is

$\displaystyle Y=\beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+\beta _{4}x_{4}+\beta
_{5}x_{5}+\varepsilon ,
$

using the $ R^{2}$ , $ R_{a}^{2}$ and $ C_{m}$ criteria, yield the following results

$\displaystyle \begin{tabular}{lccc}
Model & $R^{2}$\ & $R_{a}^{2}$\ & $C_{m}$\ ...
....3\% & 53.3\% & 6.0 \\
reduced model & 78.0\% & 60.4\% & 4.25\end{tabular}\ .
$

Thus, the $ R_{a}^{2}$ and $ C_{m}$ criteria agree that the reduced model (not containing $ x_{3}$ ) should be preferred, while the $ R^{2}$ criterion prefers the maximum model-simply because the maximum model contains more variables. Note that the $ C_{m}$ -value for the reduced model is approximately 4, which is the number of explanatory variables in the reduced model. Thus, the reduced model seems to be a fairly good model for the data according to the $ C_{m}$ criterion.

$ \diamondsuit$


There are many other selection criteria than the ones discussed here. For example the mean squares criteria (which compares the estimated error variances of the models) and the PRESS criteria. Of the three criteria discussed in this module, the $ C_{m}$ criterion and the $ F$ -test criterion are the most reliable.


8.3 Selection procedures

Previous Section
Next Section

The actual method for selecting variables depends on the chosen selection criterion, but it also depends on the selection procedure (or selection strategy). The traditional procedures-the forward selection procedure and the backward elimination procedure, respectively-concentrate on deciding whether each of the explanatory variables should, or should not, be included in the final model. The procedures are quick to do, even in situations with many possible explanatory variables. However, they do not always lead to the best model! The two traditional procedures are reviewed in Subsection 8.3.2. The stepwise regression procedure was developed from the traditional procedures, in order to improve the chance of achieving the best model. Stepwise regression is discussed in Subsection 8.3.3. The most recent-and most sensible-procedure, is the all possible models procedure. In this procedure, all possible models are fitted and compared, and the best one is chosen. Unless the number $ k$ of possible explanatory variables is large, this procedure should always be preferred. However, for large $ k$ , the all possible models procedure demands a huge amount of calculations, and the result can be difficult to interpret. In this case one might prefer to use the stepwise procedure instead. The all possible models procedure is considered in Subsection 8.3.1.




8.3.1 All possible models procedure

Previous Section
Next Section

The most careful selection procedure is the all possible models procedure in which all possible models are fitted to the data, and the selection criterion is used on all the models in order to find the model which is preferable to all others. Note that one has to choose the selection criterion carefully, as different selection criteria can result in different `best' models!


Example 8.1 (continued) Detoxification of malathion in chickens

All possible models have been fitted to the data on detoxification of malathion in chickens, and for each model, $ R^{2},$ $ R_{a}^{2}$ and $ C_{m}$ have been calculated. The results are given below.

$\displaystyle \begin{tabular}{lccc}
Explanatory variables & $R^{2}$\ & $R_{a}^{...
... & 60.3 & 40.4 & 5.7 \\
$x_{3},x_{4},x_{5}$\ & 39.7 & 9.5 & 9.6\end{tabular}\ $     $\displaystyle \begin{tabular}{lccc}
Explanatory variables & $R^{2}$\ & $R_{a}^{...
...& 34.7 & 26.6 & 6.6 \\
$x_{5}$\ & 0.3 & 0.0 & 13.2 \\
& & &
\end{tabular}\
$

The $ R_{a}^{2}$ criterion suggests the reduced model containing $ x_{1},x_{2}$ and $ x_{3}$ ( $ R_{a}=67.5\%$ ), while the $ C_{m}$ criterion suggests the model containing only $ x_{1}$ and $ x_{2}$ $ \left( C_{m}=1.4\right) $ . As always, the $ R^{2}$ criterion suggests the maximum model. Thus, the three criteria suggest three different models!


Taking a closer look at the results of the different criteria, we observe that the model preferred by the $ R_{a}^{2}$ criteria (that is, the model containing $ x_{1},x_{2}$ and $ x_{3}$ ) has the second smallest value of $ C_{m} $ ($ C_{m}=2.2$ ). That is, although this model is not the `best', according to the $ C_{m}$ criterion, it is still a `good' model. Also, the $ R^{2}$ is fairly large for this model: $ R^{2}=78.3\%$ , which is almost the same as the $ R^{2}$ for the maximum model (79.3%). Conversely, the model suggested by the $ C_{m}$ criterion (the one containing only $ x_{1}$ and $ x_{2}$ ), has reasonably large values of $ R_{a}^{2}$ (63.8%) and $ R^{2}$ (71.9%), so, according to these criteria, it is a reasonably good model. Thus, it seems that both of the models in question are `good' according to all three criteria, so we can choose either one. If simplicity of the model is essential (e.g. if interpretation of the model is an important issue), the simpler model

$\displaystyle Y=\beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+\varepsilon ,
$

might be the preferred choice, whereas, if prediction is the main purpose of the analysis, the larger model

$\displaystyle Y=\beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+\beta _{3}x_{3}+\varepsilon ,
$

may be better. The least squares models are given by, respectively,
$\displaystyle \hat{y}$ $\displaystyle =$ $\displaystyle 30.6+0.159x_{1}+0.107x_{2},$  
$\displaystyle \hat{y}$ $\displaystyle =$ $\displaystyle -16.5+0.135x_{1}+0.0899x_{2}+0.441x_{3}.$  

Residual analyses for the two models suggest that there might be an outlier in the data. This point should be investigated further before using the models for further analyses. (The residual analyses are omitted here.)

$ \diamondsuit$


In the above example, we had to fit 31 different models, and calculate $ R^{2} $ , $ R_{a}^{2}$ and $ C_{m}$ for each one, in order to find the best model. The number of models to be fitted (and compared) increases rapidly with the number of explanatory variables in the maximum model-for example, if the maximum model contains 10 explanatory variables (which is not unusual), a total of 1023 different models have to be fitted and compared! In general, if the maximum model has $ k$ explanatory variables, one has to fit (and compare) $ 2^{k}-1$ different models. Thus, in situations with many explanatory variables in the maximum model, the all possible models procedure becomes impractical. The best alternative is to use stepwise regression, which will be considered in Subsection 8.3.3. Stepwise regression is based on the two traditional selection procedures forward selection and backward elimination. These are discussed in Subsection 8.3.2.




8.3.2 Forward and backward procedures

Previous Section
Next Section

The backward elimination procedure is basically a sequence of tests for significance of explanatory variables. Starting out with the maximum model

$\displaystyle Y=\beta _{0}+\beta _{1}x_{1}+\cdots +\beta _{k}x_{k}+\varepsilon ,
$

we remove (or, eliminate) the variable with the highest $ p$ -value for the test of significance of the variable, conditioned on the $ p$ -value being bigger than some pre-determined level (say, 0.10). Next, we fit the reduced model (having removed the variable from the maximum model), and remove from the reduced model the variable with the highest $ p$ -value for the test of significance of that variable (if $ p>0.10$ ). And so on. The procedure ends when no more variables can be removed from the model at significance level 10%. Note that we use the $ F$ -test criterion in this procedure. (Here, we only describe the traditional version of the backward elimination procedure, in order to present the general idea of the procedure. There are various modified versions which may perform better.)


Example 8.1 (continued) Detoxification of malathion in chickens

Fitting maximum model to the data on detoxification of malathion in chickens, we obtain the following table for the least squares estimators

$\displaystyle \begin{tabular}{\vert c\vert r\vert r\vert r\vert r\vert}
\hline
...
...782 \\
$\beta _{5}$\ & 0.725 & 1.740 & 0.42 & 0.699 \\ \hline
\end{tabular}\
$

The variable with the highest $ p$ -value for the $ t$ -test of significance is $ x_{2}$ (recall that the $ t$ -test for significance is equivalent to the $ F$ -test for significance of a single variable.) So we eliminate $ x_{2}$ from the model, and fit a new multiple linear regression model to the data. The table for the least squares estimators becomes

$\displaystyle \begin{tabular}{\vert c\vert r\vert r\vert r\vert r\vert}
\hline
...
...5 \\
$\beta _{5}$\ & 1.0931 & 0.5605 & 1.95 & 0.109 \\ \hline
\end{tabular}\
$

Now, the variable with the highest $ p$ -value is $ x_{3}$ , so we eliminate $ x_{3}$ from the model, and fit a new model, resulting in the following table for the least squares estimators.

$\displaystyle \begin{tabular}{\vert c\vert r\vert r\vert r\vert r\vert}
\hline
...
...8 \\
$\beta _{5}$\ & 1.1028 & 0.5247 & 2.10 & 0.080 \\ \hline
\end{tabular}\
$

The highest $ p$ -value in the table corresponds to the intercept, but we are interested in significance of the variables, so we look for the second highest $ p$ -value instead. This corresponds to $ x_{4}$ , which we eliminate from the model and fit a new model to the data. We get

$\displaystyle \begin{tabular}{\vert c\vert r\vert r\vert r\vert r\vert}
\hline
...
...9 \\
$\beta _{5}$\ & 1.2111 & 0.6099 & 1.99 & 0.087 \\ \hline
\end{tabular}\
$

Both $ p$ -values, corresponding to $ x_{1}$ and $ x_{5}$ , are less than 0.10, so we cannot reduce the model any further. Thus, the least squares line for the final model is given by

$\displaystyle \hat{y}=-94.18+0.227x_{1}+1.211x_{5}.
$

Once again, we have selected a different model! However, this method is very crude and should not be compared to the all possible models procedure. The method can be refined into stepwise regression (in Subsection 8.3.3), which is a better alternative to the all possible models procedure. Note that we only fitted 4 different models using the backward elimination procedure-as compared to 31 models in the all possible models procedure. In general, the great advantage of this procedure is that we only need to fit a maximum of $ k$ different models, if the maximum model has $ k$ explanatory variables.

$ \diamondsuit$


The forward selection procedure is a reversed version of the backward elimination procedure. Instead of starting with the maximum model, and eliminating variables one by one, we start with an `empty' model with no explanatory variables, and add variables one by one until we cannot improve the model significantly by adding another variable. (Only the traditional version of the forward selection procedure is described here. There are various modified versions which may perform better.)


Example 8.1 (continued) Detoxification of malathion in chickens

We start by fitting the five simple linear regression models, relating the detoxification to each of the induced enzyme activities, one at the time. That is, we fit the five models

$\displaystyle Y=\beta _{0}+\beta _{i}x_{i}+\varepsilon ,$  $\displaystyle i=1,\ldots ,5.$ (8.4)

The table below contains, for each model, the $ t$ -statistic and the corresponding $ p$ -value for testing significance of $ x_{i}$ in (8.4). (Recall that the $ t$ -test is equivalent to the $ F$ -test for significance of a single variable.)

$\displaystyle \begin{tabular}{\vert c\vert c\vert c\vert} \hline Variable & $t$...
... \\ $x_{4}$\ & 2.06 & 0.073 \\ $x_{5}$\ & -0.15 & 0.883 \\ \hline \end{tabular}$ (8.5)

For example, if we fit the simple model $ Y=\beta _{0}+\beta
_{1}x_{1}+\varepsilon $ to the data, the table for the least squares estimators is given by

$\displaystyle \begin{tabular}{\vert c\vert r\vert r\vert r\vert r\vert}
\hline
...
...}$\ & 0.1504 & 0.05898 & \emph{2.55} & \emph{0.034} \\ \hline
\end{tabular},\
$

where the emphasised numbers 2.55 and 0.034 are the same as in (8.5). We can see from (8.5) that each of the variables $ x_{1}$ , $ x_{3}$ and $ x_{4}$ seem to provide useful information about the response variable, at a 10% significance level. Thus, we would be inclined to include these variables in the model. We start by including the `most' significant variable, that is, the variable with the lowest $ p$ -value: $ x_{1}
$ . In the next step, we fit the four models, relating the response to $ x_{1}$ and each of the four remaining variables, one by one. That is, we fit the models

$\displaystyle Y=\beta _{0}+\beta _{1}x_{1}+\beta _{i}x_{i}+\varepsilon ,$  $\displaystyle i=2,\ldots ,5,
$

to find out whether any of the four remaining variables provide additional information about the response, given that $ x_{1}$ is already in the model. The table of $ t$ -tests corresponding to these models is given by

$\displaystyle \begin{tabular}{\vert c\vert c\vert c\vert} \hline Variable & $t$...
...9 \\ $x_{4}$\ & 1.75 & 0.124 \\ $x_{5}$\ & 1.99 & 0.087 \\ \hline \end{tabular}$ (8.6)

The variable $ x_{2}$ has the lowest $ p$ -value (which is lower than 0.10), so we add $ x_{2}$ to the model. We now have $ x_{1}$ and $ x_{2}$ in the model. Next, we fit the three models containing $ x_{1}$ and $ x_{2}$ , and one of the three remaining variables $ x_{3},x_{4}$ or $ x_{5}$ , respectively. The table of $ t$ -tests becomes

$\displaystyle \begin{tabular}{\vert c\vert c\vert c\vert}
\hline
Variable & $t$...
...\
$x_{4}$\ & 1.09 & 0.316 \\
$x_{5}$\ & -0.17 & 0.872 \\ \hline
\end{tabular}$

All $ p$ -values in the table are above 0.10, so the procedure terminates at this step. That is, the selected model is the one containing the explanatory variables $ x_{1}$ and $ x_{2}$ . This is the same model that was suggested by the all possible models procedure using $ C_{m}$ criterion-but, unfortunately, this is not always the case.

$ \diamondsuit$


The two procedures, that have been discussed in this subsection, have the computational advantage that one only has to fit a small subset of the possible models. The maximum number of models to be fitted are $ k$ in the backward elimination procedure, and $ k\left( k-1\right) /2$ in the forward selection procedure, in situations where the maximum model has $ k$ explanatory variables. In both cases, it is a substantial reduction of the $ 2^{k}-1$ models to be fitted in the all possible models procedure. However, the main drawback of these two procedures is exactly the same as the advantage: that we only consider a small subset of the possible models. The risk of missing out the best model increases rapidly as the number of explanatory variables increases.




8.3.3 Stepwise regression procedure

Previous Section
Next Section

The performances of the forward- and backward-procedures, described in Subsection 8.3.2, can be greatly improved by introducing the modification stepwise regression procedure. Here, we shall consider stepwise regression based on forward selection. (Stepwise regression based on backward elimination can be defined in a similar way.) Recall that once a variable is added to the model, in forward selection, it stays in the model-irrespective of which other variables are added later on. However, it can easily happen that a variable entered early in the procedure becomes superfluous because of its interrelationship with other variables added to the model later on in the procedure.


The stepwise regression procedure modifies the forward selection procedure in the following way. Each time a new variable is added to the model, the significance of each of the variables already in the model is re-examined. That is, at each step in the forward selection procedure, we test for significance of each of the variables currently in the model, and remove the one with the highest $ p$ -value (if the $ p$ -value is above some threshold value, say 0.10). The model is then re-fitted without this variable, before going to the next step in the forward selection procedure. The stepwise regression procedure continues until no more variables can be added or removed.


Example 8.1 (continued) Detoxification of malathion in chickens

As in the forward selecting procedure, we start by fitting the five models

$\displaystyle Y=\beta _{0}+\beta _{i}x_{i}+\varepsilon ,$  $\displaystyle i=1,\ldots ,5,
$

and find from (8.5) that $ x_{1}$ is the variable which is `most' significant (it has the lowest $ p$ -value). As in forward selection, we add $ x_{1}$ to the model, and fit the four models, relating the response to $ x_{1} $ and each of the four remaining variables, one by one. As before, we find that $ x_{2}$ is `most' significant of the four (see (8.6)), so we add $ x_{2}$ to the model. But, before continuing to add more variables, we fit the current model and test significance of $ x_{1}$ . The table for the least squares estimators is given by

$\displaystyle \begin{tabular}{\vert crrrr\vert}
\hline
Parameter & Estimate & S...
...10 \\
$\beta _{2}$\ & 0.10727 & 0.04138 & 2.59 & 0.036 \\ \hline
\end{tabular}$

Since $ x_{1}$ is significant, we leave it in the model at continue to the next step of forward selection, that is, we fit the three models containing $ x_{1}$ and $ x_{2}$ , and each of the three remaining variables $ x_{3},x_{4}$ or $ x_{5}$ , one by one. The corresponding table of $ t$ -tests is the same as in the forward selection procedure, that is,

$\displaystyle \begin{tabular}{\vert c\vert c\vert c\vert}
\hline
Variable & $t$...
...\
$x_{4}$\ & 1.09 & 0.316 \\
$x_{5}$\ & -0.17 & 0.872 \\ \hline
\end{tabular}$

Since all $ p$ -values are greater than 0.10, the procedure ends. Thus, the stepwise regression procedure selects the model containing $ x_{1}$ and $ x_{2} $ , that is,

$\displaystyle Y=\beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+\varepsilon .
$

In this example, stepwise regression resulted in the same model as forward selection, and as the all possible models procedure, using the $ C_{m}$ criterion. It seems that this model may be a good choice. In general, when there are more explanatory variables in the maximum model, the three procedures often result in different models.

$ \diamondsuit$


Note that the stepwise regression procedure is better to use than the forward selection and the backward elimination procedures, because it considers more (relevant) models. At the same time, the procedure is much quicker than the all possible models procedure, when the number of explanatory variables in the maximum model is large. However, if the number of explanatory variables is small, or if fitting vast numbers of models is not a problem, it is recommended to use the all possible models procedure, rather than stepwise regression.




8.4 Summary

Previous Section
Next Section

Various methods for selecting the `best' regression model have been discussed in this module. In order to simplify the maximum model, containing all possible explanatory variables, to a reduced model that only contains the explanatory variables which provide important information about the response variable, one has to decide on a selection criterion (how to compare the possible models), and a selection procedure (a strategy for comparing relevant models). We have considered the $ R^{2}$ and $ R_{a}^{2}$ criteria, which compare models by comparing the values $ R^{2}$ and $ R_{a}^{2} $ (the adjusted $ R^{2}$ )$ ,$ respectively, the $ F$ -test criterion, which is based on testing significance of the explanatory variables, and the $ C_{m}$ criterion, which is based on a comparison of estimates of the error variances. As for selection procedures, the traditional forward selection, and backward elimination procedures have been reviewed. The stepwise regression procedure is useful when the number of explanatory variables in the maximum model is large-otherwise, the all possible models procedure is the most sensible selection procedure to use.


Keywords: maximum model, collinearity, parsimony, selection criterion, reduced model, $ R_{a}^{2}$ criterion, adjusted $ R^{2}$ statistic, $ R^{2}$ criterion, $ F$ -test criterion, $ C_{m}$ criterion, Mallow's $ C_{m}$ statistic, all possible models procedure, backward elimination procedure, forward selection procedure, stepwise regression.


HOME | Back

Last modified February 12, 2008. Webmaster