PDFPS
Module 4: Residual analysis
By Pia Veldt Larsen


Table of Contents





4.1 Introduction

Previous Section
Next Section

The principle of least squares provides a general methodology for fitting straight-line models to regression data. So far, we have fitted such models to any data for which scatterplots between the response variable and the explanatory variables displayed anything resembling straight-line relationships. But we have made no further effort to check the validity of the assumptions of the models. For a multiple linear regression model

$\displaystyle Y_{i}=\beta _{0}+\beta _{1}x_{i,1}+\beta _{2}x_{i,2}+\cdots +\beta
_{k}x_{i,k}+\varepsilon _{i},\ \ \ i=1,\ldots n,
$

we make the following four model assumptions:

(I)
Independence: The response variables $ Y_{i}$ are independent.

(II)
Normality: The response variables $ Y_{i}$ are normally distributed.

(III)
Homoscedasticity: The response variables $ Y_{i}$ all have the same variance $ \sigma ^{2}.$ (The term homoscedasticity is from Greek and means `same variance'.)

(IV)
Linearity: The true relationship between the mean of the response variable $ {\mathbb{E}}[Y]$ and the explanatory variables $ x_{1},\ldots ,x_{k}$ is a straight line.


Assumption (I) on independence of the response variables is subject to the design of the study and the way the data have been collected. In this course, we shall assume that all data have been collected independently; that is, we shall assume that Assumption (I) is satisfied.


In order to check the model assumptions, we shall need a new type of residuals: standardised residuals. These are introduced in Section 4.2. The remaining sections are concerned with methods for assessing the appropriateness of the model: Section 4.3 concerns the normality assumption, Section 4.4 the homoscedasticity and linearity assumptions, and Section 4.5 the linearity assumption in multiple regression. The module concludes with Section 4.6 which considers situations where a few points differ from the rest of the data.




4.2 Residuals

Previous Section
Next Section

Rather than checking Assumptions (II)-(IV) on the response variables directly, it is convenient to re-express the assumptions in terms of the random errors

$\displaystyle \varepsilon _{i}=Y_{i}-\left( \beta _{0}+\beta _{1}x_{i,1}+\beta _{2}x_{i,2}+\cdots +\beta _{k}x_{i,k}\right) ,$  $\displaystyle i=1,\ldots ,n,$ (4.1)

and check the assumptions on the random errors instead.


The following four assumptions on the random errors are equivalent to the assumptions on the response variables.

(i)
The random errors $ \varepsilon _{i}$ are independent.

(ii)
The random errors $ \varepsilon _{i}$ are normally distributed.

(iii)
The random errors $ \varepsilon _{i}$ have constant variance $ \sigma ^{2}$ .

(iv)
The random errors $ \varepsilon _{i}$ have zero mean.


If assumptions (i)-(iv) are satisfied, the random errors $ \varepsilon _{i}$ are independent, identically distributed random variables with distributions:

$\displaystyle \varepsilon _{i}\sim N(0,\sigma ^{2}),$  $\displaystyle i=1,\ldots ,n.
$

Thus, the random errors $ \varepsilon _{i}$ can be regarded as a random sample from a $ N(0,\sigma ^{2})$ distribution. We can check the assumptions on the random errors (and thereby the assumptions on the response variables) by analysing an observed sample of the random errors. All we need are observations of the random errors.


The obvious candidates for observations of the random errors are the fitted residuals: the differences between the observed values $ y_{1},y_{2},\ldots
,y_{n}$ of $ Y,$ and the values $ \hat{y}_{1},\hat{y}_{2},\ldots ,\hat{y}_{n}$ fitted by the model, where

$\displaystyle \hat{y}_{i}=\hat{\beta}_{0}+\hat{\beta}_{1}x_{i,1}+\hat{\beta}_{2}x_{i,2}+\cdots +\hat{\beta}_{k}x_{i,k},$  $\displaystyle i=1,2,\ldots ,n,$ (4.2)

with $ \hat{\beta}_{0},\hat{\beta}_{1},\ldots ,\hat{\beta}_{k}$ denoting the least squares estimates of the regression parameters. That is, the fitted residuals are given by
$\displaystyle \hat{\varepsilon}_{i}$ $\displaystyle =$ $\displaystyle y_{i}-\hat{y}_{i}$  
  $\displaystyle =$ $\displaystyle y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i,1}-\hat{\beta}_{2}x_{i,2}-\cdots -\hat{\beta}_{k}x_{i,k}.$  

However, as we shall see in Subsection 4.2.1, these residuals are observations of random variables-known as raw residuals-which are not independent, and which do not have the same variance. In Subsection 4.2.2, the raw residuals are transformed into standardised residuals, for which the issue of non-constant variance is overcome.




4.2.1 Raw residuals

Previous Section
Next Section

The observed values $ r_{i}$ of the raw residuals are given by the fitted residuals

$\displaystyle r_{i}=\hat{\varepsilon}_{i}=y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i,1}-\hat{\beta}_{2}x_{i,2}-\cdots -\hat{\beta}_{k}x_{i,k},$  $\displaystyle i=1,\ldots
,n,
$

where $ \hat{\beta}_{0},\hat{\beta}_{1},\ldots ,\hat{\beta}_{k}$ are the least squares estimates of the regression parameters. The corresponding random variables, denoted by $ R_{i}$ , are obtained by substituting the observed $ y_{i}$ s with the random variables $ Y_{i}$ , and the least squares estimates of $ \beta _{0},\beta _{1},\ldots ,\beta _{k}$ with the corresponding random variables: the least squares estimators. That is, the raw residuals are given by

$\displaystyle R_{i}=Y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i,1}-\hat{\beta}_{2}x_{i,2}-\cdots -\hat{\beta}_{k}x_{i,k},$  $\displaystyle i=1,\ldots ,n,$ (4.3)

where $ \hat{\beta}_{0},\hat{\beta}_{1},\ldots ,\hat{\beta}_{k}$ are the least squares estimators of the regression parameters.


It can be shown (we shall not do it here) that the $ i$ th raw residual $ R_{i}$ has the distribution

$\displaystyle R_{i}\sim N\left( 0,\left( 1-h_{ii}\right) \times \,\sigma ^{2}\right) ,$  $\displaystyle i=1,\ldots ,n,$ (4.4)

where $ h_{ii}$ is the $ i$ th diagonal element of the hat-matrix $ \mathbf{h}$ given by
$\displaystyle \mathbf{h}$ $\displaystyle =$ $\displaystyle \left(
\begin{tabular}{cccc}
$h_{11}$\ & $h_{12}$\ & $\cdots $\ &...
...vdots $\ \\
$h_{n1}$\ & $\cdots $\ & $\cdots $\ & $h_{nn}$\end{tabular}\right)$  
  $\displaystyle =$ $\displaystyle \mathbf{x}\left( \mathbf{x}^{T}\,\mathbf{x}\right) ^{-1}\,\mathbf{x}^{T},$ (4.5)

where $ \mathbf{x}$ is the design matrix

$\displaystyle \mathbf{x}=\left(
\begin{tabular}{ccccc}
1 & $x_{1,1}$\ & $x_{1,...
...
1 & $x_{n,1}$\ & $x_{n,2}$\ & $\cdots $\ & $x_{n,k}$\end{tabular}\ \right) .
$

The matrix $ \mathbf{h}$ is called the hat-matrix, because it has the property that it `puts a hat on the $ y$ s', in the sense that the fitted values $ \hat{y}_{1},\hat{y}_{2},\ldots ,\hat{y}_{n}$ in (4.2) are found by matrix-multiplying the hat-matrix on the vector of observed values $ y_{1},y_{2},\ldots ,y_{n}$ :

$\displaystyle \mathbf{hy}=\mathbf{h}\left(
\begin{tabular}{c}
$y_{1}$\ \\
$y_...
...}$\ \\
$\hat{y}_{2}$\ \\
$\vdots $\ \\
$\hat{y}_{n}$\end{tabular}\right) .
$

Here $ \mathbf{y}$ denotes the column vector of response variables, as defined in Module 3.


You can see from (4.4) that the raw residuals have different variances. Also, notice that none of the raw residuals have the variance we are looking for: $ \sigma ^{2}$ . It can be shown that all the diagonal elements $ h_{ii}$ take values between 0 and 1: if $ h_{ii}$ is small, the variance $ \left( 1-h_{ii}\right) \times \,\sigma ^{2}$ is close to `right' variance $ \sigma ^{2}$ ; however, if $ h_{ii}$ is close to one, the variance $ \left( 1-h_{ii}\right) \times \,\sigma ^{2}$ is much smaller than $ \sigma
^{2}$ .


A further problem with the raw residuals is that they are not independent. However, it can be shown that if the values of the diagonal elements $ h_{ii}$ of the hat-matrix $ \mathbf{h}$ are reasonably small, the raw residuals are `nearly' independent. We shall not go into further details with this problem.


In summary, the raw residuals are not suitable for checking the assumptions on the random errors. The random errors all have the same variance-the raw residuals have different variances; the random errors have variance $ \sigma
^{2}$ -in general, none of the raw residuals have variance $ \sigma ^{2}$ ; the random errors are independent-the raw residuals are not.




4.2.2 Standardised residuals

Previous Section
Next Section

The standardised residuals are designed to overcome the problem of different variances of the raw residuals. The problem is solved by dividing each of the raw residuals by an appropriate term.


Recall that the $ i$ th raw residual $ R_{i}$ has a $ N(0,\left( 1-h_{ii}\right)
\times \,\sigma ^{2})$ -distribution. A standard result on the normal distribution states that if $ X\sim N(\mu ,\sigma ^{2})$ , then

$\displaystyle aX\sim N(a\mu ,a^{2}\sigma ^{2}).
$

Therefore, if we multiply $ R_{i}$ by $ a_{i}=1/\sqrt{1-h_{ii}}$ , we get the standardised residual, $ S_{i}$ , with distribution

$\displaystyle S_{i}=\frac{R_{i}}{\sqrt{1-h_{ii}}}\sim N\left( \frac{0}{\sqrt{1-...
...right) \times \sigma ^{2}}{1-h_{ii}}\right) =N\left(
0,\,\sigma ^{2}\right) .
$

That is, the standardised residuals $ S_{1},\ldots ,S_{n}$ are random variables with distributions

$\displaystyle S_{i}\sim N\left( 0,\,\,\sigma ^{2}\right) ,$  $\displaystyle i=1,\ldots ,n.$ (4.6)

The observed value $ s_{i}$ of the $ i$ th standardised residual is given by

$\displaystyle s_{i}=\frac{r_{i}}{\sqrt{1-h_{ii}}}.$ (4.7)


The standardisation of the residuals has taken care of the issue of different variances, but nothing has changed with regard to dependence between the residuals. It can be shown that the dependence between the standardised residuals is exactly the same as the dependence between the raw residuals. We shall not go into further details with this problem.


In summary, the standardised residuals are better suited than the raw residuals for checking the assumptions on the random errors. The standardised residuals $ S_{i}$ have the same distributions as the random errors: $ N\left( 0,\sigma ^{2}\right) $ . However, the standardised residuals are not, in general, independent. But if the values of the diagonal elements $ h_{ii}$ of the hat-matrix $ \mathbf{h}$ are reasonably small, the standardised residuals are `nearly' independent.


Note that most statistical computer packages (including SAS) calculate the standardised residuals slightly differently from the standardised residuals defined in the module. In most packages, each of the standardised residuals is divided by an estimate of the standard error, to obtain variables which are approximately $ N\left( 0,1\right) $ -distributed, rather than $ N\left(
0,\sigma ^{2}\right) $ -distributed. However, since all the residuals are divided by the same value, the patterns in residual plots and normal probability plots are identical whether one uses the un-scaled version in (4.7) or the scaled version.




4.3 Normality

Previous Section
Next Section

The first assumption we consider is Assumption (ii): the random errors $ \varepsilon _{i}$ are normally distributed. Since the random errors can be regarded as a random sample from a $ N(0,\sigma ^{2})$ distribution, we can check Assumption (ii) by checking whether the standardised residuals $ s_{i}$ might have come from a normal distribution. A normal probability plot of the standardised residuals will give an indication of whether or not the assumption of normality of the random errors is appropriate. Recall that a normal probability plot is found by plotting the quantiles of the observed sample against the corresponding quantiles of a standard normal distribution $ N(0,1)$ . If the normal probability plot shows a straight line, it is reasonable to assume that the observed sample comes from a normal distribution. If, on the other hand, the points deviate from a straight line, there is statistical evidence against the assumption that the random errors are an independent sample from a normal distribution.


Example 4.1        Holiday cottages

Recall from Module 3 the data on sales prices, ages and livable areas of holiday cottages in Odsherred, Denmark. It was suggested, in Module 3, that a multiple linear regression model might describe the variation in the data well. The least squares line for the relationship between sales price $ (Y)$ , age $ (x_{1})$ , and livable area $ (x_{2})$ , is given by

$\displaystyle \hat{y}=-281.43-7.611x_{1}+19.01x_{2}.
$

Figure 4.1 shows a normal probability plot of the residuals
Figure 4.1: Normal probability plot of standardised residuals for Odsherred data
\includegraphics[width=0.65\textwidth]{fig/odsherredp}
There are very few data points, so one should be careful in concluding too much from the plot. Nevertheless, the points deviate quite a bit from a straight line, so the normality assumption might not be satisfied for these data.


Further details on this dataset can be found here.


$ \diamondsuit$
        


Example 4.2        Ice cream consumption

In Module 3, we considered how the ice cream consumption $ \left( Y\right) $ is related to temperature $ \left( x_{1}\right) $ , ice cream price $ \left(
x_{2}\right) $ , average annual family income $ \left( x_{3}\right) $ , and the year $ \left( x_{4}\right) $ . In Module 3, a possible outlier was removed from the dataset before we fitted a multiple linear regression model to the data. In this module, we consider the full dataset-including the outlying point. The least squares line, relating the ice cream consumption to the four explanatory variables, is given by

$\displaystyle \hat{y}=0.714+0.00315\,x_{1}-1.29\,x_{2}-0.00237\,x_{3}+0.0508x_{4}.
$

A normal probability plot of the standardised residuals is shown in Figure 4.2.
Figure 4.2: Normal probability plot of standardised residuals for ice cream data
\includegraphics[width=0.65\textwidth]{fig/icecreamp}
The normal probability plot is not too far from a straight line. (Although the line is not entirely convincing.) It seems that the normality assumption might be satisfied for these data.


Further details on this dataset can be found here.


$ \diamondsuit$


The two most common ways to deal with failure of the normality assumption are either to transform the data into a new set of data for which the assumption is satisfied (transforming data is discussed in Module 6), or to use a distribution different from the normal. A general framework to dealing with non-normal (and/or non-linear) models is that of generalised linear models. Generalised linear models are studied in ST112.


Note that, it can affect the normal probability plot if one or more of the other assumptions are broken, for instance, if the response variables are dependent, or if the variances of the response variables differ.




4.4 Homoscedasticity and linearity

Previous Section
Next Section

The two assumptions Assumption (iii): the random errors $ \varepsilon
_{i}$ have constant variation, and Assumption (iv): the random errors $ \varepsilon _{i}$ have zero mean, can be checked at the same time. To do this, we use a residual plot. A residual plot is a scatterplot of the standardised residuals $ s_{i}$ against the fitted values $ \hat{y}_{i}=\hat{\beta}_{0}+\hat{\beta}_{1}x_{i,1}+\hat{\beta}_{2}x_{i,2}+\cdots +\hat{\beta}_{k}x_{i,k}$ . Recall that the (standardised) residuals are the deviations of the observations away from the fitted values. If Assumptions (iii) and (iv) are satisfied we would expect the residuals to vary randomly around zero and we would expect the spread of the residuals to be about the same throughout the plot.


Example 4.2(continued) Ice cream consumption

A residual plot for the data on the relationship between ice cream consumption and temperature, ice cream price, average annual family income, and the year is shown in Figure 4.3.

Figure 4.3: Residual plot for the ice cream data
\includegraphics[width=0.65\textwidth]{fig/icecreamr}
The points in the plot seem to be fluctuating randomly around zero in an un-patterned fashion. Thus, the plot does not suggest violations of the assumptions of zero means and constant variance of the random errors.

$ \diamondsuit$


In general, any systematic pattern in a residual plot suggests that one or more of Assumptions (i)-(iv) are violated. Since we have assumed independence of the random errors, and since a normal probability plot is better for assessing the assumption of normality, we shall concentrate on breaches of Assumptions (iii) and (iv). When looking for patterns in residual plots, there are three main features which are important. If the residuals seem to increase or decrease in average magnitude with the fitted values, it is an indication that the variance of the residuals is not constant. That is, Assumption (iii) is broken. If the points in the plot lie on a curve around zero, rather than fluctuating randomly, it is an indication that Assumption (iv) is broken. If a few points in the plot lie a long way from the rest of the points, they might be outliers, that is, data points for which the model is not appropriate. (Outliers are considered further in Section 4.6.) Figure 4.4 illustrates the most important features to look for in a residual plot.

Figure 4.4: Different features in residual plots
\includegraphics[width=0.75\textwidth]{fig/scatterplots}
Figure 4.4(a) shows a residual plot with no systematic pattern. It seems that Assumptions (iii) and (iv) are satisfied for the data associated with this residual plot. In Figure 4.4(b) there is a clear curved pattern: Assumption (iv) may be broken. In Figure 4.4(c) the random variation of the residuals increases as the fitted values increase. This pattern indicates that the variance $ \sigma
^{2}$ is not constant. Finally, in Figure 4.4(d) most of the residuals are randomly scattered around 0, but one observation has produced a residual which is much larger than any of the other residuals. The point may be an outlier.


In Module 6, we shall consider ways to analyse data for which Assumption (iii) and/or Assumption (iv) are broken.


Example 4.3        Wind power

In Module 1, we considered a study into how the direct current output from a wind power generator changes with wind speed. A scatterplot of the data is reproduced in Figure 4.5.

Figure 4.5: Direct current outpur against wind speed
\includegraphics[width=0.65\textwidth]{fig/windspeed}

The data points seem to lie along a slightly curved line, but it is not too far from a straight line, so perhaps a simple linear regression model might be a reasonable model for the data after all. The least squares line for the data is given by

$\displaystyle \hat{y}=0.131+0.241~x.
$

Figure 4.6 shows (a) a residual plot and (b) a normal probability plot for the data.
Figure 4.6: Wind power data: (a) residual plot, (b) probability plot.
\includegraphics[width=0.75\textwidth]{fig/windspeedpr}
The normal probability plot in Figure 4.6 (b) is not very convincing: the residuals appear to come from a skew distribution. However, it is the residual plot in Figure 4.6(a) that provides the strongest argument against using a simple linear regression model for these data. There is a very clear pattern in the residual plot: the residuals go from being negative to positive and then negative again. Thus, it seems that Assumption (iv) is broken. In Module 6, we shall return to this example and find a better model for the data.


Further details on this dataset can be found here.


$ \diamondsuit$


In the case of simple models (with only one explanatory variable), a residual plot is useful for assessing both the assumption on constant variance of the response variables, and the assumption that the relationship between the response variable and the explanatory variable is a straight line. For example, in Example 4.3 the residual plot gives a very explicit indication of how the model assumptions are broken: the relationship between wind speed and current output is not a straight line-it is curved. However, when there are more than one explanatory variable in the model, the residual plot is less informative regarding the linearity assumption. For instance, although the scatterplot for the ice cream data in Figure 4.3 does not indicate violations of the assumption that the mean response is of the form $ {\mathbb{E}}[Y]=\beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+\beta
_{3}x_{3}+\beta _{4}x_{4},$ it is still possible that one or two of the explanatory variables enter the true relationship in a non-linear fashion. In general, when there are several explanatory variables, a non-linear relationship between the response and one (or more) of the explanatory variables can easily be concealed in a residual plot-in particular if the explanatory variables are correlated. In order to check whether each of the explanatory variables enters the model linearly, we need a different type of plot: partial residual plots. These are discussed in the next section.




4.5 Linearity in multiple regression

Previous Section
Next Section

In a multiple linear regression model, it is assumed that each of the explanatory variables $ x_{1},\ldots ,x_{k}$ affects the mean of the response in a linear way. That is, we assume that

$\displaystyle {\mathbb{E}}[Y]=\beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+\cdots +\beta _{k}x_{k}.$ (4.8)

How can we check this assumption? An obvious suggestion would be to look for straight-line relationships in scatterplots of the observed response variables against each of the explanatory variables, one at the time.


Example 4.2(continued) Ice cream consumption

Scatterplots of the ice cream consumption against the four explanatory variables temperature, ice cream price, average annual family income, and the year are displayed in Figure 4.7.

Figure 4.7: Scatterplots of the ice cream consumption against the four explanatory variables
\includegraphics[width=0.75\textwidth]{fig/icecream}
There seems to be straight-line relationships between the ice cream consumption and temperature, and between the ice cream consumption and the year. The two remaining plots (against price and income) have a lot of scatter. The relationships might be straight-line, but the plots are hardly convincing.

$ \diamondsuit$


When we investigate the scatterplots, we essentially consider each of the simple models

$\displaystyle {\mathbb{E}}[Y]$ $\displaystyle =$ $\displaystyle \beta _{0}+\beta _{1}x_{1},$  
$\displaystyle {\mathbb{E}}[Y]$ $\displaystyle =$ $\displaystyle \beta _{0}+\beta _{2}x_{2},$  
    $\displaystyle \vdots$  
$\displaystyle {\mathbb{E}}[Y]$ $\displaystyle =$ $\displaystyle \beta _{0}+\beta _{k}x_{k},$  

separately. For example, that the relationship between ice cream consumption and temperature (ignoring all other variables) is a straight line, and the relationship between ice cream consumption and average annual income (ignoring all other variables) is a straight line, etc.


But the assumption we wish to check is (4.8), rather than each of the simple models. That is, for each $ l=1,\ldots ,k$ we wish to check whether $ x_{l}$ enters the model linearly, taking all the other variables into account. If all the explanatory variables are uncorrelated, there is no difference between checking (4.8) and checking the simple models separately. However, it is usually the case that the explanatory variables are correlated. For instance, in Example 4.2 it is likely that the average annual income will increase from one year to the next; thus the variables `income' and `year' are likely to be correlated.


The idea behind the method for checking whether $ x_{l}$ enters linearly in (4.8), taking all the other variables into account, is the following. We want to know how $ x_{l}$ affects the response variable, $ Y$ , if all the other explanatory variables $ x_{1},\ldots ,x_{l-1},x_{l+1},\ldots
,x_{k}$ affect the response variable linearly. That is, we consider the following form of the response variables

$\displaystyle Y_{i}\approx \beta _{0}+\beta _{1}x_{i,1}+\cdots +\beta
_{l-1}x_{...
...+p_{l}\left( x_{i,l}\right) +\beta _{l+1}x_{i,l+1}+\cdots
+\beta _{k}x_{i,k},
$

for some function $ p_{l}\left( \cdot \right) $ which we wish to determine. (If we can show that $ p_{l}\left( \cdot \right) $ is linear, the assumption is satisfied for $ x_{l}$ .) Since the true regression parameters $ \beta _{0},$ $ \beta _{1},\ldots ,\beta _{l-1},\beta _{l+1},\ldots ,\beta _{k}$ are unknown, we substitute by the least squares estimators $ \hat{\beta}_{0},$ $ \hat{\beta}_{1},\ldots ,\hat{\beta}_{l-1},\hat{\beta}_{l+1},\ldots ,\hat{\beta}_{k}$ , obtaining

$\displaystyle Y_{i}\approx \hat{\beta}_{0}+\hat{\beta}_{1}x_{i,1}+\cdots +\hat{...
...eft( x_{i,l}\right) +\hat{\beta}_{l+1}x_{i,l+1}+\cdots +\hat{\beta}_{k}x_{i,k}.$ (4.9)

The next step is to use the definition of the raw residual $ R_{i}$ in (4.3): $ R_{i}=Y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i,1}-\hat{\beta}_{2}x_{i,2}-\cdots -\hat{\beta}_{k}x_{i,k}.$ We can rewrite this as

$\displaystyle Y_{i}=\hat{\beta}_{0}+\hat{\beta}_{1}x_{i,1}+\cdots +\hat{\beta}_...
...}_{l}x_{i,l}+\hat{\beta}_{l+1}x_{i,l+1}+\cdots +\hat{\beta}_{k}x_{i,k}+R_{i}.
$

If we substitute this expression for $ Y_{i}$ into (4.9) and cancel out, we get

$\displaystyle p_{l}\left( x_{i,l}\right) \approx \hat{\beta}_{l}x_{i,l}+R_{i}.
$

That is, the true function $ p_{l}\left( \cdot \right) $ for how $ x_{l}$ affects $ Y$ is approximately equal to

$\displaystyle p_{l}\left( x_{i,l}\right) \approx \hat{\beta}_{l}x_{i,l}+R_{i}=P_{i,l},\ \ i=1,\ldots ,n.$ (4.10)

The terms $ P_{i,l},$ $ i=1,\ldots ,n,$ are called the $ l$ th partial residuals. (Note that, for each explanatory variable, we get a new set of partial residuals: the 1st partial residuals refer to $ x_{1}$ , the 2nd to $ x_{2}$ , etc.) The partial residuals are random variables since both the least squares estimator $ \hat{\beta}_{l}$ and the raw residuals $ R_{i}$ are random variables. Observations of the partial residuals are given by

$\displaystyle p_{i,l}=\hat{\beta}_{l}x_{i,l}+r_{i},$  $\displaystyle i=1,\ldots ,n,
$

where $ \hat{\beta}_{l}$ is the least squares estimate of $ \beta _{l}$ .


We know from (4.10) that, for each $ i=1,\ldots ,n$ , we have that $ p_{l}\left( x_{i,l}\right) \approx P_{i,l}.$ Thus, if we plot the values of the $ l$ th explanatory variable, $ x_{1,l},x_{2,l},\ldots ,x_{n,l}$ , against the observed $ l$ th partial residuals $ p_{1,l},p_{2,l},\ldots ,p_{n,l}$ , the plot will indicate the true function $ p_{l}\left( \cdot \right) $ . This plot is called the $ l$ th partial residual plot. (Note that we get a different plot for each explanatory variable: the 1st partial residual plot refers to $ x_{1}$ , the 2nd to $ x_{2}$ , etc.) If the partial residual plot shows a straight line, it is an indication that the true relationship between the response variable and the $ l$ th explanatory variable $ x_{l}$ is straight-line, when all other variables are taken into account. If the plot shows a non-linear relationship, it is an indication that $ x_{l}$ affects the response variable in a non-linear fashion.


Example 4.2(continued) Ice cream consumption

Figure 4.8 shows partial residual plots for each of the four explanatory variables in the ice cream data.

Figure 4.8: Partial residual plots for the ice cream data
\includegraphics[width=0.75\textwidth]{fig/icecreampar}
The four plots indicate clear relationships between the ice cream consumption and each of the explanatory variables. The relationships seem to be more-or-less straight-line, although there is some indication of possible slight curves in the plots against temperature and income. Also, in the plot against temperature, a single point appears to deviate from the trend of the rest of the points. This point could be an outlier.


You can see that the plots in Figure 4.8 are quite different from the scatterplots in Figure 4.7. This is because the partial residual plots take the other variables into account.


$ \diamondsuit$




4.6 Outliers and leverage points

Previous Section
Next Section

This section concerns situations where one or a few observations are different-in some way-from the rest of the data. We distinguish between two ways a few points may differ from the remaining points. A data point might lie far from the general trend in the rest of the data: such a point is called an outlier. Outliers are discussed in Subsection 4.5.1.


Sometimes, a statistical analysis is very sensitive to a single (or a few) data point(s), in the sense that if the value of this point is changed even slightly, the outcome of the analysis alters greatly. Such points are called leverage points, and are discussed in Subsection 4.5.2.




4.6.1 Outliers

Previous Section
Next Section

An outlier is an observation which differs from the main trend in the data. The reason might be due to (unforeseen) special circumstances about the particular observation (for example, imagine that one of the holiday cottages in Example 4.1 was designed by a famous architect-adding extra value to the sales price), or it might be due to a measurement error. But there is also the possibility that the unusual observation is simply due to random variation in the data: since the data are observations of random variables, there will be some variation away from the true relationship. Most points will lie closely around the true relationship, some will lie a little away, and a few might lie a bit further away.


Suppose that a point lies a bit away from the main trend in the data, and that we wish assess whether this is due to random variation in the data, or whether the observation actually differs-in some way-from the rest of the data. There are various methods for doing this; here we shall use studentised residuals.


The idea behind this method is as follows. If a data point lies far from the general trend in the data, it is equivalent to the point having a large (raw) residual. Thus, we can re-phrase the issue of whether a point lies too far from the main trend to have happened by chance, into an issue of whether the corresponding residual is too large to have happened by chance. We know from (4.4) that the $ i$ th raw residual $ R_{i}$ has a normal distribution with zero mean and variance $ \left( 1-h_{ii}\right) \times
\,\sigma ^{2}$ ; so, in order to check whether the observed value $ r_{i}$ is too large to have happened by chance, we can compare $ r_{i}$ to the distribution of $ R_{i}$ : $ N\left( 0,\left( 1-h_{ii}\right) \times \,\sigma
^{2}\right) $ . This is a basic statistical problem: we have a normal distribution with unknown variance (since $ \sigma ^{2}$ is unknown), and we wish to test whether or not the observation $ r_{i}$ might have come from this distribution. To do this, we use a $ t$ -test. The $ t$ -statistic is given by

$\displaystyle T_{i}=\frac{R_{i}}{\sqrt{\,\left( 1-h_{ii}\right) \tilde{\sigma}_{i}^{2}}}=\frac{S_{i}}{\sqrt{\,\tilde{\sigma}_{i}^{2}}},
$

where $ \tilde{\sigma}_{i}^{2}$ is an appropriate estimate of the variance of the standardised residual $ S_{i}$ . It can be shown that an appropriate unbiased estimate is given by $ \tilde{\sigma}_{i}^{2}=\left( \left(
n-k\right) S^{2}-S_{i}^{2}\right) /\left( n-k-1\right) $ , where $ S^{2}=\sum_{i=1}^{n}\left( Y_{i}-\hat{Y}\right) ^{2}$ $ /\left( n-k\right) $ . The test statistic $ T_{i}$ has a $ t\left( n-k-1\right) $ -distribution. That is,

$\displaystyle T_{i}=S_{i}\sqrt{\frac{n-k-1}{\left( n-k\right) S^{2}-S_{i}^{2}}}\,\sim \,t\left( n-k-1\right) ,$  $\displaystyle i=1,\ldots ,n,.$ (4.11)

The variables $ T_{i}$ are called studentised residuals (because they are $ t$ -distributed; or, more precisely, Student's $ t$ -distributed). If the numerical value $ \vert t_{i}\vert$ of a studentised residual is (much) larger than the rest, it is an indication that the corresponding observation $ y_{i}$ may be an outlier.


There is no fixed value (or quantile) for which a point is an outlier if it exceeds this value (quantile). If the model is correct, we expect around 5% of the studentised residuals to lie outside the interval between the 2.5%- and 97.5%-quantiles of a $ t\left( n-k-1\right) $ -distribution, 1% to lie outside the interval between the 0.5%- and 99.5%-quantiles, and so on. For example, an observation with a studentised residual corresponding to the 99.9%-quantile may be an outlier, if the dataset only contains 20 observations, but it is not an outlier in a dataset of 1000 observations. We would expect around 0.1% of the residuals to exceed the 99.9%-quantile; in a dataset of 20 observations, this corresponds to 0.02 observations out of the 20-it is very unlikely, that an extreme residual like this would have occurred by chance. However, if the dataset contains 1000 observations, we would expect 1 observation (0.1% of 1000) to have a residual exceeding the 99.9%-quantile. Hence, the point is not an outlier-in fact, it would be suspicious if there were no residuals around or beyond the 99.9%-quantile!


When a possible outlier is detected, one should always try and find out if there is a reason why this point may be different from the rest. For example, is the particular measurement taken by a different person, or on a different day/in a different place, or does the particular subject differ in some way from the rest? In the example on ice cream sales, one data point lies away from the trend-could this be because the particular period coincided with the summer holidays? Or because there was a fun fair in the town? Or ...? Or could it simply be a misprint? If you have collected the data yourself, or have access to additional information about the data collection, you might be able to avoid the outlier (e.g. by correcting a misprint, or introducing an extra explanatory variable). In this course, however, we cannot investigate the background of outlying points, as there is no further information available on the collection and validation of the datasets that are used.


In situations where no explanation can be found to why a point is outlying, one has to decide whether to leave the corresponding observation in the dataset, or whether to omit the observation, when the data are analysed. (Alternatively, it is sometimes possible to transform the data in such a way that outlying points are pulled closer towards the general trend in the transformed data. Transforming data is discussed in Module 6.) Whether an unexplained outlier should be left in or omitted from the dataset depends both on its extremity and on its leverage. The next subsection concerns leverage, and how to check for outliers and leverage points in a diagnostic plot.


Note that sometimes studentised residuals are also used for checking normality of the random errors (Assumption (ii)). But since the $ T_{i}$ s in (4.11) are $ t$ -distributed rather than normally distributed, this is not strictly correct. (In order to assess this assumption using the studentised residuals, the quantiles of the observed sample $ t_{1},t_{2},\ldots ,t_{n}$ should be plotted against the corresponding quantiles of an appropriate $ t$ -distribution.) However, if the dataset is sufficiently large, the $ t$ -distribution is very close to a normal distribution, and the quantiles are almost identical. Thus, for large datasets, one can use a normal probability plot as a good approximation to a $ t$ -distribution probability plot.




4.6.2 Leverage points

Previous Section
Next Section

A leverage point is a point for which the observed value of this particular point has a great influence on the analysis. An illustration of a leverage point is shown in Figure 4.9: suppose you have a cluster of data with $ x$ -values not too far apart; also, you have one observation corresponding to an $ x$ -value further away. The value of this isolated point is disproportionately influential on the least squares line: one might say that it works as a lever-if the value of this observation is changed, the least squares line changes considerably (as illustrated in the figure). In contrast, if the value of one of the points within the cluster is changed, the least squares line will not be affected to the same extent.

Figure 4.9: A leverage point in regression
\includegraphics[width=0.65\textwidth]{fig/levept}


It can be shown that the diagonal element $ h_{ii}$ of the hat-matrix in (4.5) indicates the amount of leverage, or influence, the $ i$ th observation has on the least squares line. The larger the value of $ h_{ii}$ , the more influence the observation has on the least squares line. (Recall that the largest value $ h_{ii}$ can take is 1.) It can be shown that the average value of the $ h_{ii}$ s is $ \left( k+1\right) /n$ ; a rule of thumb says that an observation is a leverage point if it has a hat-diagonal $ h_{ii} $ greater than $ 2\left( k+1\right) /n$ . Recall that the hat-matrix, $ \mathbf{h}=\mathbf{x}\left( \mathbf{x}^{T}\,\mathbf{x}\right) ^{-1}\,\mathbf{x}^{T},$ only depends on the design matrix and not on the response variables $ Y_{i}$ . That is, the observed value of the response variable is irrelevant with regard to whether or not a point $ \left( x_{i},y_{i}\right) $ is a leverage point.


Note that leverage points do not necessarily constitute a problem. If the observation $ y_{i}$ corresponding to a leverage point lies close to the general trend in the data, the point is called a good leverage point, and there is no reason to do anything about the data point. However, if $ y_{i}$ differs from the main trend-in particular, if $ y_{i}$ corresponds to an outlier-the point is called a bad leverage point, and should be removed from the dataset.


Example 4.2(continued) Ice cream consumption

In Figure 4.10 the studentised residuals are plotted against the values $ h_{ii}$ of the hat-matrix for the ice cream data.

Figure 4.10: Studentised residuals and hat-diagonals for the ice cream data
\includegraphics[width=0.65\textwidth]{fig/icecreamlo}
The $ h_{ii}$ s are plotted along the horizontal axis. In this example $ k=4$ and $ n=30$ , so $ 2\left( k+1\right) /n=10/30=1/3$ , that is, the rule of thumb suggests that we should investigate observations for which $ h_{ii}>1/3$ . There is one observation with an $ h_{ii}$ around 1/3, but since the studentised residual for the point is close to zero, it seems to be a good leverage point. Two more points have high leverage ( $ h_{ii}\approx 0.29$ ), one of which has a high studentised residual ( $ t_{i}=2.27$ , corresponding to the 98.4%-quantile). We could have considered omitting this point from the dataset before analysing the data in Module 3.


There is one point in Figure 4.10 for which the studentised residual is a fair bit larger than the rest ( $ t_{i}=2.68$ , corresponding to the 99.6%-quantile); this point corresponds to the outlier that was removed from this dataset in Module 3. (It is not a very extreme outlier and it has low leverage, so we could have chosen to leave it in the dataset.)


$ \diamondsuit$




4.7 Summary

Previous Section
Next Section

The assumptions of multiple linear regression models are that the response variables are independent normally distributed random variables with constant variance and means depending linearly on the explanatory variables. These assumptions are equivalent to the random errors being independent normally distributed random variables with zero mean and constant variance. The assumptions on the response variables are checked by assessing the assumptions on the random errors. The normality assumption is checked by means of a normal probability plot of the standardised residuals. The assumption on constant variance is checked by means of a residual plot of the standardised residuals. The linearity assumption can be checked through partial residual plots. Finally, we can check for outliers by considering the studentised residuals, and for leverage points by considering the diagonal elements of the hat-matrix.


Keywords: model assumptions, independence assumption, normality assumption, homoscedasticity assumption, linearity assumption, raw residual, hat-matrix, standardised residual, normal probability plot, residual plot, partial residual, partial residual plot, outlier, studentised residual, leverage point, good leverage point, bad leverage point.


HOME | Back

Last modified February 12, 2008. Webmaster