 |
Table of Contents
This module reviews simple linear regression models. That is, regression
models with just one explanatory variable, and where the relationship
between the response variable and the explanatory variable is a straight
line. Although these models are of a simple nature, they are important for
various reasons. Firstly, they are very common (you have already met several
examples in Module 1). This is partly due to the fact that non-linear
relationships often can be approximated by straight lines, over limited
ranges. Secondly, in cases where a scatterplot of the data displays a
non-linear relationship between the response variable and the explanatory
variable, it is sometimes possible to transform the data into a new pair of
variables with a straight-line relationship. That is, we can transform a
simple non-linear regression model into a simple linear
regression model, and analyse the data using methodology from linear models.
Lastly, the simplicity of these models make them useful in providing an
overview of the general methodology. Later in the course, we shall extend
the results for simple linear regression models to more complex settings.
A formal definition of the simple linear regression model is given in
Section 2.2. In Section 2.3, we discuss how to fit the model, and how to
estimate the variation away from the line. Section 2.4 concerns inference on
simple linear regression models.
In most of the examples and exercises in Module 1, there was only one
explanatory variable, and the relationship between this variable and the
response variable was a straight-line with some random fluctuation around
the line.
Example 2.1 Mobility of elderly people
These data concern the relationship between two methods for measuring the
mobility of elderly people: the TUG score (
) and the Berg score (
). A
scatterplot of the data is shown in Figure 2.1.
Figure 2.1:
Berg score against TUG score
|
|
The relationship between the variables could be described as a straight
line, and some random fluctuations. Thus, we can use, as a model for the
data, the model
This is an example of a simple linear regression model.
Further details on this dataset can be found here.
Suppose that we have a response variable
and an explanatory variable
, then the simple linear regression model for
on
is given
by
 |
(2.1) |
where
and
are unknown parameters, and the
s are independent random variables with zero mean and
constant variance for all
.
The parameters
and
are called regression
parameters (or regression coefficients), and the line
is called the regression line or
the linear predictor. (Recall that a general
is called a regression curve.) The regression parameters
and
are unknown, non-random parameters. They are the intercept and
the slope, respectively, of the straight line relating
to
.
The name simple linear regression model refers to the fact that the
mean value of the response:
is a linear function of the regression parameters
and
. (Note that
is an affine function of the
explanatory variable
.)
The terms
in (2.1) are called random
errors or random terms. The random error
is the
term which accounts for the variation of the
th response variable
away from the linear predictor
at the point
. That is,
 |
(2.2) |
The
s are independent random variables with the same
variance and zero mean. Hence, the response variables
are
independent with means
, and constant variance
equal to the variance of
.
Example 2.1 (continued) Mobility of elderly people
An interpretation of the regression parameters
and
is as follows:
-
:
- The expected Berg score for a hypothetical
patient with TUG score zero.
-
:
- The expected change in the Berg score, when the
TUG score is increased by one minute. Observe that the slope of the line is
negative, implying that the Berg score decreases with increasing TUG score.
Having decided that a straight line might describe the relationship in the
data well, the obvious question is now: which line fits the data best?
In Figure 2.2 four different lines are added to a
scatterplot for the data on mobility of elderly people. One or two of the
lines may look a little better than others, but it is difficult to decide
which line is the best.
Figure 2.2:
Mobility data: Four different regression lines
|
|
The most common criterion for estimating the best fitting line to data is
the principle of least squares. This criterion is described in
Subsection 2.3.1. Subsection 2.3.2 concerns a measure of the strength of the
straight-line relationship. When we estimate the regression line, we
effectively estimate the two regression parameters
and
. That leaves one remaining parameter in the model: the common variance
of the response variables. We discuss how to estimate
in Subsection 2.3.3.
The principle of least squares is based on the residuals. For any
line, the residuals are the deviations of the response variables
away from the line. (Note that residuals always refer to a given line
or curve.) The residuals are usually denoted by
like the
random errors in (2.2). The reason for this notation is that,
if the line is the true regression line of the model, then the residuals are
exactly the random errors
in (2.2). For a
given line
,
the observed value of
is the difference between the
th
observation
and the linear predictor
at the point
That is,
 |
(2.3) |
The observed values of
are called observed
residuals (or just residuals). In figure 2.3, a
possible regression line has been drawn in a scatterplot of the data on
mobility of elderly people. The residuals are indicated as vertical lines in
the plot.
Figure 2.3:
Mobility data: the observed residuals
|
|
Note that, the better the line fits the data, the smaller the residuals will
be. Thus, we can use the `sizes' of the residuals as a measure of how well a
proposed line fits the data. If we simply used the sum of the residuals, we
would get a problem with large positive and large negative values cancelling
out; this problem can be avoided by using the sum of the squared
residuals instead. If this measure-the sum of squared residuals-is small,
the line explains the variation in the data well; if it is large, the line
explains the variation in the data poorly. The principle of least
squares is to estimate the regression line by the line which minimises the sum of squared residuals. Or, equivalently: estimate the
regression parameters
and
by the values which
minimise the sum of squared residuals.
The sum of squared residuals, or, as it is usually called, the
residual sum of squares, is denoted by
(or
to emphasise that it is a function of
and
), and is given by
 |
(2.4) |
(For simplicity, we omit the limits
and
on the summation symbols
in the following.)
In order to minimise
with respect to
and
we
differentiate (2.4), and get
Putting the derivatives equal to zero and re-arranging the terms, yields the
following equations
Solving the equations for
and
provides the least squares estimates
(reads beta-naught-hat) and
(beta-one-hat) of
and
,
respectively. They are given by
where
and
denote the
sample means of the response and explanatory variable, respectively.
The estimated regression line is called the least squares line or
the fitted regression line and is given by
 |
(2.5) |
The values
are called the fitted values or the predicted values. The fitted value
is an estimate of the expected response for a given
value
of the explanatory variable. The residuals corresponding to
the fitted regression line, are called the fitted residuals, or
simply the residuals. They are given by
The fitted residuals can be thought of as observations of the random errors
in the simple linear regression model (2.1).
It is convenient to use the following shorthand notation for the sums
involved in the expressions for the parameter estimates (all summations are
for
):
The sums
and
are called corrected sums of squares, and the sums
and
are called corrected sums of
cross products. (The corresponding sums involving the random variables
rather than the observations
are denoted by upper-case
letters:
,
and
.) In this notation, the least
squares estimates of the regression parameters
and
of the slope and intercept of the regression line are given by
 |
(2.7) |
and
 |
(2.8) |
respectively.
Note that the estimate of
is undefined if
(division by zero). But this is not a problem in practice: if
the explanatory variable only takes one value, and there can be no best line.
Note also that the least squares line passes through the centroid
(the point
) of the data.
Example 2.1 (continued) Mobility of elderly people
For the data on mobility of elderly people, the least squares estimates of
the regression parameters are given by
So, the fitted least squares line has equation
The least squares line is shown in Figure 2.4. The line
appears to fit the data reasonably well.
Figure 2.4:
Mobility data; the least squares line
|
|
Example 2.2 Age and height of children
In the example from Module 1 on age and height of children from an Egyptian
village, the interest was in the overall growth pattern of the children. The
least squares line relating average height to age has the equation
That is,
Height  Age,
where height is measured in cm, and age in months. Figure 2.5 shows the least squares line in a scatterplot of the data.
You can see that the line fits the data very well.
Figure 2.5:
Age and height data; the least squares line
|
|
Further details on this dataset can be found here.
The least squares principle is the traditional and most common method for
estimating the regression parameters. But there exists other estimating
criteria: e.g. estimating the parameters by the values that minimise
the sum of absolute values of the residuals, or by the values that minimise
the sum of orthogonal distances between the observed values and the fitted
line. The principle of least squares has various advantages to the other
methods. For example, it can be shown that, if the response variables are
normally distributed (which is often the case), the least squares estimates
of the regression parameters are exactly the maximum likelihood estimates of
the parameters.
In the previous subsection we used the principle of least squares to fit the
`best' straight line to data. But how well does the least squares line
explain the variation in the data? In this subsection we describe a measure
for roughly assessing how well a fitted line describes the variation in
data: the coefficient of determination.
The coefficient of determination compares the amount of variation in the
data away from the fitted line with the total amount of variation in the
data. The argument is as follows: if we did not have the linear model we
would have to use the `naïve' model
instead. The
variation away from the naïve model is
: the total amount of variation in the data.
However, if we use the least squares line (2.5) as model, the
variation away from model is only
A measure of the strength of the linear relationship between
and
is
the coefficient of determination
: it is the proportional reduction in variation obtained by using the least squares
line instead of the naïve model. That is, the reduction in variation
away from the model
as a proportion of the total variation
:
The larger the value of
, the greater the reduction from
to
relative to
, and the stronger the relationship between
and
. An estimate of
is found by substituting
and
by the observed sums
and
, that is
Note that the square root of
is exactly the estimate from Module 1
of the Pearson correlation coefficient,
, between
and
when
is regarded as a random variable:
where
and
are the standard deviations for
and
, respectively.
The value of
will always lie between 0 and 1 (or, in percentage,
between 0% and 100%). It is equal to 1 if
and
, that is, if all the data points lie precisely on the fitted straight
line (i.e. when there is a `perfect' relationship between
and
). If the coefficient of determination is close to 1, it is an indication
that the data points lie close to the least squares line. The value of
is zero if
, that is, the fitted straight-line model
offers no more information about the value of
than the naïve model
does.
It is tempting to use
as a measure of whether a model is good or
not. This is not appropriate. Try and think of why for a moment
before reading on.
The coefficient of determination is only a measure of how well a
straight-line model describes the variation in the data compared to the
naïve model-not to other models in general. Even though
is close to 1 (i.e. a straight-line explains a large proportion of
the variation), it could easily be that a non-linear model explains the
data-variation much better than the linear. Methods for assessing the
appropriateness of the assumption of a straight-line relationship between
and
will be discussed in Module 4.
Example 2.2 (continued) Age and height of children
The relevant summary statistics for these data are
The coefficient of determination is given by
Since the coefficient of determination is very high, the model seems to
describe the variation in the data very well.
In Subsection 2.3.1, we found that the principle of least squares can
provide estimates of the regression parameters in a simple linear regression
model. But, in order to fit the model we also need an estimate for the
common variance
Such an estimate is required for making
statistical inferences about the true straight-line relationship between
and
. Since
is the common variance of the residuals
,
it would be natural to estimate it by
the sample variance of the fitted residuals (2.6). That is, an
estimate would be
where
. However, it
can be shown that this is a biased estimate of
, that
is, the corresponding estimator does not have the `correct' mean value:
. An unbiased
estimate of the common variance,
, is given by
 |
(2.9) |
The denominator in (2.9) is the residual degrees of freedom (d.f.), that is
d.f. = number of observations - number of estimated parameters.
In particular, for simple linear regression models, we have
observations
and we have estimated the two regression parameters
and
, so the residual d.f. is
.
Example 2.2 (continued) Age and height of children
The relevant summary statistics for these data are
An unbiased estimate of the common variance
is given by
In Section 2.3 we produced an estimate of the straight line that describes
the data-variation best. However, since the estimated line is based on the
particular sample of data,
and
we have
observed, we would almost certainly get a different line if we took a new
sample of data and estimated the line on the basis of the new sample. For
example, if we measured the heights and ages of children in the village
neighbouring the one in Example 2.2, we would invariably get different
measurements, and therefore a different least squares line. In other words:
the least squares line is an observation of a random line which
varies from one experiment to the next. Likewise, the least squares
estimates
and
of the intercept and
slope, respectively, of the least squares line, are both observations of
random variables. These random variables are called the least
squares estimators. (An estimate is non-random and is an observation
of an estimator, which is a random variable.) The least squares
estimators are given by
where
, and with all summations from
to
. By a similar argument we find that an unbiased estimator for the
common variance
is given by
where
, with
and
being the least squares estimators. Note that
the randomness in the estimators is due to the response variables only,
since the explanatory variables are non-random. In particular, it can be
seen from (2.10) and (2.11) that
and
are linear combinations of the response
variables.
It can be shown that the least squares estimators are unbiased, that
is, that they have the `correct' mean values:
and ![$\displaystyle {\mathbb{E}}[\hat{\beta}_{1}]=\beta _{1}.$](img261.gif) |
(2.13) |
Also, the estimator
is an unbiased estimator of the common variance
, that is
![$\displaystyle {\mathbb{E}}[S^{2}]=\sigma ^{2}.$](img264.gif) |
(2.14) |
The variances of the estimators
and
can
be found from standard results on variances (we shall not do it here). The
variances are given by
Note that both variances decrease when the sample size
increases. Also,
the variances decrease if
is
increased. (That is, if the
-values are widely dispersed.) In some
studies, it is possible to design the experiment such that the value of
is high, and hence the variances of the estimators are small. It is
desirable to have small variances, as it improves the precision of results
drawn from the analysis.
In order to make inferences about the model, such as testing hypotheses and
producing confidence intervals for the regression parameters, we need to
make some assumption on the distribution of the random variables
.
The most common assumption-and the one we shall make here-is that the
response variables
are normally distributed.
Module 4 concerns various methods for checking the assumptions of regression
models. In this section, we shall simply assume the following about the
response variables: the
s are independent normally distributed random
variables with equal variances and mean values depending linearly on
.
To test hypotheses and construct confidence intervals for the regression
parameters
and
, we need the distributions of the
parameter estimators
and
. Recall from (2.10) and (2.11) that the least squares
estimators
and
are linear combinations
of the response variables
. Standard theory on the normal
distribution says that a linear combination of independent, normal random
variables is normally distributed. Thus, since the
s are independent,
normal random variables, the estimators
and
are both normally distributed. In (2.13)-(2.16), we
found the mean values and variances of the estimators. Putting everything
together, we get that
It can be shown that the distribution of the estimator
of the
common variance
is given by
where
denotes a chi-square distribution with
degrees of freedom. Moreover, it can be shown that the estimator
is independent of the estimators
and
.
(But the estimators
and
are not mutually
independent.)
We can use these distributional results to test hypotheses on the regression
parameters. Since both
and
have normal
distributions with variances depending on the unknown quantity
, we can apply standard results for normal random variables with unknown
variances. Thus, in order to test
equal to some value
,
, that is, to test hypotheses of the form
for
, we can use the
-test statistic,
given by
![$\displaystyle t_{\hat{\beta}_{i}}(y)=\frac{\hat{\beta}_{i}-\beta _{i}^{\ast }}{{\mbox{se}}[\hat{\beta}_{i}]},\hspace{1cm}i=0,1,$](img316.gif) |
(2.17) |
where
denotes the estimated standard error of
the estimator
. That is
and
It can be shown that both test statistics
and
have
-distributions with
degrees of freedom.
The test statistics in (2.17) can be used for testing the parameter
equal to any value
.
However, for the slope parameter
, one value is particularly
important: if we can test
equal to zero, the simple linear
regression model simplifies to
That is, the value of
does not depend on the value of
. In
other words: the response variable and the explanatory variable are
unrelated!
It is common-for instance in computer output-to present the estimates and
standard errors of the least squares estimators in a table like the
following.
| Parameter |
Estimate |
Standard error |
-statistic |
-value |
|
|
|
|
|
|
|
|
|
|
The column `
-statistic' contains the
-test statistic (2.17)
for testing the hypotheses
and
respectively. (Should you wish to test a parameter equal to a different
value, it is easy to produce the appropriate test statistic (2.17)
from the table.) The column `
-value' contains the
-values
corresponding to the
-test statistic in the same row.
Example 2.2 (continued) Age and height of children
For the data on age and height of Egyptian children, the table is given by
| Parameter |
Estimate |
Standard error |
-statistic |
-value |
|
|
|
|
|
|
|
|
|
|
Not surprisingly, neither parameter can be tested equal to zero. If, for
some reason, we wished to test whether the slope parameter was equal to
0.58, say, the test statistic would be
Since
in this example, the test statistic has a
-distribution. The
-value for this test is 0.0279, thus, on the basis of
these data we reject the hypothesis that the slope parameter is 0.58, at the
5% significance level.
A second practical use of the table is to provide confidence intervals for
the regression parameters. The
confidence interval for
and
are given by, respectively,
and
In order to construct the confidence intervals, all that is needed is the
table and
: the
-quantile
of a
-distribution.
Example 2.2 (continued) Age and height of children
For the data on age and height of Egyptian children, the 95% confidence
intervals for the regression parameters can be obtained from the table for
these data and the 0.975-quantile of a
-distribution:
. The confidence intervals for
and
are, respectively,
and
In this module, the simple linear regression model has been discussed. We
have described a method, based on the principle of least squares, for
fitting simple linear regression models to data. The principle of least
squares says to estimate the regression line by the line which minimises the
sum of the squared deviations of the observed data away from the line. The
intercept and slope of the fitted line are estimates of the regression
parameters
and
, respectively. Further, an unbiased
estimate of the common variance has been given. Under the assumption of
normality of the response variables, we have tested hypotheses and
constructed confidence intervals for the regression parameters.
Keywords: simple linear regression model, regression parameters,
regression line, linear predictor, observed residuals, residuals, principle
of least squares, residual sum of squares, least squares estimates, least
squares line, fitted regression line, fitted values, predicted values,
fitted residuals,
, coefficient of determination, bias corrected
estimate of common variance, degrees of freedom, least squares estimators,
distributions of least squares estimators, hypotheses on regression
parameters, confidence intervals for regression parameters.
|
 |