View/Print PDFPS
ST111: Data
By Pia Veldt Larsen


Table of Contents





1 Regression

Previous Section
Next Section




1.1 Simple regression

Previous Section
Next Section

Age and height of Egyptian children

Data: ageheight.txt

Keywords: Simple linear regression.

Description: Obviously the height of a child is not constant, but increases over time. On the other hand it is well-known that the growth pattern varies between children. In this dataset the focus is on determining the general growth pattern. One way to explore this is by using the average of several childrens heights, as presented in this dataset.

The response variable is the average heights of a group of 161 children in Kalama, an Egyptian village: the site of a study of nutrition in developing countries. The data were obtained by measuring the heights of all 161 children in the village each month over several years. Time is the explanatory variable.

Number of observations: 12

Variable Description
age Age in months
height Average height in centimetres for children at this age
   

Source: DASL.


$ \diamondsuit$

When do babies start to crawl

Data: babycrawl.txt

Keywords: Linear regression, correlation.

Description: This study investigated whether babies take longer to learn to crawl in cold months when they are often bundled in clothes that restrict their movements, than in warmer months. The study sought an association between babies' first crawling age and the average temperature at the month they first try to crawl (about 6 months after birth).

Parents brought their babies into the University of Denver Infant Study Center between 1988-1991 for the study. The parents reported the birth month and age at which their child was first able to creep or crawl a distance of four feet in one minute.

Data were collected on 208 boys and 206 girls (40 pairs of which were twins). Correlation and regression can be used to examine the relationship between the average crawling age and the average temperature.

There are a few problems about this data set, which might affect analyses:

  1. The babies are not all independent because there are twins in the study.

  2. The normality assumption is dubious since outliers can only occur at higher ages of first crawling.

  3. The study was conducted on self-selected volunteers, who may be different from the general population.

Number of observations: 12

Variable Description
month Month of birth
crawlingage Average age in weeks that this group learned to crawl
sd Standard deviation of time to crawling for this group
n Number of infants in that birth month group
fahrenheit Average monthly temperature in fahrenheit six months after birth month
celsius Average monthly temperature in celsius six months after birth month
   

Source: DASL.


$ \diamondsuit$

Beetles in brackets

Data: beetles.txt

Keywords: Simple linear regression, through origin.

Description: In a botanical experiment a researcher wanted to estimate the number of individuals of a particular species of beetle (Diaperus maculatus) within fruiting bodies (`brackets') of the birch bracket fungus Polyporus betulinus. (This is a shelf fungus that grows on the trunks of dead birch trees.) When the brackets are stored in the laboratory, the beetle larvae within them mature over several weeks-the adults then emerge and can be removed and counted.

Number of observations: 25

Variable Description
weight Weights of the brackets (in grams)
beetles Number of beetles in bracket
   

Source: Pielou, E.C. (1974) Population and Community Ecology-Principles and Methods, Gordon and Breach, New York, pp. 117-121.


$ \diamondsuit$

Brain and body weights of mammals

Data: brainweight.txt

Keywords: Regression, log-log transformation.

Description: The average brain and body weights for 62 species of mammals. In ST111, it is considered as a problem of modeling brain weight as a function of body weight. These data were taken from a larger study and were collected for another purpose.

Number of observations: 62

Variable Description
body Body weight (in kilos)
brain Brain weight (in grams)
   

Source: Allison, T. and Cicchetti, D.H. (1976) Sleep in mammals: Ecological and constitutional correlates, Science, 194, pp. 732-734.


$ \diamondsuit$

Strength of cement

Data: cemstren.txt

Keywords: Regression, non-constant variance, non-linear.

Description: One of the things that influences the tensile strength of cement is the length of time for which the cement is `cured' (that is, dried). An experiment was set up to test different batches of cement for tensile strength, after different curing times.

Number of observations: 21

Variable Description
days Number of days of curing
strength Tensile strength of cement
   

Source: Hald, A. (1952) Statistical Theory with Engineering Applications, New York, John Wiley.


$ \diamondsuit$

Driving a car at constant speed

Data: constantcar.txt

Keywords: Linear regression, through origin.

Description: This is a hypothetical dataset. Imagine driving a car at constant speed 50 km/h and observe with 5 minute intervals the distance you have gone.

Number of observations: 12

Variable Description
time Time in minutes
distance Distance in km
   

Source: The observations are generated using S-Plus.


$ \diamondsuit$

Axial lengths of ice crystals

Data: crystals.txt

Keywords: Simple linear regression.

Description: Measurements were made on the axial lengths of ice crystals at various times between 50 seconds and 180 seconds after introduction into a chamber maintained at a constant temperature of -5 ^oC.

Number of observations: 43

Variable Description
length Axial length of crystal (in micrometres)
time Time (in seconds) after introduction of the ice crystal into the chamber
   

Source: Ryan, B.F., Wishart, E.R. and Shaw, D.E. (1976) The growth rates and densities of ice crystals between -3 ^oC and -21 ^oC, J. Atmospheric Sciences, 33, pp. 842-850.


$ \diamondsuit$

Coal gasification

Data: gasification.txt

Keywords: Linear regression.

Description: The data represent the fuel gas temperature (in degrees Fahrenheit) and unit heat rate (in Btus per kilowatt hour) for a combustion turbine to be used in coal gasification.

Number of observations: 9

Variable Description
temp Fuel gas temperature
heat Unit heat rate
   

Source: Milton, J.S. & Arnold, J.C. (1995) Introduction to probability and statistics, McGraw Hill International Editions, p.445.


$ \diamondsuit$

Snow geese in Canada

Data: geese.txt

Keywords: Regression, non-constant variance.

Description: Two airborne observers (labelled A and B, respectively) were used to estimate the sizes of flocks of snow geese in an area west of Hudson Bay in Canada. In one study, photographs were also taken of the flocks, and careful counts were made from the film.

Number of observations: 45

Variable Description
photo Photographic counts
Aestimate Number estimated by observer A
Bestimate Number estimated by observer B
   

Source: Lunneborg, C.E. (1994) Modeling Experimental and Observational Data, Duxbury Press, p. 115.


$ \diamondsuit$

HDL, cholesterol and triglyceride

Data: hdl.txt

Keywords: Regression.

Description: An experiment involved a quantitative analysis of factors found in high-density lipoprotein (HDL) in a sample of human blood serum. Three variables thought to be predictive of or associated with HDL measurements are the total cholesterol and total triglyceride concentration in the sample, and the presence or absence of a certain sticky component called sinking pre-beta (or SPB). The data in this data set correspond to samples for which the SPB was absent.

Number of observations: 21

Variable Description
hdl Concentration of HDL
cholest Total concentration of cholesterol
triglyc Total concentration of triglyceride
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E., Nizam, A. (1998) Applied Regression Analysis and Other Multivariable Methods, 3rd Edition, Duxbury Press, Brooks/Cole Publishing Company, p. 202.


$ \diamondsuit$

House prices in Odense

Data: houses.txt

Keywords: Linear regression.

Description: It is natural to expect that the larger the house, the higher the price. That is, we expect price and size of the house to be positively correlated. Ten houses were randomly selected among newspaper adds for houses. The relationship between area and price is vague suggesting that it is not only the size of a house that determines the price.

Number of observations: 10

Variable Description
size Size of the house in m$ ^2$
price Price in 1000 DKK
   

Source: Bent Jørgensen.


$ \diamondsuit$

Measuring mobility of elderly people

Data: mobility.txt

Keywords: Linear regression, correlation.

Description: This dataset concerns the comparison of two measures of the mobility of elderly people. The two methods to be compared are the Berg score and Timed Up an Go (TUG). The Berg score is a measure based on how well the person performs in a number of different tasks. A low score corresponds to low mobility. The TUG score is simply the time it takes a person to get up from a chair, walk three metres and return to the chair. Measuring the Berg score is much more demanding and time-consuming than measuring the TUG score. It is of interest to determine the relationship between the results obtained using two methods. If there is a strong relation, the fast method can be used as a good predictor of the slow method.

Number of observations: 16

Variable Description
tug TUG score
berg Berg score
   

Source: Dorte Skovhede.


$ \diamondsuit$

Olympic gold medal performances

Data: olympic.txt

Keywords: Linear regression, time series, prediction.

Description: This dataset contains the gold medal performances in the men's long jump, high jump and discus for the modern Olympic games from 1900 to 1984. Regressions and scatterplots of performance variables versus year show performance improvement.

The World Wars create some gaps in the data and can be seen in the graphical displays.

Number of observations: 20

Variable Description
high Height of high jump (cm)
discus Distance of throw (cm)
long Distance of jump (cm)
year Year of the Olympic
   

Source: DASL - but converted from inches to cm. Data from 1988 and 1992.


$ \diamondsuit$

Strength of Kraft paper

Data: paper.txt

Keywords: Regression, transformation.

Description: The tensile strength (p.s.i.) of Kraft paper was measured against the percentage of hardwood in the batch of pulp from which the paper was produced.

Number of observations: 19

Variable Description
strength Tensile strength
wood Percentage of hardwood in pulp
   

Source: Joglekar, G., Schuenemeyer, J.H. and LaRiccia, V. (1989) Lack-of-fit testing when replicates are not available, American Statistician, 43, pp. 135-143.


$ \diamondsuit$

Road and map distances

Data: roadmap.txt

Keywords: Linear regression, through the origin.

Description: This dataset contains the distances (in miles) by road, and the corresponding straight line distances (measured from a map) between twenty different pairs of points in Sheffield.

Number of observations: 20

Variable Description
road Road distances (in miles)
map Map distances (in miles)
   

Source: Gilchrist, W. (1984) Statistical modelling, John Wiley and Sons, Chichester, p.5.


$ \diamondsuit$

Dry matter and ascorbic acid in spinach

Data: spinach.txt

Keywords: Linear regression.

Description: This dataset stems from a study concerning the preservation of ascorbic acid in vegetables during drying and storing. The amount of acid preserved is the response variable, while the percentage dry matter is the explanatory variable.

Number of observations: 24

Variable Description
dry Percentage dry matter after drying at 90 $ ^o$ C
acid Percentage preserved ascorbic acid
   

Source: Hald, A. (1952) Statistical Theory with Engineering Applications, New York: Wiley.


$ \diamondsuit$

Memory retention

Data: strong.txt

Keywords: Regression, non-linear, non-constant variance.

Description: This is the psychologist Strong's famous data set on memory retention. Average percentage memory retention was measured against passing time. The measurements were taken five times during the first hour after subjects memorized a list of disconnected items, and then at various times up to a week later.

Number of observations: 13

Variable Description
memory Percentage of memory retention
time Times (in minutes)
   

Source: Mosteller, F., Rourke, R.E.K. and Thomas, G.B. (1970) Probability with statistical applications, 2nd edn. Addison-Wesley, p. 383.


$ \diamondsuit$

Advertising yield and spending

Data: tvads.txt

Keywords: Non-linear regression.

Description: These data concern the relation between advertising spending and advertising yield.

Number of observations: 21

Variable Description
company Company name
budget TV advertising budget, 1983 ($ millions)
impression Millions of retained impressions per week
   

Source: DASL.


$ \diamondsuit$

Enzymatic reaction

Data: velocity.txt

Keywords: Non-linear regression.

Description: These data represent the velocity of an enzymatic reaction as a function of substrate concentration.

Number of observations: 12

Variable Description
Velocity The counts per minute of radioactive product from the reaction
Concentration The substrate concentration (in parts per million)
   

Source: Severini, T.A. (2000) Likelihood methods in statistics, New York: Oxford University Press, p. 356.


$ \diamondsuit$

Wind power

Data: windspeed.txt

Keywords: Regression, transformation.

Description: These data are on the production of power from wind mills. Direct current output was measured against wind speed (in miles per hour).

Number of observations: 25

Variable Description
output Current output produced by the wind mill
speed Windspeed (in miles per hour)
   

Source: Joglekar, G., Schuenemeyer, J.H. and LaRiccia, V. (1989) Lack-of-fit testing when replicates are not available, American Statistician, 43, pp. 135-143.


$ \diamondsuit$




1.2 Multiple regression

Previous Section
Next Section

Frank Ancombe's regression examples

Data: anscombe.txt

Keywords: Linear Regression, correlation.

Description: It is well known that correlation coefficients can be misleading. This dataset was constructed to shed light on whether the same could be the case for linear regression. Make the following four scatterplots: $ x$ vs $ y_1$ , $ x$ vs $ y_2$ , $ x$ vs $ y_3$ , and $ x_4$ vs $ y_4$ . ($ x$ and $ x_4$ are considered as explanatory variables).Make the four simple linear regressions and compare results.

Number of observations: 11

Variable Description
x An explanatory variable
y1 A response variable
y2 A response variable
y3 A response variable
x4 An explanatory variable
y4 A response variable
   

Source: Jerry Dallal's Tufts Home Page.


$ \diamondsuit$

Heat from cement hardening

Data: cement.txt

Keywords: Multiple linear regression.

Description: The heat evolved during cement hardening is influenced by the composition of the cement. The heat evolved was measured for a number of samples. Further, the contents of tricalcium-aluminate (TA), tricalcium-silicate (TS) and tetracalcium-alumino-ferrite (TAF) were measured for each sample of cement.

Number of observations: 13

Variable Description
heat Evolved heat (in calories/g)
TA Amount (as percentage of weight) of tricalcium-aluminate (TA)
TS Amount (as percentage of weight) of tricalcium-silicate (TS)
TAF Amount (as percentage of weight) of tricalcium-alumino-ferrite (TAF)
   

Source: Woods, H., Steiner, H.H. and Starke, H.R. (1932) Effects of composition of Portland cement on heat evolved during hardening, Industrial and Engineering Chemistry, 24, pp. 1207-1212.


$ \diamondsuit$

Weight, height and age of children

Data: children.txt

Keywords: Multiple linear regression.

Description: The weight, height, and age were measured for each member of a sample of 12 children with a particular kind of nutritional deficiency.

Number of observations: 12

Variable Description
weight The child's weight
height The child's height
age The child's age
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E., Nizam, A. (1998) Applied Regression Analysis and Other Multivariable Methods, 3rd Edition, Duxbury Press, Brooks/Cole Publishing Company, p. 112.


$ \diamondsuit$

Processing copper

Data: copper.txt

Keywords: Multiple linear regression, residual analysis.

Description: This dataset relate to the processing of copper ore in a given calender month. The response variable ($ y$ ) is percentage of copper recovered for a certain production process. The explanatory variables are shown in the table below.

Number of observations: 24

Variable Description
Date Date of production
Solids Percentage of solids in the ore
Mesh A measure of mesh size
y Percentage of copper recovered for a certain production process
Retention Retention time
   

Source: Jørgensen, B. (1993) The Theory of Linear Models, Chapman & Hall. (Originally supplied by R.J. MacKay.)


$ \diamondsuit$

Detoxification of malathion in chickens

Data: detox3mc.txt

Keywords: Multiple linear regression.

Description: It is known that one can alter the toxicity of various types of chemicals (e.g. drugs, pesticides or insecticides) in mammals by inducing liver enzyme activity. This example relates to a study investigating the relationship between detoxification of malathion (an insecticide containing phosphorus) and induced enzyme activity in chickens. Five different enzyme activities were induced using the enzyme inducer 3-methylcholanthrene (3-MC).

Number of observations: 10

Variable Description
detox Detoxification of malathion (in percent, relative to a control, untreated, chicken).
enzyme1 Enzyme 1 activity in a treated chicken, relative to a control chicken (in %)
enzyme2 Enzyme 2 activity in a treated chicken, relative to a control chicken (in %)
enzyme3 Enzyme 3 activity in a treated chicken, relative to a control chicken (in %)
enzyme4 Enzyme 4 activity in a treated chicken, relative to a control chicken (in %)
enzyme5 Enzyme 5 activity in a treated chicken, relative to a control chicken (in %)
   

Source: Ehrich, M., Larson, C. and Arnold, J. (1983) Organophosphate Detoxification Related by Induced Hepatic Microsomal Enzymes in Chickens, American Journal of Veterinary Research, 45.


$ \diamondsuit$

Detoxification of malathion in chickens

Data: detoxbht.txt

Keywords: Multiple linear regression.

Description: It is known that one can alter the toxicity of various types of chemicals (e.g. drugs, pesticides or insecticides) in mammals by inducing liver enzyme activity. This example relates to a study investigating the relationship between detoxification of malathion (an insecticide containing phosphorus) and induced enzyme activity in chickens. Five different enzyme activities were induced using the enzyme inducer butylated hydroxytoluene (BHT).

Number of observations: 10

Variable Description
detox Detoxification of malathion (in percent, relative to a control, untreated, chicken).
enzyme1 Enzyme 1 activity in a treated chicken, relative to a control chicken (in %)
enzyme2 Enzyme 2 activity in a treated chicken, relative to a control chicken (in %)
enzyme3 Enzyme 3 activity in a treated chicken, relative to a control chicken (in %)
enzyme4 Enzyme 4 activity in a treated chicken, relative to a control chicken (in %)
enzyme5 Enzyme 5 activity in a treated chicken, relative to a control chicken (in %)
   

Source: Ehrich, M., Larson, C. and Arnold, J. (1983) Organophosphate Detoxification Related by Induced Hepatic Microsomal Enzymes in Chickens, American Journal of Veterinary Research, 45.


$ \diamondsuit$

Suppliers of medical devices

Data: devices.txt

Keywords: Multiple regression.

Description: Medical devices from three different suppliers for the continuous delivery of an anti-inflammatory hormone were tested on 27 patients.

Number of observations: 27

Variable Description
supplier The supplier of the device (labelled 1,2 or 3)
time Time the device was used (in hours)
remains Amount remaining in the device after use
   

Source: Efron, B. and Tibshirani, R.J. (1993) An Introduction to the Bootstrap, Chapman and Hall, New York, p. 107.


$ \diamondsuit$

Fuel consumption

Data: fuel.txt

Keywords: Multiple linear regression.

Description: For 48 contiguous states, a number of variables were recorded in 1971-2: Population size, motor fuel tax rate, number of licensed drivers, per capita income, extend of federal-aid primary highways, fuel consumption.

Number of observations: 48

Variable Description
pop 1971 Population, in thousands
tax 1972 Motor fuel tax rate, in cents per gallon
nlic 1971 Thousands of licensed drivers
inc 1972 Per capita income in thousands of dollars
road 1971 Thousand of miles of federal-aid primary highways
fuelc 1972 Fuel consumption, in millions of gallons
dlic 1971 Percentage of population with driver's license
fuel Motor fuel consumption in gallons per person
   

Source: Weisberg, S. (1985) Applied Linear Regression, Wiley, p. 34.


$ \diamondsuit$

Analytical skills of young gifted children

Data: gifted.txt

Keywords: Multiple regression.

Description: An investigator is interested in understanding the relationship, if any, between the analytical skills of young gifted children and the following variables: father's IQ, mother's IQ, age in month when the child first said `mummy' or `daddy', age in month when the child first counted to 10 successfully, average number of hours per week the child's mother or father reads to the child, average number of hours per week the child watched an educational program on TV during the past three months, average number of hours per week the child watched cartoons on TV during the past three months. The analytical skills are evaluated using a standard testing procedure, and the score on this test is used as the response variable.

Data were collected from schools in a large city on a set of thirty-six children who were identified as gifted children soon after they reached the age of four.

Number of observations: 36

Variable Description
score Score in test of analytical skills
fatheriq Father's IQ
motheriq Mother's IQ
speak Age in months when the child first said `mummy' or `daddy'
count Age in months when the child first counted to 10 successfully
read Average number of hours per week the child's mother or father reads to the child
edutv Average number of hours per week the child watched an educational program on TV during the past three months
cartoons Average number of hours per week the child watched cartoons on TV during the past three months
   

Source: Graybill, F.A. & Iyer, H.K., (1994) Regression Analysis: Concepts and Applications, Duxbury, p. 511-6.


$ \diamondsuit$

Household spending in grocery stores

Data: grocery.txt

Keywords: Multiple regression.

Description: The manager of the marketing division of a grocery store chain wants to conduct a study in a particular US city, where the company wants to open a store, to understand the relationship between the number of dollars a household spends in grocery stores each month and the following variables: monthly income for the household, number of children in the household, and the number of adults in the household. A group of 27 grocery shoppers were selected by simple random sampling from a study population and are requested to provide the needed information.

Number of observations: 27

Variable Description
amount Monthly amount spend by household in grocery store (in US$)
income Monthly income for the household (in US$)
children Number of children in the household
adults Number of adults in the household
   

Source: Graybill, F.A. and Iyer, H.K. (1994) Regression Analysis: Concepts and Applications, Duxbury, p. 286.


$ \diamondsuit$

Man-hours at naval hospitals

Data: hospitals.txt

Keywords: Multiple linear regression.

Description: A study was made on monthly man-hours associated with maintaining the anesthesiology service for twelve naval hospitals in the United States.

Number of observations: 12

Variable Description
manhours Monthly number of man-hours
cases Monthly number of surgical cases
population Eligible population (in thousands)
rooms Number of operating rooms in the hospital
   

Source: Brooks, D.G., Carroll, S.S. and Verdini, W.A. (1988) Characterizing the domain of a regression model, American Statistician, 42, pp. 187-190.


$ \diamondsuit$

Ice cream consumption

Data: icecream.txt

Keywords: Linear regression, time series, ANCOVA, autocorrelation.

Description: The purpose of the study was to determine if ice cream consumption depends on the variables price, income, or temperature. Further the variables Lag-temp (the temperature the next month) and Year have been added to the original data.

Ice cream consumption was measured over 30 four-week periods from March 18, 1951 to July 11, 1953.

Number of observations: 30

Variable Description
period Identifier for the four week period (1-30).
IC Ice cream consumption in pints per capita
price Price of ice cream per pint in dollars
income Weekly family income in dollars
temp Average temperature in fahrenheit
year Year within the study (0=1951, 1=1952, 2=1953)
   

Source: Koteswara Rao Kadiyala (1970) Testing for the independence of regression disturbances, Econometrica, 38, pp. 97-117. Also found in: Hand, D.J., et al. (1994) A Handbook of Small Data Sets, London: Chapman & Hall, p. 214. DASL.


$ \diamondsuit$

Holiday cottages

Data: odsherred.txt

Keywords: Multiple linear regression.

Description: These data contains the sales prices of 5 holiday cottages in Odsherred, Denmark, together with the age and the livable area of each house.

Number of observations: 5

Variable Description
price Price, in DDK 1000 (Danish kroner)
age Age of the house, in years
area Livable area, in square metres
   

Source: Data from Nybolig, May 2003.


$ \diamondsuit$

Cracking paint

Data: paintcrack.txt

Keywords: Regression, transformation.

Description: A research study was conducted on cracking of latex paint on wooden structures. The primary concern in the study is to investigate the effect of water permeability and fracture energy (energy to propagate a crack through paint film) on paint crack rating.

Number of observations: 10

Variable Description
rating Crack rating (between 0-10)
permeability Water permeability
energy Fracture energy
   

Source: Milton, J.S. and Arnold, J.C. (1995) Introduction to Probability and Statistics, 3rd ed., McGraw Hill, p. 524.


$ \diamondsuit$

Level of pathology in psychotic patients

Data: pathology.txt

Keywords: Multiple regression.

Description: These data refer to a study of whether the level of pathology in psychotic patients 6 months after treatment can be predicted with reasonable accuracy from knowledge of pre-treatment symptom ratings of thinking disturbance and hostile suspiciousness.

Number of observations: 53

Variable Description
pathology Level of pathology
thinking Pre-treatment symptom rating of thinking disturbance
suspicious Pre-treatment symptom rating of hostile suspiciousness
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E., Nizam, A. (1998) Applied Regression Analysis and Other Multivariable Methods, 3rd Edition, Duxbury Press, Brooks/Cole Publishing Company, p. 125.


$ \diamondsuit$

Survival times of silkworm larvae

Data: silkworm.txt

Keywords: Multiple linear regression.

Description: An experiment was conducted in order to describe the toxic action of a certain chemical on silkworm larvae. The larvae were fed various doses of the chemical, and the survival times (i.e. time until death) were recorded, together with the weights of the larvae.

Number of observations: 15

Variable Description
survival Survival times of the larvae
dose Doses of the chemical
weight Weights of the larvae
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E., Nizam, A. (1998) Applied Regression Analysis and Other Multivariable Methods, 3rd Edition, Duxbury Press, Brooks/Cole Publishing Company.


$ \diamondsuit$

Predicting water supply from rainfall

Data: stream.txt

Keywords: Multiple linear regression, collinearity.

Description: Can Southern California's water supply be predicted form past rainfall data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners and policy makers could do their jobs more efficiently. The dataset contains 43 years worth of precipitation measurements taken at four sites in the Owen's Valley. Much of the water for Southern California is supplied by the Owens Valley Aqueduct. If the Owens Valley and the nearby Sierra Mountains get little rain, then water will be low in the aqueduct. The explanatory variables, labeled APSAB, APSLAKE, OPRC and OPSLAKE, are rainfall measurements for four sites in or near Owens Valley. The response, labeled BSAAM, is the stream runoff at a site near Bishop, California. Stream runoff volume is a stand-in for volume of water delivered to the aqueduct.

There is high correlation between the explanatory variables implying two models gives adequate fits, depending on the test procedure.

Number of observations: 43

Variable Description
obs Observation number
year Year
APSAB Rainfall measurement
APSLAKE Rainfall measurement
OPRC Rainfall measurement
OPSLAKE Rainfall measurement
BSAAM Stream runoff
   

Source: Bent Jørgensen.


$ \diamondsuit$

Timber yield of cherry trees

Data: trees.txt

Keywords: Multiple regression.

Description: In order to find an estimate for the volume of a tree (and thereby the timber yield), the volumes, heights and diameters were collected for a sample of 31 black cherry trees in the Allegheny National Forest, Pennsylvania.

Number of observations: 31

Variable Description
volume Volume of the tree (in cubic feet)
height Height of the tree (in feet)
diameter Diameter of the tree (in inches, at 54 inches above ground)
   

Source: Atkinson, A.C. (1982) Regression diagnostics, transformations and constructed variables (with discussion). J. Royal Statistical Society, Series B, 44, pp. 1-36.


$ \diamondsuit$

Water usage of power plant

Data: water.txt

Keywords: Multiple linear regression.

Description: A production plant cost-control engineer is responsible for cost reduction. One of the costly items in his plant is the amount of water used by the production facilities each month. He decided to investigate water usage by collecting seventeen observations on his plant's water usage and other variables.

Number of observations: 17

Variable Description
temperature Average monthly temperate (F)
production Amount of production (M pounds)
days Number of plant operating days in the month
workers Number of workers on the monthly plant payroll
water Monthly water usage (gallons)
   

Source: OzDASL, Draper, N.R., and Smith, H. (1981) Applied Regression Analysis, 2nd Edition, Wiley: New York.


$ \diamondsuit$




1.3 ANCOVA

Previous Section
Next Section

Algae in Danish lakes

Data: algae.txt

Keywords: ANCOVA.

Description: The data in this project concern the relationship between biomass (measured as the bio volume and concentration of the pigment chlorophyll a, in lakes dominated by three common types of algae. (A lake is said to be dominated by a specific algae if at least 80% of the total biomass consists of this algae.) The data were collected in 17 monitored lakes, in the period 1989-99. Information about which lakes the different measurements were taken from is not available.

Number of observations: 584

Variable Description
class Type of algae (kisel, bluegreen or fure)
biovolume Bio volume (in mm3/l)
chlorophyll Concentration of chlorophyll a (in mg/l)
   

Source: The data are provided by Anne Lilholt, Institute of Biology, SDU.


$ \diamondsuit$

Relation between live weight and dried weight of collembola

Data: collembola.txt

Keywords: Linear regression, ANCOVA, transformation.

Description: In the ecology of soils, one is interested in, among other things, measuring the biomass in the topsoil. A traditional measure is the live weight of small animals that live in the layers of the soil. However, it is difficult under practical field experiments to obtain the live weight. Instead the small animals are extracted from the soil and then dried and the dry weight is determined. Hence we are interested in a model that predicts the live weight ($ LW$ ) from the dried weight ($ DW$ ). Traditionally the life weight ($ LW$ ) is calculated as a given percentage $ \alpha$ , say, of the dried weight ($ DW$ ), i.e. a functional relationship on the form

$\displaystyle LW = \alpha DW,
$

where $ LW$ is the response and $ DW$ is the explanatory variable. A log-transformation leads to

$\displaystyle \log(LW) = \log(\alpha) + \log(DW)
$

which results in a simple linear regression model
$\displaystyle \log(LW)$ $\displaystyle =$ $\displaystyle \log(\alpha) + \beta_1\log(DW) + \varepsilon$ (1.1)
  $\displaystyle =$ $\displaystyle \beta_0 + \beta_1\log(DW) + \varepsilon,$ (1.2)

where $ \beta_0=log(\alpha)$ and $ \varepsilon \sim N(0,\sigma^2)$ . Further we have introduced $ \beta_1$ . If the proposed functional relationship in (1.3) holds, the slope should be $ 1$ . By re-expressing the model using the log-transform we can test whether the slope is $ 1$ . Further obtain an estimate of $ \beta_0$ and from this we can find $ \alpha$ - the parameter of interest.

Number of observations: 37

Variable Description
species Species of collembola (3 different)
logLW The log to the live weight
logDW The log to the dried weight
   

Source: Anvendt Statistik (1983-85), Opgaver, Vol 1-4. (Eds. Andersen, A.H. and Keiding, N.) Department of Theoretical Statistics, University of Aarhus.


$ \diamondsuit$

Rats on diets

Data: diet.txt

Keywords: Comparing two regression lines.

Description: In a study on the effects of different diets on rats, two groups of rats (on different diets) were weighed weekly over a four-week period. (The rats had been on the diets for some time before the four-week period.)

Number of observations: 48

Variable Description
weight Weights of rat (in grams)
week Week number
diet Indexing diet: diet=1 if rat is on first diet, diet=0, otherwise
   

Source: Crowder, M.J. and Hand, D.J. (1990) Analysis of repeated measures, Chapman and Hall, London, p. 19.


$ \diamondsuit$

Market research

Data: market.txt

Keywords: Comparing regression lines.

Description: Market research was conducted for a national retail company to compare the relationship between sales and advertising during the warm Spring and Summer seasons as compared with the cold Autumn and Winter seasons. The data were collected over a period of several years.

Number of observations: 18

Variable Description
season The season (Warm=0, Cool=1)
expenditure The advertising expenditure (in million $)
revenue The sales revenue (in million $)
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E., Nizam, A. (1998) Applied Regression Analysis and Other Multivariable Methods, 3rd Edition, Duxbury Press, Brooks/Cole Publishing Company, p. 353.


$ \diamondsuit$

Determining the speed of light in 1879

Data: michelson.txt

Keywords: Comparing regression lines.

Description: In 1879, Michelson undertook a set of experiments in order to determine the speed of light. A total of 100 sets of experiments were undertaken over several days (mornings and afternoons) between June 5 and July 2 1879. In each set of experiments he made 10 speed determinations and reported the average of the ten values. (This dataset only contains 98 of his measurements.)

With modern technology, the speed of light has been measured to be 299,792.46 km per second in a vacuum. So Michelson's measurements were impressively precise all things considered.

Number of observations: 98

Variable Description
speed Average of the ten speed values measured in experiment (in km per second)
temp Ambient temperature at time of experiment (in Fahrenheit)
ampm Time of day (am=0, pm=1)
   

Source: Michelson, A.A. (1880) Experimental determination of the velocity of light made at the US naval Academy, Annapolis, Astronomical papers, 1, 109-145.


$ \diamondsuit$

Photosynthesis and radiation

Data: photosyn.txt

Keywords: Comparing regression lines.

Description: A study was made into photosynthesis rate and radiation for three levels (Low, Medium and High) of water availability. It is desired to determine how similar the relationship between the photosynthesis rate and radiation is at the three water levels.

Number of observations: 15

Variable Description
photosyn Photosynthesis rate
radiation Radiation
level1 Indicator for low water level (low=1, 0 otherwise)
level2 Indicator for medium water level (medium=1, 0 otherwise)
level Water level (low=1, medium=2, high=3)
   

Source: Krzanowski, W.J.(1998) Statistical Modelling, Arnold, London, p.107.


$ \diamondsuit$

Systolic blood pressure for journalists and university teachers

Data: pressure.txt

Keywords: Linear regression, ANCOVA, same intercept.

Description: It is well-known that blood pressure increases with age. In this dataset we examine this relation. Age and systolic blood pressure where measured for 28 males. 15 of these are university teachers, while the remaining 13 are journalists. Along with the interest in the overall increase in systolic blood pressure, we can compare the regression lines between the two groups. [Note: data are not genuine.]

Number of observations: 28

Variable Description
occupation Occupation (0=journalist, 1=university lecturer)
age Age in years
systolic Systolic blood pressure in mmHg
   

Source: Blæsild, P. and Granfeldt, J., (1995), Statistik for biologer og geologer, Det Naturvidenskabelige Fakultet, University of Aarhus, Denmark.


$ \diamondsuit$

Rusting chondrites

Data: rust.txt

Keywords: Comparing regression lines.

Description: For a number of chondrites, the age and type (Type I or Type II) were noted, and the rust component was measured. Initially, the research question concerned the dependence of rustiness on age. A further question is whether there is a difference between the two types of chondrites, Type I and Type II.

Number of observations: 82

Variable Description
type Type of chondrite (Type I=0, Type II=1)
age Age of chondrite (in years)
rust Rust component (in percent)
   

Source: Dr T.B. Smith, The Open University.


$ \diamondsuit$

IQs of twins

Data: twins.txt

Keywords: Comparing regression lines.

Description: These data concern IQ scores of identical twins, one raised in a foster home and the other raised by natural parents. The data are divided into three groups according to the social class of the natural parents.

Number of observations: 27

Variable Description
foster IQ score of the twin raised in a foster home
natural IQ score of the twin raised by natural parents
high Twins from high social class (high=1, 0 otherwise)
middle Twins from middle social class (middle=1, 0 otherwise)
class Social class (high=1, middle=2, low=3)
   

Source: Weisberg, S. (1985) Applied linear regression, John Wiley & Sons, p. 180.


$ \diamondsuit$

Formaldehyde in homes

Data: uffi.txt

Keywords: Comparing regression lines.

Description: The data were collected to see if the presence of urea formaldehyde foam insulation (UFFI) had an effect on the formaldehyde concentration in homes.

Number of observations: 24

Variable Description
formaldehyde Average concentration of formaldehyde measured over a week (in ppb)
airtightness A measure of airtightness (calculated from several other measurements)
uffi Index for whether or not UFFI was used. (uffi=1, if used, uffi=0, otherwise)
   

Source: Jørgensen, B. (1993) The Theory of Linear Models, Chapman & Hall, p. 120. (Originally supplied by R.J. MacKay.)


$ \diamondsuit$




1.4 Logistic regression

Previous Section
Next Section

Surviving third-degree burns

Data: burns.txt

Keywords: Logistic regression.

Description: These data refer to 435 adults who were treated for third-degree burns by the University of Southern California General Hospital Burn Center. The patients were grouped according to the area of third-degree burns on the body. (The groups are identified as midpoints of set intervals of log(area +1).) For each patient, it was recorded whether or not they survived, and the area of their burn was recorded as the midpoint of the group corresponding to their burn.

Number of observations: 435

Variable Description
midpoint Midpoint of the group corresponding to the patients burn.
survive Binary variable: survived=1, died=0
   

Source: Fan, J., Heckman, N.E. and Wand, M.P. (1995) Local polynomial kernel regression for generalised linear models and quasi-likelihood functions, Journal of the American Statistical Association, 90, pp. 141-50.


$ \diamondsuit$

Criminal conviction after treatment for drug abuse

Data: convict.txt

Keywords: Logistic regression.

Description: A study was carried out on factors related to a criminal conviction after treatment for drug abuse. For each of sixty people, who had taken part in a drug rehabilitation programme, it was recorded whether they had a `short' education (15 years, or less) or a `long' education (more than 15 years). Also, it was recorded whether or not they had a post-treatment conviction.

Number of observations: 60

Variable Description
education Categorical variable identifying length of education (more than 15 years=1, 15 years or less=0)
convicted Binary variable for post-treatment conviction: convicted=1, not-convicted=0
   

Source: Wilson, S. and Mandelbrote, B. (1978) Drug rehabilitation and criminality, British J. Criminology, 18, pp. 381-386.


$ \diamondsuit$

Shocking cows

Data: cows.txt

Keywords: Logistic regression.

Description: An experiment was carried out to investigate the effect of small electrical currents on farm animals. The eventual goal was to understand the effects of high-voltage powerlines on livestock. The experiment was carried out with 7 cows, and 6 shock intensities: 0, 1, 2, 3, 4 and 5 milliamps. (Shocks on the order of 15 milliamps are painful for many humans.) Each cow was given 30 shocks, five at each intensity, in random order. The entire experiment was then repeated, so each cow received a total of 60 shocks. For each shock, the response, mouth movement, was either present or absent. (These data are the same as the data in cows2.txt.)

Number of observations: 420

Variable Description
current Shock intensity (in milliamps)
movement Binary variable: movement=1, no-movement=0
   

Source: Weisberg, S. (1985) Applied Linear Regression, Wiley.


$ \diamondsuit$

Shocking cows, II

Data: cows2.txt

Keywords: Logistic regression.

Description: An experiment was carried out to investigate the effect of small electrical currents on farm animals. The eventual goal was to understand the effects of high-voltage powerlines on livestock. The experiment was carried out with 7 cows, and 6 shock intensities: 0, 1, 2, 3, 4 and 5 milliamps. (Shocks on the order of 15 milliamps are painful for many humans.) Each cow was given 30 shocks, five at each intensity, in random order. The entire experiment was then repeated, so each cow received a total of 60 shocks. For each shock, the response, mouth movement, was either present or absent.

It was thought that the reactions from the cows would depend on the shock intensity, but also, that it might differ slightly between the first experiment and the repeated experiment, due to fatigue of the animals, or due to learning. (These data are the same as the data in cows.txt.)

Number of observations: 420

Variable Description
current Shock intensity (in milliamps)
movement Binary variable: movement=1, no-movement=0
trial Categorical variable identifying the trial (1= first experiment, 2= repeated experiment)
   

Source: Weisberg, S. (1985) Applied Linear Regression, Wiley.


$ \diamondsuit$

Killing insects

Data: insecticides.txt

Keywords: Logistic regression.

Description: In a trial of three insecticides, batches of about fifty insects were exposed to varying deposits of each insecticide.

Number of observations: 882

Variable Description
killed Binary variable: killed=1, not-killed=0
insecticide Categorical variable identifying insecticide (numbered 1 to 3)
deposit Amount of deposit (in milligrams)
   

Source: Krzanowski, W.J. (1998) An Introduction to Statistical Modelling, London: Arnold. pp. 198-9.


$ \diamondsuit$

Oral contraceptives and myocardial infarction

Data: pill.txt

Keywords: Logistic regression.

Description: The link between use of an oral contraceptives and the incidence of myocardial infarction was investigated. For each of 224 women, it was recorded whether or not they were using the oral contraceptive and whether or not they suffered a myocardial infarction.

Number of observations: 224

Variable Description
infarction Binary variable: infarction=1, no-infarction=0
pill Categorical variable for whether or not the pill is used (using pill=1, not using pill=0)
   

Source: Mann, J.I., Vesey, M.P., Thorogood, M. and Doll, R. (1975) British J. Medicine, 2, 241-245.


$ \diamondsuit$

Survival of infants with SIRDS

Data: sirds.txt

Keywords: Logistic regression.

Description: This data set contains the birth weights of fifty infants who exhibited severe idiopathic respiratory distress syndrome (SIRDS). This is a serious condition that may result in death, and in fact of the fifty children sampled only 23 survived.

Number of observations: 50

Variable Description
birthweight Weight at birth (in kg)
survival Binary variable: survived=1, died=0
   

Source: van Vliet, P.K. and Gupta, J.M. (1973) Sodium bicarbonate in idiopathic respiratory distress syndrome, Archives of Disease in Childhood, 48, pp. 249-255.


$ \diamondsuit$

Snoring and heart disease

Data: snoring.txt

Keywords: Logistic regression.

Description: A study was undertaken to investigate whether snoring is related to a heart disease. In the survey, 2484 people were classified according to their proneness to snoring (never, occasionally, often, always) and whether or not they had the heart disease.

Number of observations: 2484

Variable Description
disease Binary variable: having disease=1, not having disease=0
snoring Categorical variable indicating level of snoring (never=1, occasionally=2, often=3 and always=4)
   

Source: Norton, P.G. and Dunn, E.V. (1985) Snoring as a risk factor for disease: an epidemiological survey, British Medical Journal, 291, pp. 630-632.


$ \diamondsuit$

Carriers of Streptococcus pyogenes

Data: tonsil.txt

Keywords: Logistic regression.

Description: Some individuals are carriers of the bacterium Streptococcus pyogenes. An investigation was made into the possible relationship between carrier status and tonsil size in schoolchildren. A total of 1398 children were examined and classified according to tonsil size (normal, large and very large) and to whether or not they were carriers.

Number of observations: 1398

Variable Description
carrier Binary variable: carrier=1, no-carrier=0
tonsil Categorical variable indicating tonsil size (normal=1, large=2 and very large=3)
   

Source: Krzanowski, W. (1988) Principles of multivariate analysis, Oxford University Press, Oxford, p. 269.


$ \diamondsuit$

Transient vasoconstriction in skin of fingers

Data: vaso.txt

Keywords: Logistic regression.

Description: A study was made into the effect of volume and rate of air inspired by human subjects on the occurrence of transient vasoconstriction in the skin of the fingers. A total of 39 observations were obtained on these variables from 3 subjects in a laboratory. The data are assumed to be independent (including those on the same subject).

Number of observations: 39

Variable Description
volume Volume of air inspired by subject.
rate Rate of air inspired by subject.
survive Binary variable: occurrence of transient vasoconstriction in the skin of the fingers=1, no-occurence=0
   

Source: Krzanowski, W.J. (1998) An Introduction to Statistical Modelling, London: Arnold. pp. 201-2.


$ \diamondsuit$




2 Analysis of variance

Previous Section
Next Section




2.1 One-way ANOVA

Previous Section
Next Section

Effect of brain dominance on recall ability

Data: braindom.txt

Keywords: One-way ANOVA.

Description: A study was made into how different kinds of brain dominance (left-brained, right-brained or integrative (=both)) affect the ability to recall information of various types. These data refer to an experiment in which subjects were asked to recall information presented to them in tabular form about he numbers of doctors practising in various US states. The subjects were divided into three groups, depending on whether they were predominately left-brained (active, verbal, logical; Group 1), right-brained (receptive, spatial, intuitive; Group 2) or integrative (both, Group 3).

Number of observations: 24

Variable Description
score Score in recall test
brain Type of brain dominance (Group 1: left, Group 2: right, Group 3: both)
   

Source: Brown, T.S. and Evans, J.K. Muller, (1986) Hemispheric dominance and recall following graphical and tabular presentation of information Proceedings of the 1986 Annual Meeting of the Decision Sciences Institute, 1, p. 598.


$ \diamondsuit$

Potencies of cardiac substances

Data: cardiac.txt

Keywords: One-way ANOVA.

Description: Data were collected from an experiment designed to compare the relative potencies (dosages at death) of four cardiac substances. A suitable dilution of one of the substances was slowly infused into an anesthetised guinea pig, and the dosage at which the guinea pig died was recorded.

Number of observations: 40

Variable Description
potency Dosage at which the guinea pig died from substance
substance Categorical variable identifying the substance
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E. and Nizam, A., (1998), Applied Regression Analysis and Other Multivariable Methods Duxbury Press, Brooks/Cole, p. 445.


$ \diamondsuit$

Selling breakfast cereals

Data: cereals.txt

Keywords: Factorial experiment, unequal cell numbers.

Description: A manufacturer conducted a pricing experiment to explore the effects of price decreases on sales of one of its breakfast cereals. The two largest supermarket chains in a particular marked participated in the experiment. Ten stores from each chain were randomly selected, and each store was assigned a price level for the cereal (either the original price, or a 10% reduced price). If the competing chain had a store in the same vicinity, the two stores both were assigned the same price level. Some stores failed to complete the experiment due to competition from other supermarket chains. Sales volumes over the period of the study were noted for each of the 17 stores completing the experiment.

Number of observations: 17

Variable Description
sales Sales volumes (in hundreds of units)
chain Categorical variable identifying the chain (numbered 1 and 2)
price Categorical variable identifying the price level (original price=1, reduced price=2)
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E. and Nizam, A., (1998), Applied Regression Analysis and Other Multivariable Methods Duxbury Press, Brooks/Cole, p. 583.


$ \diamondsuit$

Promoting growth of chicks

Data: chickfeed.txt

Keywords: Randomised complete block design.

Description: In an experiment, a drug was added to the feed of chicks in an attempt to promote growth. The aim of the experiment was to compare the effects of the three types of feed: standard feed (a control), standard feed added a low dose of the drug, and standard feed added a high dose of the drug. It was thought that the position of the chicks in the bird house may influence the growth of the chicks as well (due to variation in lighting and ventilation, etc.), so the chicks in the experiment were grouped in eight blocks according to the location in the bird house. Each type of feed was fed to one unit (of each three chicks) from each block. Within each block, the three units were randomly allocated to the three types of feed. When the chicks had matured, the average weight per chick in each unit was recorded.

Number of observations: 24

Variable Description
weight Average weight per chick in unit (in pounds)
feed Type of feed (standard=1, low-dose=2, high-dose=3)
position Position of chick-unit (numbered 1 to 8)
   

Source: S.M. Free in Snee, R.D. (1985) Graphical display of results of three treatments randomized block experiment, Applied Statistics, 34, pp. 71-7.


$ \diamondsuit$

Sulphur content in coal seams

Data: coalseam.txt

Keywords: One-way ANOVA.

Description: A study was made to compare the sulphur content of the five major coal (numbered 1 to 5, respectively) seams in a particular region.

Number of observations: 42

Variable Description
sulphur Sulphur content
seam Factor: level corresponds to coal seam label
   

Source: Milton, J.S. & Arnold, J.C. (1995) Introduction to probability and statistics, McGraw Hill International Editions, p. 539.


$ \diamondsuit$

Antibody responses in normal and diabetic mice

Data: diabetes.txt

Keywords: One-way ANOVA.

Description: In an experiment, three groups of mice (Group 1: normal mice treated with a placebo, Group 2: alloxan-diabetic mice treated with a placebo, Group 3: alloxan-diabetic mice treated with insulin) were injected with 5 mg BSA antigen on days 0 and 28. On day 39, the amount of nitrogen-bound bovine serum albumen produced by the mice was measured.

Number of observations: 57

Variable Description
antibody Amount of nitrogen-bound bovine serum albumen produced by mouse (in micrograms per ml of undiluted mouse serum)
group Categorical variable identifying group (normal/placebo=1, diabetic/placebo=2, diabetic/insulin=3)
   

Source: Dolkart, R.E., Halpern, B. and Perlman, J. (1971) Comparison of antibody responses in normal and alloxan diabetic mice, Diabetes, 20, pp. 162-167.


$ \diamondsuit$

Dopamine activity in schizophrenic patients

Data: dopamine.txt

Keywords: Two-sample $ t$ -test.

Description: In a study into the causes of schizophrenia, 25 hospitalized schizophrenic patients were treated with anti-psychotic medication. After a period of time, they were classified as psychotic or non-psychotic. Samples of cerebrospinal fluid were taken from each patient and tested for dopamine b-hydroxylase enzyme activity.

Number of observations: 25

Variable Description
dopamine Dopamine b-hydroxylase enzyme activity in patient
state Indicator variable: 1 if psychotic, 2 non-psychotic
   

Source: Sternberg, D.E., van Kammen, D.P. and Bunney, W.E. (1982) Schizophrenia: dopamine b-hydroxylase activity and treatment response, Science, 216, pp. 1423-1425.


$ \diamondsuit$

Fatty doughnuts

Data: doughnuts.txt

Keywords: One-way ANOVA.

Description: These data concern the amount of fat absorbed by doughnuts when cooked. For each of four different types of fat (numbered 1 to 4, respectively), six batches of each 24 doughnuts were cooked. The total amount of fat absorbed by each batch was recorded.

Number of observations: 24

Variable Description
absorb Total amount of fat absorbed by batch (in grams)
fat Categorical variable identifying the type of fat
   

Source: Snedecor, G.W. and Cochran, W.G. (1967) Statistical Methods, 6th edition, Ames (IA), Iowa State University Press.


$ \diamondsuit$

Effectiveness of insecticides

Data: insects.txt

Keywords: One-way ANOVA.

Description: A study was made to compare the effectiveness of six different insecticides (numbered 1 to 6, respectively). For each insecticide, twelve batches of 50 insects were exposed to the insecticide for a fixed length of time. The numbers of insects in the batches still alive (of the 50) after the exposure time were recorded.

Number of observations: 72

Variable Description
alive Number of insects in the batches still alive after the exposure time
insecticide Categorical variable identifying the insecticide
   

Source: Lunneborg, C.E. (1994) Modeling Experimental and Observational Data, Duxbury Press, Ca., p. 150.


$ \diamondsuit$

Iron retention

Data: iron.txt

Keywords: Transformations, ANOVA.

Description: An experiment was performed to determine whether two forms of iron (Fe$ ^{2+}$ and Fe$ ^{3+}$ ) are retained differently. (If one form of iron were retained especially well, it would be the better dietary supplement.) The investigators divided 108 mice randomly into 6 groups of 18 each; 3 groups were given Fe$ ^{2+}$ in three different concentrations, 10.2, 1.2 and 0.3 millimolar, and 3 groups were given Fe$ ^{3+}$ at the same three concentrations. The mice were given the iron orally; the iron was radioactively labeled so that a counter could be used to measure the initial amount given. At a later time, another count was taken for each mouse, and the percentage of iron retained was calculated.

Number of observations: 108

Variable Description
retain Percentage of iron retained
iron Categorical variable identifying iron form (Fe$ ^{2+}$ =1 and Fe$ ^{3+}$ =2)
concentration Categorical variable identifying concentration (levels 1,2 and 3 correspond
  to 10.2, 1.2 and 0.3 millimolar, respectively)
   

Source: Rice, J.A. (1988), Mathematical Statistics and Data Analysis Wadsworth & Brooks/Cole, p.357.


$ \diamondsuit$

Comparing laboratories

Data: laboratory.txt

Keywords: One-way ANOVA.

Description: A large number of laboratories are regularly used to measure the amount of toxic substances in various materials. There is concern that results not only vary due to normal measurement variability, but that there may be substantial variability due to different laboratory techniques. If true, this might raise a need for enforcing one `standard' procedure for all laboratories. To test this concern, four laboratories were randomly selected and asked to measure the content of a certain chemical. Each laboratory was given six identical samples for testing.

Number of observations: 24

Variable Description
chemical Measured content of chemical (in parts per million)
laboratory Categorical variable identifying the laboratory
   

Source: Milton, J.S. & Arnold, J.C. (1995) Introduction to probability and statistics, McGraw Hill International Editions, p. 593.


$ \diamondsuit$

Pollution of fishing area

Data: oysters.txt

Keywords: One-way ANOVA.

Description: It was known that a toxic material was dumped in a river leading into a large salt water commercial fishing area. The way the water carried the toxic material was studied by measuring the amount of the toxic material (in parts per million) found in oysters harvested at three different locations, ranging from the estuary out into the bay where her majority of commercial fishing was carried out.

Number of observations: 24

Variable Description
toxic Toxic material in oysters
site Site at which the oysters were harvested
   

Source: Milton, J.S. & Arnold, J.C. (1995) Introduction to probability and statistics, McGraw Hill International Editions, p. 585.


$ \diamondsuit$

Growth of soya beans

Data: soya.txt

Keywords: One-way ANOVA.

Description: In an experiment on the effect of stress on the growth of soya beans, 52 soya beans were grown under different types of stress. Some were shaken for twenty minutes every day (Group 4), some were grown in semi-darkness (Group 1), some were grown in semi-darkness and shaken for twenty minutes each day (Group 2). A control group was grown without any imposed stress (Group 3). After sixteen days of growth, the leaf area of each plant was measured. (These data are the same as the data in soya2.txt.)

Number of observations: 52

Variable Description
leafarea Leaf area after 16 days
stress Categorical variable identifying the group
   

Source: Blæsild, P. and Granfeldt, J., (1995), Statistik for biologer og geologer, Det Naturvidenskabelige Fakultet, University of Aarhus, Denmark.


$ \diamondsuit$

Calibration of voltmeters

Data: voltmeter.txt

Keywords: One-way ANOVA.

Description: A utility company has a large stock of voltmeters that are used interchangeably by many employees. A study is conducted to detect differences among the average readings given by these voltmeters. If it appears that differences do exist, then all the meters in stock will be calibrated. A random sample of six meters is selected from stock and four readings are taken for each meter. The response variable is the difference between the meter reading and the known voltage being applied at the time of the reading.

Number of observations: 24

Variable Description
reading Difference between the meter reading and the known voltage
voltmeter Categorical variable identifying the voltmeter
   

Source: Milton, J.S. & Arnold, J.C. (1995) Introduction to probability and statistics, McGraw Hill International Editions, p. 575.


$ \diamondsuit$




2.2 Multiway ANOVA

Previous Section
Next Section

Strength of adhesive product

Data: adhesive.txt

Keywords: Two-way ANOVA.

Description: An experiment was made to investigate the effect of temperature and humidity on the force required to separate an adhesive product from a certain material. Four specific temperatures and two specific humidities were of interest in the experiment. (Thus, the factors are fixed!)

Number of observations: 24

Variable Description
force Force required to separate the adhesive product from the material
temperature Categorical variable identifying the temperature level
humidity Categorical variable identifying the humidity level
   

Source: Milton, J.S. & Arnold, J.C.: "Introduction to probability and statistics" (1995), McGraw Hill International Editions, p.607.


$ \diamondsuit$

Evaluation of tv-advertisements

Data: ads.txt

Keywords: Two-way ANOVA.

Description: An advertising company evaluated three types of television advertisements for a new, low-cost car: visual appeal ads, budget appeal ads, and feature appeal ads. To control for age differences, viewers from four age groups were chosen to evaluate the persuasiveness of the ads (as measured on a scale from 1 to 10, where 1 represented the lowest level of persuasion, and 10 the highest). For each type of advertisement, two viewers from each age group were asked to evaluate the ad.

Number of observations: 24

Variable Description
score Score of the ad
type Categorical variable identifying the ad type (Visual=1, Budget=2, Feature=3)
age Categorical variable identifying the viewer's age (`18-25'=1, ` 26-35'=2, `36-45'=3,
  `46 and older'=4)
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E. and Nizam, A., (1998), Applied Regression Analysis and Other Multivariable Methods Duxbury Press, Brooks/Cole, p. 557.


$ \diamondsuit$

Dye in residue bath

Data: dye.txt

Keywords: $ 2^k$ factorial experiment.

Description: A study was conducted on the effect of temperature, time in process and rate of temperature rise on the amount of dye left in the residue bath after a dying process. The experiment was run at two levels of temperature (120C, 135C), two levels of time in process (30 minutes, 60 minutes) and two levels of rate of temperature rise ($ R_1$ ,$ R_2$ ). The experiment was run as a $ 2^3$ factorial experiment with two replications.

Number of observations: 16

Variable Description
dye Amount of dye left in the residue bath (in milligrams)
temp Categorical variable identifying the temperature level (120C=1, 135C=2)
time Categorical variable identifying the time in process (30 mins=1, 60 mins=2)
rate Categorical variable identifying the rate of temperature rise ($ R_1$ =1, $ R_2$ =2)
   

Source: Milton, J.S. & Arnold, J.C. (1995) Introduction to probability and statistics, McGraw Hill International Editions, p. 629.


$ \diamondsuit$

Experience on assembly line

Data: experience.txt

Keywords: Randomised complete block design.

Description: A study was made to investigate the effect of experience on the average time required to complete an assembly task on an assembly line. (If experience is found to have an effect, a training program will be set up for new employees.) For each of eight different assembly tasks, four employees with 1,2,3 and 4 years of experience, respectively, were randomly selected to complete the task. The times it took to complete the tasks were recorded. The experiment was set up as a randomised complete block design with tasks as blocks and years of experience as factor levels.

Number of observations: 32

Variable Description
time Time it takes to complete the assembly task
experience Number of years of experience
task Categorical variable identifying the task (numbered 1 to 8)
   

Source: Milton, J.S. & Arnold, J.C. (1995) Introduction to probability and statistics, McGraw Hill International Editions, p. 591.


$ \diamondsuit$

Titration of influenza virus

Data: influenza.txt

Keywords: Two-way ANOVA.

Description: In a laboratory working on influenza virus, there were three different operators of photoelectric titration equipment and two different methods of performing the titration. The methods involve several dilutions of the virus preparation; either a single pipette could be used for every dilution, or a fresh pipette could be used for each dilution. What was being studied was not the virus preparations themselves, but the operators and the measurement methods. In fact, apart from measurement variability caused by having different operators and different methods, there was no reason for the responses to differ, since all the measurements were made on samples drawn from the same virus preparation.

Number of observations: 24

Variable Description
measure Titration measurements
operator Categorical variable identifying operator (numbered 1 to 3)
pipette Categorical variable relating to pipettes: single pipette=1, multiple pipettes=2
   

Source: Osborn, J.F. (1979) Statistical Exercises in Medical Research, Oxford, Blackwell Scientific Publications.


$ \diamondsuit$

Decomposition of leaf packs

Data: leafpack.txt

Keywords: Two-way ANOVA.

Description: Decomposition of leaf packs was measured (in terms of weight loss of the leaf packs) in four different environments after 1, 2 and 3 months of exposure.

Number of observations: 24

Variable Description
decomp Weight loss of leaf pack (in grams)
environment Categorical variable identifying the environment (numbered 1 to 4)
time Categorical variable identifying the time length (1, 2 or 3 months, respectively)
   

Source: Milton, J.S. & Arnold, J.C. (1995) Introduction to probability and statistics, McGraw Hill International Editions, p.639.


$ \diamondsuit$

Productivity in phone surveys

Data: phone.txt

Keywords: ANOVA, unequal cell numbers.

Description: The manager of a market research company conducted an experiment to investigate the productivity of three employees on each of two computerised data-entry systems. The employees conducted phone surveys, entering the survey data into the computer during the phone call. Productivity was measured as the time taken to complete a call in which the respondent agreed to complete the survey. Each employee used each system for one hour, and the order of use was randomised.

Number of observations: 51

Variable Description
time Time taken to complete call (in minutes)
employee Categorical variable identifying employee (numbered 1 to 3)
system Categorical variable identifying system (numbered 1 and 2, respectively)
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E. and Nizam, A., (1998), Applied Regression Analysis and Other Multivariable Methods Duxbury Press, Brooks/Cole, p. 584.


$ \diamondsuit$

Diameters of pine trees

Data: pinetree.txt

Keywords: Mixed-effect two-way ANOVA.

Description: The diameters of three species of pine trees were compared at each of four locations using samples of five trees per species at each location.

Number of observations: 60

Variable Description
diameter Diameter of pine tree
species Categorical variable identifying the species (numbered 1 to 3)
location Categorical variable identifying the location (numbered 1 to 4)
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E. and Nizam, A., (1998), Applied Regression Analysis and Other Multivariable Methods Duxbury Press, Brooks/Cole, p. 552.


$ \diamondsuit$

Failures of piston rings

Data: pistons.txt

Keywords: Randomised complete block design.

Description: These data concern the number of failures of piston rings for each of three legs of four steam-driven compressors.

Number of observations: 12

Variable Description
piston Numbers of failures of piston rings
leg Position of leg (North=1, Centre=2, South=3)
compressor Categorical variable identifying compressor (numbered 1 to 4)
   

Source: Davies, O.L. and Goldsmith, P.L. (1972) Statistical Methods in Research and Production, 4th edn. Oliver and Boyd, Edinburgh, p. 324.


$ \diamondsuit$

Different protein diets

Data: protein.txt

Keywords: Two-way ANOVA.

Description: Six groups, each of ten rats, were fed on diets which differed according to source of protein and amount of protein in diet. The weight gain for each rat was recorded.

Number of observations: 60

Variable Description
weight Weight gain (in grams)
protein Protein source: Beef=1, Cereal=2, Pork=3
amount Amount of protein: Low=1, High=2
   

Source: Snedecor, G.W. and Cochran, G.C. (1967) Statistical Methods, 6th edn. Iowa State University Press, p. 347.


$ \diamondsuit$

Sexism at US colleges

Data: sexism.txt

Keywords: Two-way ANOVA.

Description: A study was conducted to compare the sexist attitudes of students at various types of colleges in the US. The colleges-types are: mixed (gender) college with at least 75% male students, mixed college with less than 75% male students, and single sex college. For each gender, random samples of each 10 undergraduate students were selected from each of the three types of colleges. Each student filled in a questionnaire, from which a score for `degree of sexism'-defined as the extent to which a student considered males and females to have diffeerent life roles-was determined (the higher the score, the more sexist the attitude).

Number of observations: 60

Variable Description
sexism Score for `degree of sexism'
type Categorical variable identifying the college type (mixed with $ \geq75\%$ males=1,
  mixed with $ <75\%$ males=2, single sex=3)
gender Categorical variable identifying the student's gender (male=1, female=2)
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E. and Nizam, A., (1998), Applied Regression Analysis and Other Multivariable Methods Duxbury Press, Brooks/Cole, p. 549.


$ \diamondsuit$

Growth of soya beans, II

Data: soya2.txt

Keywords: Two-way ANOVA.

Description: In an experiment on the effect of stress on the growth of soya beans, 52 soya beans were grown under two different types of stress: Some were shaken for twenty minutes every day, some were grown in semi-darkness. After sixteen days of growth, the leaf area of each plant was measured. (These data are the same as the data in soya.txt.)

Number of observations: 52

Variable Description
leafarea Leaf area after 16 days
shake Categorical variable identifying the shaking-stress (1=shaken, 0= non-shaken)
light Categorical variable identifying the semidarkness-stress (1=semidarkness, 0= normal light)
   

Source: Blæsild, P. and Granfeldt, J., (1995), Statistik for biologer og geologer, Det Naturvidenskabelige Fakultet, University of Aarhus, Denmark.


$ \diamondsuit$

Reducing stress

Data: stressred.txt

Keywords: Two-way ANOVA.

Description: An experiment was made to investigate whether the drugs levorphanol and/or epinephrine reduce stress. Each treatment (Treatment 1: levorphanol, Treatment 2: levorphanol and epinephrine, Treatment 3: epinephrine, and Treatment 4: a control group, receiving neither drug) was given to five animals, and the cortical sterone level (which reflects the stress-level) was measured.

Number of observations: 20

Variable Description
level Level of cortical sterone
levor Indicating presence (1) or absence (0) of levorphanol in treatment
epine Indicating presence (1) or absence (0) of epinephrine in treatment
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E. and Nizam, A., (1998), Applied Regression Analysis and Other Multivariable Methods Duxbury Press, Brooks/Cole, p. 545.


$ \diamondsuit$

Removal of tattoos

Data: tattoos.txt

Keywords: Multiway ANOVA.

Description: These data concern patients who have had forearm tattoos removed by one of two different surgical methods (labelled A and B, respectively). The gender of each patient was recorded, as well as the size (small, medium or large) and the depth (moderate or deep) of the tattoos. The quality of the result was scored from 1 (poor) to 4 (excellent).

Number of observations: 55

Variable Description
method Method used to remove tattoo (A or B)
gender Patient's gender (m, f)
size Size of tattoo (small, medium, large)
depth Depth of tattoo (moderate, deep)
score Quality of result (1 to 4)
   

Source: Lunn, A.D. and McNeil, D.R. (1988) The SPIDA manual. Statistical Computing Laboratory.


$ \diamondsuit$

Wear of bus tyres

Data: tyres.txt

Keywords: Randomised complete block design.

Description: A small bus company wanted to evaluate the wear of four types of tyres. Since each of the company's five buses runs a different route with terrain and driving conditions, the company decided to place one of each type of tyre on each of the buses (choosing the wheel positions randomly).

Number of observations: 20

Variable Description
wear Wear of tyre
tyre Type of tyre (numbered 1 to 4)
bus Specifying the bus (numbered 1 to 5)
   

Source: Milton, J.S. & Arnold, J.C. (1995) Introduction to probability and statistics, McGraw Hill International Editions, p. 561.


$ \diamondsuit$

Uric acid level in bloodstreams

Data: uric.txt

Keywords: Two-way ANOVA.

Description: These data are the uric acid level found in the bloodstreams of persons with Down's syndrome, and in the bloodstreams of non-Down's syndrome subjects. All subjects were between the ages 21 and 25 years.

Number of observations: 20

Variable Description
uric Uric acid level in bloodstream
down Categorical variable relating to Down's syndrome (persons with Down's syndrome=1, other=2)
gender Categorical variable identifying the person's gender (male=1, female=2)
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E. and Nizam, A., (1998), Applied Regression Analysis and Other Multivariable Methods Duxbury Press, Brooks/Cole, p. 551.


$ \diamondsuit$

Confidence in legal system

Data: victim.txt

Keywords: Factorial experiment, unequal cell numbers.

Description: A crime victimisation study was undertaken in a medium-size southern US-city. The main purpose was to determine the effects of being a crime victim on confidence in the law enforcement authority and in the legal system itself. A questionnaire was administered to a random sample of 40 city residents. Among the information elicited were data on the number of times the resident has been vicimised, a measure of social class status, and a measure of the respondent's confidence in law enforcement and in the legal system.

Number of observations: 40

Variable Description
confidence Measure of confidence in law enforcement and legal system
victim Number of times resident has been victimised (0, 1 or 2+)
class Categorical variable identifying social class status (Low=1, Medium=2, High=3)
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E. and Nizam, A., (1998), Applied Regression Analysis and Other Multivariable Methods Duxbury Press, Brooks/Cole, p. 576.


$ \diamondsuit$

Satisfaction with medical care

Data: worry.txt

Keywords: Factorial experiment, unequal cell numbers.

Description: These data concern a study of the satisfaction with medical care of pregnant women. The patients were classified according to two factors: patient worry (positive or negative), and the affectiveness of physician-patient communication (High, Medium or Low). The variables were developed from scales based on questionnaires administered to patients and their physicians.

Number of observations: 50

Variable Description
satisfy Satisfaction score (between 1 and 10)
worry Categorical variable identifying the level of worry (Negative=1, Positive=2)
commun Categorical variable identifying affectiveness of communication (High=1, Medium=2, Low=3)
   

Source: Kleinbaum, D.G., Kupper, L.L., Muller, K.E. and Nizam, A., (1998), Applied Regression Analysis and Other Multivariable Methods Duxbury Press, Brooks/Cole, p. 562.


$ \diamondsuit$

HOME | Back

Last modified May 23, 2006. Webmaster
©2001-2005 Master of Applied Statistics