Statistic Help

7 Correlation and Causality

LEARNING GOALS

  • 7.1 Seeking Correlation Define correlation, explore correlations with scatterplots, and understand the correlation coefficient as a measure of the strength of a correlation.

  • 7.2 Interpreting Correlations Be aware of important cautions concerning the interpretation of correlations, especially the effects of outliers, the effects of grouping data, and the crucial fact that correlation does not necessarily imply causality.

  • 7.3 Best-Fit Lines and Prediction Become familiar with the concept of a best-fit line, recognize when such lines have predictive value and when they do not, and understand the general concept of multiple regression.

  • 7.4 The Search for Causality Understand the difficulty of establishing causality from correlation, and investigate guidelines that can be used to help establish confidence in causality.

FOCUS TOPICS

  • p. 271 Focus on Education: What Helps Children Learn to Read?

  • p. 273 Focus on Environment: What Is Causing Global Warming?

Does smoking cause lung cancer? Are drivers more dangerous when on their cell phones? Is human activity causing global warming? A major goal of many statistical studies is to search for relationships among different variables so that researchers can then determine whether one factor causes another. Once a relationship is discovered, we can try to determine whether there is an underlying cause. In this chapter, we will study relationships known as correlations and explore how they are important to the more difficult task of searching for causality.

The person who knows “how” will always have a job. The person who knows “why” will always be his boss.

—Diane Ravitch

7.1 SEEKING CORRELATION

What does it mean when we say that smoking causes lung cancer? It certainly does not mean that you’ll get lung cancer if you smoke a single cigarette. It does not even mean that you’ll definitely get lung cancer if you smoke heavily for many years, as some heavy smokers do not get lung cancer. Rather, it is a statistical statement meaning that you are much more likely to get lung cancer if you smoke than if you don’t smoke.

How did researchers learn that smoking causes lung cancer? The process began with informal observations, as doctors noticed that a surprisingly high proportion of their patients with lung cancer were smokers. These observations led to carefully conducted studies in which researchers compared lung cancer rates among smokers and nonsmokers. These studies showed clearly that heavier smokers were more likely to get lung cancer. In more formal terms, we say that there is a correlation between the variables amount of smoking and likelihood of lung cancer. A correlation is a special type of relationship between variables, in which a rise or fall in one goes along with a corresponding rise or fall in the other.

  • Smoking is one of the leading causes of statistics. —Fletcher Knebel

Definition

correlation exists between two variables when higher values of one variable consistently go with higher values of another variable or when higher values of one variable consistently go with lower values of another variable.

Here are a few other examples of correlations:

  • • There is a correlation between the variables height and weight for people; that is, taller people tend to weigh more than shorter people.

  • • There is a correlation between the variables demand for apples and price of apples;that is, demand tends to decrease as price increases.

  • • There is a correlation between practice time and skill among piano players; that is, those who practice more tend to be more skilled.

It’s important to realize that establishing a correlation between two variables does not mean that a change in one variable causes a change in the other. The correlation between smoking and lung cancer did not by itself prove that smoking causes lung cancer. We could imagine, for example, that some gene predisposes a person both to smoking and to lung cancer. Nevertheless, identifying the correlation was the crucial first step in learning that smoking causes lung cancer. We will discuss the difficult task of establishing causality later in this chapter. For now, we concentrate on how we look for, identify, and interpret correlations.

BY THE WAY

Smoking is linked to many serious diseases besides lung cancer, including heart disease and emphysema. Smoking is also linked with many less lethal health conditions, such as premature skin wrinkling and sexual impotence.

TIME UT TO THINK

Suppose there really were a gene that made people prone to both smoking and lung cancer. Explain why we would still find a strong correlation between smoking and lung cancer in that case, but would not be able to say that smoking causes lung cancer.

Scatterplots

Table 7.1 lists data for a sample of gem-store diamonds—their prices and several common measures that help determine their value. Because advertisements for diamonds often quote only their weights (in carats), we might suspect a correlation between the weights and the prices. We can look for such a correlation by making a scatterplot (or scatter diagram) showing the relationship between the variables weight and price.

TABLE 7.1 Prices and Characteristics of a Sample of 23 Diamonds from Gem Dealers

Diamond

Price

Weight (carats)

Depth

Table

Color

Clarity

$6,958

1.00

60.5

65

$5,885

1.00

59.2

65

$6,333

1.01

62.3

55

$4,299

1.01

64.4

62

$9,589

1.02

63.9

58

$6,921

1.04

60.0

61

$4,426

1.04

62.0

62

$6,885

1.07

63.6

61

$5,826

1.07

61.6

62

10

$3,670

1.11

60.4

60

11

$7,176

1.12

60.2

65

12

$7,497

1.16

59.5

60

13

$5,170

1.20

62.6

61

14

$5,547

1.23

59.2

65

15

$7,521

1.29

59.6

59

16

$7,260

1.50

61.1

65

17

$8,139

1.51

63.0

60

18

$12,196

1.67

58.7

64

19

$14,998

1.72

58.5

61

20

$9,736

1.76

57.9

62

21

$9,859

1.80

59.6

63

22

$12,398

1.88

62.9

62

23

$11,008

2.03

62.0

63

Notes: Weight is measured in carats (1 carat = 0.2 gram). Depth is defined as 100 times the ratio of height to diameter. Table is the size of the upper flat surface. (Depth and table determine “cut.”) Color and clarity are each measured on standard scales, where 1 is best. For color, 1 = colorless, and increasing numbers indicate more yellow. For clarity, 1 = flawless, and 6 indicates that defects can be seen by eye.

BY THE WAY

The word karats (with a k) used to describe gold does not have the same meaning as the term carats (with a c) for diamonds and other gems. A carat is a measure of weight equal to 0.2 gram. Karats are a measure of the purity of gold: 24-karat gold is 100% pure gold; 18-karat gold is 75% pure (and 25% other metals); 12-karat gold is 50% pure (and 50% other metals); and so on.

Definition

scatterplot (or scatter diagram) is a graph in which each point represents the values of two variables.

Figure 7.1 shows the scatterplot, which can be constructed with the following procedure.

  • 1. We assign one variable to each axis and label the axis with values that comfortably fit all the data. Sometimes the axis selection is arbitrary, but if we suspect that one variable depends on the other then we plot the explanatory variable on the horizontal axis and the response variable on the vertical axis. In this case, we expect the diamond price to depend at least in part on its weight; we therefore say that weight is the explanatory variable (because it helps explain the price) and price is the response variable (because it responds to changes in the explanatory variable). We choose a range of 0 to 2.5 carats for the weight axis and $0 to $16,000 for the price axis.

Figure 7.1 Scatterplot showing the relationship between the variables priceand weight for the diamonds in Table 7.1. The dashed lines show how we find the position of the point for Diamond 10.

  • 2. For each diamond in Table 7.1, we plot a single point at the horizontal position corresponding to its weight and the vertical position corresponding to its price. For example, the point for Diamond 10 goes at a position of 1.11 carats on the horizontal axis and $3,670 on the vertical axis. The dashed lines on Figure 7.1 show how we locate this point.

  • 3. (Optional) We can label some (or all) of the data points, as is done for Diamonds 10, 16, and 19 in Figure 7.1.

Scatterplots get their name because the way in which the points are scattered may reveal a relationship between the variables. In Figure 7.1, we see a general upward trend indicating that diamonds with greater weight tend to be more expensive. The correlation is not perfect. For example, the heaviest diamond is not the most expensive. But the overall trend seems fairly clear.

TIME UT TO THINK

Identify the points in Figure 7.1 that represent Diamonds 3, 7, and 23.

EXAMPLE  Color and Price

Using the data in Table 7.1, create a scatterplot to look for a correlation between a diamond’s colorand price. Comment on the correlation.

SOLUTION We expect price to depend on color, so we plot the explanatory variable color on the horizontal axis and the response variable price on the vertical axis in Figure 7.2. (You should check a few of the points against the data in Table 7.1.) The points appear much more scattered than in Figure 7.1. Nevertheless, you may notice a weak trend diagonally downward from the upper left toward the lower right. This trend represents a weak correlation in which diamonds with more yellow color (higher numbers for color) are less expensive. This trend is consistent with what we would expect, because colorless diamonds appear to sparkle more and are generally considered more desirable.

Figure 7.2 Scatterplot for the color and price data in Table 7.1.

TIME UT TO THINK

Thanks to a large bonus at work, you have a budget of $6,000 for a diamond ring. A dealer offers you the following two choices for that price. One diamond weighs 1.20 carats and has color = 4. The other weighs 1.18 carats and has color = 3. Assuming all other characteristics of the diamonds are equal, which would you choose? Why?

Types of Correlation

We have seen two examples of correlation. Figure 7.1 shows a fairly strong correlation between weight and price, while Figure 7.2 shows a weak correlation between color and price. We are now ready to generalize about types of correlation. Figure 7.3 shows eight scatterplots for variables called x and y. Note the following key features of these diagrams:

  • • Parts a to c show positive correlations: The values of y tend to increase with increasing values of x. The correlation becomes stronger as we proceed from a to c. In fact, c shows a perfect positive correlation, in which all the points fall along a straight line.

  • • Parts d to f show negative correlations: The values of y tend to decrease with increasing values of x. The negative correlation becomes stronger as we proceed from d to f. In fact, f shows a perfect negative correlation, in which all the points fall along a straight line.

  • • Part g shows no correlation between x and y: Values of x do not appear to be linked to values of y in any way.

  • • Part h shows a nonlinear relationshipx and y appear to be related but the relationship does not correspond to a straight line. (Linear means along a straight line, and nonlinear means not along a straight line.)

Figure 7.3 Types of correlation seen on scatterplots.

Types of Correlation

Positive correlation: Both variables tend to increase (or decrease) together.

Negative correlation: The two variables tend to change in opposite directions, with one increasing while the other decreases.

No correlation: There is no apparent (linear) relationship between the two variables.

Nonlinear relationship: The two variables are related, but the relationship results in a scatterplot that does not follow a straight-line pattern.

TECHNICAL NOTE

In this text we use the term correlation only for linear relationships. Some statisticians refer to nonlinear relationships as “nonlinear correlations.” There are techniques for working with nonlinear relationships that are similar to those described in this text for linear relationships.

EXAMPLE  Life Expectancy and Infant Mortality

Figure 7.4 shows a scatterplot for the variables life expectancy and infant mortality in 16 countries. What type of correlation does it show? Does this correlation make sense? Does it imply causality? Explain.

Figure 7.4 Scatterplot for life expectancy and infant mortality data.

Source: United Nations.

SOLUTION The diagram shows a moderate negative correlation in which countries with lowerinfant mortality tend to have higher life expectancy. It is a negative correlation because the two variables vary in opposite directions. The correlation makes sense because we would expect that countries with better health care would have both lower infant mortality and higher life expectancy. However, it does not imply causality between infant mortality and life expectancy: We would not expect that a concerted effort to reduce infant mortality would increase life expectancy significantly unless it was part of an overall effort to improve health care. (Reducing infant mortality will slightly increase life expectancy because having fewer infant deaths tends to raise the mean age of death for the population.)

Measuring the Strength of a Correlation

For most purposes, it is enough to state whether a correlation is strong, weak, or nonexistent. However, sometimes it is useful to describe the strength of a correlation in more precise terms. Statisticians measure the strength of a correlation with a number called the correlation coefficient, represented by the letter r. The correlation coefficient is easy to calculate in principle (see the optional section on p. 243), but the actual work is tedious unless you use a calculator or computer.

We can explore the interpretation of correlation coefficients by studying Figure 7.3, which shows the value of the correlation coefficient r for each scatterplot. Notice that the correlation coefficient is always between –1 and 1. When points in a scatterplot lie close to an ascending straight line, the correlation coefficient is positive and close to 1. When all the points lie close to a descending straight line, the correlation coefficient is negative with a value close to –1. Points that do not fit any type of straight-line pattern or that lie close to a horizontal straight line (indicating that the y values have no dependence on the x values) result in a correlation coefficient close to 0.

Properties of the Correlation Coefficient, r

  • • The correlation coefficient, r, is a measure of the strength of a correlation. Its value can range only from –1 to 1.

  • • If there is no correlation, the points do not follow any ascending or descending straight-line pattern, and the value of r is close to 0.

  • • If there is a positive correlation, the correlation coefficient is positive (0 < r ≤ 1): Both variables increase together. A perfect positive correlation (in which all the points on a scatterplot lie on an ascending straight line) has a correlation coefficient r = 1. Values of rclose to 1 indicate a strong positive correlation and positive values closer to 0 indicate a weak positive correlation.

  • • If there is a negative correlation, the correlation coefficient is negative(–1 ‰ r < 0): When one variable increases, the other decreases. A perfect negative correlation (in which all the points lie on a descending straight line) has a correlation coefficient r = –1. Values of r close to –1 indicate a strong negative correlation and negative values closer to 0 indicate a weak negative correlation.

TECHNICAL NOTE

For the methods of this section, there is a requirement that the two variables result in data having a “bivariate normal distribution.” This basically means that for any fixed value of one variable, the corresponding values of the other variable have a normal distribution. This requirement is usually very difficult to check, so the check is often reduced to verifying that both variables result in data that are normally distributed.

EXAMPLE  U.S. Farm Size

Figure 7.5 shows a scatterplot for the variables number of farms and mean farm size in the United States. Each dot represents data from a single year between 1950 and 2000; on this diagram, the earlier years generally are on the right and the later years on the left. Estimate the correlation coefficient by comparing this diagram to those in Figure 7.3 and discuss the underlying reasons for the correlation.

Figure 7.5 Scatterplot for farm size data.

Source: U.S. Department of Agriculture.

SOLUTION The scatterplot shows a strong negative correlation that most closely resembles the scatterplot in Figure 7.3f, suggesting a correlation coefficient around r = –0.9. The correlation shows that when there were fewer farms, they tended to have a larger mean size, and when there were more farms, then tended to have a smaller mean size. This trend reflects a basic change in the nature of farming: Prior to 1950, most farms were small family farms. Over time, these small farms were replaced by large farms owned by agribusiness corporations.

BY THE WAY

In 1900, more than 40% of the U.S. population worked on farms; by 2000, less than 2% of the population worked on farms.

EXAMPLE  Accuracy of Weather Forecasts

The scatterplots in Figure 7.6 show two weeks of data comparing the actual high temperature for the day with the same-day forecast (part a) and the three-day forecast (part b). Estimate the correlation coefficient for each data set and discuss what these coefficients imply about weather forecasts.

Figure 7.6 Comparison of actual high temperatures with (a) same-day and (b) three-day forecasts.

SOLUTION If every forecast were perfect, each actual temperature would equal the corresponding forecasted temperature. This would result in all points lying on a straight line and a correlation coefficient of r = 1. In Figure 7.6a, in which the forecasts were made at the beginning of the same day, the points lie fairly close to a straight line, meaning that same-day forecasts are closely related to actual temperatures. By comparing this scatterplot to the diagrams in Figure 7.3, we can reasonably estimate this correlation coefficient to be about r = 0.8. The correlation is weaker in Figure 7.6b, indicating that forecasts made three days in advance aren’t as close to actual temperatures as same-day forecasts. This correlation coefficient is about r = 0.6. These results are unsurprising because we expect longer-term forecasts to be less accurate.

TIME UT TO THINK

For further practice, visually estimate the correlation coefficients for the data for diamond weight and price (Figure 7.1) and diamond color and price (Figure 7.2).

Calculating the Correlation Coefficient (Optional Section)

The formula for the (linear) correlation coefficient r can be expressed in several different ways that are all algebraically equivalent, which means that they produce the same value. The following expression has the advantage of relating more directly to the underlying rationale for r:

 USING TECHNOLOGY—SCATTERPLOTS AND CORRELATION COEFFICIENTS

EXCEL The screen shot below shows the process for making a scatterplot like that in Figure 7.1:

  • 1. Enter the data, which are shown in Columns B (weight) and C (price).

  • 2. Select the columns for the two variables on the scatterplot; in this case, Columns B and C.

  • 3. Choose “XY Scatter” as the chart type, with no connecting lines. You can then use the “chart options” (which comes up with a right-click in the graph) to customize the design, axis range, labels, and more.

  • 4. To calculate the correlation coefficient, shown in row 26, use the built-in function CORREL.

  • 5. [Optional] The straight line on the graph, called a best-fit line, is added by choosing the option to “Add Trendline”; be sure to choose the “linear” option for the trendline. You’ll also find options that add the two items shown in the upper left of the graph: the equation of the line and the value R2, which is the square of the correlation coefficient. Best-fit lines and R2 are discussed in Section 7.3.

Microsoft Excel 2008 for Mac.

STATDISK Enter the paired data in columns of the STATDISK Data Window. Select Analysis from the main menu bar, then select the option Correlation and Regression. Select the columns of data to be used, then click on the Evaluate button. The STATDISK display will include the value of the linear correlation coefficient r and other. A scatterplot can also be obtained by clicking on the PLOTbutton.

TI-83/84 Plus Enter the paired data in lists L1 and L2, then press  and select TESTS. Using the option of LinRegTTest will result in several displayed values, including the value of the linear correlation coefficient r.

To obtain a scatterplot, press , then  (for STAT PLOT). Press   to turn Plot 1 on, then select the first graph type, which resembles a scatterplot. Set the X list and Y list labels to L1 and L2 and press , then select ZoomStat and press .

In the above expression, division by n – 1(where n is the number of pairs of data) shows that r is a type of average, so it does not increase simply because more pairs of data values are included. The symbol sx denotes the standard deviation of the x values (or the values of the first variable), and sydenotes the standard deviation of the y values. The expression (x – x)/sx is in the same format as the standard score introduced in Section 5.2. By using the standard scores for x and y, we ensure that the value of r does not change simply because a different scale of values is used. The key to understanding the rationale for r is to focus on the product of the standard scores for x and the standard scores for y. Those products tend to be positive when there is a positive correlation, and they tend to be negative when there is a negative correlation. For data with no correlation, some of the products are positive and some are negative, with the net effect that the sum is relatively close to 0.

The following alternative formula for r has the advantage of simplifying calculations, so it is often used whenever manual calculations are necessary. The following formula is also easy to program into statistical software or calculators:

This formula is straightforward to use, at least in principle: First calculate each of the required sums, then substitute the values into the formula. Be sure to note that (Σx2) and (Σx)2 are not equal: (Σx2) tells you to first square all the values of the variable x and then add them; (Σx)2 tells you to add the x values first and then square this sum. In other words, perform the operation within the parentheses first. Similarly, (Σy2) and (Σy)2 are not the same.

Section 7.1 Exercises

Statistical Literacy and Critical Thinking

1.

Correlation. In the context of correlation, what does r measure, and what is it called?

2.

Scatterplot. What is a scatterplot, and how does it help us investigate correlation?

3.

Correlation. After computing the correlation coefficient r from 50 pairs of data, you find that r = 0. Does it follow that there is no relationship between the two variables? Why or why not?

4.

Scatterplot. One set of paired data results in r = 1 and a second set of paired data results in r= –1. How do the corresponding scatterplots differ?

Does It Make Sense? For Exercises 5–8, decide whether the statement makes sense (or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of these statements have definitive answers, so your explanation is more important than your chosen answer.

5.

Births. A study showed that for one town, as the stork population increased, the number of births in the town also increased. It therefore follows that the increase in the stork population caused the number of births to increase.

6.

Positive Effect. An engineer for a car company finds that by reducing the weights of various cars, mileage (mi/gal) increases. Because this is a positive result, we say that there is a positive correlation.

7.

Correlation. Two studies both found a correlation between low birth weight and weakened immune systems. The second study had a much larger sample size, so the correlation it found must be stronger.

8.

Interpreting r. In investigating correlations between many different pairs of variables, in each case the correlation coefficient r must fall between –1 and 1.

Concepts and Applications

Types of Correlation. Exercises 9–16, list pairs of variables. For each pair, state whether you believe the two variables are correlated. If you believe they are correlated, state whether the correlation is positive or negative. Explain your reasoning.

9.

Weight/Cost. The weights and costs of 50 different bags of apples

10.

IQ/Hat Size. The IQ scores and hat sizes of randomly selected adults

11.

Weight/Fuel Efficiency. The total weights of airliners flying from New York to San Francisco and the fuel efficiency as measured in miles per gallon

12.

Weight/Fuel Consumption. The total weights of airliners flying from New York to San Francisco and the total amounts of fuel that they consume

13.

Points and DJIA. The total number of points scored in Super Bowl football games and the changes in the Dow Jones Industrial stock index in the years following those games

14.

Altitude/Temperature. The outside air temperature and the altitude of aircraft

15.

Height/SAT Score. The heights and SAT scores of randomly selected subjects who take the SAT

16.

Golf Score/Prize Money. Golf scores and prize money won by professional golfers

17.

Crickets and Temperature. One classic application of correlation involves the association between the temperature and the number of times a cricket chirps in a minute. The scatterplot in Figure 7.7 shows the relationship for eight different pairs of temperature/chirps data. Estimate the correlation coefficient and determine whether there appears to be a correlation between the temperature and the number of times a cricket chirps in a minute.

Figure 7.7Scatterplot for cricket chirps and temperature.

Source: Based on data from The Song of Insects by George W. Pierce, Harvard University Press.

18.

Two-Day Forecast. Figure 7.8 shows a scatterplot in which the actual high temperature for the day is compared with a forecast made two days in advance. Estimate the correlation coefficient and discuss what these data imply about weather forecasts. Do you think you would get similar results if you made similar diagrams for other two-week periods? Why or why not?

Figure 7.8

19.

Safe Speeds? Consider the following table showing speed limits and death rates from automobile accidents in selected countries.

Country

Death rate (per 100 million vehicle-miles)

Speed limit (miles per hour)

Norway

3.0

55

United States

3.3

55

Finland

3.4

55

Britain

3.5

70

Denmark

4.1

55

Canada

4.3

60

Japan

4.7

55

Australia

4.9

65

Netherlands

5.1

60

Italy

6.1

75

Source: D. J. Rivkin, New York Times.

  • a.Construct a scatterplot of the data.

  • b.Briefly characterize the correlation in words (for example, strong positive correlation, weak negative correlation) and estimate the correlation coefficient of the data. (Or calculate the correlation coefficient exactly with the aid of a calculator or software.)

  • c.In the newspaper, these data were presented in an article titled “Fifty-five mph speed limit is no safety guarantee.” Based on the data, do you agree with this claim? Explain.

20.

Population Growth. Consider the following table showing percentage change in population and birth rate (per 1,000 of population) for 10 states over a period of 10 years.

State

Percentage change in population

Birth rate

Nevada

50.1%

16.3

California

25.7%

16.9

New Hampshire

20.5%

12.5

Utah

17.9%

21.0

Colorado

14.0%

14.6

Minnesota

7.3%

13.7

Montana

1.6%

12.3

Illinois

0%

15.5

Iowa

–4.7%

13.0

West Virginia

–8.0%

11.4

Source: U.S. Census Bureau and Department of Health and Human Services.

  • a.Construct a scatterplot for the data.

  • b.Briefly characterize the correlation in words and estimate the correlation coefficient.

  • c.Overall, does birth rate appear to be a good predictor of a state’s population growth rate? If not, what other factor(s) may be affecting the growth rate?

21.

Brain Size and Intelligence. The table below lists brain sizes (in cm3) and Wechsler IQ scores of subjects (based on data from “Brain Size, Head Size, and Intelligence Quotient in Monozygatic Twins,” by Tramo et al, Neurology, Vol. 50, No. 5). Is there sufficient evidence to conclude that there is a linear correlation between brain size and IQ score? Does it appear that people with larger brains are more intelligent?

Brain Size

IQ

965

90

1,029

85

1,030

86

1,285

102

1,049

103

1,077

97

1,037

124

1,068

125

1,176

102

1,105

114

  • a.Construct a scatterplot for the data.

  • b.Briefly characterize the correlation in words and estimate the correlation coefficient.

  • c.Do these data suggest that people with larger brains are more intelligent? Explain.

22.

Movie Data. Consider the following table showing total box office receipts and total attendance for all American films.

Year

Total Gross Receipts (billions of dollars)

Tickets Sold (billions)

2001

8.4

1.49

2002

9.2

1.58

2003

9.2

1.53

2004

9.4

1.51

2005

8.8

1.38

2006

9.2

1.41

2007

9.7

1.40

2008

9.6

1.34

2009

10.6

1.41

2010

10.6

1.34

Source: Motion Picture Association of America.

  • a.Construct a scatterplot of the data.

  • b.Briefly characterize the correlation in words and estimate the correlation coefficient.

23.

TV Time. Consider the following table showing the average hours of television watched in households in five categories of annual income.

Household income

Weekly TV hours

Less than $30,000

56.3

$30,000 – $40,000

51.0

$40,000 – $50,000

50.5

$50,000 – $60,000

49.7

More than $60,000

48.7

Source: Nielsen Media Research.

  • a.Construct a scatterplot for the data. To locate the dots, use the midpoint of each income category. Use a value of $25,000 for the category “less than $30,000,” and use $70,000 for “more than $60,000.”

  • b.Briefly characterize the correlation in words and estimate the correlation coefficient.

  • c.Suggest a reason why families with higher incomes watch less TV. Do you think these data imply that you can increase your income simply by watching less TV? Explain.

24.

January Weather. Consider the following table showing January mean monthly precipitation and mean daily high temperature for ten Northern Hemisphere cities (National Oceanic and Atmospheric Administration).

City

Mean daily high temperature for January (°F)

Mean January precipitation (inches)

Athens

54

2.2

Bombay

88

0.1

Copenhagen

36

1.6

Jerusalem

55

5.1

London

44

2.0

Montreal

21

3.8

Oslo

30

1.7

Rome

54

3.3

Tokyo

47

1.9

Vienna

34

1.5

Source: The New York Times Almanac.

  • a.Construct a scatterplot for the data.

  • b.Briefly characterize the correlation in words and estimate the correlation coefficient.

  • c.Can you draw any general conclusions about January temperatures and precipitation from these data? Explain.

25.

Retail Sales. Consider the following table showing one year’s total sales (revenue) and profits for eight large retailers in the United States.

Company

Total sales (billions of dollars)

Profits (billions of dollars)

Wal-Mart

315.6

11.2

Kroger

60.6

0.98

Home Depot

81.5

5.8

Costco

60.1

1.1

Target

52.6

2.4

Starbuck’s

7.8

0.6

The Gap

16.0

1.1

Best Buy

30.8

1.1

Source: Fortune.com.

  • a.Construct a scatterplot for the data.

  • b.Briefly characterize the correlation in words and estimate the correlation coefficient.

  • c.Discuss your observations. Does higher sales volume necessarily translate into greater earnings? Why or why not?

26.

Calories and Infant Mortality. Consider the following table showing mean daily caloric intake (all residents) and infant mortality rate (per 1,000 births) for 10 countries.

Country

Mean daily calories

Infant mortality rate (per 1,000 births)

Afghanistan

1,523

154

Austria

3,495

Burundi

1,941

114

Colombia

2,678

24

Ethiopia

1,610

107

Germany

3,443

Liberia

1,640

153

New Zealand

3,362

Turkey

3,429

44

United States

3,671

  • a.Construct a scatterplot for the data.

  • b.Briefly characterize the correlation in words and estimate the correlation coefficient.

  • c.Discuss any patterns you observe and any general conclusions that you can reach.

Properties of the Correlation Coefficient. For Exercises 27 and 28, determine whether the given property is true, and explain your answer.

27.

Interchanging Variables. The correlation coefficient remains unchanged if we interchange the variables x and y.

28.

Changing Units of Measurement. The correlation coefficient remains unchanged if we change the units used to measure x, y, or both.

PROJECTS FOR  THE INTERNET & BEYOND

29.

Unemployment and Inflation. Use the Bureau of Labor Statistics Web page to find monthly unemployment rates and inflation rates over the past year. Construct a scatter-plot for the data. Do you see any trends?

30.

Success in the NFL. Find last season’s NFL team statistics. Construct a table showing the following for each team: number of wins, average yards gained on offense per game, and average yards allowed on defense per game. Make scatterplots to explore the correlations between offense and wins and between defense and wins. Discuss your findings. Do you think that there are other team statistics that would yield stronger correlations with the number of wins?

31.

Statistical Abstract. Explore the “frequently requested tables” at the Web site for the Statistical Abstract of the United States. Choose data that are of interest to you and explore at least two correlations. Briefly discuss what you learn from the correlations.

32.

Height and Arm Span. Select a sample of at least eight people and measure each person’s height and arm span. (When you measure arm span, the person should stand with arms extended like the wings on an airplane.) Using the paired sample data, construct a scatterplot and estimate or calculate the value of the correlation coefficient. What do you conclude?

33.

Height and Pulse Rate. Select a sample of at least eight people and record each person’s pulse rate by counting the number of heartbeats in 1 minute. Also record each person’s height. Using the paired sample data, construct a scatterplot and estimate or calculate the value of the correlation coefficient. What do you conclude?

IN THE NEWS

34.

Correlations in the News. Find a recent news report that discusses some type of correlation. Describe the correlation. Does the article give any sense of the strength of the correlation? Does it suggest that the correlation reflects any underlying causality? Briefly discuss whether you believe the implications the article makes with respect to the correlation.

35.

Your Own Positive Correlations. Give examples of two variables that you expect to be positively correlated. Explain why the variables are correlated and why the correlation is (or is not) important.

36.

Your Own Negative Correlations. Give examples of two variables that you expect to be negatively correlated. Explain why the variables are correlated and why the correlation is (or is not) important.

7.2 INTERPRETING CORRELATIONS

  • Statistics show that of those who contract the habit of eating, very few survive. —Wallace Irwin

Researchers sifting through statistical data are constantly looking for meaningful correlations, and the discovery of a new and surprising correlation often leads to a flood of news reports. You may recall hearing about some of these discovered correlations: dark chocolate consumption correlated with reduced risk of heart disease; musical talent correlated with good grades in mathematics; or eating less correlated with increased longevity. Unfortunately, the task of interpreting such correlations is far more difficult than discovering them in the first place. Long after the news reports have faded, we may still be unsure of whether the correlations are significant and, if so, whether they tell us anything of practical importance. In this section, we discuss some of the common difficulties associated with interpreting correlations.

Beware of Outliers

Examine the scatterplot in Figure 7.9. Your eye probably tells you that there is a positive correlation in which larger values of x tend to mean larger values of y. Indeed, if you calculate the correlation coefficient for these data, you’ll find that it is a relatively high r = 0.880, suggesting a very strong correlation.

Figure 7.9 How does the outlier affect the correlation?

However, if you place your thumb over the data point in the upper right corner of Figure 7.9, the apparent correlation disappears. In fact, without this data point, the correlation coefficient is zero! In other words, removing this one data point changes the correlation coefficient from r = 0.880 to r= 0.

This example shows that correlations can be very sensitive to outliers. Recall that an outlier is a data value that is extreme compared to most other values in a data set (see Section 4.1). We must therefore examine outliers and their effects carefully before interpreting a correlation. On the one hand, if the outliers are mistakes in the data set, they can produce apparent correlations that are not real or mask the presence of real correlations. On the other hand, if the outliers represent real and correct data points, they may be telling us about relationships that would otherwise be difficult to see.

Note that while we should examine outliers carefully, we should not remove them unless we have strong reason to believe that they do not belong in the data set. Even in that case, good research principles demand that we report the outliers along with an explanation of why we thought it legitimate to remove them.

EXAMPLE  Masked Correlation

You’ve conducted a study to determine how the number of calories a person consumes in a day correlates with time spent in vigorous bicycling. Your sample consisted of ten women cyclists, all of approximately the same height and weight. Over a period of two weeks, you asked each woman to record the amount of time she spent cycling each day and what she ate on each of those days. You used the eating records to calculate the calories consumed each day. Figure 7.10 shows a scatterplot with each woman’s mean time spent cycling on the horizontal axis and mean caloric intake on the vertical axis. Do higher cycling times correspond to higher intake of calories?

Figure 7.10 Data from the cycling study.

SOLUTION If you look at the data as a whole, your eye will probably tell you that there is a positive correlation in which greater cycling time tends to go with higher caloric intake. But the correlation is very weak, with a correlation coefficient of r = 0.374. However, notice that two points are outliers: one representing a cyclist who cycled about a half-hour per day and consumed more than 3,000 calories, and the other representing a cyclist who cycled more than 2 hours per day on only 1,200 calories. It’s difficult to explain the two outliers, given that all the women in the sample have similar heights and weights. We might therefore suspect that these two women either recorded their data incorrectly or were not following their usual habits during the two-week study. If we can confirm this suspicion, then we would have reason to delete the two data points as invalid. Figure 7.11 shows that the correlation is quite strong without those two outlier points, and suggests that the number of calories consumed rises by a little more than 500 calories for each hour of cycling. Of course, we should not remove the outliers without confirming our suspicion that they were invalid data points, and we should report our reasons for leaving them out.

Figure 7.11 The data from Figure 7.10 without the two outliers.

Beware of Inappropriate Grouping

Correlations can also be misinterpreted when data are grouped inappropriately. In some cases, grouping data hides correlations. Consider a (hypothetical) study in which researchers seek a correlation between hours of TV watched per week and high school grade point average (GPA). They collect the 21 data pairs in Table 7.2.

The scatterplot (Figure 7.12) shows virtually no correlation; the correlation coefficient for the data is about r = –0.063. The lack of correlation seems to suggest that TV viewing habits are unrelated to academic achievement. However, one astute researcher realizes that some of the students watched mostly educational programs, while others tended to watch comedies, dramas, and movies. She therefore divides the data set into two groups, one for the students who watched mostly educational television and one for the other students. Table 7.3 shows her results with the students divided into these two groups.

Figure 7.12 The full set of data concerning hours of TV and GPA shows virtually no correlation.

TABLE 7.2 Hours of TV and High School GPA (hypothetical data)

Hours per week of TV

GPA

3.2

3.0

3.1

2.5

2.9

3.0

2.5

2.7

2.8

2.7

2.5

2.9

10

3.4

12

3.6

12

2.5

14

3.5

14

2.3

15

3.7

16

2.0

20

3.6

20

1.9

Now we find two very strong correlations (Figure 7.13): a strong positive correlation for the students who watched educational programs (r = 0.855) and a strong negative correlation for the other students (r = –0.951). The moral of this story is that the original data set hid an important (hypothetical) correlation between TV and GPA: Watching educational TV correlated positively with GPA and watching non-educational TV correlated negatively with GPA. Only when the data were grouped appropriately could this discovery be made.

TABLE 7.3 Hours of TV and High School GPA—Grouped Data (hypothetical data)

Group 1: watched educational programs

Group 2: watched regular TV

Hours per week of TV

GPA

Hours per week of TV

GPA

2.5

3.2

2.8

3.0

2.7

3.1

2.9

2.9

10

3.4

3.0

12

3.6

2.5

14

3.5

2.7

15

3.7

2.5

20

3.6

12

2.5

 

 

14

2.3

 

 

16

2.0

 

 

20

1.9

BY THE WAY

Children ages 2–5 watch an average of 26 hours of television per week, while children ages 6–11 watch an average of 20 hours of television per week (Nielsen Media Research). Adult viewership averages more than 25 hours per week. If the average adult replaced television time with a job paying just $8 per hour, his or her annual income would rise by more than $10,000.

Figure 7.13 These scatterplots show the same data as Figure 7.12, separated into the two groups identified in Table 7.3.

In other cases, a data set may show a stronger correlation than actually exists among subgroups. Consider the (hypothetical) data in Table 7.4, showing the relationship between the weights and prices of selected cars. Figure 7.14 shows the scatterplot.

The data set as a whole shows a strong correlation; the correlation coefficient is r = 0.949. However, on closer examination, we see that the data fall into two rather distinct categories corresponding to light and heavy cars. If we analyze these subgroups separately, neither shows any correlation: The light cars alone (top six in Table 7.4) have a correlation coefficient r = 0.019 and the heavy cars alone (bottom six in Table 7.4) have a correlation coefficient r = –0.022. You can see the problem by looking at Figure 7.14. The apparent correlation of the full data set occurs because of the separation between the two clusters of points; there’s no correlation within either cluster.

TABLE 7.4 Car Weights and Prices (hypothetical data)

Weight (pounds)

Price (dollars)

1,500

9,500

1,600

8,000

1,700

8,200

1,750

9,500

1,800

9,200

1,800

8,700

3,000

29,000

3,500

25,000

3,700

27,000

4,000

31,000

3,600

25,000

3,200

30,000

Figure 7.14 Scatterplot for the car weight and price data in Table 7.4.

TIME UT TO THINK

Suppose you were shopping for a compact car. If you looked at only the overall data and correlation coefficient from Figure 7.14, would it be reasonable to consider weight as an important factor in price? What if you looked at the data for light and heavy cars separately? Explain.

 ASE STUDY Fishing for Correlations

Oxford physician Richard Peto submitted a paper to the British medical journal Lancet showing that heart-attack victims had a better chance of survival if they were given aspirin within a few hours after their heart attacks. The editors of Lancet asked Peto to break down the data into subsets, to see whether the benefits of the aspirin were different for different groups of patients. For example, was aspirin more effective for patients of a certain age or for patients with certain dietary habits?

Breaking the data into subsets can reveal important facts, such as whether men and women respond to the treatment differently. However, Peto felt that the editors were asking him to divide his sample into too many subgroups. He therefore objected to the request, arguing that it would result in purely coincidental correlations. Writing about this story in the Washington Post, journalist Rick Weiss said, “When the editors insisted, Peto capitulated, but among other things he divided his patients by zodiac birth signs and demanded that his findings be included in the published paper. Today, like a warning sign to the statistically uninitiated, the wacky numbers are there for all to see: Aspirin is useless for Gemini and Libra heart-attack victims but is a lifesaver for people born under any other sign.”

The moral of this story is that a “fishing expedition” for correlations can often produce them. That doesn’t make the correlations meaningful, even though they may appear significant by standard statistical measures.

Correlation Does Not Imply Causality

Perhaps the most important caution about interpreting correlations is one we’ve already mentioned: Correlation does not necessarily imply causality. In general, correlations can appear for any of the following three reasons.

Possible Explanations for a Correlation

  • 1. The correlation may be a coincidence.

  • 2. Both correlation variables might be directly influenced by some common underlying cause.

  • 3. One of the correlated variables may actually be a cause of the other. But note that, even in this case, it may be just one of several causes.

For example, the correlation between infant mortality and life expectancy in Figure 7.4 is a case of common underlying cause: Both variables respond to the underlying variable quality of health care. The correlation between smoking and lung cancer reflects the fact that smoking causes lung cancer (see the discussion in Section 7.4). Coincidental correlations are also quite common; Example 2 below discusses one such case.

Caution about causality is particularly important in light of the fact that many statistical studies are designed to look for causes. Because these studies generally begin with the search for correlations, it’s tempting to think that the work is over as soon as a correlation is found. However, as we will discuss in Section 7.4, establishing causality can be very difficult.

EXAMPLE  How to Get Rich in the Stock Market (Maybe)

Every financial advisor has a strategy for predicting the direction of the stock market. Most focus on fundamental economic data, such as interest rates and corporate profits. But an alternative strategy might rely on a famous correlation between the Super Bowl winner in January and the direction of the stock market for the rest of the year: The stock market tends to rise when a team from the old, pre-1970 NFL wins the Super Bowl and tends to fall when the winner is not from the old NFL. This correlation successfully matched 28 of the first 32 Super Bowls to the stock market, which made the “Super Bowl Indicator” a far more reliable predictor of the stock market than any professional stock broker during the same period. In fact, detailed calculations show that the probability of such success by pure chance is less than 1 in 100,000. Should you therefore make a decision about whether to invest in the stock market based on the NFL origins of the most recent Super Bowl winner?

SOLUTION The extremely strong correlation might make it seem like a good idea to base your investments on the Super Bowl Indicator, but sometimes you need to apply a bit of common sense. No matter how strong the correlation might be, it seems inconceivable to imagine that the origin of the winning team actually causes the stock market to move in a particular direction. The correlation is undoubtedly a coincidence, and the fact that its probability of occurring by pure chance was less than 1 in 100,000 is just another illustration of the fact that you can turn up surprising correlations if you go fishing for them. This fact was borne out in more recent Super Bowls: Following Super Bowl 32, the indicator successfully predicted the stock market direction in only 5 of the next 10 years—exactly the fraction that would be expected by pure chance.

 ASE STUDY Oat Bran and Heart Disease

If you buy a product that contains oat bran, there’s a good chance that the label will tout the healthful effects of eating oats. Indeed, several studies have found correlations in which people who eat more oat bran tend to have lower rates of heart disease. But does this mean that everyone should eat more oats?

Not necessarily. Just because oat bran consumption is correlated with reduced risk of heart disease does not mean that it causes reduced risk of heart disease. In fact, the question of causality is quite controversial in this case. Other studies suggest that people who eat a lot of oat bran tend to have generally healthful diets. Thus, the correlation between oat bran consumption and reduced risk of heart disease may be a case of a common underlying cause: Having a healthy diet leads people both to consume more oat bran and to have a lower risk of heart disease. In that case, for some people, adding oat bran to their diets might be a bad idea because it could cause them to gain weight, and weight gain is associated with increased risk of heart disease.

This example shows the importance of using caution when considering issues of correlation and causality. It may be a long time before medical researchers know for sure whether adding oat bran to your diet actually causes a reduced risk of heart disease.

Useful Interpretations of Correlation

In discussing uses of correlation that might lead to wrong interpretations, we have described the effects of outliers, inappropriate groupings, fishing for correlations, and incorrectly concluding that correlation implies causality. But there are many correct and useful interpretations of correlation, some of which we have already studied. So while you should be cautious in interpreting correlations, they remain a valuable tool in any field in which statistical research plays a role.

Section 7.2 Exercises

Statistical Literacy and Critical Thinking

1.

Correlation and Causality. In clinical trials of the drug Lisinopril, it is found that increased dosages of the drug correlated with lower blood pressure levels. Based on the correlation, can we conclude that Lisinopril treatments cause lower blood pressure? Why or why not?

2.

SIDS. An article in the New York Times on infant deaths included a statement that, based on the study results, putting infants to sleep in the supine position decreased deaths due to SIDS (sudden infant death syndrome). What is wrong with that statement?

3.

Outliers. When studying salaries paid to CEOs of large companies, it is found that almost all of them range from a few hundred thousand dollars to several million dollars, but one CEO is paid a salary of $1. Is that salary of $1 an outlier? In general, how might outliers affect conclusions about correlation?

4.

Scatterplot. Does a scatterplot reveal anything about a cause and effect relationship between two variables?

Does It Make Sense? For Exercises 5–8, decide whether the statement makes sense (or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of these statements have definitive answers, so your explanation is more important than your chosen answer.

5.

Scatterplot. A set of paired sample data results in a correlation coefficient of r = 0, so the scatterplot will show that there is no pattern of the plotted points.

6.

Causation. If we have 20 pairs of sample data with a correlation coefficient of 1, then we know that one of the two variables is definitely the cause of the other.

7.

Causation. If we conduct a study showing that there is a strong negative correlation between resting pulse rate and amounts of time spent in rigorous exercise, we can conclude decreases in resting pulse rates are somehow associated with increases in exercise.

8.

Causation. If we have two variables with one being the direct cause of the other, then there may or may not be a correlation between those two variables.

Concepts and Applications

Correlation and Causality. Exercises 9–16 make statements about a correlation. In each case, state the correlation clearly. (For example, we might state that “there is a positive correlation between variable A and variable B.”) Then state whether the correlation is most likely due to coincidence, a common underlying cause, or a direct cause. Explain your answer.

9.

Guns and Crime Rate. In one state, the number of unregistered handguns steadily increased over the past several years, and the crime rate increased as well.

10.

Running and Weight. It has been found that people who exercise regularly by running tend to weigh less than those who do not run, and those who run longer distances tend to weigh less than those who run shorter distances.

11.

Study Time. Statistics students find that as they spend more time studying, their test scores are higher.

12.

Vehicles and Waiting Time. It has been found that as the number of registered vehicles increases, the time drivers spend sitting in traffic also increases.

13.

Traffic Lights and Car Crashes. It has been found that as the number of traffic lights increases, the number of car crashes also increases.

14.

Galaxies. Astronomers have discovered that, with the exception of a few nearby galaxies, all galaxies in the universe are moving away from us. Moreover, the farther the galaxy, the faster it is moving away. That is, the more distant a galaxy, the greater the speed at which it is moving away from us.

15.

Gas and Driving. It has been found that as gas prices increase, the distances vehicles are driven tend to get shorter.

16.

Melanoma and Latitude. Some studies have shown that, for certain ethnic groups, the incidence of melanoma (the most dangerous form of skin cancer) increases as latitude decreases.

17.

Outlier Effects. Consider the scatterplot in Figure 7.15.

Figure 7.15

  • a.Which point is an outlier? Ignoring the outlier, estimate or compute the correlation coefficient for the remaining points.

  • b.Now include the outlier. How does the outlier affect the correlation coefficient? Estimate or compute the correlation coefficient for the complete data set.

18.

Outlier Effects. Consider the scatterplot in Figure 7.16.

Figure 7.16

  • a.Which point is an outlier? Ignoring the outlier, estimate or compute the correlation coefficient for the remaining points.

  • b.Now include the outlier. How does the outlier affect the correlation coefficient? Estimate or compute the correlation coefficient for the complete data set.

19.

Grouped Shoe Data. The following table gives measurements of weight and shoe size for 10 people (including both men and women).

  • a.Construct a scatterplot for the data. Estimate or compute the correlation coefficient. Based on this correlation coefficient, would you conclude that shoe size and weight are correlated? Explain.

    Weight (pounds)

    Shoe size

    105

    112

    4.5

    115

    123

    135

    155

    10

    165

    11

    170

    180

    10

    190

    12

  • b.You later learn that the first five data values in the table are for women and the next five are for men. How does this change your view of the correlation? Is it still reasonable to conclude that shoe size and weight are correlated?

20.

Grouped Temperature Data. The following table shows the average January high temperature and the average July high temperature for 10 major cities around the world.

City

January high

July high

Berlin

35

74

Geneva

39

77

Kabul

36

92

Montreal

21

78

Prague

34

74

Auckland

73

56

Buenos Aires

85

57

Sydney

78

60

Santiago

85

59

Melbourne

78

56

  • a.Construct a scatterplot for the data. Estimate or compute the correlation coefficient. Based on this correlation coefficient, would you conclude that January and July temperatures are correlated for these cities? Explain.

  • b.Notice that the first five cities in the table are in the Northern Hemisphere and the next five are in the Southern Hemisphere. How does this change your view of the correlation? Would you now conclude that January and July temperatures are correlated for these cities? Explain.

21.

Birth and Death Rates. Figure 7.17 shows the birth and death rates for different countries, measured in births and deaths per 1,000 population.

Figure 7.17Birth and death rates for different countries.

Source: United Nations.

  • a.Estimate the correlation coefficient and discuss whether there is a strong correlation between the variables.

  • b.Notice that there appear to be two groups of data points within the full data set. Make a reasonable guess as to the makeup of these groups. In which group might you find a relatively wealthy country like Sweden? In which group might you find a relatively poor country like Uganda?

  • c.Assuming that your guess about groups in part b is correct, do there appear to be correlations within the groups? Explain. How could you confirm your guess about the groups?

22.

Reading and Test Scores. The following (hypothetical) data set gives the number of hours 10 sixth-graders read per week and their performance on a standardized verbal test (maximum of 100).

Reading time per week

Verbal test score

50

65

56

62

65

60

75

50

10

88

12

38

  • a.Construct a scatterplot for these data. Estimate or compute the correlation coefficient. Based on this correlation coefficient, would you conclude that reading time and test scores are correlated? Explain.

  • b.Suppose you learn that five of the children read only comic books while the other five read regular books. Make a guess as to which data points fall in which group. How could you confirm your guess about the groups?

  • c.Assuming that your guess in part b is correct, how does it change your view of the correlation between reading time and test scores? Explain.

PROJECTS FOR  THE INTERNET & BEYOND

23.

Football-Stock Update. Find data for recent years concerning the Super Bowl winner and the end-of-year change in the stock market (positive or negative). Do recent results still agree with the correlation described in Example 2? Explain.

24.

Real Correlations.

  • a.Describe a real situation in which there is a positive correlation that is the result of coincidence.

  • b.Describe a real situation in which there is a positive correlation that is the result of a common underlying cause.

  • c.Describe a real situation in which there is a positive correlation that is the result of a direct cause.

  • d.Describe a real situation in which there is a negative correlation that is the result of coincidence.

  • e.Describe a real situation in which there is a negative correlation that is the result of a common underlying cause.

  • f.Describe a real situation in which there is a negative correlation that is the result of a direct cause.

IN THE NEWS

25.

Misinterpreted Correlations. Find a recent news report in which you believe that a correlation may have been misinterpreted. Describe the correlation, the reported interpretation, and the problems you see in the interpretation.

26.

Well-Interpreted Correlations. Find a recent news report in which you believe that a correlation has been presented with a reasonable interpretation. Describe the correlation and the reported interpretation, and explain why you think the interpretation is valid.

7.3 BEST-FIT LINES AND PREDICTION

Suppose you are lucky enough to win a 1.5-carat diamond in a contest. Based on the correlation between weight and price in Figure 7.1, it should be possible to predict the approximate value of the diamond. We need only study the graph carefully and decide where a point corresponding to 1.5 carats is most likely to fall. To do this, it is helpful to draw a best-fit line (also called a regression line) through the data, as shown in Figure 7.18. This line is a “best fit” in the sense that, according to a standard statistical measure (which we discuss shortly), the data points lie closer to this line than to any other straight line that we could draw through the data.

Figure 7.18 Best-fit line for the data from Figure 7.1.

BY THE WAY

The term regression comes from an 1877 study by Sir Francis Galton. He found that the heights of boys with short or tall fathers were closer to the mean than were the heights of their fathers. He therefore said that the heights of the children regress toward the mean, from which we get the term regression. The term is now used even for data that have nothing to do with a tendency to regress toward a mean.

Definition

The best-fit line (or regression line) on a scatterplot is a line that lies closer to the data points than any other possible line (according to a standard statistical measure of closeness).

Of all the possible straight lines that can be drawn on a diagram, how do you know which one is the best-fit line? In many cases, you can make a good estimate of the best-fit line simply by looking at the data and drawing the line that visually appears to pass closest to all the data points. This method involves drawing the best-fit line “by eye.” As you might guess, there are methods for calculating the precise equation of a best-fit line (see the optional topic at the end of this section), and many computer programs and calculators can do these calculations automatically. For our purposes in this text, a fit by eye will generally be sufficient.

Predictions with Best-Fit Lines

We can use the best-fit line in Figure 7.18 to predict the price of a 1.5-carat diamond. As indicated by the dashed lines in the figure, the best-fit line predicts that the diamond will cost about $9,000. Notice, however, that two actual data points in the figure correspond to 1.5-carat diamonds, and both of these diamonds cost less than $9,000. That is, although the predicted price of $9,000 sounds reasonable, it is certainly not guaranteed. In fact, the degree of scatter among the data points in this case tells us that we should not trust the best-fit line to predict accurately the price for any individual diamond. Instead, the prediction is meaningful only in a statistical sense: It tells us that if we examined many 1.5-carat diamonds, their mean price would be about $9,000.

This is only the first of several important cautions about interpreting predictions with best-fit lines. A second caution is to beware of using best-fit lines to make predictions that go beyond the bounds of the available data. Figure 7.19 shows a best-fit line for the correlation between infant mortality and longevity from Figure 7.4. According to this line, a country with a life expectancy of more than about 80 years would have a negative infant mortality rate, which is impossible.

  • It is a capital mistake to theorize before one has data. —Arthur Conan Doyle

Figure 7.19 A best-fit line for the correlation between infant mortality and longevity from Figure 7.4.

Source: United Nations.

A third caution is to avoid using best-fit lines from old data sets to make predictions about current or future results. For example, economists studying historical data found a strong negative correlation between unemployment and the rate of inflation. According to this correlation, inflation should have risen dramatically in the mid-2000s when the unemployment rate fell below 6%. But inflation remained low, showing that the correlation from old data did not continue to hold.

Fourth, a correlation discovered with a sample drawn from a particular population cannot generally be used to make predictions about other populations. For example, we can’t expect that the correlation between aspirin consumption and heart attacks in an experiment involving only men will also apply to women.

  • It’s tough to make predictions, especially about the future. —attributed to Niels Bohr, Yogi Berra, and others

Fifth, remember that we can draw a best-fit line through any data set, but that line is meaningless when the correlation is not significant or when the relationship is nonlinear. For example, there is no correlation between shoe size and IQ, so we could not use shoe size to predict IQ.

Cautions in Making Predictions from Best-Fit Lines

  • 1. Don’t expect a best-fit line to give a good prediction unless the correlation is strong and there are many data points. If the sample points lie very close to the best-fit line, the correlation is very strong and the prediction is more likely to be accurate. If the sample points lie away from the best-fit line by substantial amounts, the correlation is weak and predictions tend to be much less accurate.

  • 2. Don’t use a best-fit line to make predictions beyond the bounds of the data points to which the line was fit.

  • 3. A best-fit line based on past data is not necessarily valid now and might not result in valid predictions of the future.

  • 4. Don’t make predictions about a population that is different from the population from which the sample data were drawn.

  • 5. Remember that a best-fit line is meaningless when there is no significant correlation or when the relationship is nonlinear.

EXAMPLE  Valid Predictions?

State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not.

  • a. You’ve found a best-fit line for a correlation between the number of hours per day that people exercise and the number of calories they consume each day. You’ve used this correlation to predict that a person who exercises 18 hours per day would consume 15,000 calories per day.

  • b. There is a well-known but weak correlation between SAT scores and college grades. You use this correlation to predict the college grades of your best friend from her SAT scores.

  • c. Historical data have shown a strong negative correlation between national birth rates and affluence. That is, countries with greater affluence tend to have lower birth rates. These data predict a high birth rate in Russia.

  • d. A study in China has discovered correlations that are useful in designing museum exhibits that Chinese children enjoy. A curator suggests using this information to design a new museum exhibit for Atlanta-area school children.

  • e. Scientific studies have shown a very strong correlation between children’s ingesting of lead and mental retardation. Based on this correlation, paints containing lead were banned.

  • f. Based on a large data set, you’ve made a scatterplot for salsa consumption (per person) versus years of education. The diagram shows no significant correlation, but you’ve drawn a best-fit line anyway. The line predicts that someone who consumes a pint of salsa per week has at least 13 years of education.

SOLUTION

  • a. No one exercises 18 hours per day on an ongoing basis, so this much exercise must be beyond the bounds of any data collected. Therefore, a prediction about someone who exercises 18 hours per day should not be trusted.

  • b. The fact that the correlation between SAT scores and college grades is weak means there is much scatter in the data. As a result, we should not expect great accuracy if we use this weak correlation to make a prediction about a single individual.

  • c. We cannot automatically assume that the historical data still apply today. In fact, Russia currently has a very low birth rate, despite also having a low level of affluence.

  • d. The suggestion to use information from the Chinese study for an Atlanta exhibit assumes that predictions made from correlations in China also apply to Atlanta. However, given the cultural differences between China and Atlanta, the curator’s suggestion should not be considered without more information to back it up.

  • e. Given the strength of the correlation and the severity of the consequences, this prediction and the ban that followed seem quite reasonable. In fact, later studies established lead as an actual cause of mental retardation, making the rationale behind the ban even stronger.

  • f. Because there is no significant correlation, the best-fit line and any predictions made from it are meaningless.

BY THE WAY

In the United States, lead was banned from house paint in 1978 and from food cans in 1991, and a 25-year phaseout of lead in gasoline was completed in 1995. Nevertheless, many young children—especially children living in poor areas—still have enough lead in their blood to damage their health. Major sources of ongoing lead hazards include paint in older housing and soil near major roads, which has high lead content from past use of leaded gasoline.

EXAMPLE  Will Women Be Faster Than Men?

Figure 7.20 shows data and best-fit lines for both men’s and women’s world record times in the 1-mile race. Based on these data, predict when the women’s world record will be faster than the men’s world record. Comment on the prediction.

Figure 7.20 World record times in the mile (men and women).

SOLUTION If we accept the best-fit lines as drawn, the women’s world record will equal the men’s world record by about 2040. However, this is not a valid prediction because it is based on extending the best-fit lines beyond the range of the actual data. In fact, notice that the most recent world records (as of 2011) date all the way back to 1999 for men and 1996 for women, while the best-fit lines predict that the records should have fallen by several more seconds since those dates.

The Correlation Coefficient and Best-Fit Lines

Earlier, we discussed the correlation coefficient as one way of measuring the strength of a correlation. We can also use the correlation coefficient to say something about the validity of predictions with best-fit lines.

For mathematical reasons (not discussed in this text), the square of the correlation coefficient, or r2, is the proportion of the variation in a variable that is accounted for by the best-fit line (or, more technically, by the linear relationship that the best-fit line expresses). For example, the correlation coefficient for the diamond weight and price data (see Figure 7.18) turns out to be r = 0.777. If we square this value, we get r2 = 0.604 which we can interpret as follows: About 0.6, or 60%, of the variation in the diamond prices is accounted for by the best-fit line relating weight and price. That leaves 40% of the variation in price that must be due to other factors, presumably such things as depth, table, color, and clarity—which is why predictions made with the best-fit line in Figure 7.18are not very precise.

A best-fit line can give precise predictions only in the case of a perfect correlation (r = 1 or r = –1); we then find r2 = 1, which means that 100% of the variation in a variable can be accounted for by the best-fit line. In this special case of r2 = 1, predictions should be exactly correct, except for the fact that the sample data might not be a true representation of the population data.

Best-Fit Lines and r2

The square of the correlation coefficient, or r2, is the proportion of the variation in a variable that is accounted for by the best-fit line.

TECHNICAL NOTE

Statisticians call r2 the coefficient of determination.

EXAMPLE  Retail Hiring

You are the manager of a large department store. Over the years, you’ve found a strong correlation between your September sales and the number of employees you’ll need to hire for peak efficiency during the holiday season; the correlation coefficient is 0.950. This year your September sales are fairly strong. Should you start advertising for help based on the best-fit line?

SOLUTION In this case, we find that r2 = 0.9502 = 0.903, which means that 90% of the variation in the number of peak employees can be accounted for by a linear relationship with September sales. That leaves only 10% of the variation in the number of peak employees unaccounted for. Because 90% is so high, we conclude that the best-fit line accounts for the data quite well, so it seems reasonable to use it to predict the number of employees you’ll need for this year’s holiday season.

EXAMPLE  Voter Turnout and Unemployment

Political scientists are interested in knowing what factors affect voter turnout in elections. One such factor is the unemployment rate. Data collected in presidential election years since 1964 show a very weak negative correlation between voter turnout and the unemployment rate, with a correlation coefficient of about r = –0.1 (Figure 7.21). Based on this correlation, should we use the unemployment rate to predict voter turnout in the next presidential election?

Figure 7.21 Data on voter turnout and unemployment, 1964–2008.

Source: U.S. Bureau of Labor Statistics.

SOLUTION The square of the correlation coefficient is r2 = (–0.1)2 = 0.01, which means that only about 1% of the variation in the data is accounted for by the best-fit line. Nearly all of the variation in the data must therefore be explained by other factors. We conclude that unemployment is not a reliable predictor of voter turnout.

Multiple Regression

If you’ve ever purchased a diamond, you might have been surprised that we found such a weak correlation between color and price in Figure 7.2. Surely a diamond cannot be very valuable if it has poor color quality. Perhaps color helps to explain why the correlation between weight and price is not perfect. For example, maybe differences in color explain why two diamonds with the same weight can have different prices. To check this idea, it would be nice to look for a correlation between the price and some combination of weight and color together.

  • All who drink his remedy recover in a short time, except those whom it does not help, who all die. Therefore, it is obvious that it fails only in incurable cases. —Galen, Roman “doctor”

TIME UT TO THINK

Check this idea in Table 7.1. Notice, for example, that Diamonds 4 and 5 have nearly identical weights, but Diamond 4 costs only $4,299 while Diamond 5 costs $9,589. Can differences in their color explain the different prices? Study other examples in Table 7.1 in which two diamonds have similar weights but different prices. Overall, do you think that the correlation with price would be stronger if we used weight and color together instead of either one alone? Explain.

There is a method for investigating a correlation between one variable (such as price) and a combination of two or more other variables (such as weight and color). The technique is called multiple regression, and it essentially allows us to find a best-fit equation that relates three or more variables (instead of just two). Because it involves more than two variables, we cannot make simple diagrams to show best-fit equations for multiple regression. However, it is still possible to calculate a measure of how well the data fit a linear equation. The most common measure in multiple regression is the coefficient of determination, denoted R2. It tells us how much of the scatter in the data is accounted for by the best-fit equation. If R2 is close to 1, the best-fit equation should be very useful for making predictions within the range of the data values. If R2 is close to zero, then predictions with the best-fit equation are essentially useless.

Definition

The use of multiple regression allows the calculation of a best-fit equation that represents the best fit between one variable (such as price) and a combination of two or more other variables (such as weight and color). The coefficient of determination, R2, tells us the proportion of the scatter in the data accounted for by the best-fit equation.

In this text, we will not describe methods for finding best-fit equations by multiple regression. However, you can use the value of R2 to interpret results from multiple regression. For example, the correlation between price and weight and color together results in a value of R2 = 0.79. This is somewhat higher than the r2 = 0.61 that we found for the correlation between price and weight alone. Statisticians who study diamond pricing know that they can get stronger correlations by including additional variables in the multiple regression (such as depth, table, and clarity). Given the billions of dollars spent annually on diamonds, you can be sure that statisticians play prominent roles in helping diamond dealers realize the largest possible profits.

BY THE WAY

One study of alumni donations found that, in developing a multiple regression equation, one should include these variables: income, age, marital status, whether the donor belonged to a fraternity or sorority, whether the donor is active in alumni affairs, the donor’s distance from the college, and the nation’s unemployment rate, used as a measure of the economy (Bruggink and Siddiqui, “An Econometric Model of Alumni Giving: A Case Study for a Liberal Arts College,” The American Economist, Vol. 39, No. 2).

EXAMPLE  Alumni Contributions

You’ve been hired by your college’s alumni association to research how past contributions were associated with alumni income and years that have passed since graduation. It is found that R2 = 0.36. What does that result tell us?

SOLUTION With R2 = 0.36, we conclude that 36% of the variation in past contributions can be explained by the variation in alumni income and years since graduation. It follows that 64% of the variation in past contributions can be explained by factors other than alumni income level and years since graduation. Because such a large proportion of the variation can be explained by other factors, it would make sense to try to identify any other factors that might have a strong effect on past contributions.

Finding Equations for Best-Fit Lines (Optional Section)

The mathematical technique for finding the equation of a best-fit line is based on the following basic ideas. If we draw any line on a scatterplot, we can measure the vertical distance between each data point and that line. One measure of how well the line fits the data is the sum of the squares of these vertical distances. A large sum means that the vertical distances of data points from the line are fairly large and hence the line is not a very good fit. A small sum means the data points lie close to the line and the fit is good. Of all possible lines, the best-fit line is the line that minimizes the sum of the squares of the vertical distances. Because of this property, the best-fit line is sometimes called the least squares line.

You may recall that the equation of any straight line can be written in the general form

where m is the slope of the line and b is the y-intercept of the line. The formulas for the slope and y-intercept of the best-fit line are as follows:

In the above expressions, r is the correlation coefficient, sx denotes the standard deviation of the xvalues (or the values of the first variable), sy denotes the standard deviation of the y values, xrepresents the mean of the values of the variable x, and y represents the mean of the values of the variable y. Because these formulas are tedious with manual calculations, we usually use a calculator or computer to find the slope and y-intercept of best-fit lines. Statistical software packages and some calculators, such as the TI-83/84 Plus family of calculators, are designed to automatically generate the equation of a best-fit line.

When software or a calculator is used to find the slope and intercept of the best-fit line, results are commonly expressed in the format y = b0 + b1x, where b0 is the intercept and b1 is the slope, so be careful to correctly identify those two values.

Section 7.3 Exercises

Statistical Literacy and Critical Thinking

1.

Best-Fit Line. What is a best-fit line (also called a regression line)? How is a best-fit line useful?

2.

r2. For a study involving paired sample data, it is found that r = –0.4. What is the value of r2? In general, what is r2 called, what does it measure, and how can it be interpreted? That is, what does its value tell us about the variables?

3.

Regression. An investigator has data consisting of heights of daughters and the heights of the corresponding mothers and fathers. She wants to analyze the data to see the effect that the height of the mother and the height of the father has on the height of the daughter. Should she use a (linear) regression or multiple regression? What is the basic difference between (linear) regression and multiple regression?

4.

R2. Using data described in Exercise 3, it is found that R2 = 0.68. Interpret that value. That is, what does that value tell us about the data?

Does It Make Sense? For Exercises 5–8, decide whether the statement makes sense (or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of these statements have definitive answers, so your explanation is more important than your chosen answer.

5.

r2 Value. A value of r2 = 1 is obtained from a sample of paired data with one variable representing the amount of gas (gallons) purchased and the total cost of the gas.

6.

r2 Value. A value of r2 = –0.040 is obtained from a sample of men, with each pair of data consisting of the height in inches and the SAT score for one man.

7.

Height and Weight. Using data from the National Health Survey, the equation of the best-fit line for women’s heights and weights is obtained, and it shows that a woman 120 inches tall is predicted to weigh 430 pounds.

8.

Old Faithful. Using paired sample data consisting of the duration time (in seconds) of eruptions of Old Faithful geyser and the time interval (in minutes) after the eruption, a value of r2 = 0.926 is calculated, indicating that about 93% of the variation in the interval after eruption can be explained by the relationship between those two variables as described by the best-fit line.

Concepts and Applications

Best-Fit Lines on Scatterplots. For Exercises 9–12, do the following.

  • a. Insert a best-fit line in the given scatterplot.

  • b. Estimate or compute r and r2. Based on your value for r2, determine how much of the variation in the variable can be accounted for by the best-fit line.

  • c. Briefly discuss whether you could make valid predictions from this best-fit line.

9.

Use the scatterplot for color and price in Figure 7.2.

10.

Use the scatterplot for life expectancy and infant mortality in Figure 7.4.

11.

Use the scatterplot for number of farms and size of farms in Figure 7.5.

12.

Use both scatterplots for actual and predicted temperature in Figure 7.6.

Best-Fit Lines. Exercises 13–20 refer to the tables in the Section 7.1 Exercises. In each case, do the following.

  • a. Construct a scatterplot and, based on visual inspection, draw the best-fit line by eye.

  • b. Briefly discuss the strength of the correlation. Estimate or compute r and r2. Based on your value for r2, identify how much of the variation in the variable can be accounted for by the best-fit line.

  • c. Identify any outliers on the scatterplot and discuss their effects on the strength of the correlation and on the best-fit line.

  • d. For this case, do you believe that the best-fit line gives reliable predictions outside the range of the data on the scatterplot? Explain.

13.

Use the data in Exercise 19 of Section 7.1.

14.

Use the data in Exercise 20 of Section 7.1.

15.

Use the data in Exercise 21 of Section 7.1.

16.

Use the data in Exercise 22 of Section 7.1.

17.

Use the data in Exercise 23 of Section 7.1. To locate the points, use the midpoint of each income category; use a value of $25,000 for the category “less than $30,000,” and use a value of $70,000 for the category “more than $60,000.”

18.

Use the data in Exercise 24 of Section 7.1.

19.

Use the data in Exercise 25 of Section 7.1.

20.

Use the data in Exercise 26 of Section 7.1.

PROJECTS FOR  THE INTERNET & BEYOND

21.

Lead Poisoning. Research lead poisoning, its sources, and its effects. Discuss the correlations that have helped researchers understand lead poisoning. Discuss efforts to prevent it.

22.

Asbestos. Research asbestos, its sources, and its effects. Discuss the correlations that have helped researchers understand adverse health effects from asbestos exposure. Discuss efforts to prevent those adverse health effects.

23.

Worldwide Population Indicators. The following table gives five population indicators for eleven selected countries. Study these data and try to identify possible correlations. Doing additional research if necessary, discuss the possible correlations you have found, speculate on the reasons for the correlations, and discuss whether they suggest a causal relationship. Birth and death rates are per 1,000 population; fertility rate is per woman.

Country

Birth rate

Death rate

Life expectancy

Percent urban

Fertility rate

Afghanistan

50

22

43

20

6.9

Argentina

21

72

88

2.6

Australia

15

78

85

1.9

Canada

14

78

77

1.6

Egypt

29

64

45

3.4

El Salvador

30

68

45

3.1

France

13

78

73

1.6

Israel

21

77

91

2.8

Japan

10

79

78

1.5

Laos

45

15

51

22

6.7

United States

16

76

76

2.0

Source: The New York Times Almanac.

IN THE NEWS

24.

Predictions in the News. Find a recent news report in which a correlation is used to make a prediction. Evaluate the validity of the prediction, considering all of the cautions described in this section. Overall, do you think the prediction is valid? Why or why not?

25.

Best-Fit Line in the News. Although scatterplots are rare in the news, they are not unheard of. Find a scatterplot of any kind in a news article (recent or not). Draw a best-fit line by eye. Discuss what predictions, if any, can be made from your best-fit line.

26.

Your Own Multiple Regression. Come up with an example from your own life or work in which a multiple regression analysis might reveal important trends. Without actually doing any analysis, describe in words what you would look for through the multiple regression and how the answers might be useful.

7.4 THE SEARCH FOR CAUSALITY

A correlation may suggest causality, but by itself a correlation never establishes causality. Much more evidence is required to establish that one factor causes another. Earlier, we found that a correlation between two variables may be the result of either (1) coincidence, (2) a common underlying cause, or (3) one variable actually having a direct influence on the other. The process of establishing causality is essentially a process of ruling out the first two explanations.

In principle, we can rule out the first two explanations by conducting experiments:

  • • We can rule out coincidence by repeating the experiment many times (or by using a large number of subjects in the experiment). Because coincidences occur randomly, the same coincidence is unlikely to occur in repeated trials of an experiment.

  • • We can rule out a common underlying cause by controlling and randomizing the experiment to eliminate the effects of confounding variables (see Section 1.3). If the controls rule out confounding variables, any remaining effects must be caused by the variables of interest.

Unfortunately, these ideas are often difficult to put into practice. In the case of ruling out coincidence, it may be too time-consuming or expensive to repeat an experiment a sufficient number of times. To rule out a common underlying cause, the experiment must control for everything except the variables of interest, and this is often impossible. Moreover, there are many cases in which experiments are impractical or unethical, so we can gather only observational data. Because observational studies cannot definitively establish causality, we must find other ways of trying to establish causality.

Establishing Causality

Suppose you have discovered a correlation and suspect causality. How can you test your suspicion? Let’s return to the issue of smoking and lung cancer. The strong correlation between smoking and lung cancer did not by itself prove that smoking causes lung cancer. In principle, we could have looked for proof with a controlled experiment. But such an experiment would be unethical because it would require forcing a group of randomly selected people to smoke cigarettes. So how was smoking established as a cause of lung cancer?

The answer involves several lines of evidence. First, researchers found correlations between smoking and lung cancer among many groups of people: women, men, and people of different races and cultures. Second, among groups of people that seemed otherwise identical, lung cancer was found to be more rare in nonsmokers. Third, people who smoked more and for longer periods of time were found to have higher rates of lung cancer. Fourth, when researchers accounted for other potential causes of lung cancer (such as exposure to radon gas or asbestos), they found that almost all the remaining lung cancer cases occurred among smokers (or people exposed to second-hand smoke).

BY THE WAY

Statistical methods cannot prove that smoking causes cancer, but statistical methods can be used to identify an association, and physical proof of causation can then be sought by researchers. Dr. David Sidransky of Johns Hopkins University and other researchers found a direct physical link that involves mutations of a specific gene among smokers. Molecular analysis of genetic changes allows researchers to determine whether cigarette smoking is the cause of a cancer. (See “Association Between Cigarette Smoking and Mutation of the p53 Gene in Squamous-Cell Carcinoma of the Head and Neck,” by Brennan, Boyle et al., New England Journal of Medicine, Vol 332, No. 11.)

These four lines of evidence made a strong case, but still did not rule out the possibility that some other factor, such as genetics, predisposes people both to smoking and to lung cancer. However, two additional lines of evidence made this possibility highly unlikely. One line of evidence came from animal experiments. In controlled experiments, animals were divided into randomly chosen treatment and control groups. The experiments still found a correlation between inhalation of cigarette smoke and lung cancer, which seems to rule out a genetic factor, at least in the animals. The final line of evidence came from biologists studying small samples of human lung tissue. The biologists discovered the basic process by which ingredients in cigarette smoke create cancer-causing mutations. This process does not appear to depend in any way on specific genetic factors, making it all but certain that lung cancer is caused by smoking and not by any preexisting genetic factor. The fact that second-hand smoke exposure is also associated with some cases of lung cancer further argues against a genetic factor (since second-hand smoke affects non-smokers) but is consistent with the idea that ingredients in cigarette smoke create cancer-causing mutations.

The following box summarizes these ideas about establishing causality. Generally speaking, the case for causality is stronger when more of these guidelines are met.

Guidelines for Establishing Causality

If you suspect that a particular variable (the suspected cause) is causing some effect:

  • 1. Look for situations in which the effect is correlated with the suspected cause even while other factors vary.

  • 2. Among groups that differ only in the presence or absence of the suspected cause, check that the effect is similarly present or absent.

  • 3. Look for evidence that larger amounts of the suspected cause produce larger amounts of the effect.

  • 4. If the effect might be produced by other potential causes (besides your suspected cause), make sure that the effect still remains after accounting for these other potential causes.

  • 5. If possible, test the suspected cause with an experiment. If the experiment cannot be performed with humans for ethical reasons, consider doing the experiment with animals, cell cultures, or computer models.

  • 6. Try to determine the physical mechanism by which the suspected cause produces the effect.

BY THE WAY

The first four guidelines to the left are called Mill’s methods after John Stuart Mill (1806–1873). Mill was a leading scholar of his time and an early advocate of women’s right to vote. In philosophy, the four methods are called, respectively, the methods of agreement, difference, concomitant variation, and residues.

TIME UT TO THINK

There’s a great deal of controversy concerning whether animal experiments are ethical. What is your opinion of animal experiments? Defend your opinion.

 ASE STUDY Air Bags and Children

By the mid-1990s, passenger-side air bags had become commonplace in cars. Statistical studies showed that the air bags saved many lives in moderate- to high-speed collisions. But a disturbing pattern also appeared. In at least some cases, young children, especially infants and toddlers in child car seats, were killed by air bags in low-speed collisions.

At first, many safety advocates found it difficult to believe that air bags could be the cause of the deaths. But the observational evidence became stronger, meeting the first four guidelines for establishing causality. For example, the greater risk to infants in child car seats fit Guideline 3, because it indicated that being closer to the air bags increased the risk of death. (A child car seat sits on top of the built-in seat, thereby putting a child closer to the air bags than the child would be otherwise.)

To seal the case, safety experts undertook experiments using dummies. They found that children, because of their small size, often sit where they could be easily hurt by the explosive opening of an air bag. The experiments also showed that an air bag could impact a child car seat hard enough to cause death, thereby revealing the physical mechanism by which the deaths occurred.

BY THE WAY

Based on these studies, the government now recommends that child car seats never be used on the front seat and that children under age 12 (or under 4 feet, 9 inches tall) sit in the back seat whenever possible.

 ASE STUDY Cardiac Bypass Surgery

Cardiac bypass surgery is performed on people who have severe blockage of arteries that supply the heart with blood (the coronary arteries). If blood flow stops in these arteries, a patient may suffer a heart attack and die. Bypass surgery essentially involves grafting new blood vessels onto the blocked arteries so that blood can flow around the blocked areas. By the mid-1980s, many doctors were convinced that the surgery was prolonging the lives of their patients.

However, a few early retrospective studies turned up a disconcerting result: Statistically, the surgery appeared to be making little difference. In other words, patients who had the surgery seemed to be faring no better on average than similar patients who did not have it. If this were true, it meant that the surgery was not worth the pain, risk, and expense involved.

Because these results flew in the face of what many doctors thought they had observed in their own patients, researchers began to dig more deeply. Soon, they found confounding variables that had not been accounted for in the early studies. For example, they found that patients getting the surgery tended to have more severe blockage of their arteries, apparently because doctors recommended the surgery more strongly to these patients. Because these patients were in worse shape to begin with, a comparison of longevity between them and other patients was not really valid.

More important, the research soon turned up substantial differences in the results among patients who had the surgery in different hospitals. In particular, a few hospitals were achieving remarkable success with bypass surgery and their patients fared far better than patients who did not have the surgery or had it at other hospitals. Clearly, the surgical techniques used by doctors at the successful hospitals were somehow different and superior. Doctors studied the differences to ensure that all doctors could be trained in the superior techniques.

In summary, the confounding variables of amount of blockage and surgical technique had prevented the early studies from finding a real correlation between cardiac bypass surgery and prolonged life. Today, cardiac bypass surgery is accepted as a cause of prolonged life in patients with blocked coronary arteries. It is now among the most common types of surgery, and it typically adds decades to the lives of the patients who undergo it.

BY THE WAY

As you might guess, it is also difficult to define reasonable doubt. For criminal trials, the Supreme Court endorsed this guidance from Justice Ruth Bader Ginsburg: “Proof beyond a reasonable doubt is proof that leaves you firmly convinced of the defendant’s guilt. There are very few things in this world that we know with absolute certainty, and in criminal cases the law does not require proof that overcomes every possible doubt. If, based on your consideration of the evidence, you are firmly convinced that the defendant is guilty of the crime charged, you must find him guilty. If on the other hand, you think there is a real possibility that he is not guilty, you must give him the benefit of the doubt and find him not guilty.”

Hidden Causality

So far we have discussed how to establish causality after first discovering a correlation. However, sometimes a correlation—or the lack of a correlation—can hide an underlying causality. As the next case study shows, such hidden causality often occurs because of confounding variables.

Confidence in Causality

The six guidelines offer us a way to examine the strength of a case for causality, but we often must make decisions before a case of causality is fully established. Consider, for example, the well-known case of global warming. It may never be possible to prove beyond all doubt that the burning of fossil fuels is causing global warming (see the Focus on Environment at the end of this chapter), so we must decide whether to act while we still face some uncertainty about causation. How much must we know before we decide to act?

In other areas of statistics, accepted techniques help us deal with this type of uncertainty by allowing us to calculate a numerical level of confidence or significance. But there are no accepted ways to assign such numbers to the uncertainty that comes with questions of causality. Fortunately, another area of study has dealt with practical problems of causality for hundreds of years: our legal system. You may be familiar with the three broad ways of expressing a legal level of confidence shown on the top of the next page.

Broad Levels of Confidence in Causality

Possible cause: We have discovered a correlation, but cannot yet determine whether the correlation implies causality. In the legal system, possible cause (such as thinking that a particular suspect possibly committed a particular crime) is often the reason for starting an investigation.

Probable cause: We have good reason to suspect that the correlation involves cause, perhaps because some of the guidelines for establishing causality are satisfied. In the legal system, probable cause is the general standard for getting a judge to grant a warrant for a search or wiretap.

Cause beyond reasonable doubt: We have found a physical model that is so successful in explaining how one thing causes another that it seems unreasonable to doubt the causality. In the legal system, cause beyond reasonable doubt is the usual standard for convictions and generally demands that the prosecution have shown how and why (essentially the physical model) the suspect committed the crime. Note that beyond reasonable doubt does not mean beyond all doubt.

While these broad levels remain fairly vague, they give us at least some common language for discussing confidence in causality. If you study law, you will learn much more about the subtleties of interpreting these terms. However, because statistics has little to say about them, we will not discuss them much further in this text.

Section 7.4 Exercises

Statistical Literacy and Critical Thinking

1.

Correlation. Identify three different explanations for the presence of a correlation between two variables.

2.

Role of Experiments. In theory, we can use experiments to rule out two of the three different explanations for the presence of a correlation between two variables. Which of the three explanations do we not want to rule out? Why would we not want to rule it out?

3.

Confounding Variable. What is a confounding variable? How can a confounding variable create a situation in which an underlying causality is hidden?

4.

Correlation and Causality. What is the difference between finding a correlation between two variables and establishing causality between two variables?

Does It Make Sense? For Exercises 5–8, decide whether the statement makes sense (or is clearly true) or does not make sense (or is clearly false). Explain clearly; not all of these statements have definitive answers, so your explanation is more important than your chosen answer.

5.

Value of r. When analyzing paired sample data, the value of the correlation coefficient rallows us to determine whether one variable has a direct causal effect on the other.

6.

Value of r. A variable can have a direct causal effect on another variable only if the correlation coefficient is given by r = 1.

7.

Smoking and Cotinine. A study showed that there is a correlation between exposure to second-hand smoke and the measured amount of cotinine in the body. We can establish that exposure to second-hand smoke is a cause of cotinine if we can rule out coincidence as a possible explanation of the correlation.

8.

Smoking and Cotinine. When the body absorbs nicotine, it converts it into cotinine. Experiments have ruled out coincidence as an explanation for a correlation between exposure to second-hand smoke and cotinine in the body. The only possible explanations for that correlation are that the exposure causes cotinine or that there is some other underlying cause.

Concepts and Applications

Physical Models. For Exercises 9–12, determine whether the stated causal connection is valid. If the causal connection appears to be valid, provide an explanation.

9.

Test Grades. Test grades are affected by the amount of time and effort spent studying and preparing for the test.

10.

Magnet Treatment. Heart disease can be cured by wearing a magnetic bracelet on your wrist.

11.

Drinking and Reaction Time. Drinking greater amounts of alcohol decreases a person’s reaction time.

12.

IQ and Pulse Rate. People with higher resting pulse rates (beats per minute) tend to have higher IQ scores.

13.

Identifying Causes: Headaches. You are trying to identify the cause of late-afternoon headaches that plague you several days each week. For each of the following tests and observations, explain which of the six guidelines for establishing causality you used and what you concluded. Then summarize your overall conclusion based on all the observations.

  • a.The headaches occur only on days that you go to work.

  • b.If you stop drinking Coke at lunch, the headaches persist.

  • c.In the summer, the headaches occur less frequently if you open the windows of your office slightly. They occur even less often if you open the windows of your office fully.

14.

Smoking and Lung Cancer. There is a strong correlation between tobacco smoking and incidence of lung cancer, and most physicians believe that tobacco smoking causes lung cancer. Yet, not everyone who smokes gets lung cancer. Briefly describe how smoking could cause cancer when not all smokers get cancer.

15.

Other Lung Cancer Causes. Several things besides smoking have been shown to be probabilistic causal factors in lung cancer. For example, exposure to asbestos and exposure to radon gas, both of which are found in many homes, can cause lung cancer. Suppose that you meet a person who lives in a home that has a high radon level and insulation that contains asbestos. The person tells you, “I smoke, too, because I figure I’m doomed to lung cancer anyway.” What would you say in response? Explain.

16.

Longevity of Orchestra Conductors. A famous study in Forum on Medicine concluded that the mean lifetime of conductors of major orchestras was 73.4 years, about 5 years longer than that of all American males at the time. The author claimed that a life of music causes a longer life. Evaluate the claim of causality and propose other explanations for the longer life expectancy of conductors.

17.

Older Moms. A study reported in Nature claims that women who give birth later in life tend to live longer. Of the 78 women who were at least 100 years old at the time of the study, 19% had given birth after their 40th birthday. Of the 54 women who were 73 years old at the time of the study, only 5.5% had given birth after their 40th birthday. A researcher stated that “if your reproductive system is aging slowly enough that you can have a child in your 40s, it probably bodes well for the fact that the rest of you is aging slowly too.” Was this an observational study or an experiment? Does the study suggest that later child bearing causeslonger lifetimes or that later child bearing reflects an underlying cause? Comment on how persuasive you find the conclusions of the report.

18.

High-Voltage Power Lines. Suppose that people living near a high-voltage power line have a higher incidence of cancer than people living farther from the power line. Can you conclude that the high-voltage power line is the cause of the elevated cancer rate? If not, what other explanations might there be for it? What other types of research would you like to see before you concluded that high-voltage power lines cause cancer?

19.

Gun Control. Those who favor gun control often point to a positive correlation between the availability of handguns and murder rates to support their position that gun control would save lives. Does this correlation, by itself, indicate that handgun availability causes a higher murder rate? Suggest some other factors that might support or weaken this conclusion.

20.

Vasectomies and Prostate Cancer. An article titled “Does Vasectomy Cause Prostate Cancer?” (Chance, Vol. 10, No. 1) reports on several large studies that found an increased risk of prostate cancer among men with vasectomies. In the absence of a direct cause, several researchers attribute the correlation to detection bias, in which men with vasectomies are more likely to visit the doctor and thereby are more likely to have any prostate cancer found by the doctor. Briefly explain how this detection bias could affect the claim that vasectomies cause prostate cancer.

PROJECTS FOR  THE INTERNET & BEYOND

21.

Air Bags and Children. Starting from the Web site of the National Highway Traffic Safety Administration, research the latest studies on the safety of air bags, especially with regard to children. Write a short report summarizing your findings and offering recommendations for improving child safety in cars.

22.

Dietary Fiber and Coronary Heart Disease. In the largest study of how dietary fiber prevents coronary heart disease (CHD) in women (Journal of the American Medical Association, Vol. 281, No. 21), researchers detected a reduced risk of CHD among women who have a high-fiber diet. Find the research paper, summarize its findings, and discuss whether a cause for the correlation is proposed.

23.

Coffee and Gallstones. Writing in the Journal of the American Medical Association (Vol. 281, No. 22), researchers reported finding a negative correlation between incidence of gallstone disease and coffee consumption in men. Find the research paper, summarize its findings, and discuss whether a cause for the correlation is proposed.

24.

Alcohol and Stroke. Researchers reported in the Journal of the American Medical Association(Vol. 281, No. 1) that moderate alcohol consumption is correlated with a decreased risk of stroke in people 40 years of age and older. (Heavy consumption of alcohol was correlated with deleterious effects.) Find the research paper, summarize its findings, and discuss whether a cause for the correlation is proposed.

25.

Tobacco Lawsuits. Tobacco companies have been the subject of many lawsuits related to the dangers of smoking. Research one recent lawsuit. What were the plaintiffs trying to prove? What statistical evidence did they use? How well do you think they established causality? Did they win? Summarize your findings in one to two pages.

IN THE NEWS

26.

Causation in the News. Find a recent news report in which a statistical study led to a conclusion of causation. Describe the study and the claimed causation. Do you think the claim of causation is legitimate? Explain.

27.

Legal Causation. Find a news report concerning an ongoing legal case, either civil or criminal, in which establishing causality is important to the outcome. Briefly describe the issue of causation in the case and how the ability to establish or refute causality will influence the outcome of the case.

CHAPTER REVIEW EXERCISES

For Exercises 1–3, refer to the combined city–highway fuel economy ratings (mi/gal) for different cars. The old ratings are based on tests used before 2008 and the new ratings are based on tests that went into effect in 2008.

Old

16

27

17

33

28

24

18

22

20

29

21

New

15

24

15

29

25

22

16

20

18

26

19

1.

Construct a scatterplot. What does the result suggest?

2.

Estimate the value of the correlation coefficient. What does that value suggest?

3.

Can we conclude that the old ratings have a direct causal effect on the new ratings? Explain briefly.

4.

In a study of casino size (square feet) and revenue, the value of r = 0.445 is obtained. Find the value of r2. What does that value tell us?

5.

In a study of global warming, assume that we have found a strong positive correlation between carbon dioxide concentration and temperature. Identify three possible explanations for this correlation.

6.

For 10 pairs of sample data, the correlation coefficient is computed to be r = –1. What do you know about the scatterplot?

7.

In a study of randomly selected subjects, it is found that there is a strong correlation between household income and number of visits to dentists. Is it valid to conclude that higher incomes cause people to visit dentists more often? Is it valid to conclude that more visits to dentists cause people to have higher incomes? How might the correlation be explained?

8.

You are considering the most expensive purchase that you are likely to make: the purchase of a home. Identify at least five different variables that are likely to affect the actual value of a home. Among the variables that you have identified, which single variable is likely to have the greatest influence on the value of the home? Identify a variable that is likely to have little or no effect on the value of a home.

9.

A researcher collects paired sample data and computes the value of the linear correlation coefficient to be 0. Based on that value, he concludes that there is no relationship between the two variables. What is wrong with this conclusion?

10.

Examine the scatterplot in Figure 7.22 and estimate the value of the correlation coefficient.

Figure 7.22

CHAPTER QUIZ

1.

Fill in the blanks: Every possible correlation coefficient must lie between the values of _____ and _____.

2.

Which of the following are likely to have a correlation?

  • a.SAT scores and weights of randomly selected subjects

  • b.Reaction times and IQ scores of randomly selected subjects

  • c.Height and arm span of randomly selected subjects

  • d.Proportion of seats filled and amount of airline profit for randomly selected flights

  • e.Value of cars owned and annual income of randomly selected car owners

3.

For a collection of paired sample data, the correlation coefficient is found to be –0.099. Which of the following statements best describes the relationship between the two variables?

  • a.There is no correlation.

  • b.There is a weak correlation.

  • c.There is a strong correlation.

  • d.One of the variables is the direct cause of the other variable.

  • e.Neither of the variables is the direct cause of the other variable.

4.

Estimate the correlation coefficient for the data in Figure 7.23.

Figure 7.23

5.

Refer again to the scatterplot in Figure 7.23. Does there appear to be a significant correlation between the two variables?

In Exercises 6–10, determine whether the given statement is true or false.

6.

If r = 0.200, then r2 = 0.040 and 4% of the plotted points lie on the line of best fit.

7.

If r = 1 or r = –1, then all points in the scatterplot lie directly on the line of best fit.

8.

If the value of the correlation coefficient is negative, the value of r2 must also be negative.

9.

A scatterplot is a graph in which the points are scattered throughout, without any noticeable pattern.

10.

If the line of best fit is inserted in a scatterplot, it must pass through every point in the graph.

 EDUCATION

What Helps Children Learn to Read?

Everyone has an idea about how best to teach reading to children. Some advocate a phonetic approach, teaching students to “sound out” words. Some advocate a “whole language” approach, teaching students to recognize words from their context. Others advocate a combination of these approaches, or something else entirely. These differing ideas would be unimportant if they were merely opinions. But in a nation that spends more than a trillion dollars per year on education, differing approaches to teaching reading involve major political confrontations among groups with different special interests.

The huge stakes involved in teaching reading demand statistics to measure the effectiveness of various approaches. Some of the most important educational statistics are those that come from the National Assessment of Educational Progress (NAEP), often known more simply as “the Nation’s Report Card.” The NAEP is an ongoing survey of student achievement conducted by a government agency, the National Center for Education Statistics, with authorization and funding from the U.S. Congress.

The NAEP uses stratified random sampling (see Chapter 1) to choose representative samples of fourth-, eighth-, and 12th-grade students of varying ethnicity, family income, type of school attended, and so on. Students chosen for the samples are given tests designed to measure their academic achievement in a particular subject area, such as reading, mathematics, or history. Samples are chosen on both state and national levels. Overall, a few thousand students are chosen for each test. Results from NAEP tests inevitably make the news, with articles touting improvements or decrying drops in test scores.

But what really causes improvement in reading performance? Researchers begin by searching for correlations between reading performance and other factors. Sometimes the correlations are clear, but offer no direction for improving reading. For example, parental education is clearly correlated with reading achievement—children with more highly educated parents tend to read more proficiently than those with uneducated parents—but this correlation doesn’t offer much guidance for the schools because children do not choose their parents. Other times the correlations may suggest ways to improve reading. For example, students who report reading more pages daily in school and for homework tend to score higher than students who read fewer pages. This suggests that schools should assign more reading.

Of course, the high stakes involved in education make education statistics particularly prone to misinterpretation or misuse. Consider just a few of the problems that make the NAEP reading tests difficult to interpret:

  • • They are standardized tests that are mostly multiple choice. Some people believe that such tests are inevitably biased and cannot truly measure reading ability.

  • • Because the tests generally don’t affect students’ grades, some students may not take the tests seriously, in which case test results may not reflect actual reading ability.

  • • State-by-state comparisons may not be valid if the makeup of the student population (particularly in its fraction of students for whom English is a second language) varies significantly among states.

  • • There is some evidence of cheating on the part of the adults involved in the NAEP tests by, for example, choosing samples that are not truly representative but instead skewed toward students who read better.

You can probably think of a dozen other problems that make it difficult to interpret NAEP results. So what can you do, as an individual, to help a child to read? Fortunately, the NAEP studies also reveal a few correlations that are uncontroversial and agree with common sense. For example, higher reading performance correlates with each of the following factors:

  • • more total reading, both for school and for pleasure

  • • more choice in reading—that is, allowing children to pick their own books to read

  • • more writing, particularly of extended pieces such as essays or long letters

  • • more discussion of reading material with friends and family

  • • less television watching

These correlations give at least some guidance on how to help a child learn to read and should be good starting points for discussions of how to increase literacy.

QUESTIONS FOR DISCUSSION

  • 1. One result of the NAEP reading tests is that students in private schools tend to score significantly higher than students in public schools. Does this imply that private schools are “better” than public schools? Defend your opinion.

  • 2. Do you think that standardized tests like those of the NAEP are valid ways to measure academic achievement? Why or why not?

  • 3. Currently, the NAEP tests are given to only a few thousand of the millions of school children in the United States. Some people advocate giving similar tests to all students, on either a voluntary or a mandatory basis. Do you think such “standardized national testing” is a good idea? Why or why not?

  • 4. Have you ever helped a child learn to read? Compare your experiences with those of other classmates who have worked with young children.

  • 5. Read the latest edition of the NAEP Reading Report Card (available online). What are some of the latest results with regard to the teaching of reading in the United States?

 ENVIRONMENT

What Is Causing global Warming?

Global warming is one of the most important issues of our time, yet surveys and media reports suggest that many people doubt that it is real or that humans are responsible for it. In this Focus, we will investigate the evidence that has led the vast majority of climate scientists to conclude that human activity is the cause of global warming.

As we discussed in the Focus on Environment for Chapter 3 (page 117), measurements clearly show that the atmospheric carbon dioxide concentration is rising rapidly and is now significantly higher than it has been at any time during at least the past 800,000 years (see Figure 3.46). Chemical analysis shows the added carbon dioxide is coming primarily from human activity, especially the burning of fossil fuels. Moreover, data from ice cores show that the carbon dioxide concentration is strongly correlated with the global average temperature. The key question, then, is whether this correlation implies causality. To answer it, we must understand how a gas like carbon dioxide can affect the temperature and then investigate whether the recent increase in the concentration is having the expected effects.

The fact that some atmospheric gases—called greenhouse gases—can trap heat has been well-known for more than 150 years, ever since Irish physicist John Tyndall measured the heat-absorbing effects of carbon dioxide and water vapor in his laboratory in 1859. Other scientists, most notably Swedish scientist Svante Arrhenius (1859–1927), later pointed out that the burning of fossil fuels releases carbon dioxide, and that this might therefore cause global warming. Today, the mechanism by which carbon dioxide and other greenhouse gases (the most important others being water vapor and methane) warm a planet is called the greenhouse effect. It is well understood and summarized in Figure 7.24.

Scientists can further test this understanding of the greenhouse effect by checking to see whether it successfully accounts for the temperatures of various planets. In the absence of greenhouse gases, a world’s average temperature would be determined by only two major factors: its distance from the Sun and the fraction of the incoming sunlight that its surface absorbs (the rest is reflected back into space). There is a simple equation that allows the calculation of the temperature in this case, and it successfully predicts the temperatures of worlds with no atmosphere, such as the Moon and the planet Mercury. For planets with atmospheres, however, scientists can successfully predict their temperatures only by taking the greenhouse effect into account, and the results clearly show that more greenhouse gases mean more excess heating. Our planetary neighbors vividly demonstrate this fact. Mars has a very thin carbon dioxide atmosphere that gives it a fairly weak greenhouse effect, making the planet about 11°F warmer than it would be otherwise. Venus, which has an extremely dense atmosphere containing nearly 200,000 times as much carbon dioxide as Earth’s atmosphere, has a correspondingly extreme greenhouse effect that makes its surface about 850°F hotter than it would be otherwise—giving it a surface hot enough to melt lead.

Figure 7.24 This diagram shows the basic mechanism of the greenhouse effect. The greater the abundance of greenhouse gases, the more the escape of infrared light is slowed and the warmer the planet becomes.

Earth is the lucky intermediate case. Without the greenhouse effect, Earth’s average temperature would be well below freezing, at about –16°C (3°F). But thanks to the carbon dioxide, methane, and water vapor in our atmosphere, the actual temperature is close to 15°C (60°F). From that standpoint, the greenhouse effect is a very good thing, because our lives would not be possible without it. Just keep in mind that the case of Venus offers proof that it’s possible to have too much of this good thing.

The evidence from laboratory measurements and studies of other planets leave no reasonable doubt that carbon dioxide and other greenhouse gases cause a planet’s temperature to be hotter than it would be otherwise. Nevertheless, because carbon dioxide is not the only thing that affects our planet’s temperature, we might wonder whether any effects due to the recent rise in carbon dioxide might be offset by, say, reductions in the amount of other greenhouse gases or an increase in how much sunlight our planet reflects. Scientists can test this idea in two basic ways. First, they look at data showing changes in Earth’s average temperature. As you’ll see in Figure 7.25, Earth’s average temperature has indeed been rising, and the data give at least some hint that the rise has been accelerating in recent decades.

The second way to test the idea that human burning of fossil fuels is causing global warming is to conduct experiments. We obviously cannot perform controlled experiments with our entire planet, so scientists instead build computer models designed to simulate the way Earth’s climate works. Earth’s climate is incredibly complex, so the models cannot be perfect. Nevertheless, today’s models match real climate data quite well, giving scientists confidence that the models have predictive value. Figure 7.26 compares real data to models with and without the human contribution to the greenhouse gas concentration. We see a good match only for models that include the human contribution.

Figure 7.25 Clear evidence of a warming Earth: The black curve shows the mean global average temperature during each year; the red curve shows a running mean computed over 5-year periods. The vertical scale (“temperature anomaly”) shows the difference between each year’s actual average temperature and the average during the period 1951–1980. The blue bars represent the uncertainty ranges in the data at three different times; the uncertainty is lower for recent times because measurements have become more precise.

Figure 7.26 This graph compares observed temperature changes (black curve) with the predictions of climate models that include only natural factors such as changes in the brightness of the Sun and effects of volcanoes (blue curve), and models that also include the human-made increase in the greenhouse gas (red curve). Only the red curve matches the observations well. (The red and blue model curves are each averages of many scientists’ independent models of global warming, which generally agree with each other to within 0.1°C – 0.2°C.)

The conclusion is clear: Laboratory measurements of the greenhouse effect, studies of other planets, data for Earth’s rising carbon dioxide concentration and temperature, and computer models of the climate all provide evidence in favor of the claim that human activity is causing global warming. It is the fact that so many lines of evidence are all in agreement that makes scientists so confident that the causality is real.

QUESTIONS FOR DISCUSSION

  • 1. Look back at the six guidelines for establishing causality on page 265. Discuss whether or how each guideline is met by current data and understanding of global warming.

  • 2. Look back at the legal levels of confidence in causality discussed in Section 7.4. Would you say that the case for human activity as the cause of global warming is now at the level of possible cause, probable cause, or cause beyond reasonable doubt? Defend your opinion.

  • 3. Investigate some of the likely consequences of global warming. If current trends continue, what changes can you expect in the world by the year 2050? 2100?

  • 4. Based on what you’ve learned about the cause of global warming and its potential consequences, what do you think we should be doing about it, if anything?

66