statistics hw using R

Go to TOC Statistics for the Sciences Charles Peters Go to TOC Contents 1 Background 6 1.1 Populations, Samples and Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 1.2 Types of Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 1.3 Random Experiments and Sample Spaces. . . . . . . . . . . . . . . . . . . . . . . . .7 1.4 Computing in Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 1.5 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 2 Descriptive and Graphical Statistics11 2.1 Location Measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11 2.1.1 The Mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11 2.1.2 The Median and Other Quantiles. . . . . . . . . . . . . . . . . . . . . . . . . .12 2.1.3 Trimmed Means. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 2.1.4 Grouped Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13 2.1.5 Histograms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 2.1.6 Robustness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 2.1.7 The Five Number Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 2.1.8 The Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 2.1.9 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 2.2 Measures of Variability or Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 2.2.1 The Variance and Standard Deviation. . . . . . . . . . . . . . . . . . . . . . .16 2.2.2 The Coe cient of Variation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 2.2.3 The Mean and Median Absolute Deviation. . . . . . . . . . . . . . . . . . . .17 2.2.4 The Interquartile Range. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18 2.2.5 Boxplots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18 2.2.6 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 2.3 Jointly Distributed Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 2.3.1 Side by Side Boxplots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 2.3.2 Scatterplots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22 2.3.3 Covariance and Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .23 2.3.4 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26 3 Probability 28 3.1 Basic De nitions. Equally Likely Outcomes. . . . . . . . . . . . . . . . . . . . . . . .28 3.2 Combinations of Events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29 3.2.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30 1 Go to TOC CONTENTS 2 3.3 Rules for Probability Measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31 3.4 Counting Outcomes. Sampling with and without Replacement. . . . . . . . . . . . .32 3.4.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33 3.5 Conditional Probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34 3.5.1 Relating Conditional and Unconditional Probabilities. . . . . . . . . . . . . .36 3.5.2 Bayes' Rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36 3.6 Independent Events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37 3.6.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38 3.7 Replications of a Random Experiment. . . . . . . . . . . . . . . . . . . . . . . . . . .39 4 Discrete Distributions40 4.1 Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40 4.2 Discrete Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41 4.3 Expected Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42 4.3.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43 4.4 Bernoulli Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43 4.4.1 The Mean and Variance of a Bernoulli Variable. . . . . . . . . . . . . . . . . .44 4.5 Binomial Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44 4.5.1 The Mean and Variance of a Binomial Distribution. . . . . . . . . . . . . . . .48 4.5.2 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48 4.6 Hypergeometric Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49 4.6.1 The Mean and Variance of a Hypergeometric Distribution. . . . . . . . . . . .51 4.7 Poisson Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51 4.7.1 The Mean and Variance of a Poisson Distribution. . . . . . . . . . . . . . . .54 4.7.2 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54 4.8 Jointly Distributed Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55 4.8.1 Covariance and Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .57 4.9 Multinomial Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58 4.9.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60 5 Continuous Distributions62 5.1 Density Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62 5.2 Expected Values and Quantiles for Continuous Distributions. . . . . . . . . . . . . .67 5.2.1 Expected Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67 5.2.2 Quantiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .68 5.2.3 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69 5.3 Uniform Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69 5.4 Exponential Distributions and Their Relatives. . . . . . . . . . . . . . . . . . . . . . .70 5.4.1 Exponential Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70 5.4.2 Gamma Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72 5.4.3 Weibull Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74 5.4.4 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78 5.5 Normal Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79 5.5.1 Tables of the Standard Normal Distribution. . . . . . . . . . . . . . . . . . . .80 5.5.2 Other Normal Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . .81 5.5.3 The Normal Approximation to the Binomial Distribution. . . . . . . . . . . .83 Go to TOC CONTENTS 3 5.5.4 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84 6 Joint Distributions and Sampling Distributions85 6.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85 6.2 Jointly Distributed Continuous Variables. . . . . . . . . . . . . . . . . . . . . . . . . .85 6.2.1 Mixed Joint Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88 6.2.2 Covariance and Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .89 6.2.3 Bivariate Normal Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . .90 6.3 Independent Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92 6.3.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 6.4 Sums of Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94 6.4.1 Simulating Random Samples. . . . . . . . . . . . . . . . . . . . . . . . . . . .96 6.5 Sample Sums and the Central Limit Theorem. . . . . . . . . . . . . . . . . . . . . . .98 6.5.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102 6.6 Other Distributions Associated with Normal Sampling. . . . . . . . . . . . . . . . . .103 6.6.1 Chi Square Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103 6.6.2 Student t Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .106 6.6.3 The Joint Distribution of the Sample Mean and Variance. . . . . . . . . . . .108 6.6.4 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .109 7 Statistical Inference for a Single Population110 7.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110 7.2 Estimation of Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110 7.2.1 Estimators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110 7.2.2 Desireable Properties of Estimators. . . . . . . . . . . . . . . . . . . . . . . . .111 7.3 Estimating a Population Mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112 7.3.1 Con dence Intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .113 7.3.2 Small Sample Con dence Intervals for a Normal Mean. . . . . . . . . . . . . .115 7.3.3 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .117 7.4 Estimating a Population Proportion. . . . . . . . . . . . . . . . . . . . . . . . . . . .119 7.4.1 Choosing the Sample Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .120 7.4.2 Con dence Intervals for p. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .121 7.4.3 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .123 7.5 Estimating Quantiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .123 7.5.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125 7.6 Estimating the Variance and Standard Deviation. . . . . . . . . . . . . . . . . . . . .125 7.7 Hypothesis Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .126 7.7.1 Test Statistics, Type 1 and Type 2 Errors. . . . . . . . . . . . . . . . . . . . .127 7.8 Hypotheses About a Population Mean. . . . . . . . . . . . . . . . . . . . . . . . . . .127 7.8.1 Tests for the mean when the variance is unknown. . . . . . . . . . . . . . . . .129 7.9 p-values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .130 7.9.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .132 7.10 Hypotheses About a Population Proportion. . . . . . . . . . . . . . . . . . . . . . . .132 7.10.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .134 Go to TOC CONTENTS 4 8 Regression and Correlation136 8.1 Examples of Linear Regression Problems. . . . . . . . . . . . . . . . . . . . . . . . . .136 8.2 Least Squares Estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .140 8.2.1 The "lm" Function in R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143 8.2.2 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .144 8.3 Distributions of the Least Squares Estimators. . . . . . . . . . . . . . . . . . . . . . .145 8.3.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .147 8.4 Inference for the Regression Parameters. . . . . . . . . . . . . . . . . . . . . . . . . .148 8.4.1 Con dence Intervals for the Parameters. . . . . . . . . . . . . . . . . . . . . .150 8.4.2 Hypothesis Tests for the Parameters. . . . . . . . . . . . . . . . . . . . . . . .150 8.4.3 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156 8.5 Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .157 8.5.1 Con dence intervals for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158 8.5.2 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159 9 Inference from Multiple Samples160 9.1 Comparison of Two Population Means. . . . . . . . . . . . . . . . . . . . . . . . . . .160 9.1.1 Large Samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .160 9.1.2 Comparing Two Population Proportions. . . . . . . . . . . . . . . . . . . . . .162 9.1.3 Samples from Normal Distributions. . . . . . . . . . . . . . . . . . . . . . . .164 9.1.4 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .167 9.2 Paired Observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .168 9.2.1 Crossover Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169 9.2.2 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171 9.3 More than Two Independent Samples: Single Factor Analysis of Variance. . . . . . .171 9.3.1 Example Using R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .175 9.3.2 Multiple Comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 9.3.3 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .178 9.4 Two-Way Analysis of Variance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .179 9.4.1 Interactions Between the Factors. . . . . . . . . . . . . . . . . . . . . . . . . .183 9.4.2 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .184 10 Analysis of Categorical Data185 10.1 Multinomial Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .185 10.1.1 Estimators and Hypothesis Tests for the Parameters. . . . . . . . . . . . . . .186 10.1.2 Multinomial Probabilities That Are Functions of Other Parameters. . . . . . .187 10.1.3 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .189 10.2 Testing Equality of Multinomial Probabilities. . . . . . . . . . . . . . . . . . . . . . .190 10.3 Independence of Attributes: Contingency Tables. . . . . . . . . . . . . . . . . . . . .192 10.3.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .195 11 Miscellaneous Topics196 11.1 Multiple Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .196 11.1.1 Inferences Based on Normality. . . . . . . . . . . . . . . . . . . . . . . . . . .197 11.1.2 Using R's "lm" Function for Multiple Regression. . . . . . . . . . . . . . . . .198 11.1.3 Factor Variables as Predictors. . . . . . . . . . . . . . . . . . . . . . . . . . . .201 11.1.4 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .206 Go to TOC CONTENTS 5 11.2 Nonparametric Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207 11.2.1 The Signed Rank Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207 11.2.2 The Wilcoxon Rank Sum Test. . . . . . . . . . . . . . . . . . . . . . . . . . .212 11.2.3 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .214 11.3 Bootstrap Con dence Intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215 11.3.1 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .218 Go to TOC Chapter 1 Background Statistics is the art of summarizing data, depicting data, and extracting information from it. Statistics and the theory of probability are distinct sub jects, although statistics depends on probability to quantify the strength of its inferences. The probability used in this course will be developed in Chapter 3 and throughout the text as needed. We begin by introducing some basic ideas and terminology.

1.1 Populations, Samples and Variables A population is a set of individual elements whose collective properties are the sub ject of investigation.

Usually, populations are large collections whose individual members cannot all be examined in detail.

In statistical inference a manageable subset of the population is selected according to certain sampling procedures and properties of the subset are generalized to the entire population. These generalizations are accompanied by statements quantifying their accuracy and reliability. The selected subset is called a sample from the population.

Examples :

(a) the population of registered voters in a congressional district, (b) the population of U.S. adult males, (c) the population of currently enrolled students at a certain large urban university, (d) the population of all transactions in the U.S. stock market for the past month, (e) the population of all peak temperatures at points on the Earth's surface over a given time interval.

Some samples from these populations might be:

(a) the voters contacted in a pre-election telephone poll, (b) adult males interviewed by a TV reporter, (c) the dean's list, (d) transactions recorded on the books of Smith Barney, (e) peak temperatures recorded at several weather stations.

Clearly, for these particular samples, some generalizations from sample to population would be highly questionable.

6 Go to TOC CHAPTER 1. BACKGROUND 7 A population variable is an attribute that has a value for each individual in the population. In other words, it is a function from the population to some set of possible values. It may be helpful to imagine a population as a spreadsheet with one row or record for each individual member. Along the ith row, the values of a number of attributes of the ith individual are recorded in di erent columns.

The column headings of the spreadsheet can be thought of as the population variables. For example, if the population is the set of currently enrolled students at the urban university, some of the variables are academic classi cation, number of hours currently enrolled, total hours taken, grade point average, gender, ethnic classi cation, ma jor, and so on. Variables, such as these, that are de ned for the same population are said to be jointly observed or jointly distributed.

1.2 Types of Variables Variables are classi ed according to the kinds of values they have. The three basic types are numeric variables, factor variables, and ordered factor variables. Numeric variables are those for which arith- metic operations such as addition and subtraction make sense. Numeric variables are often related to a scale of measurement and expressed in units, such as meters, seconds, or dollars. Factor variables are those whose values are mere names, to which arithmetic operations do not apply. Factors usually have a small number of possible values. These values might be designated by numbers. If they are, the numbers that represent distinct values are chosen merely for convenience. The values of factors might also be letters, words, or pictorial symbols. Factor variables are sometimes called nominal variables or categorical variables. Ordered factor variables are factors whose values are ordered in some natural and important way. Ordered factors are also called ordinal variables. Some textbooks have a more elaborate classi cation of variables, with various subtypes. The three types above are enough for our purposes.

Examples : Consider the population of students currently enrolled at a large university. Each stu- dent has a residency status, either resident or nonresident. Residency status is an unordered factor variable. Academic classi cation is an ordered factor with values \freshman", \sophomore", \junior", \senior", \post-baccalaureate" and \graduate student". The number of hours enrolled is a numeric variable with integer values. The distance a student travels from home to campus is a numeric vari- able expressed in miles or kilometers. Home area code is an unordered factor variable whose values are designated by numbers.

1.3 Random Experiments and Sample Spaces An experiment can be something as simple as ipping a coin or as complex as conducting a public opinion poll. A random experiment is one with the following two characteristics:

(1) The experiment can be replicated an inde nite number of times under essentially the same exper- imental conditions.

(2) There is a degree of uncertainty in the outcome of the experiment. The outcome may vary from replication to replication even though experimental conditions are the same. Go to TOC CHAPTER 1. BACKGROUND 8 When we say that an experiment can be replicated under the same conditions, we mean that control- lable or observable conditions that we think might a ect the outcome are the same. There may be hidden conditions that a ect the outcome, but we cannot account for them. Implicit in (1) is the idea that replications of a random experiment are independent , that is, the outcomes of some replications do not a ect the outcomes of others. Obviously, a random experiment is an idealization of a real experiment. Some simple experiments, such as tossing a coin, approach this ideal closely while more complicated experiments may not.

The sample space of a random experiment is the set of all its possible outcomes. We use the Greek capital letter (omega)to denote the sample space. There is some degree of arbitrariness in the description of . It depends on how the outcomes of the experiment are represented symbolically.

Examples :

(a) Toss a coin.

= fH; T g, where \H" denotes a head and \T" a tail. Another way of repre- senting the outcome is to let the number 1 denote a head and 0 a tail (or vice-versa). If we do this, then = f0;1 g. In the latter representation the outcome of the experiment is just the number of heads.

(b) Toss a coin 5 times, i.e., replicate the experiment in (a) 5 times. An outcome of this experiment is a 5 term sequence of heads and tails. A typical outcome might be indicated by (H,T,T,H,H), or by (1,0,0,1,1). Even for this little experiment it is cumbersome to list all the outcomes, so we use a shorter notation =f(x 1; x 2; x 3; x 4; x 5) j x i = 0 or x i = 1 for each i g:

(c) Select a student randomly from the population of all currently enrolled students. The sample space is the same as the population. The word andomly" is vague. We will de ne it later.

(d) Repeat the Michelson-Morley experiment to measure the speed of the Earth relative to the ether (which doesn't exist, as we now know). The outcome of the experiment could conceivably be any nonnegative number, so we take = [0 ;1 ) = fx jx is a real number and x 0:g Uncertainty arises from the fact that this is a very delicate experiment with several sources of unpredictable error.

1.4 Computing in Statistics Even moderately large data sets cannot be managed e ectively without a computer and computer software. Furthermore, much of applied statistics is exploratory in nature and cannot be carried out by hand, even with a calculator. Spreadsheet programs, such as Microsoft Excel, are designed to manipulate data in tabular form and have functions for performing the common tasks of statistics. In addition, many add-ins are available, some of them free, for enhancing the graphical and statistical capabilities of spreadsheet programs. Some of the exercises and examples in this text make use of Excel with its built-in data analysis package. Because it is so common in the business world, it is important for students to have some experience with Excel or a similar program.

The disadvantages of spreadsheet programs are their dependence on the spreadsheet data format with cell ranges as input for statistical functions, their lack of exibility, and their relatively poor graphics. Many highly sophisticated packages for statistics and data analysis are available. Some of Go to TOC CHAPTER 1. BACKGROUND 9 the best known commercial packages are Minitab, SAS, SPSS, Splus, Stata, and Systat. The package used in this text is called R. It is an open source implementation of the same language used in Splus and may be downloaded free at http://www.r-pro ject.org.

After downloading and installing R we recommend that you download and install another free package called Rstudio. It can be obtained from http://www.rstudio.com.

Rstudio makes importing data into R much easier and makes it easier to integrate R output with other programs. Detailed instructions on using R and Rstudio for the exercises will be provided.

Data les used in this course are from four sources. Some are local in origin and come from student or course data at the University of Houston. Others are simulated but made to look as realistic as possible. These and others are available at http://www.math.uh.edu/ charles/data.

Many data sets are included with R in the datasets library and other contributed packages. We will refer to them frequently. The main external sources of data are the data archives maintained by the Journal of Statistics Education .

www.amstat.org/publications/jse and the Statistical Science Web :

http://www.stasci.org/datasets.html.

1.5 Exercises 1. Go tohttp://www.math.uh.edu/ charles/data. Examine the data set \Air Pollution Filter Noise".

Identify the variables and give their types.

2. Highlight the data in Air Pollution Filter Noise. Include the column headings but not the language preceding the column headings. Copy and paste the data into a plain text le, for example with Notepad in Windows. Import the text le into Excel or another spread sheet program. Create a new folder or directory named \math3339" and save both les there.

3. Start R by double clicking on the big blue R icon on your desktop. Click on the le menu at the top of the R Gui window. Select \change dir : : :" . In the window that opens next, nd the name of the directory where you saved the text le and double click on the name of that directory. Suppose that you named your le \ap lternoise". (Name it anything you like.) Import the le into R with the command Go to TOC CHAPTER 1. BACKGROUND 10 > ap lternoise=read.table("ap lternoise.txt",header=T) and display it with the command > ap lternoise Click on the le menu at the top again and select \Exit". At the prompt to save your workspace, click \Yes". If you open the folder where your work was saved you will see another big blue R icon. If you double click on it, R will start again and your previously saved workspace will be restored.

If you use Rstudio for this exercise you can import ap lternoise into R by clicking on the "Import Dataset" tab. This will open a window on your le system and allow you to select the le you saved in Exercise 2. The dialog box allows you to rename the data and make other minor changes before importing the data as a data frame in R.

4. If you are using Rstudio, click on the "Packages" tab and then the word "datasets". Find the data set "airquality" and click on it. Read about it. If you are using R alone, type > help(airquality) at the command prompt >in the Console window.

Then type > airquality to view the data. Could "Month" and "Day" be considered ordered factors rather than numeric vari- ables?

5. A random experiment consists of throwing a standard 6-sided die and noting the number of spots on the upper face. Describe the sample space of this experiment.

6. An experiment consists of replicating the experiment in exercise 4 four times. Describe the sample space of this experiment. How many possible outcomes does this experiment have? Go to TOC Chapter 2 Descriptive and Graphical Statistics A large part of a statistician's job consists of summarizing and presenting important features of data.

Simply looking at a spreadsheet with 1000 rows and 50 columns conveys very little information. Most likely, the user of the data would rather see numerical and graphical summaries of how the values of di erent variables are distributed and how the variables are related to each other. This chapter concerns some of the most important ways of summarizing data.

2.1 Location Measures 2.1.1 The Mean Suppose that xis the name of a numeric variable whose values are recorded either for the entire population or for a sample from that population. Let the nrecorded values of xbe denoted by x 1; x 2; : : : ; x n. These are not necessarily distinct numbers. The mean or average of these values is x = 1 n n X i =1 x i When the values of xfor the entire population are included, it is customary to denote this quantity by (x ) and call it the population mean. The mean is called a location measure partly because it is taken as a representative or central value of x. More importantly, it behaves in a certain way if we change the scale of measurement for values of x. Imagine that xis temperature recorded in degrees Celsius and we decide to change the unit of measurement to degrees Fahrenheit. If y i denotes the Fahrenheit temperature of the ith individual, then y i = 1 :8 x i + 32. In e ect, we have de ned a new variable yby the equation y= 1 :8 x + 32. The means of the new and old variables have the same relationship as the individual measurements have.

y = 1 n n X i =1 y i = 1 n n X 1 (1 :8 x i+ 32) = 1 :8 x + 32 In general, if aand b >0 are constants and y= a+ bx , y = a+ b x . Other location measures introduced below behave in the same way.

11 Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 12 When there are repeated values of x, there is an equivalent formula for the mean. Let the mdistinct values of xbe denoted by v 1; : : : ; v m. Let n i be the number of times v i is repeated and let f i = n i=n .

Note that P m i =1 n i = nand P m i =1 f i = 1. Then the average is given by x = m X i =1 f iv i The number n i is the frequency of the value v i and f i is its relative frequency .

2.1.2 The Median and Other Quantiles Let xbe a numeric variable with values x 1; x 2; : : : ; x n. Arrange the values in increasing order x (1) x (2) : : : x (n ). The median of xis a number median(x ) such that at least half the values of x are median (x ) and at least half the values of xare median (x ). This conveys the essential idea but unfortunately it may de ne an interval of numbers rather than a single number. The ambiguity is usually resolved by taking the median to be the midpoint of that interval. Thus, if nis odd, n= 2 k+ 1, where kis a positive integer, median(x ) = x (k +1) , while if nis even, n= 2 k, median(x ) = x (k ) + x (k +1) 2 :

Let p2 (0;1) be a number between 0 and 1. The pth quantile of x is more commonly known as the 100 pth percentile; e.g., the 0.8 quantile is the same as the 80 th percentile. We de ne it as a number q (x; p ) such that the fraction of values of xthat are q(x; p ) is at least pand the fraction of values of x that are q(x; p ) is at least 1 p. For example, at least 80 percent of the values of xare the 80 th percentile of xand at least 20 percent of the values of xare its 80 th percentile. Again, this may not de ne a unique number q(x; p ). Software packages have rules for resolving the ambiguity, but the details are usually not important.

The median is the 50 th percentile, i.e., the 0.5 quantile. The 25 th and 75 th percentiles are called the rst and third quartiles. The 10 th ;20 th ;30 th , etc. percentiles are called the deciles. The median is a location measure as de ned in the preceding section.

2.1.3 Trimmed Means Trimmed means of a variable xare obtained by nding the mean of the values of xexcluding a given percentage of the largest and smallest values. For example, the 5% trimmed mean is the mean of the values of xexcluding the largest 5% of the values and the smallest 5% of the values. In other words, it is the mean of all the values between the 5 th and 95 th percentiles of x. A trimmed mean is a location measure. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 13 2.1.4 Grouped Data Sometimes large data sets are summarized by grouping values. Let xbe a numeric variable with values x 1; x 2; : : : ; x n. Let c 0 < c 1< : : : < c mbe numbers such that all the values of xare between c 0 and c m . For each i, let n i be the number of values of x(including repetitions) that are in the interval ( c i 1; c i], i.e., the number of indices jsuch that c i 1 < x j c i. A frequency table ofxis a table showing the class intervals ( c i 1; c i] along with frequencies n i with which the data values fall into each interval. Sometimes additional columns are included showing the relative frequencies f i = n i=n , the cumulative relative frequencies F i= P j if j, and the midpoints of the intervals.

Example 2.1. The data below are 50 measured reaction times in response to a sensory stimulus, arranged in increasing order. A frequency table is shown below the data.

0.12 0.30 0.35 0.37 0.44 0.57 0.61 0.62 0.71 0.80 0.88 1.02 1.08 1.12 1.13 1.17 1.21 1.23 1.35 1.41 1.42 1.42 1.46 1.50 1.52 1.54 1.60 1.61 1.68 1.72 1.86 1.90 1.91 2.07 2.09 2.16 2.17 2.20 2.29 2.32 2.39 2.47 2.60 2.86 3.43 3.43 3.77 3.97 4.54 4.73 Interval Midpoint n i f i F i (0,1] 0.5 11 0.22 0.22 (1,2] 1.5 22 0.44 0.66 (2,3] 2.5 11 0.22 0.88 (3,4] 3.5 4 0.08 0.96 (4,5] 4.5 2 0.04 1.00 If only a frequency table like the one above is given, the mean and median cannot be calculated exactly. However, they can be estimated. If we take the midpoint of an interval as a stand-in for all the values in that interval, then we can use the formula in the preceding section for calculating a mean with repeated values. Thus, in the example above, we would estimate the mean as 0:22(0 :5) + :44(1 :5) + 0 :22(2 :5) + 0 :08(3 :5) + 0 :04(4 :5) = 1 :78 Estimating the median is a bit more di cult. By examining the cumulative frequencies F i, we see that 22% of the data is less than or equal to 1 and 66% of the data is less than or equal to 2. Therefore, the median lies between 1 and 2. That is, it is 1 + a certain fraction of the distance from 1 to 2. A reasonable guess at that fraction is given by linear interpolation between the cumulative frequencies at 1 and 2. In other words, we estimate the median as 1 +:

50 :22 :

66 :22 (2 1) = 1 :636 :

A cruder estimate of the median is just the midpoint of the interval that contains the median, in this case 1.5. We leave it as an exercise to calculate the mean and median from the data of Example 1 and to compare them to these estimates. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 14 2.1.5 Histograms The gure below is a histogram of the reaction times.

> reacttimes=read.table("reacttimes.txt",header=T) > hist(reacttimes$Times,breaks=0:5,xlab="Reaction Times",main=" ") The histogram is a graphical depiction of the grouped data. The end points c i of the class intervals are shown on the horizontal axis. This is an absolute frequencyhistogram because the heights of the vertical bars above the class intervals are the absolute frequencies n i. A relative frequency histogram would show the relative frequencies f i. A density histogram has bars whose heights are the relative frequencies divided by the lengths of the corresponding class intervals. Thus,in a density histogram the area of the bar is equal to the relative frequency. If all class intervals have the same length, these types of histograms all have the same shape and convey the same visual information. Reaction Times Frequency 0 1 2 3 4 5 0 5 10 15 20 Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 15 2.1.6 Robustness A robust measure of location is one that is not a ected by a few extremely large or extremely small values. Values of a numeric variable that lie a great distance from most of the other values are calledoutliers . Outliers might be the result of mistakes in measuring or recording data, perhaps from misplacing a decimal point. The mean is not a robust location measure. It can be a ected signi cantly by a single extreme outlier if that outlying value is extreme enough. Thus, if there is any doubt about the quality of the data, the median or a trimmed mean might be preferred to the mean as a reliable location measure. The median is very insensitive to outliers. A 5% trimmed mean is insensitive to outliers that make up no more than 5% of the data values.

2.1.7 The Five Number Summary The ve number summary is a convenient way of summarizing numeric data. The ve numbers are the minimum value, the rst quartile (25 th percentile), the median, the third quartile (75 th percentile), and the maximum value. Sometimes the mean is also included, which makes it a six number summary.

Example 2.2. The natural logarithms yof the data values xin Example 1 are, to two places:

-2.12 -1.20 -1.05 -0.99 -0.82 -0.56 -0.49 -0.48 -0.34 -0.22 -0.13 0.02 0.08 0.11 0.12 0.16 0.19 0.21 0.30 0.34 0.35 0.35 0.38 0.40 0.42 0.43 0.47 0.48 0.52 0.54 0.62 0.64 0.65 0.73 0.74 0.77 0.78 0.79 0.83 0.84 0.87 0.90 0.96 1.05 1.23 1.23 1.33 1.38 1.51 1.55 It is sometimes advantageous to transform data in some way, i.e., to de ne a new variable yas a function of the old variable x. In this case, we have transformed the reaction times xwith the natural logarithm transformation. We might want to do this to so that we can more easily apply certain statistical inference procedures you will learn about later. The six number summary of the transformed data yis:

> reacttimes=read.table("reacttimes.txt",header=T) > summary(log(reacttimes$Times)) Min. 1st Qu. Median Mean 3rd Qu. Max.

-2.12000 0.08605 0.42520 0.33710 0.78500 1.55400 2.1.8 The Mode The mode of a variable is its most frequently occurring value. With numeric variables the mode is less important than the mean and median for descriptive purposes or for statistical inference. For factor variables the mode is the most natural way of choosing a "most representative" value. We hear this frequently in the media, in statements such as "Financial problems are the most common cause of marital strife". For grouped numeric data the modal class interval is the class interval having the highest absolute or relative frequency. In Example 1, the modal class interval is the interval (1,2].

2.1.9 Exercises 1. Find the mean and median of the reaction time data in Example 1. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 16 2. Find the quartiles of the reaction time data. There is more than one acceptable answer.

3. The 40th value x 40 of the reaction time data has a value of 2.32. Replace that with 232.0.

Recalculate the mean and median. Comment.

4. Construct a frequency table like the one in Example 1 for the log-transformed reaction times of Example 2. Use 5 class intervals of equal length beginning at -3 and ending at 2. Draw an absolute frequency histogram.

5. Estimate the mean and median of the grouped log-transformed reaction times by using the tech- niques discussed in Example 1. Compare your answers to the summary in Example 2.

6. Repeat exercises 1, 2, and the histogram of exercise 4 by using R.

7. Let xbe a numeric variable with values x 1; : : : ; x n 1; x n. Let x n be the average of all nval- ues and let x n 1 be the average of x 1; : : : ; x n 1. Show that x n = (1 1 n ) x n 1 + 1 n x n. What happens if x n ! 1 while all the other values of xare xed?

2.2 Measures of Variability or Scale 2.2.1 The Variance and Standard Deviation Let xbe a population variable with values x 1; x 2; : : : ; x n. Some of the values might be repeated. The variance of xis var(x ) = 2 = 1 n n X i =1 ( x i (x )) 2 :

The standard deviation of xis sd(x ) = = p var (x ):

When x 1; x 2; : : : ; x nare values of xfrom a sample rather than the entire population, we modify the de nition of the variance slightly, use a di erent notation, and call these ob jects the sample variance and standard deviation.

s2 = 1 n 1 n X i =1 ( x i x )2 ; s = p s 2 :

The reason for modifying the de nition for the sample variance has to do with its properties as an estimate of the population variance. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 17 Alternate algebraically equivalent formulas for the variance and sample variance are 2 = 1 n n X i =1 x 2 i (x )2 ; s 2 = 1 n 1( n X i =1 x 2 i n x 2 ):

These are sometimes easier to use for hand computation.

The standard deviation is called a measure of scale because of the way it behaves under linear transformations of the data. If a new variable yis de ned by y= a+ bx , where aand bare constants, sd (y ) = jb jsd (x ). For example, the standard deviation of Fahrenheit temperatures is 1.8 times the standard deviation of Celsius temperatures. The transformation y= a+ bx can be thought of as a rescaling operation, or a choice of a di erent system of measurement units, and the standard deviation takes account of it in a natural way.

2.2.2 The Coe cient of Variation For a variable that has only positive values, it may be more important to measure the relative vari- ability than the absolute variability. That is, the amount of variation should be compared to the mean value of the variable. The coe cient of variation for a population variable is de ned as cv(x ) = sd (x ) (x ) ; For a sample of values of xwe substitute the sample standard deviation sand the sample average x.

2.2.3 The Mean and Median Absolute Deviation Suppose that you must choose a single number cto represent all the values of a variable xas accurately as possible. One measure of the overall error with which crepresents the values of xis g (c ) = v u u t 1 n n X i =1 ( x i c)2 :

In the exercises, you are asked to show that this expression is minimized when c= x. In other words, the single number which most accurately represents all the values is, by this criterion, the mean of the variable. Furthermore, the minimum possible overall error, by this criterion, is the standard deviation.

However, this is not the only reasonable criterion. Another is h(c ) = 1 n n X i =1 j x i cj:

It can be shown that this criterion is minimized when c= median (x ). The minimum value of h(c ) is called the mean absolute deviation from the median. It is a scale measure which is somewhat more robust(less a ected by outliers) than the standard deviation, but still not very robust. A related very robust measure of scale is the median absolute deviation from the median, or mad:

mad (x ) = median (jx median (x )j) : Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 18 2.2.4 The Interquartile Range The interquartile range of a variable xis the di erence between its 75 th and 25 th percentiles.

I QR (x ) = q(x; : 75) q(x; : 25) :

It is a robust measure of scale which is important in the construction and interpretation of boxplots, discussed below.

All of these measures of scale are valid for comparison of the "spread" or variability of numeric variables about a central value. In general, the greater their values, the more spread out the values of the variable are. Of course, the standard deviation, median absolute deviation, and interquartile range of a variable will be di erent numbers and one must be careful to compare like measures.

2.2.5 Boxplots Boxplots are also called box and whisker diagrams. Essentially, a boxplot is a graphical representation of the ve number summary. The boxplot below depicts the sensory response data of the preceding section without the log transformation.

> reacttimes=read.table("reacttimes.txt",header=T) > boxplot(reacttimes$Times,horizontal=T,xlab="Reaction Times") > summary(reacttimes) Times Min. :0.120 1st Qu.:1.090 Median :1.530 Mean :1.742 3rd Qu.:2.192 Max. :4.730 Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 19The central box in the diagram encloses the middle 50% of the numeric data. Its left and right bound- aries mark the rst and third quartiles. The boldface middle line in the box marks the median of the data. Thus, the interquartile range is the distance between the left and right boundaries of the central box. For construction of a boxplot, an outlier is de ned as a data value whose distance from the nearest quartile is more than 1.5 times the interquartile range. Outliers are indicated by isolated points (tiny circles in this boxplot). The dashed lines extending outward from the quartiles are called the whiskers. They extend from the quartiles to the most extreme values in either direction that are not outliers.

This boxplot shows a number of interesting things about the response time data.

(a) The median is about 1.5. The interquartile range is slightly more than 1.

(b) The three largest values are outliers. They lie a long way from most of the data. They might call for special investigation or explanation.0 1 2 3 4 Reaction Times Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 20 (c) The distribution of values is not symmetric about the median. The values in the lower half of the data are more crowded together than those in the upper half. This is shown by comparing the distances from the median to the two quartiles, by the lengths of the whiskers and by the presence of outliers at the upper end .

The asymmetry of the distribution of values is also evident in the histogram of the preceding sec- tion.

2.2.6 Exercises 1. Find the variance and standard deviation of the response time data. Treat it as a sample from a larger population.

2. Find the interquartile range and the median absolute deviation for the response time data.

3. In the response time data, replace the value x 40 = 2 :32 by 232 :0. Recalculate the standard deviation, the interquartile range and the median absolute deviation and compare with the answers from problems 1 and 2.

4. Make a boxplot of the log-transformed reaction time data. Is the transformed data more sym- metrically distributed than the original data?

5. Show that the function g(c ) in section 2.2.3 is minimized when c= (x ). Hint: Minimize g(c )2 .

6. Find the variance, standard deviation, IQR, mean absolute deviation and median absolute de- viation of the variable "Ozone" in the data set "airquality". Use R or Rstudio. You can address the variable Ozone directly if you attach the airquality data frame to the search path as follows:

> attach(airquality) The R functions you will need are "sd" for standard deviation, "var" for variance, "IQR" for the interquartile range, and "mad" for the median absolute deviation. There is no built-in function in R for the mean absolute deviation, but it is easy to obtain it.

> mean(abs(Ozone-median(Ozone))) 2.3 Jointly Distributed Variables When two or more variables are jointly distributed, or jointly observed, it is important to understand how they are related and how closely they are related. We will rst consider the case where one variable is numeric and the other is a factor. Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 21 2.3.1 Side by Side Boxplots Boxplots are particularly useful in quickly comparing the values of two or more sets of numeric data with a common scale of measurement and in investigating the relationship between a factor variable and a numeric variable. The gure below compares placement test scores for each of the letter grades in a sample of 179 students who took a particular math course in the same semester under the same instructor. The two jointly observed population variables are the placement test score and the letter grade received. The gure separates test scores according to the letter grade and shows a boxplot for each group of students. One would expect to see a decrease in the median test score as the letter grade decreases and that is con rmed by the picture. However, the decrease in median test scores from a letter grade of B to a grade of F is not very dramatic, especially compared to the size of the IQRs.

This suggests that the placement test is not especially good at predicting a student's nal grade in the course. Notice the two outliers. The outlier for the "W" group is clearly a mistake in recording data because the scale of scores only went to 100.

> test.vs.grade=read.csv("test.vs.grade.csv",header=T) > attach(test.vs.grade) > plot(Test~Grade,varwidth=T) Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 222.3.2 Scatterplots Suppose xand yare two jointly distributed numeric variables. Whether we consider the entire population or a sample from the population, we have the same number nof observed values for each variable. If we plot the npoints ( x 1; y 1) ; (x 2; y 2) ; : : : ; (x n; y n) in a Cartesian plane, we obtain a scatterplot or a scatter diagram of the two variables. Below are the rst 6 rows of the "Payroll" data set. The column labeled "payroll" is the total monthly payroll in thousands of dollars for each company listed. The column "employees" is the number of employees in each company and "industry" indicates which of two related industries the company is in. A scatterplot of all 50 values of the two variables "payroll" and "employees" is also shown.

> Payroll=read.table("Payroll.txt",header=T) > Payroll[1:6,] payroll employees industry 1 190.67 85 A 2 233.58 109 AA B C D F W 40 60 80 100 120 Grade Test Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 23 3 244.04 130 B 4 351.41 166 A 5 298.60 154 B 6 241.43 124 B > attach(Payroll) > plot(payroll~employees,col=industry) The scatterplot shows that in general the more employees a company has, the higher its monthly payroll. Of course this is expected. It also shows that the relationship between the number of employees and the payroll is quite strong. For any given number of employees, the variation in payrolls for that number is small compared to the overall variation in payrolls for all employment levels. In this plot, the data from industry A is in black and that from industry B is red. The plot shows that for employees 100, payrolls for industry A are generally greater than those for industry B at the same level of employment.50 100 150 150 200 250 300 350 employees payroll Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 24 2.3.3 Covariance and Correlation If xand yare jointly distributed numeric variables, we de ne their covariance as cov (x; y ) = 1 n n X i =1 ( x i (x ))( y i (y )) :

If xand ycome from samples of size nrather than the whole population, replace the denominator n by n 1 and the population means (x ), (y ) by the sample means x, y to obtain the sample covariance. The sign of the covariance reveals something about the relationship between xand y. If the covariance is negative, values of xgreater than (x ) tend to be accompanied by values of yless than (y ). Values of xless than (x ) tend to go with values of ygreater than (y ), so xand ytend to deviate from their means in opposite directions. If cov(x; y )> 0, they tend to deviate in the same direction. The strength of these tendencies is not expressed by the covariance because its magnitude depends on the variability of each of the variables about its mean. To correct this, we divide each deviation in the sum by the standard deviation of the variable. The resulting quantity is called the correlation between xand y:

cor(x; y ) = cov (x; y ) sd (x ) sd (y ):

The correlation between payroll and employees in the example above is 0.9782 (97.82 %).

Theorem 2.1. The correlation between xand ysatis es 1 cor (x; y ) 1. cor (x; y ) = 1 if and only if there are constants aand b >0 such that y= a+ bx .cor (x; y ) = 1 if and only if y= a+ bx with b <0.

A correlation close to 1 indicates a strong positive relationship (tending to vary in the same direction from their means) between xand ywhile a correlation close to 1 indicates a strong negative rela- tionship. A correlation close to 0 indicates that there is no linearrelationship between xand y. In this case, xand yare said to be (nearly) uncorrelated . There might be a relationship between xand y but it would be nonlinear. The picture below shows a scatterplot of two variables that are clearly related but very nearly uncorrelated.

> xs=runif(500,0,3*pi) > ys=sin(xs)+rnorm(500,0,.15) > cor(xs,ys) [1] 0.004200081 > plot(xs,ys) Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 25Some sample scatterplots of variables with di erent population correlations are shown below.0 2 4 6 8 1.0 0.5 0.0 0.5 1.0 1.5 xs ys Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 262.3.4 Exercises 1. With the Air Pollution Filter Noise data, construct side by side boxplots of the variable NOISE for the di erent levels of the factor SIZE. Comment. Do the same for NOISE and TYPE.

2. With the Payroll data, construct side by side boxplots of "employees" versus "industry" and "pay- roll" versus "industry". Are these boxplots as informative as the color coded scatterplot in Section 2.3.2?

3. If you are using Rstudio click on the "Packages" tab, then the checkbox next to the library MASS.

Click on the word MASS and then the data set "mammals" and read about it. If you are using R alone, in the Console window at the prompt >type > data(mammals,package="MASS").

View the data with 1 0 1 2 4 2 0 1 2 3 cor(x,y)=0 2 1 0 1 2 3 2 1 0 1 2 cor(x,y)=0.3 3 1 0 1 2 3 4 3 1 1 2 cor(x,y)= 0.5 2 1 0 1 2 2 1 0 1 2 cor(x,y)=0.9 Go to TOC CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS 27 > mammals Make a scatterplot with the following commands and comment on the result.

> attach(mammals) > plot(body,brain) Also make a scatterplot of the log transformed body and brain weights.

> plot(log(body),log(brain)) A recently discovered hominid species homo oresiensishad an estimated average body weight of 25 kg. Based on the scatterplots, what would you guess its brain weight to be?

4. Let xand ybe jointly distributed numeric variables and let z= a+ by, where aand bare constants. Show that cov(x; z ) = b cov (x; y ). Show that if b >0; cor (x; z ) = cor(x; y ). What happens if b <0? Go to TOC Chapter 3 Probability 3.1 Basic De nitions. Equally Likely Outcomes Let a random experiment with sample space be given. Recall from Chapter 1 that is the set of all possible outcomes of the experiment. An event is a subset of . A probability measure is a function which assigns numbers between 0 and 1 to events. If the sample space , the collection of events, and the probability measure are all speci ed, they constitute a probability model of the random experiment.

The simplest probability models have a nite sample space . The collection of events is the col- lection of all subsets of and the probability of an event is simply the proportion of all possible outcomes that correspond to that event. In such models, we say that the experiment has equal ly likely outcomes . If the sample space has Nelements, then each elementary eventf! g consisting of a single outcome has probability 1 N . If Eis a subset of , then P r(E ) = #( E) N :

Here we introduce some notation that will be used throughout this text. The probability measure for a random experiment is most often denoted by the abbreviation P r, sometimes with subscripts.

Events will be denoted by upper case Latin letters near the beginning of the alphabet. The expression #( E) denotes the number of elements of the subset E.

Example 3.1. The Payroll data consists of 50 observations of 3 variables, "payroll", "employees" and "industry". Suppose that a random experiment is to choose one record from the Payroll data and suppose that the experiment has equally likely outcomes. Then, as the summary below shows, the probability that industry A is selected is P r(industry =A) = 27 50 = 0 :54 :

> Payroll=read.table("Payroll.txt",header=T) > summary(Payroll) 28 Go to TOC CHAPTER 3. PROBABILITY 29 payroll employees industry Min. :129.1 Min. : 26.00 A:27 1st Qu.:167.8 1st Qu.: 71.25 B:23 Median :216.1 Median :108.50 Mean :228.2 Mean :106.42 3rd Qu.:287.8 3rd Qu.:143.25 Max. :354.8 Max. :172.00 In this example we use another common and convenient notational convention. The event whose probability we want is described in quasi-natural language as "industry=A" rather than with the the formal but too cumbersome f! 2 P ayroll jindustry (! ) = Ag. The description "industry=A" refers to the set of all possible outcomes of the experiment for which the variable "industry" has the value "A".

This sort of informal description of an event will be used again and again.

The assumption of equally likely outcomes is an assumption about the selection procedure for ob- taining one record from the data. It is conceivable that a selection method is employed for which this assumption is not valid. If so, we should be able to discover that it is invalid by replicating the experiment su ciently many times. This is a basic principle of classical statistical inference. It relies on a famous result of mathematical probability theory called the law of large numbers. One version of it is loosely stated as follows:

Law of Large Numbers : Let Ebe an event associated with a random experiment and let P rbe the probability measure of a true probability model of the experiment. Suppose the experiment is repli- cated ntimes and let b P r (E ) = 1 n # replications in which E occurs. Then b P r (E ) ! P r(E ) as n ! 1 .

b P r (E ) is called the empirical probability of E.

3.2 Combinations of Events Events are related to other events by familiar set operations. Let E 1; E 2; : : :

be a nite or in nite sequence of events. The union of E 1 and E 2 is the event E 1[ E 2= f! 2 j! 2 E 1 or !

2 E 2g :

More generally, [ i E i= E 1[ E 2[ : : : =f! 2 j! 2 E ifor some i g:

The intersection of E 1 and E 2 is the event E 1\ E 2= f! 2 j! 2 E 1 and !

2 E 2g ; and, in general, \ i E i= E 1\ E 2\ : : : =f! 2 j! 2 E ifor all i g : Go to TOC CHAPTER 3. PROBABILITY 30 Sometimes we omit the intersection symbol \and simply conjoin the symbols for the events in an intersection. In other words, E1E 2: : : E n= E 1\ E 2\ : : : \E n:

The complement of the event Eis the event E =f! 2 j! = 2 Eg:

E occurs if and only if Edoes not occur. The event E 1 E 2 occurs if and only if E 1 occurs and E 2 does not occur.

Finally, the entire sample space is an event with complement , the empty event. The empty event never occurs. We need the empty event because it is possible to formulate a perfectly sensible description of an event which happens never to be satis ed. For example, if = Payroll the event "employees <25" is never satis ed, so it is the empty event.

We also have the subset relation between events. E 1 E 2 means that if E 1 occurs, then E 2 oc- curs, or in more familiar language, E 1 is a subset of E 2. For any event E, it is true that E .

E 2 E 1 means the same as E 1 E 2.

3.2.1 Exercises 1. A random experiment consists of throwing a pair of dice, say a red die and a green die, simultane- ously. They are standard 6-sided dice with one to six dots on di erent faces. Describe the sample space.

2. For the same experiment, let Ebe the event that the sum of the numbers of spots on the two dice is an odd number. Write Eas a subset of the sample space, i.e., list the outcomes in E.

3. List the outcomes in the event F= "the sum of the spots is a multiple of 3".

4. Find F ,E [F,E F =E\F, and E F .

5. Assume that the outcomes of this experiment are equally likely. Find the probability of each of the events in # 4.

6. Show that for any events E 1 and E 2, if E 1 E 2 then E 2 E1.

7. Load the "mammals" data set into your R workspace. In Rstudio you can click on the "Pack- ages" tab and then on the checkbox next to MASS. Without Rstudio, type > data(mammals,package="MASS") Attach the mammals data frame to your R search path with > attach(mammals) Go to TOC CHAPTER 3. PROBABILITY 31 A random experiment is to choose one of the species listed in this data set. All outcomes are equally likely. You can obtain a list of the species in the event "body >200" with the command > subset(mammals,body >200) What is the probability of this event, i.e., what is the probability that you randomly select a species with a body weight greater than 200 kg?

8. What are the species in the event that the ratio of brain weight to body weight is greater than 0.02?

Remember that brain weight is recorded in grams and body weight in kilograms, so body weight must be multiplied by 1000 to make the two weights comparable. What is the probability of that event?

3.3 Rules for Probability Measures The assumption of equally likely outcomes is the starting point for the construction of many proba- bility models. There are many random experiments for which this assumption is wrong. No matter what other considerations are involved in choosing a probability measure for a model of a a random experiment, there are certain rules that it must satisfy. They are:

1. 0 P r (E ) 1 for each event E.

2. P r ( ) = 1.

3. If E 1; E 2; : : :

is a nite or in nite sequence of events such that E iE j= for i6 = j, then P r(S iE i) = P iP r (E i). If E iE j= for all i6 = jwe say that the events E 1; E 2; : : :

are pairwise disjoint .

These are the basic rules. There are other properties that may be derived from them as theorems.

4. P r (E F ) = P r(E ) P r (E F ) for all events Eand F. In particular, P r( E ) = 1 P r (E ) 5. P r ( ) = 0.

6. P r (E [F) = P r(E ) + P r(F ) P r (E F ) for all events Eand F.

7. If E F, then P r(E ) P r (F ).

8. If E 1 E 2 : : : is an in nite sequence, then P r(S iE i) = lim i!1 P r (E i).

9. If E 1 E 2 : : : is an in nite sequence, then P r(T iE i) = lim i!1 P r (E i). Go to TOC CHAPTER 3. PROBABILITY 32 3.4 Counting Outcomes. Sampling with and without Replace- ment Suppose a random experiment with sample space is replicated ntimes. The result is a sequence ( !

1; !

2; : : : ; !

n), where !

i 2 is the outcome of the ith replication. This sequence is the outcome of a so-called compound experiment - the sequential replications of the basic experiment. The sample space of this compound experiment is the n-fold cartesian product n = . Now suppose that the basic experiment is to choose one member of a nite population with Nelements.

We may identify the sample space with the population. Consider an outcome ( !

1; !

2; : : : ; !

n) of the replicated experiment. There are Npossibilities for !

1 and for each of those there are Npossi- bilities for !

2 and for each pair !

1; !

2there are Npossibilities for !

3, and so on. In all, there are N N N=Nn possibilities for the entire sequence ( !

1; !

2; ; !

n). If all outcomes of the compound experiment are equally likely, then each has probability 1 N n . Moreover, it can be shown that the compound experiment has equally likely outcomes if and only if the basic experiment has equally likely outcomes, each with probability 1 N .

De nition : An ordered random sample of size nwith replacement from a population of size Nis a randomly chosen sequence of length nof elements of the population, where repetitions are possible and each outcome ( !

1; !

2; ; !

n) has probability 1 N n .

Now suppose that we sample one element !

1 from the population, with all Noutcomes equally likely.

Next, we sample one element !

2 from the population excluding the one already chosen . That is, we randomly select one element from f!

1g with all the remaining N 1 elements being equally likely. Next, we randomly select one element !

3 from the the N 2 elements of f!

1; !

2g , and so on until at last we select !

n from the remaining N (n 1) elements of the population. The result is a nonrepeating sequence (!

1; !

2; ; !

n) of length nfrom the population. A nonrepeating sequence of length nis also called a permutation of length nfrom the Nob jects of the population. The total number of such permutations is N (N 1) (N n+ 1) = N ! ( N n)! . Obviously, we must have n N for this to make sense. The number of permutations of length Nfrom a set of Nob jects is N !. It can be shown that, with the sampling scheme described above, all permutations of length n are equally likely to result. Each has probability ( N n)! N ! of occurring.

De nition : An ordered random sample of size nwithout replacement from a population of size N is a randomly chosen nonrepeatingsequence of length nfrom the population where each outcome ( !

1; !

2; ; !

n) has probability ( N n)! N ! .

Most of the time when sampling without replacement from a nite population, we do not care about the order of appearance of the elements of the sample. Two nonrepeating sequences with the same elements in di erent order will be regarded as equivalent. In other words, we are concerned only with the resulting subset of the population. Let us count the number of subsets of size nfrom a set of N ob jects. Temporarily, let Cdenote that number. Each subset of size ncan be ordered in n! di erent ways to give a nonrepeating sequence. Thus, the number of nonrepeating sequences of length nis C times n!. So, N ! ( N n)! = C n! i.e., C= N ! n !( N n)! = N n . This is the same binomial coe cient N n that appears in the binomial theorem: ( a+ b)N = P N n =0 N n a n bN n . Go to TOC CHAPTER 3. PROBABILITY 33 De nition : A simple random sample of size nfrom a population of size Nis a randomly chosen subset of size nfrom the population, where each subset has the same probability of being chosen, namely 1 ( N n ).

A simple random sample may be obtained by choosing ob jects from the population sequentially, in the manner described above, and then ignoring the order of their selection.

Example : The Birthday Problem There are N= 365 days in a year. (Ignore leap years.) Suppose n= 23 people are chosen ran- domly and their birthdays recorded. What is the probability that at least two of them have the same birthday?

Solution : Arbitrarily numbering the people involved from 1 to n, their birthdays form an ordered sam- ple, with replacement, from the set of 365 birthdays. Therefore, each sequence has probability 1 N n of occurring. No two people have the same birthday if and only if the sequence is actually nonrepeating.

The number of nonrepeating sequences of birthdays is N(N 1) (N n+ 1). Therefore, the event "No two people have the same birthday" has probability N(N 1) (N n+ 1) N n =N (N 1) (N n+ 1) N N N = (1 1 N )(1 2 N ) (1 n 1 N ) With n= 23 and N= 365 we can nd this in R as follows:

> prod(1-(1:22)/365) [1] 0.4927028 So, there is about a 49% probability that no two people in a random selection of 23 have the same birthday. In other words, the probability that at least two share a birthday is about 51%.

An important, intuitively obvious principle in statistics is that if the sample size nis very small in comparison to the population size N, a sample taken without replacement may be regarded as one taken with replacement, if it is mathematically convenient to do so. A sample of size 100 taken with replacement from a population of 100,000 has very little chance of repeating itself. The probability of a repetition is about 5%.

3.4.1 Exercises 1. A red 6-sided die and a green 6-sided die are thrown simultaneously. The outcomes of this exper- iment are equally likely. What is the probability that at least one of the dice lands with a 6 on its upper face?

2. A hand of 5-card draw poker is a simple random sample from the standard deck of 52 cards. What is the probability that a 5-card draw hand contains the ace of hearts? Go to TOC CHAPTER 3. PROBABILITY 34 3. How many 5 draw poker hands are there? In 5-card stud poker, the cards are dealt sequentially and the order of appearance is important. How many 5 stud poker hands are there?

4. Everybody in Ourtown is a fool or a knave or possibly both. 70% of the citizens are fools and 85% are knaves. One citizen is randomly selected to be mayor. What is the probability that the mayor is both a fool and a knave?

5. A Martian year has 669 days. An R program for calculating the probability of no repetitions in a sample with replacement of n birthdays from a year of N days is given below.

> birthdays=function(n,N) prod(1-1:(n-1)/N) To invoke this function with, for example, n=12 and N=400 simply type > birthdays(12,400) Check that the program gives the right answer for N=365 and n=23. Then use it to nd the number n of Martians that must be sampled in order for the probability of a repetition to be at least 0.5.

6. A standard deck of 52 cards has four queens. Two cards are randomly drawn in succession, without replacement, from a standard deck. What is the probability that the rst card is a queen? What is the probability that the second card is a queen? If three cards are drawn, what is the probability that the third is a queen? Make a general conjecture. Prove it if you can. (Hint: Does the probability change if "queen" is replaced by "king" or "seven"?) 3.5 Conditional Probability De nition : Let Aand Bbe events with P r(B )> 0. The conditional probability of A, given Bis:

P r (A jB ) = P r (AB ) P r (B ) :

(3.1) P r (A ) itself is called the unconditional probability of A.

Example 3.2. R includes a tabulation by various factors of the 2201 passengers and crew on the Titanic. Read about it by typing > help(Titanic) We are going to look at these factors two at a time, starting with the steerage class of the passengers and whether they survived or not.

> apply(Titanic,c(1,4),sum) Survived Class No Yes Go to TOC CHAPTER 3. PROBABILITY 35 1st 122 203 2nd 167 118 3rd 528 178 Crew 673 212 Suppose that a passenger or crew member is selected randomly. The unconditionalprobability that that person survived is 711 2201 = 0 :323.

> apply(Titanic,4,sum) No Yes 1490 711 > apply(Titanic,1,sum) 1st 2nd 3rd Crew 325 285 706 885 Let us calculate the conditional probability of survival, given that the person selected was in a rst class cabin. If A= "survived" and B= " rst class", then P r (AB ) = 203 2201 = 0 :0922 and P r(B ) = 325 2201 = 0 :1477 :

Thus, P r(A jB ) = 0 :0922 0 :1477 = 0 :625 :

First class passengers had about a 62% chance of survival. For random sampling from a nite popu- lation such as this, we can use the counts of occurrences of the events rather than their probabilities because the denominators in P r(AB ) and P r(B ) cancel.

P r (A jB ) = #( AB ) #( B) = 203 325 = 0 :625 For comparison, look at the conditional probabilities of survival for the other classes.

P r(survived jsecond class) = 118 285 = 0 :414 P r (survived jthird class) = 178 706 = 0 :252 P r (survived jcrew) = 212 885 = 0 :240 Go to TOC CHAPTER 3. PROBABILITY 36 3.5.1 Relating Conditional and Unconditional Probabilities The de ning equation (3.1) for conditional probability can be written as P r(AB ) = P r(A jB )P r (B ); (3.2) which is often more useful, especially when P r(A jB ) is easily determined from the description of the experiment. There is an even more useful result sometimes called the law of total probability. Let B 1; B 2; ; B kbe pairwise disjoint events such that each P r(B i) > 0 and = B 1[ B 2[ [ B k.

Let Abe another event. Then, P r(A ) = k X i =1 P r (A jB i) P r (B i) : (3.3) This is quite easy to show since A= ( AB 1) [ [ (AB k) is a union of pairwise disjoint events and P r (AB i) = P r(A jB i) P r (B i).

Example 3.3. Diagnostic Tests:

Let Ddenote the presence of a disease in a randomly selected member of a given population. Suppose that there is a diagnostic test for the disease and let Tdenote the event that a random sub ject tests positive, that is, that the test indicates the disease. The conditional probability P r(T jD ) is called the sensitivity of the test. The conditional probability P r( T j D ) is called the speci city of the test. The unconditional probability P r(D ) is called the prevalence of the disease in the population. A good test will have both a high sensitivity and a high speci city, although there is usually a trade-o between the two. The unconditional probability that a randomly chosen sub ject tests positive for the disease is P r(T ) = P r(T jD )P r (D ) + P r(T j D )P r ( D ) Suppose that the disease is rare, P r(D ) = 0 :02, and that the sensitivity of the test is P r(T jD ) = 0 :95 with speci city P r( T j D ) = 0 :85. The false positive rate for the test is P r(T j D ) = 1 P r ( T j D ) = 0 :15. The unconditional probability of a positive test result is P r(T ) = 0 :95 0:02 + 0 :15 0:98 = 0 :166 16.6% of the population will test positive for the disease, even though only 2% have it.

3.5.2 Bayes' Rule Bayes' rule is named for Thomas Bayes, an eighteenth century clergyman and part-time mathemati- cian. As given below, it is merely a relationship between conditional probabilities but it is associated with Bayesian inference, a distinct philosophy and methodology of statistical practice. Bayes' rule is often described as a rule for calculating conditional "posterior" probabilities from unconditional "prior" probabilities.

Bayes' Rule : Let Aand B 1; B 2; ; B kbe given as in the law of total probability (3.3) and assume P r (A )> 0. Then for each i, P r(B ij A ) = P r (A jB i) P r (B i) P r (A ) ; (3.4) where P r(A ) is calculated as in (3.3). Go to TOC CHAPTER 3. PROBABILITY 37 Example 3.4. Urn 1 contains 3 red balls and 5 white balls. Urn 2 contains 6 red balls and 3 white balls. A fair coin is tossed (meaning that heads and tails are equally likely). If a head turns up, a ball is randomly selected from Urn 1. If a tail comes up, a ball is randomly selected from Urn 2. Given that a white ball was selected, what is the probability that it came from Urn 1?

Solution : From the law of total probability, P r(W hite ) =P r(W hite jU rn 1)P r (U rn 1) + P r(W hite jU rn 2)P r (U rn 2) = 5 8 1 2 + 3 9 1 2 = 23 48 :

From Bayes' rule, P r(U rn 1jW hite ) =P r (W hite jU rn 1)P r (U rn 1) P r (W hite ) = (5 =8) (1=2) 23 =48 = 15 23 :

Example 3.5. Diagnostic Tests (Continued):

A patient receiving a test result indicating a disease should be more interested in the conditional probability of having the disease (the probability posterior to receiving the diagnosis) than in the unconditional probability (the probability prior to receiving the diagnosis). That is, he or she wants to know P r(D jT ). This is easily obtained from Bayes' rule.

P r(D jT ) = P r (T jD )P r (D ) P r (T ) Let us assume the same prevalence, sensitivity and speci city as in the previous example. Then P r(D jT ) = 0 :95 0:02 0 :166 = 0 :1145 :

Thus, if a disease is rare a positive test result may not strongly indicate the presence of the disease.

3.6 Independent Events Two events Aand Bare independent if P r (AB ) = P r(A )P r (B ). If P r(B )> 0, this is equivalent to P r(A jB ) = P r(A ). In other words, the probability of Ais not a ected by the occurrence or non-occurrence of B. This conforms to our intuitive understanding of independence. More generally, events in a collection Care independent if P r(A 1A 2 A n) = P r(A 1) P r (A 2) P r(A n) for each nite subcollection fA 1; A 2; ; A ng of events in C. Events that are not independent are dependent .

Example 3.6. Draw 2 cards in succession without replacement from a standard deck. Let Abe the event that the rst card is a face card and let Bbe the event that the second card is a seven. The conditional probability of B, given A, is 4/51. The unconditional probability of Bis 1/13. Therefore, A and Bare dependent events. Let Cbe the event that the second card drawn is a heart. The unconditional probability of Cis 1/4. It is an exercise to show that the conditional probability of C, given A, is also 1/4. Therefore, Aand Care independent. Go to TOC CHAPTER 3. PROBABILITY 38 3.6.1 Exercises 1. A department store tabulated the relative frequencies of the amounts of purchases and the method of payment. The results are shown below. Cash Credit Debit < $20 .09 .03 .04 $20-$100 .05 .21 .18 > $100 .03 .23 .14 (a) What proportion of purchases are paid for in cash?

(b) Given that a purchase is for more than $100, what is the probability that it is paid for by credit?

(c) Are payment by credit and amount >$100 independent events?

2. Refer to examples 3 and 5 above. What is P r( D j T )?

3. Generalize equation (3.2) to show that for any events A 1; A 2; ; A n, P r (A 1A 2 A n) = P r(A nj A 1A 2 A n 1) P r (A n 1j A 1 A n 2) P r(A 2j A 1) P r (A 1) ; provided P r(A 1A 2 A n 1) > 0. Hint: Use an inductive argument.

4. The Montana text le is adapted from theMontana outlook poll conducted by the University of Montana in 1992. Use Rstudio to load it into your R workspace or use plain R with the command with the "read.table" function as shown below.

> Montana=read.table("Montana.txt",header=T) > attach(Montana) > table(AREA,INC) INC AREA <20K >35K 20-35K NE 13 21 22 SE 17 21 31 W 17 18 30 > table(AREA,POL) POL AREA Dem Ind Rep NE 15 12 30 SE 30 16 31 W 39 12 17 Are the events INC >35K and AREA == W independent or dependent? What about the events AREA == W and POL == Rep? Go to TOC CHAPTER 3. PROBABILITY 39 5. Two cards are drawn in succession without replacement from a standard deck. Show that the events A="face card on rst draw" and B="heart on second draw" are independent. Hint: Write A = A 1[ A 2, where A 1="face card and a heart on rst draw" and A 2="face card and not a heart on rst draw".

3.7 Replications of a Random Experiment In Chapter 1 we mentioned that replications of a random experiment are independent, without making that statement precise. We can now elaborate on that idea. Let be the sample space of a basic random experiment. Replicating the experiment ntimes results in a compound random experiment whose sample space is the n-fold Cartesian product n = . LetA 1; A 2; ; A nbe any subsets of , that is, any events belonging to the basic experiment. Thus P r(A i) is a well de ned probability for each i. The cartesian product A 1 A 2 A n is an event in the compound exper- iment, a subset of n . For a replicated experiment it must be true that P r (A 1 A 2 A n) = P r(A 1) P r (A 2) P r(A n) for all choices of A 1; A 2; ; A n.

The notation in the last equation is slightly o . The symbol " P r" on the left stands for the probability measure on n , whereas on the right it stands for the probability measure on . Go to TOC Chapter 4 Discrete Distributions 4.1 Random Variables A random variable is a function whose domain is the sample space of a random experiment. The set of values (the range) of this function might be a nite set of letters, words, or other symbols.

Such a random variable is called a nominal variable, a categorical variable, or a factor. Other random variables, called numeric variables, have real number values whose order and arithmetic relationships are important. This chapter is mostly about numeric random variables.

Examples :

1. Select one person randomly from a population of Mwomen and Nmen. Let X= 1 if a the person selected is a woman and let X= 0 if a man. In other words, Xis the number of women that occur in a single random selection. A random variable that has only the two values 0 and 1 is called a Bernoulli random variable.

2. Replicate the experiment in (1) n times, i.e., choose an ordered random sample of size n with replacement from the population. Let Wbe the number of women in the sample. Whas possible values f0;1 ; ; ng. We may express WasW =X 1+ X 2+ +X n, where X iis 1 if a woman was selected on the ith replication and 0 if a man was selected.

3. Choose a random sample of size n withoutreplacement from a population of Mwomen and N men. Let Wbe the number of women in the sample.

4. Choose a random sample of size n from the population of prospective voters in a national election.

Let X 1be the number in the sample who self-identify as Democrats, X 2the number of Republicans, X 3 the number of Libertarians, X 4 the number of Greens, X 5 the number of Other Party respon- dents, and X 6the number a liated with no party. Individually, each of these variables is of the type described in examples (2) or (3), depending on whether sampling from the population is done with or without replacement. Since they are simultaneously observed on each outcome of the sampling experiment, they are said to be jointly observed or jointly distributed .

40 Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 41 5. Roll a fair 6-sided die twice. Let X 1be the sum of the two rolls and let X 2be the larger of the two rolls. Then X 1and X 2are jointly distributed random variables.

4.2 Discrete Random Variables A random variable whose set of possible values (i.e., its range) is a nite or countably in nite set is called a discrete random variable. All of the random variables in the examples above are discrete. Let X denote such a variable. Its values can be arranged in a nite or in nite sequence x 1; x 2; ; x n; :

The probabilities with which Xassumes these values is of fundamental importance. The set of experimental outcomes f! 2 jX (! ) = x ig is an event and will be denoted by ( X=x i) for short.

The probability P r(X =x i) is the frequency of x i or probability mass at x i and the function fde ned on the set fx 1; x 2; g f(x i) = P r(X =x i) is called the frequency function or probability mass function of X. For numeric variables it is conve- nient to allow fto be de ned for all real numbers xby de ning f (x ) = P r(X =x); with the understanding that P r(X =x) is 0 if xis not one of the x i. As a consequence of the rules of probability we have f(x ) 0 for each real xand P xf (x ) = 1, where the sum is taken over all real numbers x. In reality, the sum reduces to P if (x i) = 1.

Other probabilities can be expressed in terms of the frequency function. For example, if Iis any kind of interval of real numbers, the set of outcomes ( X2I) = f! 2 jX (! ) 2 Ig is an event and its probability may be calculated as P r(X 2I) = X x 2 I f (x ):

Example 4.1. Roll a 6-sided die twice. Assume that all 36 outcomes are equally likely. Let Tdenote the total number of spots on the two rolls. A table of values of Tand their probabilities is given below. t 2 3 4 5 6 7 8 9 10 11 12 f(t) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 Suppose that we are interested in P r(T 4). We can calculate it as P r (T 4) = X t 4f (t) = 4 X t =2 f (t) = 1 =36 + 2 =36 + 3 =36 = 6 =36 Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 42 4.3 Expected Values De nition 4.1. LetXbe a discrete random variable with frequency function f(x ) = P r(X =x).

The expected value or mean value of the distribution of Xis E (X ) = = X x xf (x ) = X x xP r (X =x):

In case the set of possible values of Xis countably in nite, we require that the sum in the de nition be absolutely convergent. If Xhas only a nite set of possible values this is not a concern.

Theorem 4.1. LetXbe a discrete random variable with frequency function f(x ) and let h(x ) be a function de ned on the range of X. The expected value of the random variable Y=h(X ) is equal to E (Y ) = E(h (X )) = X x h (x )f (x ):

Proof : For a given value yof Y, P r (Y =y) = X x :h (x )= yP r (X =x):

Hence, E(Y ) = X y y X h (x )= yP r (X =X) = X x h (x )f (x ):

Example 4.2. For the random variable Tof the preceding example, nd E(T ) and E(T 2 ).

Solution : Extending the table of Example 1, t 2 3 4 5 6 7 8 9 10 11 12 f(t) 1 36 2 36 3 36 4 36 5 36 6 36 5 36 4 36 3 36 2 36 1 36 tf(t) 2 36 6 36 12 36 20 36 30 36 42 36 40 36 36 36 30 36 22 36 12 36 t 2 f(t) 4 36 18 36 48 36 100 36 180 36 294 36 320 36 324 36 300 36 242 36 144 36 Adding the entries in the last two rows gives E(T ) = 7 and E(T 2 ) = 54 :833 :

De nition 4.2. The variance of the distribution of a random variable Xis var (X ) = 2 = E((X )2 ); where is the mean of X. The standard deviation of X is the square root of its variance.

sd (X ) = = p var (X ): Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 43 The mean, variance and standard deviation of a distribution are analogous to but not the same as the sample mean, variance and standard deviation that we discussed previously for numeric data sets.

Like their sample counterparts, the mean and standard deviation of a distribution are measures of its location and spread.

Theorem 4.2. IfY =a+ bX , where aand bare constants, then E (Y ) = a+ bE (X ) and sd(Y ) = jb jsd (X ):

This leads to an alternate formula for the variance that is sometimes easier for calculation. Let = E(X ). Then (X )2 = X2 2 X + 2 :

Hence, var(X ) = E((X )2 ) = E(X 2 ) 2 2 + 2 = E(X 2 ) E(X )2 :

For the random variable Tof the preceding example, we calculated E(T 2 ) = 54 :833 and E(T ) = 7.

Thus, var(T ) = 54 :833 49 = 5 :833 and sd(T ) = p 5 :833 = 2 :415.

The next theorem, Chebyshev's inequality , places a universal restriction on the probabilities of devi- ations of random variables from their means.

Theorem 4.3. IfX is a random variable with mean and standard deviation and if kis a positive constant, then P r(jX j> k ) 1=k 2 :

4.3.1 Exercises 1. A fair coin is tossed until either a head occurs or 6 tails in a row have occurred. Let Xdenote the number of tosses. Find the frequency function, mean, and variance of X.

2. Verify Chebyshev's inequality for k= 2 and k= 3 when Xis the total number of spots on two rolls of a fair 6-sided die.

3. Prove Theorem 2.

4. The function f(n ) = 1 =(n (n + 1)) ; n= 1 ;2 ;3 ; is a legitimate frequency function. Show that its mean value does not exist.

4.4 Bernoulli Random Variables A Bernoulli random variable has only two possible values, usually designated as 1 and 0. Often these are numeric codes for verbal descriptions like "success" and "failure". For example, roll a pair of dice Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 44 and call it a success if the total number of spots is 7 or 11. Otherwise, call the experiment a failure.

Instead of de ning a random variable Xwith possible values fsuccess; f ailure gwe typically let the values of Xbe f1 ;0 g, where 1 means success and 0 means failure. One advantage in doing this is that X is then a numeric variable and can be interpreted as the number of successes in one performance of the experiment.

For a given Bernoulli variable Xlet pdenote P r(X = 1). pis the so-called success probability . In the example just given, p= P r (T = 7 or T= 11) = 8 =36, but in general pcould be any number between 0 and 1. The frequency function for Xis f (x ) = 8 > < > : p; ifx= 1, 1 p; ifx= 0, 0 ; ifx6 = 0 ;1. (4.1) A compact way of writing this is f(x ) = px (1 p)1 x for x= 0 ;1.

Example 4.3. Randomly select one person from a population of Mmen and Nwomen. Let W be the number (either 0 or 1) of women selected. Wis a Bernoulli variable with success probability p = N= (M +N).

4.4.1 The Mean and Variance of a Bernoulli Variable Let Xbe a Bernoulli variable with success probability p= P r (X = 1). The expected value of Xis E (X ) = 0 (1 p) + 1 p= p:

Furthermore, since X2 = X,E (X 2 ) = palso. Therefore, var (X ) = E(X 2 ) E(X )2 = p(1 p):

4.5 Binomial Random Variables Let Xbe a Bernoulli random variable with success probability parising from a given random ex- periment. Replicate the experiment ntimes and let X 1; X 2; ; X nbe the values of Xfrom the replications. The random variables X 1; ; X nare independent , which is a most important property, not just for Bernoulli variables but for any jointly distributed random variables whenever it holds.

De nition 4.3. Jointly distributed random variables X 1; X 2; ; X nare independent if for all inter- vals I 1 ; I 2; ; I n, P r (X 12 I 1 ; X 22 I 2 ; ;X n2 I n ) = P r(X 12 I 1 ) P r (X 22 I 2 ) P r(X n2 I n ) : Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 45 The expression ( X 1 2 I 1 ; X 2 2 I 2 ; ;X n 2 I n ) means the same thing as the intersection ( X 1 2 I 1 ) \ (X 22 I 2 ) \ \ (X n2 I n ).

For independent replications of a Bernoulli experiment, let Y=X 1+ X 2+ +X n.

Y is the total number of successes in the nreplications. Clearly, the possible values of Yare 0 ;1 ; ; n. We will derive P r(Y =y) for any yin this range. Let x 1; x 2; ; x nbe any particular sequence of y1's and n y0's. Since P r(X i= 1) = p, P r (X i= 0) = 1 p, and the X iare independent, P r (X 1= x 1; X 2= x 2; ;X n= x n) = py (1 p)n y :

This is only one particular sequence of values of the X ithat leads to the event Y=y. In all, there are n y sequences of y1's and n y0's. Thus, P r (Y =y) = n y py (1 p)n y :

De nition 4.4. A random variable Yhas a binomial distribution based on ntrials and success probability p 2 (0;1) if the frequency function of Yis f Y ( y ) = ( n y p y (1 p)n y ; ify2 f 0;1 ; ; ng; 0 ; otherwise:

Note that a Bernoulli random variable is a binomial random variable with n= 1. The family of all binomial distributions is a parametricfamily because speci cation of the values of the two parameters n and psingles out a speci c member of that family. To indicate that Yhas a binomial distribution with given parameter values nand p, we write Y Binom (n; p ).

Any numeric random variable Xhas a cumulative distribution function de ned as F X ( x ) = P r(X x) for all real numbers x. For discrete random variables the relationship between the frequency function and the cumulative distribution function is FX ( x ) = X x i xf X ( x i) ; where x 1; x 2; are the values of X. In particular, for a binomial random variable Y Binom (n; p ), F Y ( y ) = 8 > < > : P 0 i y n i p i (1 p)n i if 0 y n, 0 if y <0, 1 if y n.

Any cumulative distribution function Fhas the following properties:

1. Fis a nondecreasing function de ned on the set of all real numbers.

2. Fis right-continuous. That is, for each a, F (a ) = F(a +) = lim x! a+ F (x ).

3. lim x! 1 F (x ) = 0; lim x! +1 F (x ) = 1. Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 46 4. P r (a < X b) = F X ( b ) F X ( a ) for all real aand b, a < b .

5. P r (X > a ) = 1 F X ( a ).

6. P r (X < b ) =F X ( b ) = lim x! b F X ( x ).

7. P r (a < X < b ) =F X ( b ) F X ( a ).

8. P r (X =b) = F X ( b ) F X ( b ).

Here is the graph of the cumulative distribution function of X Binom (20;0 :4). Notice that it is constant between successive possible values. The values of the frequency function of Xare plotted below as vertical line segments.0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 x F(x) Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 47R has a suite of functions related to binomial distributions. You can read about them by calling > help(Binomial) For now, the most important are the functions "dbinom" for calculating the frequency function and "pbinom" for calculating the cumulative distribution function. For example, if Y Binom (20;0 :3), the frequency function f(10) = P r(Y = 10) and the cumulative distribution F(10) = P r(Y 10) are found by > dbinom(10,size=20,prob=0.3) [1] 0.03081708 > pbinom(10,size=20,prob=0.3) [1] 0.98285525 10 15 20 0.00 0.05 0.10 0.15 x f(x) Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 48 4.5.1 The Mean and Variance of a Binomial Distribution To derive the mean and variance of a binomial distribution we will rely on the following results, which will be discussed in more detail later.

Theorem 4.4. If jointly distributed random variables X 1; X 2; ; X nhave expected values and Y =X 1+ X 2+ +X n, then E(Y ) = E(X 1) + E(X 2+ +E(X n). If X 1; X 2; ; X nare independent, then var(Y ) = var(X 1) + var(X 2) + +var (X n).

A binomial random variable Y Binom (n; p ) has the same distribution as X 1+ +X n, where X 1; X 2; ; X nare independent Bernoulli variables X i Binom (1; p ). Therefore, E (Y ) = E(X 1) + +E(X n) = np; var (Y ) = var(X 1) + +var (X n) = np(1 p); and sd(Y ) = p np (1 p):

4.5.2 Exercises 1. A 6-sided die is thrown twice. All outcomes are equally likely. Let Mdenote the maximum of the two numbers on the upper surfaces. Find the frequency function and the cumulative distribution function of M. Graph the cumulative distribution function.

2. Six people are randomly selected in succession, with replacement, from a class containing 25 men and 20 women. What is the probability of obtaining the sequence 1, 0, 0, 0, 1, 1, where 1 indicates a man was chosen and 0 indicates a woman was chosen?

3. Write down all the other outcomes of this sequential sampling experiment that lead to 3 men and 3 women being chosen. What are their probabilities?

4. What is the probability that 3 men are chosen in the sampling experiment?

5. What is the probability that 2 or more women are chosen?

6. Suppose that 6 people are randomly chosen without replacement from a population consisting of 2500 men and 2000 women. Find the approximate probability that there are 4 men in the sample.

Justify your answer.

7. Use both your calculator and R to nd the following probabilities.

(a) P r(Y = 5), Y Binom (12;0 :3).

(b) P r(Y > 8),Y Binom (12;0 :3).

(c) P r(jY 10j 4),Y Binom (20;0 :5). Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 49 8. Show that the sum of binomial frequencies is 1, i.e., that n X x =0 n x px (1 p)n x = 1 :

Hint: Expand 1 = 1 n = [ p+ (1 p)] n by the binomial theorem from calculus.

9. Sketch the cumulative distribution function of the Bernoulli distribution Binom(n = 1 ; p=:7).

10. Use R's "pbinom" function to verify Chebyshev's inequality for k= 2 and k= 3 when X Binomial (50;0 :4) :

4.6 Hypergeometric Distributions Suppose that a random sample of size kis selected without replacement from an urn containing m white balls and nblack balls. Let Xdenote the number of white balls in the sample. The distribution of X is called a hypergeometric distribution with parameters m,n, and k. X is an integer-valued random variable that lies between maxf0; k ng and minfk; m g.

Let xbe an integer between maxf0 ; k ng and minfk; m g. In order for the event ( X=x) to occur, a set of xwhite balls must be chosen. This can occur in m x ways, and for each such outcome, there are n k x ways of choosing k xblack balls. Therefore, the number of outcomes of the sampling experiment in the event ( X=x) is m x n k x :

The m +n k outcomes of the sampling experiment are all equally likely. Thus, fX ( x ) = P r(X =x) = m x n k x m +n k (4.2) De nition 4.5. An integer valued random variable has a hypergeometric distribution with parameters m ,n, and k(all positive integers, k m +n) if its frequency function is given by4.2for all integers x , max f0; k ng x min fk; m g.

We have mentioned several times that if the sample size is small compared to the population size, we can regard a sample taken without replacement as one taken with replacement (or vice-versa) if it is mathematically convenient to do so. This is re ected in the following theorem.

Theorem 4.5. Letm,n, and kbe positive integers and suppose that m; n! 1 in such a way that m= (m +n) ! p2 (0;1). Then for any integer xbetween 0 and k, m x n k x m +n k ! k x px (1 p)k x : Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 50 This theorem justi es approximating a hypergeometric distribution with a binomial distribution in certain circumstances. In general it is easier to work with a binomial distribution.

To indicate that Xhas a hypergeometric distribution with parameters m,n and k, we write X H yper (m; n; k ). The R functions for the hypergeometric frequency function and the cumulative dis- tribution function are "dhyper" and "phyper", respectively. Details on their use are in the R help le > help(Hyper) Example 4.4. A class consists of 25 men and 20 women. Six people are randomly selected from the class without replacement. What is the probability that 3 men are chosen?

Solution : The number Xof men in the sample has a hypergeometric distribution with parameter values m= 25, n= 20, and k= 6. Hence, P r(X = 3) = 25 3 20 3 45 6 :

This can be calculated in R with the "dhyper" function or with the "choose" function for evaluating binomial coe cients.

> dhyper(x=3,m=25,n=20,k=6) [1] 0.3219129 > choose(25,3)*choose(20,3)/choose(45,6) [1] 0.3219129 For nding the value of the cumulative distribution function, calculating binomial coe cients quickly becomes tiresome. The R function is "phyper". For example, P r(X 3) is > phyper(3,25,20,6) [1] 0.5527105 If the sampling had been with replacement, the distribution of Xwould have been binomial and the answers would have been > dbinom(x=3,size=6,prob=25/45) [1] 0.3010682 > pbinom(3,6,25/45) [1] 0.5472216 Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 51 4.6.1 The Mean and Variance of a Hypergeometric Distribution A random sample of size kchosen without replacement from a population of msuccesses and n failures can be selected sequentially. After each selection all the remaining members of the population must be equally likely on the next selection. All subsets of size kare equally likely to result from this method. On the ith selection, let X i= 1 if the choice is a success and X i= 0 if it is not.

X 1; X 2; ; X kare Bernoulli variables, all with the same success probability p= m= (m +n), but they are not independent. Since Y=X 1+ +X kis the number of successes and has the hypergeometric distribution Y H yper (m; n; k ), the rst part of Theorem 4 applies to give E(Y ) = kp=km= (m +n):

However, the second part of Theorem 4 does not apply because the X iare not independent. The variance of the hypergeometric distribution di ers from the variance of the binomial distribution by a factor sometimes called the correction for sampling from a nite population .

var (Y ) = kp(1 p)(1 k 1 m +n 1) :

Notice that the correction factor 1 k 1 m +n 1 is almost 1 if k << m+n. This is another clue that sampling with and without replacement are almost the same under these circumstances.

4.7 Poisson Distributions Poisson distributions are important in modeling random phenomena such as subatomic decay events, meteor impacts and genetic mutations that occur sporadically in time or space. We shall rst consider occurrences in time. For a given time interval I, let X(I ) denote the number of occurrences of the phenomenon in question during that interval.

De nition 4.6. A Poisson process is a collection of non-negative integer valued random variables X (I ) associated with time intervals I= ( t; t+ t) which satis es the following conditions.

1. If no two of the time intervals I 1 ; I 2; ; I m overlap, the random variables X(I 1 ) ; X (I 2 ) ; ; X(I m ) are independent.

2. If two time intervals I 1 and I 2 have the same length, the random variables X(I 1 ) and X(I 2 ) have the same distribution.

3. There is a constant >0 such that for a time interval Iof length t, P r (X (I ) > 0) = t+ , where = t! 0 as t! 0. For small time intervals, the probability of an occurrence during that in- terval is approximately proportional to its length, with negligible error. The proportionality constant is called the rate of the process.

4. P r(X (I ) > 1)= t! 0 as t! 0. The probability of more than one occurrence during a small time interval is negligible. Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 52 A spatial Poisson process satis es the same conditions except that the random variables X(I ) are associated with two or three-dimensional regions Iof space and tis the area or volume of the region I rather than time duration.

Theorem 4.6. For a Poisson process with rate parameter >0, the random variable X(I ) has nonnegative integer values and has frequency function P r(X (I ) = x) = f(x ) = e t ( t) x x !

for x= 0 ;1 ;2 ; .

De nition 4.7. A random variable Xwith nonnegative integer values has a Poisson distribution if its frequency function is f(x ) = P r(X =x) = e x x !; (4.3) for x= 0 ;1 ;2 ; , where >0 is a constant.

Thus, for a Poisson process the random variable X(I ) has a Poisson distribution with parameter = t. As with any frequency function, the sum of Poisson frequencies must equal 1. This is easy to show from the McClaurin series for the exponential function.

1 X x =0 e x =x ! = e 1 X x =0 x =x ! = e e = 1 :

Example 4.5. Suppose that random mutations occur in a certain section of the human genome ac- cording to a Poisson process with a rate of 1 per 10,000 years. What is the probability that more than one mutation will occur in a period of 5000 years?

Solution : To model a process occurring in time as a Poisson process, it is necessary to specify the unit of time. In this problem it is convenient to take the time unit to be 10,000 years and the rate param- eter to be 1. One could choose one year as the time unit and adjust the rate to be = 0 :0001. The resulting answer would be correct as long as we keep the units in mind. The point is, the parameter must be expressed in units of 1/time. We will take = 1 for simplicity. Thus, an interval Iof 5000 years has length t= 0 :5 and the number of mutations X=X(I ) has a Poisson distribution with parameter = t= 0 :5.

P r(X > 1) = 1 P r (X 1) P r (X 1) = P r(X = 0) + P r(X = 1) = e 0:5 + e 0:5 0 :5 = 0 :6065 + 0 :3033 = 0 :9098 :

P r (X > 1) = 1 0:9098 = 0 :0902 :

If X has a Poisson distribution with parameter , we write X P ois ( ). The R functions for evaluating the frequency function and the cumulative distribution function are "dpois" and "ppois".

The parameter must be speci ed as an argument. Unfortunately, it is called "lambda" in R. Don't confuse it with the rate parameter in the discussion above. Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 53 > ppois(1,lambda=0.5) [1] 0.909796 > dpois(0,0.5); dpois(1,0.5) [1] 0.6065307 [1] 0.3032653 There is a close relationship between binomial and Poisson distributions. Let fp n g be a sequence of positive numbers such that p n !

0 and np n!

> 0. Rearrange the expression n x px n (1 p n )n x as n(n 1) (n x+ 1) n x 1 (1 p n )x ( np n)x x ! (1 np n n ) x Then (np n)x x ! (1 np n n ) n ! e x x !

while the rest of it goes to 1. Thus the binomial distribution Binom(n; p n) approaches the Poisson distribution P ois( ). In certain circumstances, the Poisson distribution P ois( = np) is a good approximation to the binomial distribution Binom(n; p ). It has been proved 1 that the error of approximation is at most np2 .

Example 4.6. The incidence of Hantavirus infection in New Mexico during an outbreak of the disease was 4.4 cases per million residents. What is the probability that in a sample of 10000 residents, there will be more than one case of Hantavirus?

Solution : Let Xdenote the number of cases. Considering this as a binomial experiment, X Binom (n = 10000 ; p= 4 :4 10 6 ). Using the Poisson approximation, X P ois ( = 4 :4 10 2 ).

According to R, the true probability is > 1-pbinom(1,size=10000,prob=4.4e-06) [1] 0.0009399798 and the Poisson approximation is > 1-ppois(1,lambda=4.4e-02) [1] 0.0009400684 The actual error of approximation is 8 :86 10 8 . The upper bound np2 for the error is 1 :94 10 7 . 1 Hodges, J.L. and LeCam, L.(1960) "The Poisson Approximation to the Binomial Distribution", Annals of Mathematical Statistics 31, 737-740 Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 54 4.7.1 The Mean and Variance of a Poisson Distribution Let X P ois ( ) with frequency function f(x ) = e x =x !

for x= 0 ;1 ;2 ; . We calculate the mean of Xfrom the de nition.

E (X ) = 1 X x =0 xe x x !

= e 1 X x =1 x x 1 x !

= e 1 X x =1 x 1 ( x 1)!

= e 1 X y =0 y y !

= e e = We leave it as an exercise to show by a similar argument that E(X (X 1)) = 2 . Thus, E(X 2 ) = 2 + and var(X ) = :

The mean and variance of a Poisson distribution are both equal to the parameter .

4.7.2 Exercises 1. A club has 50 members, 10 belonging to the ruling clique and 40 second-class members. Six mem- bers are randomly selected for free movie tickets. What is the probability that 3 or more belong to the ruling clique?

2. Answer the same question if the club has 50,000 members, 10,000 in the ruling clique and 40,000 second-class members.

3. Biologists tagged 50 animals of a species and then released them back into the wild. After a certain "mixing" time they captured a random sample of 50 animals and discovered that 6 of them had been tagged. Let Ndenote the size of the population and let Xdenote the number of tagged animals in a random sample of size 50 from the population. For N= 200 ;500 ;1000 calculate the probability that X 6.

4. Huck and Jim are waiting for a raft. The number of rafts oating by over intervals of time is a Poisson process with a rate of = 0 :4 rafts per day. They agree in advance to let the rst raft go and take the second one that comes along. What is the probability that they will have to wait more than a week? Hint: If they have to wait more than a week, what does that say about the number of rafts in a period of 7 days? Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 55 5. In each of the following cases, use R's "pbinom" or "dbinom" function to nd the true probability of the event. Then give the Poisson approximation and the value of the upper bound for the error.

How does the actual error compare to the upper bound?

(a) P r(X 6), X Binom (n = 36 ;000 ; p= 1 :67 10 4 ).

(b) P r(2 X 3), X Binom (105 ;4 :4 10 6 ).

(c) P r(X = 6), X Binom (n = 10 4 ; p = 5 10 4 ).

6. Fire ant colonies occur according to a spatial Poisson process with a rate of 1.5 colonies per acre.

What is the probability that a 10 acre plot of land will have 10 or fewer re ant colonies?

7. Complete the proof that for X P ois ( ); var (X ) = .

4.8 Jointly Distributed Variables Jointly distributed random variables require more than just their individual distributions to completely characterize them. For now, we concentrate on jointly distributed discrete variables and begin with the case of just two variables.

De nition 4.8. LetX 1and X 2be two jointly distributed discrete random variables. The joint frequency function of X 1and X 2is the function of two variables de ned as f(x 1; x 2) = P r(X 1= x 1 and X 2= x 2) :

The marginal frequency functions of X 1 and X 2 are simply their individual frequency functions as previously de ned.

f1( x 1) = P r(X 1= x 1) ; f 2( x 2) = P r(X 2= x 2) :

Suppose that x 1 is given and that f 1( x 1) = P r(X 1= x 1) > 0. The conditional frequency function of X 2, given that X 1= x 1, is the function of x 2 de ned by f 2j1 ( x 2j x 1) = P r(X 2= x 2j X 1= x 1) :

The conditional frequency function of X 1, given that X 2= x 2 is f 1j2 ( x 1j x 2) = P r(X 1= x 1j X 2= x 2) :

Theorem 4.7. LetX 1and X 2have joint frequency function f(x 1; x 2). Then (1) f 2( x 2) = P x1 f (x 1; x 2) for all x 2.

(2) f 1( x 1) = P x2 f (x 1; x 2) for all x 1. Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 56 (3) f 2j1 ( x 2j x 1) = f(x 1; x 2) =f 1( x 1) if f 1( x 1) > 0.

(4) X 1and X 2are independent if and only if f(x 1; x 2) = f 1( x 1) f 2( x 2) for all x 1; x 2.

(5) P x1 P x2 f (x 1; x 2) = 1.

For m 2 jointly distributed discrete random variables X 1; X 2; ; X m, the joint frequency function is the function of marguments de ned as f (x 1; x 2; ; x m) = P r(X 1= x 1; X 2= x 2; ;X m = x m ) :

Statements analogous to (1) through (5) of the preceding theorem also hold for more than two variables.

In particular, X 1; X 2; ; X mare independent if and only if f (x 1; x 2; ; x m) = f 1( x 1) f 2( x 2) ; ; f m( x m ) for all x 1; x 2; ; x m, where f i( x i) is the marginal frequency function of X i.

Example 4.7. Roll a standard pair of dice. All outcomes are equally likely. Let X 1be the maximum of the numbers of spots on the two dice and let X 2be the minimum of the two numbers. The joint frequency function of X 1and X 2can be displayed in tabular form as follows. X 1: max 1 2 3 4 5 6 X 2: min 1 1/36 2/36 2/36 2/36 2/36 2/36 11/36 2 0 1/36 2/36 2/36 2/36 2/36 9/36 3 0 0 1/36 2/36 2/36 2/36 7/36 4 0 0 0 1/36 2/36 2/36 5/36 5 0 0 0 0 1/36 2/36 3/36 6 0 0 0 0 0 1/36 1/36 1/36 3/36 5/36 7/36 9/36 11/36 1 The numbers in the rightmost column are the marginal frequencies of X 2. The numbers in the bot- tom row are the marginal frequencies of X 1. The conditional frequency function f 1j2 ( x 1j X 2= 3) is obtained by dividing each of the elements of the row corresponding to x 2 = 3 by their sum 7 =36. Thus, x 1 1 2 3 4 5 6 f 1j2 ( x 1j 3) 0 0 1/7 2/7 2/7 2/7 Clearly, X 1and X 2are not independent since the marginal frequency function of X 1is not equal to the conditional frequency function given that X 2 = 3. It is clear also because the joint frequency function obviously does not factor into the product of the marginal frequencies.

4.8.1 Covariance and Correlation De nition 4.9. LetXand Ybe jointly distributed random variables with respective means x and y and standard deviations x; y. The covariance of X and Yis Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 57 cov (X; Y ) =E((x x)( Y y)) :

The correlation of X and Yis cor(X; Y ) =cov (X; Y ) x y :

An alternate formula for the covariance is cov(X; Y ) =E(X Y ) E(X )E (Y ):

The covariance has the following properties. Most of them follow easily from the de nition.

1. cov (X; Y ) =cov(Y ; X ) 2. cov (X; X ) =var(X ) 3. If X; Y, and Zare jointly distributed and aand bare constants cov (X; aY +bZ ) = a cov (X; Y ) +b cov (X; Z ):

4. If Xand Yare independent, cov(X; Y ) = 0.

It can be shown that 1 cor (X; Y ) 1. cor (X; Y ) = 1 if and only if there are constants a; bsuch that Y=a+ bX and b >0. In other words, there is an exact linear relationship between Xand Ywith positive slope. cor(X; Y ) = 1 if and only if there is an exact linear relationship with negative slope.

In all other cases, the correlation is strictly between -1 and 1. In general, the correlation measures the strength of linear association between Xand Y.

Example 4.8. Let us nd the covariance and correlation between X 1 and X 2 in Example4.7. To begin, we have E(X 1X 2) = 1 1 1= 36 + 1 2 2=36 + + 6 5 0=36 + 6 6 1=36 = 12 :250 E (X 1) = 1 1=36 + 2 3=36 + 3 5=36 + 4 7= 36 + 5 9=36 + 6 11 =36 = 4 :472 E (X 2) = 1 11 =36 + 2 9= 36 + 3 7= 36 + 4 5=36 + 5 3=36 + 6 1= 36 = 2 :528 Thus, cov(X 1; X 2) = E(X 1X 2) E(X 1) E (X 2) = 0 :9445 :

To get the correlation, we must divide this by the product of the standard deviations. You can show that sd(X 1) = 1 :404 = sd(X 2). Hence, cor(X 1; X 2) = 0 :9445 1 :971 = 0 :479 :

Random variables X 1 and X 2 whose covariance is 0 are said to be uncorrelated . If X 1 and X 2 are independent then they are uncorrelated. The converse is not true. Being uncorrelated has an important inplication for variances. Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 58 Theorem 4.8. If jointly distributed random variables X 1; X 2; ; X nare pairwise uncorrelated, then var (X 1+ X 2+ +X n) = var(X 1) + var(X 2) + +var (X n) :

Proof : We will assume that n= 2. The general proposition can then be proved easily by induction.

var (X 1+ X 2) = cov(X 1+ X 2; X 1+ X 2) = cov (X 1; X 1+ X 2) + cov(X 2; X 1+ X 2) = cov (X 1; X 1) + cov(X 1; X 2) + cov(X 2; X 1) + cov(X 2; X 2) = var (X 1) + 2 cov(X 1; X 2) + var(X 2) Since by hypothesis cov(X 1; X 2) = 0, var (X 1+ X 2) = var(X 1) + var(X 2) :

4.9 Multinomial Distributions Suppose Xis a random variable which is a factor (nominal variable or categorical variable) with m possible values or levels L 1; L 2; ; L m. For example, if we randomly choose one member of the population of eligible voters, that that person will be classi ed in one and only one way as "Republican", "Democrat", "Libertarian", "Green", "Other", or "Independent". The random vari- able Xis party a liation and these six names are its levels. In general, let p 1 = P r(X =L 1), p 2 = P r (X =L 1) ; ; p m = P r (X =L m ). Each p i 2 (0;1) and p 1 + p 2 + +p m = 1. Because of this last condition, we can express one of the p i in terms of the others, e.g., p m = 1 P m 1 i =1 p i. This leaves only m 1 of the p i as free parameters which must satisfy P m 1 i =1 p i < 1.

Let the experiment giving rise to Xbe replicated ntimes independently. Let Y 1 be the number of replications for which X=L 1, Y 2 the number of replications for which X=L 2, and so on. Finally let Y m be the number of replications for which X=L m .

Y 1, Y 2, etc. are jointly distributed random variables whose joint distribution is called a multinomial distribution. The replications are called multinomial trials .

Let y 1; y 2; ; y m be nonnegative integers such that P m i =1 y i = n. Consider any particular n-term sequence of levels in which L 1 occurs y 1 times, L 2 occurs y 2 times, and so on until nally L m occurs y m times. By independence, the probability of this sequence resulting from the experiment is py 1 1 py 2 2 py m m :

However, this is only one way that the event ( Y 1 = y 1; Y 2 = y 2; ;Y m = y m ) could occur. The number of ways that it can occur is the total number of n-term sequences that have y 1 terms equal to L 1, y 2 terms equal to L 2, and so on. That number is given by the multinomial coe cient Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 59 n ! y 1!

y 2!

y m !:

Thus, the joint frequency function for Y 1; Y 2; ; Y m is f (y 1; y 2; ; y m) = P r(Y 1 = y 1; Y 2 = y 2; ;Y m = y m ) (4.4) = n ! y 1!

y 2!

y m !p y 1 1 py 2 2 py m m :

where y 1; y 2; ; y m are nonnegative integers that sum to nand p 1; p 2; ; p m are positive numbers that sum to 1.

The R function for the multinomial frequency function is "dmultinom". You can read about it by calling > help(Multinomial).

The required arguments for "dmultinom" are "x", which is the same as the vector ( y 1; y 2; ; y m) in the discussion above, and "prob" which is the vector ( p 1; p 2; ; p m) of probabilities of the levels. For example, suppose that n= 25, m= 4, and ( p 1; p 2; p 3; p 4) = (0 :2 ;0 :4 ;0 :2 ;0 :2). Say we want to nd P r (Y 1 = 5; Y 2 = 10; Y 3 = 5 :

Y 4 = 5). The answer is > dmultinom(x=c(5,10,5,5),prob=c(0.2,0.4,0.2,0.2)) [1] 0.00849941 Example 4.9. : Hardy-Weinberg genetic equilibrium A gene occurs in two forms, or alleles, a dominant form "A" and a recessive form "a". Each individual organism in the population carries two copies of the gene, one from each parent. The organism has genotype "AA" if both copies are of form A, "Aa" if one is of form A and the other of form a, or "aa" if both are allele a. Let denote the proportion of all the copies of the gene in the population which are of form A. Then 1 is the proportion of form a. The Hardy-Weinberg model for genetic equilibrium assumes that in matings, the alleles contributed by the parents are independently selected with probabilities equal to their frequencies in the population. Thus, the probability that an o spring will have genotype AA is 2 , the probability of aa is (1 )2 , and the probability of Aa is 2 (1 ).

This is a fair assumption if the population is large, thoroughly mixed, and none of the genotypes has a reproductive advantage over the others.

Suppose that a proportion = :65 of genes are of form A. The Hardy-Weinberg genotype probabilities are p AA = 0 :65 2 = 0 :4225, p Aa = 2 0:65 0:35 = 0 :455, and p aa = 0 :35 2 = 0 :1225. Suppose that a sample of size 100 is randomly selected from the population and each organisim in the sample is typed. Let Y AA , Y Aa , and Y aa denote the numbers of the three genotypes in the sample. The outcome ( Y AA ; Y Aa ; Y aa) = (42 ;46 ;12) is the most probable outcome. However, its probability is small. Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 60 > dmultinom(c(42,46,12),prob=c(.4225,.455,.1225)) [1] 0.01028722 Marginal and conditional distributions of a multinomial distribution are also multinomial. In partic- ular, each component Y i of Y= ( Y 1; Y 2; ; Y m) has a binomial distribution Y i Binom (n; p i). This is easy to see without any calculation. Simply call a trial a success if level L ioccurs, otherwise call it a failure. The conditional distributions are well illustrated by the case m= 3. Let ( Y 1; Y 2; Y 3) have a multinomial distribution based on ntrials with probabilities ( p 1; p 2; p 3). The conditional distribution of Y 1, given that Y 2 = y 2 is binomial with n y 2 trials and with success probability p 1 1 p 2 = p 1 p 1+ p 3 .

( Y 1j Y 2 = y 2) Binom (n y 2; p 1 1 p 2 ) : (4.5) When Y= ( Y 1; ; Y m) has a multinomial distribution, the components Y i of are correlated, therefore dependent. When m 3 the covariance between Y i and Y j; i 6 = jis cov (Y i; Y j) = np ip j (4.6) and the correlation is cor(Y i; Y j) = r p i 1 p i p j 1 p j :

(4.7) 4.9.1 Exercises 1. Roll a pair of standard dice. All outcomes are equally likely. Let X 1be the minimum of the num- bers on the dice and let X 2be their sum. Construct a joint frequency table like the one in Example 6. Include the marginal frequency functions by summing the rows and columns of the table.

2. Find the conditional frequency function of X 2, given that X 1= 2. Are X 1and X 2independent or dependent?

3. Let X 1 and X 2 be independent discrete random variables with frequency functions f 1 and f 2, respectively. Let Y=X 1+ X 2. The frequency function for Yis given by the convolution formula:

g (y ) = X x 1 f 2( y x 1) f 1( x 1) :

Verify the convolution formula for the case where X 1and X 2are independent rolls of a fair die.

4. The proportion of the dominant allele of a certain gene in a population is 0.75. The recessive proportion is 0.25. A sample of 20 members of the population is taken and their genotypes deter- mined. What is the probability that the sample had 12 pure dominant, 2 pure recessive, and 6 mixed genotypes?

5. From a set of nob jects, y 1 are to be chosen and labelled " L 1", y 2 are to be labelled " L 2", y 3 are to be labelled " L 3", and so on until nally, the last y m are labelled " L m ". The number of ways this can be done is Go to TOC CHAPTER 4. DISCRETE DISTRIBUTIONS 61 n y 1 n y 1 y 2 n y 1 y 2 y 3 ym y m :

Simplify this expression.

6. Let ( Y 1; Y 2; Y 3) have a multinomial distribution with n= 30 and ( p 1; p 2; p 3) = (0 :25 ;0 :40 ;0 :35).

What is the conditional distribution of Y 2, given that Y 1 = 10?

7. Prove equation4.5.

8. In problem 4 above, what are the covariance and correlation between the number of pure dominant and the number of mixed types in the sample of 20 organisms? What is the conditional distribution of the number of mixed types, given that the number of pure dominant types is 12?

9. Derive equation4.7from equation4.6. Show that the absolute value of the expression on the right hand side of4.7is less than 1. Go to TOC Chapter 5 Continuous Distributions 5.1 Density Functions All measurements are made with limited precision. Therefore, one could say that all observable nu- meric random variables are discrete. However, some numeric observations are made with very great precision and can be replicated many times. In such situations it may be hopeless to try to describe a discrete distribution for the observations. Instead, the distribution may be approximated closely by a continuous distribution which is more amenable to mathematical treatment.

Example 5.1. A number between 0 and 1 is randomly selected in the following way. A fair coin is tossed a large number of times (e.g., m= 100) to generate a string of 1's and 0's. These are the bits in a binary representation of the number. Thus, the number is Um = m X i =1 X i= 2 i ; where X 1; ; X mare independent Bernoulli random variables with success probability p= 0 :5.

There are 2 m possible values of U m , so it would be a chore to write them all down and tabulate the frequency function. In fact, the values of U m are all equally likely. (Can you see why?) Since all m-term binary expansions of numbers in (0 ;1) are equally likely, it seems that the values of U m should be evenly distributed over the interval (0 ;1). In other words, two subintervals of (0 ;1) of the same length should have the same proportion of values of U m . That is nearly true, and it is exactly true in the limit as m! 1 . If we allow in nite binary expansions and de ne U= 1 X i =1 X i= 2 i then the range of Uis the set of all real numbers between 0 and 1. For any subinterval ( u 1; u 2) of (0 ;1) P r(u 1 < U u 2) = F U ( u 2) F U ( u 1) = u 2 u 1:

62 Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 63 We say that the random variable Uis uniformly distributed over the interval (0 ;1). Note that Uis not observable but U m is. For reasonably large values of m, the discrete distribution of U m is closely approximated by the continuous distribution of U.

The cumulative distribution function of Uis F (u ) = 8 > < > : u if 0 u 1, 0 if u <0, 1 if u >1. (5.1)The cumulative distribution function5.1is the integral of another function called the density function of U.

f(u ) = ( 0 if u <0 or u >1, 1 if 0 u 1. (5.2) 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 u F(u) Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 64By this we mean that F(u ) = Z u 1 f (x )dx for all real numbers u. It follows that P r(u 1 < U u 2) = Z u2 u 1 f (u )du for any interval ( u 1; u 2). This distribution is called the uniform distribution over the interval (0 ;1).

In the notation used by R, we write U U nif (0;1) to indicate that the random variable Uhas this distribution.

De nition 5.1. A density function is a nonnegative function fde ned on the set of real numbers R such that Z1 1 f (x )dx = 1 : 0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 u f(u) Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 65 Theorem 5.1. Iffis a density function, then its integral F(x ) = R x 1 f (u )du is a continuous cumula- tive distribution function. That is, Fis nondecreasing, lim x! 1 F (x ) = 0, and lim x!1 F (x ) = 1. If X is a random variable with this density function, then for any two real numbers x 1, x 2 with x 1 < x 2, P r (x 1 < X x 2) = Z x2 x 1 f (u )du:

Conversely, if Fis a continuous cumulative distribution function which is continuously di erentiable except perhaps at a nite set of points, its derivative f(x ) = F0 ( x ) is a density function and F(x ) = R x 1 f (u )du .

Except for some slight caveats it is accurate to say that the cumulative distribution is the integral of the density and that the density is the derivative of the cumulative distribution. The density function is analogous to the frequency function for a discrete random variable. The di erence is that sums are replaced by integrals. In the discrete case P r(a < X b) = X a

For a random variable Xwith a continuous distribution, P r(X =x) = 0 for each xed x. Thus, the events ( a < X b), ( a < X < b ), (a X < b ) and ( a X b) all have the same probability. End points of intervals may be included or not included according to convenience.

Example 5.2. LetXbe a Poisson process occurring in time, so that X(I ) is the number of "arrivals" in the time interval I. Let >0 be the rate of the process. Instead of focusing on the number of arrivals in a given interval of time, let us consider the times between successive arrivals. Let the random variable Tbe the time from the beginning (t=0) until the rst arrival and let t >0 be a given positive number. The event ( T > t) happens if and only if the number X(0; t) of occurrences in the interval (0 ; t) is zero. X(0; t) has a Poisson distribution with parameter = t. Hence, P r (T > t ) =P r(X (0; t) = 0) = e t :

The cumulative distribution of Tis F(t) = P r(T t) = 1 e t :

for t 0 and F(t) = 0 for t <0. The density of Tis f (t) = F0 ( t) = e t for t > 0. For t 0; f (t) = 0. The cumulative distribution and the density of Tare plotted below for = 1. This distribution is called an exponential distribution and the parameter is its rate parameter. We write T E xp (rate = ) to indicate that Thas this distribution. Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 66 > par(mfrow=c(2,1)) > curve(pexp(x),from=-1,to=4,ylab="F(x)") > abline(h=0,lty=2) > abline(v=0,lty=2) > curve(dexp(x),from=0,to=4,ylab="f(x)",xlim=c(-1,4)) > lines(c(-1,0),c(0,0)) > abline(v=0,lty=2) > abline(h=0,lty=2) > par(mfrow=c(1,1)) 1 0 1 2 3 4 0.0 0.4 0.8 x F(x) 1 0 1 2 3 4 0.0 0.4 0.8 x f(x) Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 67 5.2 Expected Values and Quantiles for Continuous Distribu- tions 5.2.1 Expected Values When a random variable Xhas a continuous distribution, we evaluate probabilities P r(X 2I) as integrals R I f (x )dx rather than sums P x2 I f (x ). Similarly, we evaluate expected values as integrals.

If X has density function f(x ), then E(X ) = = Z 1 1 xf (x )dx:

More generally, if his a function de ned on the range of X, E (h (X )) = Z 1 1 h (x )f (x )dx:

In particular, var(X ) = 2 = Z 1 1 ( x )2 f (x )dx = Z 1 1 x 2 f (x )dx 2 :

As before, we require that these expressions be absolutely convergent; otherwise the expected values do not exist.

Example 5.3. LetXhave the exponential distribution with rate parameter >0. For x 0 the density of Xis f(x ) = e x ; and f(x ) = 0 for x <0. The expected value of Xis = E(X ) = Z 1 1 xf (x )dx = Z 1 0 x e x dx = 1 Z 1 0 ue u du = 1 : Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 68 Similarly, the expected value of X2 is E (X 2 ) = Z 1 1 x 2 f (x )dx = Z 1 0 x 2 e x dx = 1 2 Z 1 0 u 2 e u du = 2 2:

It follows that the variance of Xis var(X ) = 2 2 (1 ) 2 = 1 2 and the standard deviation is sd(X ) = 1 :

The mean and standard deviation of an exponential distribution are the same.

The exponential distribution is often identi ed by its mean parameter rather than the rate parameter . When it is, the density and cumulative distribution are f(x ) = 1 e x= F (x ) = 1 e x= for x 0.

5.2.2 Quantiles De nition 5.2. LetFbe a given cumulative distribution and let p2 (0;1). The pth quantile of F , also called the 100 pth percentile of F , is de ned as F 1 (p ) = min fx jF (x ) pg:

When Fis identi ed with a particular random variable X, we also write q(X; p ) for F 1 (p ). F 1 (:25) ; F 1 (:5) , and F 1 (:75) are the rst quartile , themedian , and the third quartile ofF, respectively. The function F 1 as de ned above is called the quantile function .

For continuous distributions, F 1 (p ) is the smallest number xsuch that F(x ) = p. Often Fhas a true inverse function and nding the quantile reduces to nding the unique solution of the equation F (x ) = p.

Example 5.4. LetFbe the exponential distribution with mean and cumulative distribution F(x ) = 1 e x= forx 0. Setting F(x ) = pand solving for xgives x = F 1 (p ) = ln(1 p):

Thus, the median of Fis ln( :5) = ln 2. Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 69 5.2.3 Exercises 1. For 0 x 1 let f(x ) = kx(1 x), where kis a constant. Find the value of ksuch that fis a density function.

2. Find the mean and variance of the distribution in the preceding exercise.

3. For x 0, let f(x ) = 2 xe x2 . Show that fis a density function.

4. Find the cumulative distribution for the density in the preceding exercise.

5. Find the pth quantile of this distribution.

6. For a real number x, let F(x ) = ex =(1 + ex ). Find the density function for this cumulative distribution function. Find the quantile function F 1 (p ).

5.3 Uniform Distributions Let U U nif (0;1) and let aand b > a be constants. De ne a new random variable Xby linearly transforming Uas X=a+ ( b a)U . Since Ulies between 0 and 1, Xlies between aand b. We will calculate the cumulative distribution of Xby a method that is useful for many kinds of transforma- tions, both linear and nonlinear.

Let xbe an arbitrary number in ( a; b).

F X ( x ) = P r(X x) = P r(a + ( b a)U x) = P r (U x a b a) = F U (x a b a) = x a b a since F U ( u ) = ufor u2 (0;1). Thus the cumulative distribution of Xis F (x ) = 8 > < > : ( x a)= (b a) if a < x < b 0 if x a 1 if x b (5.3) By di erentiating we get the uniform density on ( a; b).

f (x ) = F0 ( x ) = ( 1=(b a) if a < x < b 0 otherwise (5.4) De nition 5.3. A random variable Xis uniformly distributed on the interval ( a; b) if its cumulative distribution function is given by5.3and its density function is5.4.

We indicate that Xis uniformly distributed on ( a; b) by writing X U nif (a; b ).

Example 5.5. X U nif ( 1;3). Find the following probabilities. Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 70 (a) P r(X 2) (b) P r( 2 X < 1) (c) P r(0< X < 3) (d) P r(X > 0) Solution: We will use both elementary calculations and the R function "punif" for the cumulative distribution.

(a) F(2) = (2 ( 1)) =(3 ( 1)) = 0 :75 > punif(2,min=-1,max=3) [1] 0.75 (b) F(1) F( 2) = 0 :5 0 = 0 :5 punif(1,-1,3)-punif(-2,-1,3) [1] 0.5 (c) F(3) F(0) = 1 1=4 = 0 :75 > punif(3,-1,3)-punif(0,-1,3) [1] 0.75 (d) 1 F(0) = 1 1=4 = 0 :75 > 1-punif(0,-1,3) [1] 0.75 The Mean, Variance and Quantile Function of a Uniform Distribution The formulas for the mean, variance and quantile function of a uniform distribution are simple calcu- lations which we leave as an exercise. Let X U nif (a; b ).

E (X ) = a + b 2 (5.5) var (X ) = ( b a)2 12 (5.6) q (X; p ) =a+ ( b a) p (5.7) 5.4 Exponential Distributions and Their Relatives 5.4.1 Exponential Distributions De nition 5.4. A random variable Xhas an exponential distribution with rate parameter >0, if its cumulative distribution is F(x ) = ( 1 e x ifx 0, 0 if x <0, (5.8) Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 71 with density function f(x ) = ( e x ifx 0, 0 if x <0. (5.9) When Xhas the exponential distribution with rate parameter , we write X E xp ( ).

In a previous section, we showed that the mean and standard deviation of the exponential distribution with rate are = 1 = and that the quantile function is given by F 1 (p ) = ln(1 p).

As previously mentioned, the exponential distributions have an intimate connection with Poisson pro- cesses. If we observe a Poisson process evolving in time and T 1 is the time from the beginning until the rst arrival, T 2 is the time between the rst and second arrivals, T 3 the time between the second and third arrivals, and so on, then T 1; T 2; T 3etc. are independent random variables all with the same exponential distribution E xp( ), where is the rate parameter for the Poisson process.

Exponential distributions are a starting point for the study of lifetimedistributions. Let Tdenote the length of time a randomly chosen member of a population survives in a particular state. The population is not necessarily a biological population; it could be the population of atoms in a lump of radioactive matter and the lifetime could be the amount of time an atom survives in an excited state before decaying into a lower energy state. Our main interest is in the survivalfunction S(t) = 1 F(T ) = P r(T > t ) fort >0, interpreted as the probability of survival past time t. If Thas an exponential distribution with rate parameter , then S (t) = e t :

This describes systems at the atomic level quite well. The exponential distribution has a peculiar property called the "memoryless" property which seems to be true for such things as atoms in an excited energy state. Informally, this property states that the probability that an ob ject survives an additional tunits of time is independent of the amount of time tthat the ob ject has already survived. In symbols, P r(T > t + tjT > t ) does not depend on t. For an exponential distribution P r (T > t + tjT > t ) =P r(T > t + t) =P r (T > t ) = e (t+ t)) =e t = e t :

Thus an ob ject whose lifetime distribution is exponential does not age. The exponential distributions are the only continuous distributions with the memoryless property. Lifetime distributions for com- plex systems (e.g., organisms) are not exponential because clearly they do age. The probability of survival for an additional 5 years is not the same for a 70 year old man as for a 20 year old man.

The cumulative distribution and quantile function for the exponential distributions are calculated in R with the functions "pexp" and "qexp". For "pexp" the required argument is x, the point at which the cumulative distribution is to be evaluated and the rateargument for the rate parameter is required only if it is di erent from 1.

> pexp(2) [1] 0.8646647 > pexp(2,rate=2) [1] 0.9816844 Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 72 For the quantile function qexp, the argument pis required and the rateargument is optional.

> qexp(.75) [1] 1.386294 > qexp(.75,rate=2) [1] 0.6931472 5.4.2 Gamma Distributions The gamma function is an extension to the set of all positive real numbers of the familiar factorial function de ned for nonnegative integers. It is de ned for >0 as ( ) = Z 1 0 x 1 e x dx:

This integral converges for any >0. Using integration by parts, it can be shown that ( + 1) = ( ) and it is easily seen that (1) = 1. Hence, (2) = 1 (1) = 1, (3) = 2 (2) = 2, (4) = 3 (3) = 3!, and by induction ( n) = ( n 1)! for positive integers n. Now modify the integrand in the de nition slightly as follows: Z1 0 x 1 e x dx; where >0 is a constant. By making the change of variables u= x , the integral becomes 1 Z 1 0 u 1 e u du = ( ) :

Thus, f(x ; ; ) = ( )x 1 e x for x > 0 is a density function. It is called the gamma density with shape parameter and rate parameter . Sometimes the formula is written in terms of the scale parameter = 1 = instead of the rate parameter . When Xhas a gamma distribution we write X Gamma ( ; rate = ) or X Gamma ( ; scale = ).

The exponential distributions are special cases of the gamma distributions corresponding to = 1.

Like exponential distributions, the gamma distributions have a connection to Poisson processes. The time between the kth and k+ mth arrivals in a Poisson process with rate parameter has the gamma distribution Gamma(shape =m; rate = ).

Several gamma density functions with di erent shape parameters are plotted below.

> par(mfrow=c(2,2)) > curve(dgamma(x,shape=.5),from=0,to=4,xlab="x",ylab="f(x)", + main="alpha = 0.5") Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 73 > curve(dgamma(x,shape=1),from=0,to=4,xlab="x",ylab="f(x)", + main="alpha = 1 (exponential)") > curve(dgamma(x,shape=2),from=0,to=8,xlab="x",ylab="f(x)", + main="alpha = 2") > curve(dgamma(x,shape=3),from=0,to=8,xlab="x",ylab="f(x)", + main="alpha = 3") > par(mfrow=c(1,1)) Aside from their relationship to Poisson processes, some gamma distributions are especially important in the theory of statistical inference. We will cover this aspect of them in a later chapter.0 1 2 3 4 0.0 1.0 2.0 alpha = 0.5 x f(x) 0 1 2 3 4 0.0 0.4 0.8 alpha = 1 (exponential) x f(x) 0 2 4 6 8 0.0 0.1 0.2 0.3 alpha = 2 x f(x) 0 2 4 6 8 0.00 0.10 0.20 alpha = 3 x f(x) Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 74 The Mean, Variance and Quantiles of a Gamma Distribution If X Gamma (shape = ; scale = ), E (X ) = 1 ( ) Z 1 0 x e x= dx = +1 ( + 1) ( ) = +1 ( ) ( ) = (5.10) By a similar argument E(X 2 ) = ( + 1) 2 :

Thus, var(X ) = 2 (5.11) To nd the pth quantile of a gamma distribution we would have to solve the equation p= 1 ( ) Z x 0 t 1 e t= :

This cannot be done in terms of elementary functions of x; however, the equation can be solved numerically. R's quantile function for the gamma distributions "qgamma" does this. For example, if = 3 and = 2, the third quartile of the corresponding gamma distribution is > qgamma(.75,shape=3,scale=2) [1] 7.840804 The function "pgamma" gives the value of the cumulative distribution.

> pgamma(7.840804,shape=3,scale=2) [1] 0.75 5.4.3 Weibull Distributions Engineers and scientists have found that a power transformation of an exponential random variable sometimes results in a more realistic representation of the lifetime distribution of a complex system.

Speci cally, suppose that X E xp ( = 1) and let T= X 1 = , where and are positive constants.

We write the exponent as 1 = rather than simply to make the formulas come out nicer in the end.

The survival function for Tis P r(T > t ) =P r( X 1 = > t ) = P r (X > (t ) ) = e (t ) (5.12) Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 75 From the survival function we easily derive the cumulative distribution and the density. F(t) = 1 e (t ) f (t) = ( t ) 1 e (t ) (5.13) The utility of the Weibull distribution in survival analysis is that it has the two adjustable parameters and which can be adapted to speci c data on survival of the systems under study. In other words, they can be estimated from data. Estimation of parameters is one of the ma jor goals of statistical inference. The parameter is called the scale parameter because it adjusts for a change in the scale of time measurement. The parameter is the shape parameter. It governs the shape of the density function and its rate of decay as t! 1 . For = 1 the Weibull distribution is the exponential distribution. For >1 the system wears out with age in the sense that the conditional probability of survival another tunits of time decreases as tincreases. For <1, the system improves with age - the conditional probability of survival another tunits increases as tincreases. Examples of the survivalfunction and density function for several values of are plotted below.

> par(mfrow=c(2,2)) > curve(1-pweibull(x,shape=0.5),from=0,to=4,ylab="S(t)",xlab="t", + main="alpha=0.5") > curve(1-pweibull(x,shape=1),from=0,to=4,ylab="S(t)",xlab="t", + main="alpha=1 (exponential)") > curve(1-pweibull(x,shape=3),from=0,to=4,ylab="S(t)",xlab="t", + main="alpha=3") > curve(1-pweibull(x,shape=4),from=0,to=4,ylab="S(t)",xlab="t", + main="alpha=4") > par(mfrow=c(1,1)) Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 76> par(mfrow=c(2,2)) > curve(dweibull(x,shape=0.5),from=0,to=4,ylab="f(t)",xlab="t", + main="alpha=0.5") > curve(dweibull(x,shape=1),from=0,to=4,ylab="f(t)",xlab="t", + main="alpha=1 (exponential)") > curve(dweibull(x,shape=3),from=0,to=4,ylab="f(t)",xlab="t", + main="alpha=3") > curve(dweibull(x,shape=4),from=0,to=4,ylab="f(t)",xlab="t", + main="alpha=4") > par(mfrow=c(1,1))0 1 2 3 4 0.2 0.4 0.6 0.8 1.0 alpha=0.5 t S(t) 0 1 2 3 4 0.0 0.4 0.8 alpha=1 (exponential) t S(t) 0 1 2 3 4 0.0 0.4 0.8 alpha=3 t S(t) 0 1 2 3 4 0.0 0.4 0.8 alpha=4 t S(t) Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 77The Mean, Variance and Quantile Function of a Weibull Distribution The mean, variance, and quantiles of a Weibull distribution are easily found from its relationship to the standard exponential distribution E xp( = 1) with density f(x ) = e x ; x 0. If Xhas this distribution, then Y= X 1 = has the Weibull distribution W eib(shape = ; scale = ). Therefore, E (Y ) = Z 1 0 x 1 = e x dx = ( 1 + 1) (5.14) Likewise, E(Y 2 ) = 2 ( 2 + 1) so that var(Y ) = 2 f ( 2 + 1) ( 1 + 1) 2 g : (5.15)0 1 2 3 4 0.0 0.5 1.0 1.5 2.0 alpha=0.5 t f(t) 0 1 2 3 4 0.0 0.4 0.8 alpha=1 (exponential) t f(t) 0 1 2 3 4 0.0 0.4 0.8 1.2 alpha=3 t f(t) 0 1 2 3 4 0.0 0.5 1.0 1.5 alpha=4 t f(t) Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 78 The transformation y= x 1 = is strictly increasing. Therefore, it maps quantiles of Xto correspond- ing quantiles of Y. The pth quantile of Xis ln(1 p). Thus, the pth quantile of Yis q (Y ; p ) = [ ln(1 p)] 1 (5.16) 5.4.4 Exercises 1. Suppose X U nif ( 1;3). Find the probabilities of the following events, both by hand calculation and with R's puniffunction.

(a) ( X 2) (b) ( X 1) (c) ( 0:5 < X < 1:5) (d) ( X= 0) 2. Find the median of U nif(a; b ).

3. Suppose U U nif (0;1). Show that 1 U U nif (0;1).

4. Suppose U U nif (0;1) and that >0 is a given constant. Let X = 1 log (1 U) . Find the cumulative distribution of X.

5. Find the rst and third quartiles of the exponential distribution with mean . Compare the in- terquartile range F 1 (:75) F 1 (:25) to the standard deviation.

6. Suppose X E xp ( = 2). Find the probabilities of the events (a) - (d) in exercise 1 above.

7. The lifetimes in years of air conditioning systems have a Weibull distribution with a shape param- eter = 4 and scale parameter = 8. What is the probability that your new a.c. system will last more than 10 years? What is the third quartile of a.c. lifetimes?

8. Refer to problem 7 above. Given that your a.c. system has already lasted more than 10 years, what is the probability that it will last at least one more year? Given that it has already lasted 5 years, what is the probability that it will last another year?

9. Suppose X Gamma ( ; scale = ) and that Y=kX with k >0 a constant. Show that Y Gamma ( ; scale =k ).

10. Suppose X Gamma ( = 3 ; scale = 2). Use R's "pgamma" function to nd:

(a) P r(X 4) (b) P r(1 X < 3) Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 79 5.5 Normal Distributions De nition 5.5. The random variable Zhas the standard normal distribution if its density function is (z ) = 1 p 2 e z2 = 2 (5.17) for all z, 1 < z < 1. This is indicated by the expression Z N orm (0;1).

By considering the rapid rate at which (z ) ! 0 as jz j ! 1 , it is easy to see that R 1 1 (z )dz is nite.

To show that the integral is 1, so that is actually a density function, is trickier. The standard trick is to let the value of the integral be denoted by Iand then to write I 2 = 1 2 Z 1 1 e x2 = 2 dx Z 1 1 e y2 = 2 dy:

Next, write this as a double integral.

I2 = 1 2 Z 1 1 Z 1 1 e (x 2 + y2 )= 2 dxdy:

Now change to polar coordinates x= rcos , y = rsin with r 0 and 0 < 2 .

I 2 = 1 2 Z 2 =0 Z 1 r =0 e r2 = 2 rdrd :

This expression is equal to 1. We leave the remaining details to the reader.

The standard normal cumulative distribution is (z) = Z z 1 (u )du = 1 p 2 Z z 1 e u2 = 2 du: (5.18) This integral cannot be expressed in terms of elementary functions, but it can be numerically evaluated as accurately as desired. The R function for evaluating is "pnorm" and the function for 1 is "qnorm". For example, > pnorm(1.645) [1] 0.9500151 > pnorm(1.98) [1] 0.9761482 > pnorm(-0.68) [1] 0.2482522 > qnorm(.975) [1] 1.959964 > qnorm(.25) [1] -0.6744898 Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 80 Although it is seldom needed, the density function may evaluated with "dnorm".

> dnorm(0) [1] 0.3989423 > dnorm(1.645) [1] 0.1031108 The density and cumulative distribution of the standard normal are plotted below. 5.5.1 Tables of the Standard Normal Distribution Tables of values of the standard normal cumulative distribution are widely available. The one below was produced with R. The rst two signi cant digits of zare arranged along the left hand margin and the third is read from the top row. The table entries are ( z) for nonnegative values of z. For negative zuse the relation ( z) = 1 ( z), which holds for all zbecause of the symmetry of the density function . 3 2 1 0 1 2 3 0.0 0.2 0.4 x 3 2 1 0 1 2 3 0.0 0.4 0.8 x Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 81 0 1 2 3 4 5 6 7 8 9 0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359 0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753 0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141 0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517 0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879 0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224 0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549 0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852 0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133 0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389 1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621 1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830 1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015 1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177 1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319 1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441 1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545 1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633 1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706 1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767 2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817 2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857 2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890 2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916 2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936 2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952 2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964 2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974 2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981 2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986 5.5.2 Other Normal Distributions Suppose that Z N orm (0;1) and that >0 and are constants. Let the random variable Xbe related to Zby X=Z + . We calculate the cumulative distribution of Xby the same transformation methods we have used before.

FX ( x ) = P r(X x) = P r (Z + x) = P r (Z x ) = ( x ) (5.19) Di erentiating F X gives the density function of the random variable X. Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 82 f X ( x ) = 1 0 ( x ) = 1 (x ) = 1 p 2 e ( x )2 2 2 (5.20) The distribution of Xis called the normal distribution with parameters and . This is indicated by the expression X N orm ( ; ).

Theorem 5.2. X N orm ( ; ) if and only if Z= ( X )= N orm (0;1).

If Z N orm (0;1), then E(Z ) = Z 1 1 z (z )dz = 0 because the integrand z(z ) is an odd function. To nd the variance var(Z ) = E(Z 2 ) = Z 1 1 z 2 (z )dz integrate by parts letting u= zand dv=z (z )dz . The result is var (Z ) = 1 :

If X N orm ( ; ), then X= + Z with Z N orm (0;1). Hence, E (X ) = + E (Z ) = ; and var(X ) = 2 var (Z ) = 2 :

Thus the parameter is the mean and the parameter is the standard deviation of a normal distri- bution.

The practical implication of Theorem5.2is that we can use the standard normal table for any normal distribution.

Example : Let X N orm ( = 2 ; = 3). Find P r(X 5:5). Find the 95 th percentile of X.

Solution : According to equation5.19, P r(X 5:5) = F X (5 :5) = ((5 :5 2)=3) = (1 :1667) Interpolating the table values, we get 0.8784.

Since X= + Z , F 1 X ( p ) = + 1 (p ): Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 83 Thus, F 1 X ( :95) = 2 + 3 1 (:95). Reading the table backward and interpolating again, we have 1 (:95) = 1 :645. Hence, F 1 X ( :95) = 6 :935.

The R functions for computing the normal cumulative distribution F X and its inverse F 1 X are "pnorm" and "qnorm", illustrated below.

> pnorm(5.5,mean=2,sd=3) [1] 0.8783275 > qnorm(.95,mean=2,sd=3) [1] 6.934561 5.5.3 The Normal Approximation to the Binomial Distribution We shall see in the next chapter that normal distributions occupy a central place in probability and statistics. This is because of a famous theorem called the central limit theorem. In just one of many applications of the central limit theorem, the binomial distributions may be approximated by normal distributions. We state the result as a theorem and postpone its justi cation until the next chapter.

Theorem 5.3. LetY Binom (n; p ), where pis a xed constant. For all real z, P r ((Y np )= p np (1 p) z) ! (z) as n! 1 .

Example 5.6. Suppose thatYis the number of heads in 30 tosses of a fair coin. Let us approximate P r (Y > 18).

Solution : When approximating the binomial with the normal distribution, a better approximation is obtained by applying the continuity correction. This means to adjust the inequalities describing events to avoid points of discontinuity of the binomial distribution, i.e., the possible values of the discrete random variable Y. Since Yis an integer, the events ( Y >18) and ( Y >18:5) are actually the same.

The mean of Yis 15 and its standard deviation is p 7 :5 = 2 :739.

P r (Y > 18:5) = P r(y 15 2 :739 > 18 :5 15 2 :739 ) = 1 (1 :278) R gives the answer > 1-pnorm(1.278) [1] 0.1006247 Compare this to the exact binomial probability > 1-pbinom(18,30,0.5) [1] 0.1002442 Go to TOC CHAPTER 5. CONTINUOUS DISTRIBUTIONS 84 5.5.4 Exercises 1. Let Z N orm (0;1). Use the normal table and also R's "pnorm" function to nd (a) P r(Z 1:45) (b) P r(Z > 1:28) (c) P r( 0:674 Z < 1:036) (d) P r(Z > 0:836) 2. Use the normal table and also R's "pnorm" function to nd (a) P r(X 6:13), X N orm (1;4) (b) P r(X > 2:35), X N orm ( 1;2) (c) P r( 0:872 < X 7:682), X N orm (2:5 ;5) (d) P r(X > 0:698), X N orm ( 2;4) 3. Use the normal table and also R's "qnorm" function to nd (a) The 90 th percentile of N orm(0;5).

(b) The 15 th percentile of N orm(1;3).

(c) The interquartile range, i.e., the distance from the rst to third quartiles of N orm( ; ).

4. A student makes a score of 700 on an achievement test with normally distributed scores having a mean of 600 and a standard deviation of 75. What is the student's percentile score?

5. X is the number of heads when a fair coin is tossed 30 times. Find the exact binomial probability P r (X = 15) and its normal approximation. Go to TOC Chapter 6 Joint Distributions and Sampling Distributions 6.1 Introduction We have already discussed jointly distributed discrete random variables, their joint and marginal distributions, and their conditional distributions. If Xand Yare jointly distributed discrete variables, their joint frequency function is related to joint probabilities by P r(X 2I 1 ; Y 2I 2 ) = X x 2 I 1 X y 2 I 2 f (x; y ):

The marginal (individual) frequency functions of the variables are related to the joint frequency function by, e.g., fX ( x ) = X y f (x; y ); and the conditional frequency function of one variable, given the value of the other, is fX jY ( x jy ) = P r(X =xjY =y) = f (x; y ) f Y ( y ) :

6.2 Jointly Distributed Continuous Variables The formal relationships for jointly distributed continuous variables are similar, except that sums must be replaced by integrals. If Xand Yare jointly distributed continuous variables, their joint density function is a function f(x; y ) 0 of two arguments such that for all intervals I 1 and I 2 , P r (X 2I 1 ; Y 2I 2 ) = Z I2 Z I1 f (x; y )dxdy: (6.1) More generally, if Ais any region in the x; ycartesian plane that has an area, P r ((X; Y )2 A) = Z A Z f(x; y )dxdy:

85 Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 86 Here ( X; Y) can be thought of as a random point in the x; yplane. In (6.1) if we let I 2 = ( 1 ;1 ) we nd that the marginal density function of Xis f X ( x ) = Z 1 1 f (x; y )dy: (6.2) Similarly, the marginal density of Yis fY ( y ) = Z 1 1 f (x; y )dx:

Let xbe a xed number such that f X ( x ) > 0. The conditional density function of Y, given that X =x, is the function of y:

fY jX ( y jx ) = f (x; y ) f X ( x ) :

(6.3) If we integrate this function with respect to yover an interval I, we obtain the conditional probability that Y2I, given that X=x.

P r (Y 2IjX =x) = Z I f Y jX ( y jx )dy: (6.4) In this situation, this is the de nition of P r(Y 2IjX =x). The elementary de nition of condi- tional probability does not work because the event ( X=x) has zero probability. Conditional and unconditional probabilities for Yare related by P r (Y 2I) = Z 1 1 P r (Y 2IjX =x)f X ( x )dx: (6.5) Example 6.1. The Uniform Distribution over a Region in the Plane Consider the shaded triangular region Twith vertices (0 ;0), (1 ;0), and (1 ;1) shown below. When we say that ( X; Y) is uniformly distributed over Twe mean that f(x; y ) has a constant value on Tand is zero outside of T. Since R T R f(x; y )dxdy = 1, this implies that for ( x; y)2 T, f (x; y ) = 1 =area of T :

For any region Ain the plane, P r((X; Y )2 A) is the proportion of the area of Toccupied by A, i.e., the area of A\T divided by the area of T. In this example the area of Tis 1 =2, so the joint density of X and Yisf(x; y ) = 2 if ( x; y)2 T and f(x; y ) = 0 otherwise. We will nd the marginal and conditional density functions associated with f. Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 87The basic formula for the marginal density of Xis given by (6.2). For x <0 or x >1,f(x; y ) = 0 for all yand therefore f X ( x ) = 0. For xed xbetween 0 and 1, as pictured, f(x; y ) = 0 if y <0 or y > x and f(x; y ) = 2 if 0 y x. Thus, (6.2) reduces to fX ( x ) = Z y= x y =0 2 dy = 2 x:

Similarly, for 0 y 1, fY ( y ) = Z x=1 x = y 2 dx = 2(1 y):

To nd the conditional density of Y, given that X=xwe must remember that the de nition requires that f X ( x ) > 0, i.e., that 0 < x 1. Since f(x; y ) = 0 for youtside the interval (0 ; x), f Y jX ( y jx ) = 2 2 x = 1 x for 0 y x. In other words, given X=x, Y is uniformly distributed on the interval (0 ; x). This0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x y Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 88 should be almost obvious from the picture. Likewise, given Y=y2 (0;1), Xis uniformly distributed on the interval ( y;1).

The conditional distribution of Y, given that X=xmay have a mean E (Y jX =x) = Z 1 1 yf YjX ( y jx )dy and a variance var(Y jX =x) = Z 1 1 y 2 f Y jX ( y jx )dy E(Y jX =x)2 :

If so, these will generally be functions of the number x. The relationship between the conditional and unconditional expected values of Yis readily obtained from the de nition of the conditional density.

Z 1 1 E (Y jX =x)f X ( x )dx =Z 1 1 yf Y( y )dy =E(Y ): (6.6) More generally, if g(y ) is a function de ned on the range of the random variable Y, E (g (Y )) = Z 1 1 E (g (Y )jX =x)f X ( x )dx:

6.2.1 Mixed Joint Distributions It is quite common to have two or more jointly distributed random variables, some continuous and others discrete. In most such cases their conditional distributions precede a description of their joint distribution. Suppose Xand Yare jointly distributed, Xis discrete with frequency function f X ( x ), and Yis continuous with conditional density function f Y jX ( y jx ). The joint distribution of Xand Y is characterized through their joint hybrid frequency-density function f(x; y ) = f X ( x )f Y jX ( y jx ) and the marginal density function of Yis f Y ( y ) = X x f Y jX ( y jx )f X ( x ):

Then the conditional frequency function of X, given that Y=yis f X jY ( x jy ) = f Y jX ( y jx )f X ( x ) f Y ( y ) :

(6.7) Example 6.2. The site of an archaeological excavation was at two di erent times occupied by two genetically distinct groups of people. The earlier group is thought to have comprised about 25% of their combined numbers. One distinguishing anatomical characteristic is the logarithm Yof the ratio of skull height to skull width. For the earlier group that is normally distributed with mean 0.223 and standard deviation 0.04. For the later group, the log of the skull ratio is normally distributed with mean 0.300 and standard deviation 0.04. A skull is excavated that has a value of Y= 0 :240. What is the probability that it came from the earlier group?

Solution Let X= 1 if a skull comes from the earlier group and X= 0 if it comes from the later group.

X is a Bernoulli variable with success probability p= 0 :25. We want to nd f X jY (1 ;0 :240) = P r(X = 1 jY = 0 :240). We will perform the calculations of equation (6.7) with R and its "dnorm()" function. Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 89 > fxy=0.25*dnorm(0.240,mean=0.233,sd=0.04) > fy=fxy+0.75*dnorm(0.240,mean=0.300,sd=0.04) > fxy/fy [1] 0.5027688 6.2.2 Covariance and Correlation If X and Yare jointly distributed with joint density function f(x; y ) and g(x; y ) is a real valued function, then g(X; Y ) is a random variable. If its expected value exists, it can be found by E(g (X; Y )) = Z 1 1 Z 1 1 g (x; y )f (x; y )dxdy: (6.8) Especially important is the covariance of X and Y.

cov (X; Y ) =E((X x)( Y y)) = E(X Y ) E(X )E (Y ); where x = E(X ) and y = E(Y ). The correlation between Xand Yis cor (X; Y ) =cov (X; Y ) x y ; a number between -1 and 1. Here x and y are the standard deviations of Xand Y, respectively. The interpretations of the covariance and correlation for continuous variables are the same as for discrete variables. The greater the absolute value of cor(X; Y ), the closer the variables Xand Ycome to satisfying a linear relationship of the form aX+bY =cfor some constants a; band c.

Example 6.3. We will calculate the covariance and correlation between Xand Yin Example 1.

E (X Y ) = Z 1 0 Z x 0 2 xy dy dx = Z 1 0 2 x Z x 0 ydy dx = Z 1 0 x 3 dx = 1 =4 We leave it as an exercise to show that E(X ) = 2 =3, E(Y ) = 1 =3, E(X 2 ) = 1 =2, and E(Y 2 ) = 1 =6.

It follows that var(X ) = 1 =2 (2=3) 2 = 1 =18 and var(Y ) = 1 =6 (1=3) 2 = 1 =18. Thus, cov (X; Y ) = 1 =4 (2=3)(1 =3) = 1 =36 and cor(X; Y ) =1 = 36 1 = 18 = 1 =2: Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 90 De nition 6.1. Two random variables Xand Yare uncorrelated if cov (X; Y ) = 0, equivalently, if E (X Y ) = E(X )E (Y ).

Example 6.4. Letf(x; y ) be the uniform density over the unit disc in the x; yplane. We leave it to the reader to show that E(X ) = E(Y ) = 0. Thus, the covariance of Xand Yis just E(X Y ). For all four quadrants Q, the integral Z Q Z xyf (x; y )dxdy is the same except for sign. It is positive in the rst and third quadrants and negative in the second and fourth quadrants. Thus, cov(X; Y ) = 0. Students should work out the details by actually doing the integrations. 6.2.3 Bivariate Normal Distributions A bivariate normal distribution depends on ve parameters, x, y, x > 0, y > 0, and 2 ( 1;1).

Let Xand Ybe jointly distributed variables and let X N orm ( x; x). Also, let the conditional 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 91 distribution of Y, given that X=x, be normal with conditional mean E (Y jX =x) = y + y x ( x x) ; (6.9) conditional variance var(Y jX =x) = 2 y (1 2 ); (6.10) and standard deviation yp 1 2 . Then the joint density of Xand Yis the product of the two normal densities f(x; y ) = f Y jX ( y jx )f X ( x ): (6.11) It is tedious, but only algebra, to show that the joint density of Xand Yworks out to be f (x; y ) = 1 2 x yp 1 2 e 1 2 ( x = x x )2 2 ( x x x ) y y y + y y y 2 (6.12) This expression remains the same if all the x0 s and y0 s are interchanged, so it follows that the marginal distribution of YisN orm ( y; y) and that the conditional distribution of X, given that Y=y, is normal with conditional mean x + x y ( y y) and variance 2 x (1 2 ). The parameter is the correlation between Xand Y. Notice that if = 0, the joint density factors into the product of the marginal densities. f(x; y ) = f X ( x )f Y ( y ):

This means that Xand Yare independent. We will say more about this in the next section.

The level curves f(x; y ) = constant of the bivariate normal density function are ellipses whose incli- nation to the coordinate axes and eccentricities depend on the correlation . The gure below shows level curves for the bivariate normal density function with zero means, unit variances and = 0:7. Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 92To show that Xand Yhave a joint bivariate normal distribution, it is enough to show that Xis normally distributed and that the conditional distribution of Y, given that X=x, is normal with mean 0 + 1x , a linear function of x, and constant variance independent of x. The parameters y; yand are then uniquely determined by 0; 1; and the mean and standard deviation of X.

6.3 Independent Random Variables We have already de ned what it means for jointly distributed random variables X 1; ; X nto be independent. To repeat that de nition, it means that for all intervals I 1 ; ; I nof real numbers, P r (X 12 I 1 ; X 22 I 2 ; ;X n2 I n ) = P r(X 12 I 1 ) P r (X 22 I 2 ) P r(X n2 I n ) :

If each X ihas a density function or frequency function f i( x ), the X iare independent if and only if the joint frequency-density function factors into the product of the marginal frequency or density functions: f(x 1; x 2; ; x n) = f 1( x 1) f 2( x 2) f n ( x n) ; 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 3 2 1 0 1 2 3 3 2 1 0 1 2 3 Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 93 for all x 1; ; x n.

Theorem 6.1. LetXand Ybe jointly distributed independent variables with nite variances. Then X and Yare uncorrelated.

Proof : We will assume that Xand Yare continuous variables. The argument for other cases would be almost the same. Since Xand Yare independent, f(x; y ) = f X ( x )f Y ( y ) and E(X Y ) = Z 1 1 Z 1 1 xyf X( x )f Y ( y )dxdy = Z 1 1 xf X( x )dx Z 1 1 yf Y( y )dy = E(X )E (Y ) Thus, cov(X; Y ) =E(X Y ) E(X )E (Y ) = 0.

The converse is not true in general. In Example6.4, Xand Yare uncorrelated but not independent.

However, if Xand Yhave a bivariate normal distribution with correlation = 0, then they are independent.

6.3.1 Exercises 1. In Example 1, are Xand Yindependent or dependent? Give reasons.

2. In Example 1 nd P r(X > 1= 2) in two di erent ways, one by integrating the marginal density function and the other by considering ratios of areas.

3. 70% of students in calculus 1 are STEM ma jors. The semester average in calculus 1 for STEM ma jors is normally distributed with mean 81 and standard deviation 8. The semester average for non-STEM ma jors is normally distributed with mean 75 and standard deviation 10. Given that a student has a semester average of 90, what is the probability that he or she is a STEM ma jor?

4. In problem 3, what is the probability that a student is a STEM ma jor, given that his or her semester average is greater than or equal to 90?

5. In Example 4 show by integration that E(X ) = E(Y ) = 0 and E(X Y ) = 0.

6. Suppose that ( X; Y) has a bivariate normal distribution with parameters x = 0; y = 0; x = 1; y = 1; = 0:7. Plot E(Y jX =x) as a function of x. On the same axes plot E(X jY =y) as a function of y.

7. Suppose that Uand Vare independent U nif(0;1) random variables. Let W=U+V. Find the cumulative distribution of W,P r (W w), for 0 w 2. (Hint: Use (6.5) and consider 0 w 1 Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 94 and 1 w 2 separately.) 8. Find the density function of the random variable Win exercise 7 and sketch its graph.

6.4 Sums of Random Variables Let X 1and X 2be jointly distributed random variables. If their expected values exist, so does E(X 1+ X 2) and E(X 1+ X 2) = E(X 1) + E(X 2) (6.13) This can be extended to the sum of any number of jointly distributed variables. We calculate the variance of X 1+ X 2as follows:

var (X 1+ X 2) = cov(X 1+ X 2; X 1+ X 2) = cov (X 1; X 1+ X 2) + cov(X 2; X 1+ X 2) = cov (X 1; X 1) + cov(X 1; X 2) + cov(X 2; X 1) + cov(X 2; X 2) = var (X 1) + 2 cov(X 1; X 2) + var(X 1) We used three general properties of the covariance function above. One is that cov(X 1; X 2) = cov (X 2; X 1). Another is that var(X ) = cov(X; X ). Finally, we used the fact that the covariance function is linear in each of its arguments. These are all easy to establish from the de nition of the covariance.

This result can be extended to more than two jointly distributed random variables by induction.

var(n X i =1 X i) = n X i =1 var (X i) + 2 n X i =2 i 1 X j =1 cov (X i; X j) : (6.14) If the random variables X iare uncorrelated, then we have Theorem 6.2. IfX 1; X 2; ; X nare uncorrelated (in particular, if they are independent), var (n X i =1 X i) = n X i =1 var (X i) :

Next, we consider the sum of independent normal random variables. If X 1 N orm ( 1; 1) and X 2 N orm ( 2; 2) are independent, then we know from the preceding theorem that the mean of X 1+ X 2is 1 + 2 and the standard deviation of X 1+ X 2is p 2 1 + 2 2 . We will show that X 1+ X 2 is normally distributed. We have X1= 1 + 1Z 1; X 2= 2 + 2Z 2; where Z 1 and Z 2 are independent standard normal random variables. The joint density function of ( Z 1; Z 2) is bivariate normal with correlation = 0. Its formula is f (z 1; z 2) = 1 2 e 1 2 ( z 2 1 + z2 2 ) ; Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 95 and its level curves are circles centered at the origin. The joint density function depends only on the squared distance z2 1 + z2 2 of the point ( z 1; z 2) from the origin and not on the orientation of the axes. If the axes were rotated about the origin, the formula for the density function in the new coordinate system would be the same. A rotation of the axes through an angle corresponds to a transformation of ( Z 1; Z 2) of the form Z 0 1 = Z 1cos + Z 2sin (6.15) Z 0 2 = Z 1sin + Z 2cos :

So, the distribution of ( Z0 1 ; Z 0 2 ) is the same as that of ( Z 1; Z 2) and in particular, Z0 1 is standard normal, like Z 1. Let us choose so that cos = 1 p 2 1 + 2 2 sin = 2 p 2 1 + 2 2 :

We now have thatz1 z2 0.02 0.04 0.06 0.08 0.1 0.12 0.14 2 1 0 1 2 2 1 0 1 2 Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 96 1 p 2 1 + 2 2 Z 1+ 2 p 2 1 + 2 2 Z 2 is standard normal. Multiplying by p 2 1 + 2 2 , 1Z 1+ 2Z 2 N orm (0;q 2 1 + 2 2 ) ; and 1 + 2 + 1Z 1+ 2Z 2 N orm ( 1 + 2;q 2 1 + 2 2 ) :

Thus, X 1+ X 2is normally distributed. By induction, we can easily extend this result to sums of any number of independent normally distributed random variables.

Theorem 6.3. LetX i; i = 1 ; ; nbe independent normally distributed random variables X i N orm ( i; i). Then X 1+ X 2+ +X n is normally distributed with mean 1 + 2 + + n and variance 2 1 + +2 n .

We shall not prove it, but it is true that all linear combinations of two random variables with a bivariate normal distribution are normally distributed.

Theorem 6.4. LetX 1and X 2have a bivariate normal distribution with parameters 1; 2; 1; 2 and . Let a, b 1 , and b 2 be constants and let Y=a+ b 1 X 1+ b 2 X 2. Then Yis normally distributed with mean a+ b 1 1 + b 2 2 and variance b2 1 2 1 + b2 2 2 2 + 2 b 1 b 2 1 2.

6.4.1 Simulating Random Samples De nition 6.2. Arandom sample of sizenfrom a distribution (generically denoted by F) is a sequence X 1; X 2; ; X nof independent random variables whose marginal distributions are all F.

Random samples come from replicating an experiment with an associated random variable Xhaving distribution Fand letting X ibe the value of Xon the ith replication.

Computer simulation of random experiments is a very important tool in probability and statistics.

Simulation of an experiment involves generating a sequence of numbers that behave like values of a random sample, even though the mechanism for generating them is deterministic. Very good random number generators are available even on hand-held calculators. Below we describe the most basic method of simulation.

Recall that if Xis a random variable with cumulative distribution F, the pth quantile of F, or of X, is F 1 (p ) = q(X; p ) = min fx jF (x ) pg for 0 < p < 1. If we can calculate the quantile function and can simulate an observation from the uniform distribution U nif(0;1), then we can simulate an observation of Xwith distribution F.

Theorem 6.5. LetFbe a given cumulative distribution and let U U nif (0;1). Then the random variable X=F 1 (U ) has distribution F. Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 97 Proof : Let xbe an arbitrary real number. It follows from the de nition of the quantile function that F 1 (p ) xif and only if F(x ) p. Therefore, P r (X x) = P r(F 1 (U ) x) = P r(F (x ) U):

Since U U nif (0;1) and F(x ) 2 [0;1], P r (U F(x )) = F(x ):

Thus P r(X x) = F(x ) for all xand Fis the cumulative distribution of X.

To simulate a random sample of size nfrom F, simulate a random sample U 1; U 2; ; U nfrom U nif (0;1) and let X i= F 1 (U i) ; i = 1 ; ; n. Almost any calculator will simulate random sam- ples from U nif(0;1). Simply press the random number key ntimes. In R, the "runif" function will generate any number of uniform samples. Then to transform these into random samples from the exponential distribution with mean 1, we recall from the preceding chapter that the quantile function for E xp ( = 1) is log(1 p).

> us=runif(10) > xs=-log(1-us) > data.frame(us,xs) us xs 1 0.98573318 4.24981876 2 0.09089919 0.09529929 3 0.52470614 0.74382202 4 0.78792302 1.55080598 5 0.16579396 0.18127486 6 0.91731973 2.49277432 7 0.83268964 1.78790473 8 0.18240977 0.20139401 9 0.44970750 0.59730533 10 0.54134677 0.77946084 Although in theory this procedure is universally applicable, it is not necessarily the most e cient for generating random samples from given distributions. Each of the common families of distributions has a random sample simulator in R. For example, for the uniform distributions, it is "runif()", for exponential distributions "rexp()", for gamma distributions "rgamma()", for normal "rnorm()". The sample size nis a required argument and there are other required or optional arguments for selecting one particular member of the given family. We will illustrate this by generating a sample of size 100 from the gamma distribution with shape parameter = 2. We will plot a density histogram of the samples, and then superimpose the ideal density function.

> xs=rgamma(100,shape=2) > hist(xs,freq=F) > curve(dgamma(x,shape=2),add=T) Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 986.5 Sample Sums and the Central Limit Theorem Let X 1; X 2; ; X nbe a random sample from a distribution F. We are interested in the distributions of the sample sum T n = X 1+ +X n and the sample average X n= T n=n .

We assume that Fhas mean and standard deviation . From (6.2) we know that E(T n) = n and var (T n) = n2 . Hence, sd(T n) = p n . It then follows that E( X n) = and sd( X n) = =p n . The standardized values of T n and X nare equal by simple algebra.

Z n = T n n p n =p n ( X n ) (6.16) If the X iare normally distributed, then by Theorem 3, Z n is standard normal.

Example 6.5. The time for each of the four legs of a 400 meter relay race is normally distributed with mean 10 seconds and standard deviation 1.5 seconds. What is the probability that the race isHistogram of xs xs Density 0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 99 run in less than 36 seconds?

Solution :

P r(T 4 < 36) = P r T4 4(10) 1 :5 p 4 < 36 4(10) 1 :5 p 4 = P r (Z 4 < 1:333) = ( 1:333) = 0 :0913 Theorem 6.6. The Central Limit Theorem Let X 1; X 2; ; X nbe a random sample from a distribution with mean and standard deviation .

Let T n and X n be the sample sum and sample average and let Z n be their standardized value as de ned in (6.16). Then lim n !1 P r (Z n z) = ( z) where is the standard normal cumulative distribution.

The remarkable thing about the central limit theorem is that its conclusion holds for any distribution whatsoever, as long as it has a positive, nite variance. An informal way of stating it is that the sample sum is approximately (or asymptotically) normal with mean n and standard deviation p n .

The sample average is approximately normal with mean and standard deviation =p n . A natural question is how large the sample size must be to have a good approximation. The folklore answer is that n 30 is usually su cient, but the answer really depends on how nearly normal the underlying distribution is. We will explore that question by simulating random samples from some common dis- tributions and checking to see if their averages appear to be normally distriibuted.

Instead of looking at a histogram of the averages, we will use a better graphical indication of normality called a normal quantile plot . A normal quantile plot is a plot of theoretical quantiles of the standard normal distribution on the horizontal axis and the sample quantiles on the vertical axis. If the data comes from a normal distribution, this plot should be close to a straight line. Any patterned departure from straightness is an indication of non-normality. The following is a normal quantile plot of a sample of size 30 from a normal distribution.

> xs=rnorm(30,mean=10,sd=3) > qqnorm(xs) > qqline(xs) Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 100The two left-hand gures below are a normal quantile plot and a histogram of a sample of size 50 from the gamma distribution Gamma(shape = 1=2; rate = 1). As the gures show, this is a very non-normal distribution. The two right hand gures show the normal quantile plot and histogram obtained from 50 replications of the experiment of sampling 30 from the gamma distribution and calculating the 50 sample means. The gures appear to con rm that a sample of size 30 is borderline su cient for the approximate normality of the sample mean, even from this very skewed distribution.

> par(mfrow=c(2,2)) > xs=rgamma(50,shape=0.5) > qqnorm(xs); qqline(xs) > xmeans=replicate(50,mean(rgamma(30,shape=0.5))) > qqnorm(xmeans); qqline(xmeans) > hist(xs,freq=F) > curve(dgamma(x,shape=0.5),add=T) > hist(xmeans,freq=F) > curve(dnorm(x,mean=0.5,sd=sqrt(0.5/30)),add=T) 2 1 0 1 2 6 8 10 12 14 16 18 Normal Q Q Plot Theoretical Quantiles Sample Quantiles Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 101Next, we investigate how well the central limit theorem applies to samples of size 30 from the Bernoulli distribution Binom(n = 1 ; p= 0 :25). Since X= 0 or X= 1, a normal quantile plot and a histogram of the data would have no meaning, but we can look at them for sample averages.

> par(mfrow=c(2,1)) > xmeans=replicate(50,mean(rbinom(30,1,0.25))) > qqnorm(xmeans); qqline(xmeans) > hist(xmeans,freq=F) > curve(dnorm(x,mean=0.25,sd=sqrt(0.25*0.75/30)),add=T) 2 1 0 1 2 0.0 0.5 1.0 1.5 2.0 Normal Q Q Plot Theoretical Quantiles Sample Quantiles 2 1 0 1 2 0.3 0.5 0.7 0.9 Normal Q Q Plot Theoretical Quantiles Sample Quantiles Histogram of xs xs Density 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 Histogram of xmeans xmeans Density 0.2 0.4 0.6 0.8 0.0 1.0 2.0 3.0 Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 102Again, a sample size of 30 seems marginally acceptable. However, this is a ected by the success probability. A value of pnearer 1/2 would give a di erent picture.

6.5.1 Exercises 1. John's pro jected annual income after graduation is normally distributed with mean $80,000 and standard deviation $10,000. Sally's is normally distributed with mean $85,000 and standard deviation $12,000. If their incomes are independent, what is the probability that their combined income exceeds $180,000?

2. What is the standard deviation of their combined income if their incomes have a bivariate normal distribution and the correlation between them is 0.4? What is it if the correlation is -0.4?

3. What is the probability that their combined income exceeds $180,000 when the correlation is 0.4?

-0.4?

4. Suppose that John's and Sally's incomes have a bivariate normal distribution with a correlation of 2 1 0 1 2 0.10 0.25 0.40 Normal Q Q Plot Theoretical Quantiles Sample Quantiles Histogram of xmeans xmeans Density 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0 2 4 6 Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 103 0.4. Given that Sally's income is $95,000, what is the probability that John's income is greater than $100,000. What is the unconditional probability?

5. Back when people still wrote "checks" and entered the amounts in a "check register", it was common to round o the amounts entered to the nearest dollar. Assume that the roundo errors are uniformly distributed between -$0.50 and $0.50. and that they are independent. If 30 checks are written, what is the probability that the sum of the roundo errors is less than $5.00 in absolute value?

6. Do 50 replications of the experiment of adding the roundo errors for 30 checks. Use normal quantile plots and histograms to investigate the normality of the sum of 30 roundo errors.

7. The logistic distribution has cumulative distribution function F(x ) = 1 1 + e x; 1 < x < 1:

Find the quantile function of this distribution. Simulate a random sample of size 100 from the logistic distribution. Make a histogram and a normal quantile plot of the simulated data.

6.6 Other Distributions Associated with Normal Sampling 6.6.1 Chi Square Distributions De nition 6.3. Let be a positive integer. The gamma distribution with shape parameter = = 2 and scale parameter = 2 is called the chi square distribution with degrees of freedom . We write X C hisq (df = ) to say that Xhas such a distribution.

The density function of C hisq(df = ) is f (x ) = 1 2 = 2 ( = 2)x ( = 2) 1 e x= 2 (6.17) A key fact about gamma distributions is Theorem 6.7. IfX 1 Gamma (shape = 1; scale = ) and X 2 Gamma ( 2; ) are independent, then X 1+ X 2 Gamma (shape = 1 + 2; scale = ).

This theorem can be established with the convolution formula for the density of the sum of independent random variables. We will omit the proof. Notice that the theorem assumes that X 1 and X 2 have the same scale parameter. Since all chi square distributions have scale 2, it follows that the sum of independent chi square random variables has a chi square distribution.

Corollary 6.1. IfX 1; X 2; ; X nare independent and each X i C hisq (df = i), then X 1+ X 2+ +X n C hisq (df = 1 + 2 + + n ). Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 104 Let W=Z2 , where Z N orm (0;1). The density of Wcan be calculated as follows.

P r (W w) = P r( p w Z p w ) = 2 P r (0 Z p w ) = 2 p 2 Z p w 0 e z2 = 2 dz f (w ) = d dw 2 p 2 Z p w 0 e z2 = 2 dz = 1 2 1 = 2 p w 1= 2 e w= 2 This is the density of Gamma(shape = 1=2; scale = 2) = C hisq(df = 1).

Corollary 6.2. IfZ 1; Z 2; ; Z nare independent N orm(0;1) variables, then Z 2 1 + Z2 2 + +Z2 n C hisq (df =n):

Tables of Chi Square Distributions The table below is a table of lower percentage points of chi square distributions with degrees of freedom given by the row headings. The column headings are the probabilities P r(W w), where W C hisq (df =row index ) andwis the table entry. For example, the .025 quantile of C hisq(df = 10) is 3.246973.

0.01 0.025 0.05 0.1 1 0.000157 0.000982 0.003932 0.015791 2 0.020101 0.050636 0.102587 0.210721 3 0.114832 0.215795 0.351846 0.584374 4 0.297109 0.484419 0.710723 1.063623 5 0.554298 0.831212 1.145476 1.610308 6 0.872090 1.237344 1.635383 2.204131 7 1.239042 1.689869 2.167350 2.833107 8 1.646497 2.179731 2.732637 3.489539 9 2.087901 2.700389 3.325113 4.168159 10 2.558212 3.246973 3.940299 4.865182 11 3.053484 3.815748 4.574813 5.577785 12 3.570569 4.403789 5.226029 6.303796 13 4.106915 5.008751 5.891864 7.041505 14 4.660425 5.628726 6.570631 7.789534 15 5.229349 6.262138 7.260944 8.546756 16 5.812212 6.907664 7.961646 9.312236 17 6.407760 7.564186 8.671760 10.085186 18 7.014911 8.230746 9.390455 10.864936 19 7.632730 8.906516 10.117013 11.650910 20 8.260398 9.590777 10.850811 12.442609 21 8.897198 10.282898 11.591305 13.239598 Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 105 22 9.542492 10.982321 12.338015 14.041493 23 10.195716 11.688552 13.090514 14.847956 24 10.856361 12.401150 13.848425 15.658684 25 11.523975 13.119720 14.611408 16.473408 26 12.198147 13.843905 15.379157 17.291885 27 12.878504 14.573383 16.151396 18.113896 28 13.564710 15.307861 16.927875 18.939242 29 14.256455 16.047072 17.708366 19.767744 30 14.953457 16.790772 18.492661 20.599235 31 15.655456 17.538739 19.280569 21.433565 32 16.362216 18.290765 20.071913 22.270594 33 17.073514 19.046662 20.866534 23.110197 34 17.789147 19.806253 21.664281 23.952253 35 18.508926 20.569377 22.465015 24.796655 36 19.232676 21.335882 23.268609 25.643300 37 19.960232 22.105627 24.074943 26.492094 38 20.691442 22.878482 24.883904 27.342950 39 21.426163 23.654325 25.695390 28.195785 40 22.164261 24.433039 26.509303 29.050523 The next table is a table of upper percentage points of the chi square distributions with degrees of freedom given by the row headings. The column headings are the probabilities that P r(W > w ) where W C hisq (df =row index ) andwis the table entry. For example, the .975 quantile of C hisq (df = 10) is 20.483.

Quantiles of the chi square distributions are given in R by the "qchisq()" function.

> qchisq(.975,df=30) [1] 46.97924 0.1 0.05 0.025 0.01 1 2.705543 3.841459 5.023886 6.634897 2 4.605170 5.991465 7.377759 9.210340 3 6.251389 7.814728 9.348404 11.344867 4 7.779440 9.487729 11.143287 13.276704 5 9.236357 11.070498 12.832502 15.086272 6 10.644641 12.591587 14.449375 16.811894 7 12.017037 14.067140 16.012764 18.475307 8 13.361566 15.507313 17.534546 20.090235 9 14.683657 16.918978 19.022768 21.665994 10 15.987179 18.307038 20.483177 23.209251 11 17.275009 19.675138 21.920049 24.724970 12 18.549348 21.026070 23.336664 26.216967 13 19.811929 22.362032 24.735605 27.688250 14 21.064144 23.684791 26.118948 29.141238 15 22.307130 24.995790 27.488393 30.577914 16 23.541829 26.296228 28.845351 31.999927 17 24.769035 27.587112 30.191009 33.408664 Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 106 18 25.989423 28.869299 31.526378 34.805306 19 27.203571 30.143527 32.852327 36.190869 20 28.411981 31.410433 34.169607 37.566235 21 29.615089 32.670573 35.478876 38.932173 22 30.813282 33.924438 36.780712 40.289360 23 32.006900 35.172462 38.075627 41.638398 24 33.196244 36.415029 39.364077 42.979820 25 34.381587 37.652484 40.646469 44.314105 26 35.563171 38.885139 41.923170 45.641683 27 36.741217 40.113272 43.194511 46.962942 28 37.915923 41.337138 44.460792 48.278236 29 39.087470 42.556968 45.722286 49.587884 30 40.256024 43.772972 46.979242 50.892181 31 41.421736 44.985343 48.231890 52.191395 32 42.584745 46.194260 49.480438 53.485772 33 43.745180 47.399884 50.725080 54.775540 34 44.903158 48.602367 51.965995 56.060909 35 46.058788 49.801850 53.203349 57.342073 36 47.212174 50.998460 54.437294 58.619215 37 48.363408 52.192320 55.667973 59.892500 38 49.512580 53.383541 56.895521 61.162087 39 50.659770 54.572228 58.120060 62.428121 40 51.805057 55.758479 59.341707 63.690740 6.6.2 Student t Distributions De nition 6.4. LetZ N orm (0;1) and W C hisq (df = ) be independent. The distribution of T = Z p W= is called the student-t distribution with degrees of freedom. We indicate that Thas this distribution by writing T t( df = ).

The graph of the student-t density function is bell-shaped and symmetric about 0 like the normal distribution, but heavier in the tails. As ! 1 it converges to the standard normal density function, so for large values of there is very little di erence between student-t and standard normal.

Tables of Student t Distributions Tables of upper percentage points of the student-t distribution are available in all statistics textbooks.

The one below was produced in R. The row headings 1 through 40 are the numbers of degrees of freedom. The column headings are the right tail probabilities P r(T > t ) and the table entries are the values of tfor those probabilities. For example, the 99th percentile of the student-t distribution with 30 degrees of freedom is 2.457. Quantiles for the student-t distributions are given in R by the "qt()" function.

> qt(.99,df=30) Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 107 [1] 2.457262 0.1 0.05 0.025 0.01 1 3.077684 6.313752 12.706205 31.820516 2 1.885618 2.919986 4.302653 6.964557 3 1.637744 2.353363 3.182446 4.540703 4 1.533206 2.131847 2.776445 3.746947 5 1.475884 2.015048 2.570582 3.364930 6 1.439756 1.943180 2.446912 3.142668 7 1.414924 1.894579 2.364624 2.997952 8 1.396815 1.859548 2.306004 2.896459 9 1.383029 1.833113 2.262157 2.821438 10 1.372184 1.812461 2.228139 2.763769 11 1.363430 1.795885 2.200985 2.718079 12 1.356217 1.782288 2.178813 2.680998 13 1.350171 1.770933 2.160369 2.650309 14 1.345030 1.761310 2.144787 2.624494 15 1.340606 1.753050 2.131450 2.602480 16 1.336757 1.745884 2.119905 2.583487 17 1.333379 1.739607 2.109816 2.566934 18 1.330391 1.734064 2.100922 2.552380 19 1.327728 1.729133 2.093024 2.539483 20 1.325341 1.724718 2.085963 2.527977 21 1.323188 1.720743 2.079614 2.517648 22 1.321237 1.717144 2.073873 2.508325 23 1.319460 1.713872 2.068658 2.499867 24 1.317836 1.710882 2.063899 2.492159 25 1.316345 1.708141 2.059539 2.485107 26 1.314972 1.705618 2.055529 2.478630 27 1.313703 1.703288 2.051831 2.472660 28 1.312527 1.701131 2.048407 2.467140 29 1.311434 1.699127 2.045230 2.462021 30 1.310415 1.697261 2.042272 2.457262 31 1.309464 1.695519 2.039513 2.452824 32 1.308573 1.693889 2.036933 2.448678 33 1.307737 1.692360 2.034515 2.444794 34 1.306952 1.690924 2.032245 2.441150 35 1.306212 1.689572 2.030108 2.437723 36 1.305514 1.688298 2.028094 2.434494 37 1.304854 1.687094 2.026192 2.431447 38 1.304230 1.685954 2.024394 2.428568 39 1.303639 1.684875 2.022691 2.425841 40 1.303077 1.683851 2.021075 2.423257 Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 108 6.6.3 The Joint Distribution of the Sample Mean and Variance Theorem 6.8. LetX 1; X 2; ; X nbe a random sample from the normal distribution N orm( ; ).

Let X be the sample average and S2 the sample variance. Then 1) X N orm ( ; = p n ); 2) ( n 1)S2 = 2 C hisq (df =n 1); 3) X and S2 are independent random variables; 4) T= p n ( X ) S has a student-t distribution with n 1 degrees of freedom.

Proof : Conclusion (1) follows from results of the previous section. Let us accept (2) and (3) for the time being. To prove (4), let Z= p n ( X ) and let W=( n 1) S2 2 . Then Zand Wsatisfy the conditions of the de nition of a student-t random variable with = n 1 and T = Z p W= =p n ( X ) S :

We will prove (2) and (3) only for the case n= 2. The proof can be generalized to any n >2, but it requires advanced linear algebra. We have X1= + Z 1 X 2= + Z 2 , with Z 1 and Z 2 independent N orm(0;1) variables. Then X = + Z ; and Z = Z 1+ Z 2 2 :

The sample variance S2 X of fX 1; X 2g is 2 times the sample variance S2 Z of fZ 1; Z 2g . Furthermore, it is easy to show that S2 Z = ( Z 1 Z 2)2 2 :

Now apply the transformations (6.15) with = = 4 to ( Z 1; Z 2). The transformed variables Z 0 1 = Z 1+ Z 2 p 2 and Z0 2 = Z 2 Z 1 p 2 are independent and standard normal. Thus, (n 1)S2 X 2 = S2 Z = Z0 2 2 C hisq (df = 1 = n 1): Go to TOC CHAPTER 6. JOINT DISTRIBUTIONS AND SAMPLING DISTRIBUTIONS 109 and p 2( X ) = Z0 1 N orm (0;1) independently.

6.6.4 Exercises 1. Use R's "qchisq()" function and also the table provided above to nd (a) the 95th percentile of C hisq (df = 24), (b) the 90th percentile of C hisq(df = 10).

2. Use R's "qt()" function and also the table provided above to nd (a) the 99th percentile of student-t with df= 15, (b) the 95th percentile of student-t with df= 30.

3. Let W C hisq (df = ). From known facts about gamma distributions, nd E(W ) and var(W ).

4. Let W C hisq (df = ). Using the central limit theorem and Corollary 2 of Theorem (6.7), show that W p 2 approaches standard normal as ! 1 . Con rm this by using R's "rchisq()" function with a large number of degrees of freedom to generate a random sample of size 100 from the chi square distribution.

Then make a normal quantile plot and a histogram of the simulated sample values. Go to TOC Chapter 7 Statistical Inference for a Single Population 7.1 Introduction In this chapter we begin the study of statistical inference . This is the science of inferring characteristics of an entire population from the information contained in a sample from that population. Statistical inferences are not just bald assertions about population characteristics. They must be accompanied by statements quantifying the probable accuracy of those assertions. Thus, probability is an essential ingredient of statistical inference. The probabiliity statements accompanying inferences are derived from prior knowledge or experience, knowledge particular to the sub ject at hand, and theoretical assumptions concerning the processes that produce the values of population variables. Of course the probability statements must be internally consistent and satisfy the mathematical properties of prob- ability as outlined in Chapter 3.

Statistical inference is divided into two broad categories, estimationandhypothesis testing . They are not mutually exclusive. We turn to estimation rst.

7.2 Estimation of Parameters A random variable X, whether it comes from sampling from a nite population or from an idealized random experiment, has a distribution which depends on one or more unknown parameters. Often we assume that the distribution is from a family whose members are completely determined by the values of a few parameters, for example, the family of normal distributions, whose members are determined by their means and standard deviations. In other cases, we make only minimal assumptions about the distributions, such as the existence of a variance.

7.2.1 Estimators Let X 1; X 2; ; X nbe a random sample from a distribution that depends in part on an unknown pa- rameter . An estimator of is a function b (X 1; X 2; ; X n) of the sample values whose value is taken 110 Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 111 as an estimate of the value of . b must be computable from the data X 1; ; X nalone, so it cannot involve or other unknown parameters in its formula or in any way except through the data values.

Also, observe that b (X 1; ; X n) = b is a random variable, so it has a distribution derived from the distribution of X. Therefore, the distribution of b also depends on the unknown value of . Its dis- tribution has a median, presumably a mean and a variance, and other general features of distributions.

Examples:

1. The sample mean X is an estimator of the population mean .

2. The sample variance S2 is an estimator of the population variance 2 .

3. The sample proportion of successes b p = Y =n , where Y Binom (n; p ) is an estimator of the popu- lation proportion of successes p.

4. Sample quantiles are estimators of quantiles of a distribution.

7.2.2 Desireable Properties of Estimators An unbiased estimator is one which, on average, gives the correct value of the unknown parameter.

More precisely, De nition 7.1. An estimatorb of a parameter is unbiased if E (b ) = for all values of . The bias of b is the mean estimation error:

bias(b ; ) = E (b ) :

b is asymptotically unbiased if its bias approaches zero as the sample size n! 1 .

The subscripts in the equations above are to remind you that the distribution (and therefore the expected value) of b depends on the unknown and that the equations must be true for any value it may have.

All other things being equal, it is nice for an estimator to be unbiased. However, there are many natural and useful estimators that are biased. For example, the sample standard deviation is a biased estimator of the population standard deviation. It is more important that an estimator be asymptoti- cally unbiased. If it is not, there is a persistent systematic error which cannot be reduced by increasing the sample size. The sample standard deviation is asymptotically unbiased.

Another consideration is the accuracy of an estimator, as measured by the spread of its distribution about the true parameter value. When comparing two unbiased estimators of the same parameter, the one with smaller variance is better than the one with larger variance. Furthermore, we want the variance of an estimator to approach zero as n! 1 . If this is true for an asymptotically unbiased estimator then for large sample sizes, the probability is high that the estimated value of will be very Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 112 close to the true value. All the common estimators that we will study are asymptotically unbiased.

For many of them it can be shown that their variances approach zero at least as fast as the variance of any other estimator of the same parameter. Therefore, they are in a sense nearly the best possible estimators.

7.3 Estimating a Population Mean Let Xbe a variable with unknown mean and standard deviation and let X 1; X 2; ; X nbe a random sample from the distribution of X. We will assume rst that is known. Realistically, it probably is not known but we will postpone that issue for now. X is an unbiased estimator of :

E ( X ) = E(1 n n X i =1 X i) = 1 n n X i =1 E (X i) = 1 n n X i =1 = :

Since the samples are independent, its variance goes to zero as n! 1 .

var ( X ) = var(1 n n X 1 X i) = 1 n 2 n X 1 var (X i) = 1 n 2 n X 1 2 = 2 n :

sd ( X ) = p n :

Therefore, for large sample sizes n, X will be close to with high probability. Let us examine this statement more closely in the case where X is normally distributed and its standardized value Z = p n ( X ) :

has a standard normal distribution.

Let >0 be the accuracy we would like to achieve in estimating , i.e., the maximum tolerable error of estimation or margin of error. Let 1 , where 2 (0;1), be the probability with which we would like to achieve it. Then P r j X j = P r p n j X j p n = P r jZ j p n = p n p n 1 provided p n 1 1 2 = z = 2: Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 113 Therefore, if n z 2 = 2 2 2 (7.1) we achieve accuracy to within units with probability at least 1 .

This is a rule for determining the sample size required to achieve desired experimental accuracy. It depends on the normality of X and knowledge of . If the samples are not from a normal distribution, it can still be used as long as nis also large enough for the central limit theorem to apply to the distribution of X . Conventionally n 30 is considered large enough in most situations. The rule is useful also if the population standard deviation is known only approximately.

Example 7.1. Engineers would like to estimate the mean sulfur dioxide concentration in emissions from a particular industrial process. They would like their estimate to be accurate to 0.1 ppm with a probability of at least 95%. The standard deviation of measurements is at most 2 ppm. How many independent measurements of SO2 concentration should they make?

Solution : Since 1 = 0 :95, z = 2= z :025 = 1 :96. We are given that = 0 :1 and 2. Therefore, rounding up to the nearest integer, from (7.1) we need n 1 :96 2 2 2 0 :1 2 = 1537 samples. This is large enough that normality of X should be of no concern.

7.3.1 Con dence Intervals From the de nition of z = 2= 1 (1 = 2) we have P r (jZ j< z =2) = 1 .

Assuming that X is normally distributed, 1 = P r z = 2< p n ( X ) < z =2 .

We can rearrange these inequalities as follows.

1 = P r X z = 2 p n < < X +z = 2 p n :

The interval X z = 2 p n (7.2) with random endpoints includes the unknown parameter with probability 1 . It is called a 100(1 )% con dence interval for . A con dence interval is more informative than a single point Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 114 estimate of a parameter because it gives a range of possible values for the parameter along with a statement of the probability that such an interval would include the parameter in repeated sampling.

All of the above is based on the assumption that we have prior knowledge of the population standard deviation , at least to a good approximation. When this is not the case, and the sample size is large, there is a modi cation of the central limit theorem that provides a solution.

Theorem 7.1. Let X and Sbe the sample mean and standard deviation from a sample of size n from a distribution with mean and standard deviation . Then the distribution of T = p n ( X ) S approaches the standard normal distribution as n! 1 .

When the population standard deviation is unknown and the sample size is large, an approximate 100(1 )% con dence interval for is X z = 2 S p n :

(7.3) When is unknown, nmust be somewhat larger for a good approximation than when is known.

Below is a normal quantile plot of 100 replications of Tfor a sample of size n= 100 from the expo- nential distribution with mean 1. This is a very non-normal, skewed distribution. The plot indicates that n= 100 should be a large enough sample size in most applications. If the normal approximation is valid, we would expect that in 95% of the 100 replications, the lower con dence limit (lcl) would be less than the true mean of 1 and the upper con dence limit (ucl) would be greater than 1. The actual number is shown in the output.

> expsamp=matrix(rexp(10000),nrow=100) > xbars=apply(expsamp,1,mean) > sds=apply(expsamp,1,sd) > zs=sqrt(50)*(xbars-1)/sds > lcl=xbars-1.96*sds/sqrt(100) > ucl=xbars+1.96*sds/sqrt(100) > sum(lcl < 1 & 1 < ucl) [1] 93 > qqnorm(zs,main="Standardized Exponential Averages") > qqline(zs) Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 1157.3.2 Small Sample Con dence Intervals for a Normal Mean The preceding results apply only when the sample size is large enough to assume that X has a normal distribution. If the sample size is not large, but it is known that the samples are from a nearly normal distribution, there is a con dence interval for based on the student-t distribution.

From Theorem 8 of Chapter 6, if X 1; ; X nis a sample from N orm( ; ), the random variable T = p n ( X ) S has a student-t distribution with n 1 degrees of freedom. If t = 2( n 1) denotes the 1 = 2 quantile of this distribution, P r t = 2( n 1) < T < t =2( n 1) = 1 . 2 1 0 1 2 2 1 0 1 Standardized Exponential Averages Theoretical Quantiles Sample Quantiles Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 116 Rearranging in the same way as before, 1 = P r X t = 2( n 1) < < X +t = 2( n 1) S p n :

Therefore, a 100(1 )% con dence interval for is X t = 2( n 1) S p n :

(7.4) For large samples there is almost no di erence between the student-t and normal con dence intervals.

The number t = 2( n 1) is almost the same as z = 2.

Example 7.2. A sample of 10 middle school teachers from the Houston area with exactly 5 years teaching experience was obtained. Their salaries are shown below. Find a 90% con dence interval for the mean salary after 5 years in the whole Houston area.(This is made-up data. The mean of the normal distribution it was generated from is 50,000.) 50333 43683 50290 40389 49324 46840 50849 40397 53249 53325 Solution : Normal quantile plots for only 10 observations can be misleading. This one does not show any pronounced non-normality, so we shall assume that the distribution is normal. Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 117The mean of the observations is 47867.90 and the standard deviation is 4853.46. For 90% con dence, t = 2( n 1) = t :05 (9) = 1 :833. This comes from the student-t table in Chapter 6 or from R with the command > qt(.95,9) [1] 1.833113 The 90% con dence interval is X t = 2 S p n = 47867 :90 1:833 4853 :46 p 10 = 47867 :90 2813 :29 or (45054.61,50681.19).

R has a function that will nd student-t con dence intervals from raw data for any given con dence level. It is "t.test", and it returns the con dence interval along with the results of a hypothesis test about the population mean. We will discuss hypothesis testing later in this chapter. 1.5 1.0 0.5 0.0 0.5 1.0 1.5 40000 44000 48000 52000 Normal Q Q Plot Theoretical Quantiles Sample Quantiles Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 118 > salaries [1] 50333 43683 50290 40389 49324 46840 50849 40397 53249 53325 > t.test(salaries,conf.level=.90) One Sample t-test data: salaries t = 31.188, df = 9, p-value = 1.756e-10 alternative hypothesis: true mean is not equal to 0 90 percent confidence interval: 45054.44 50681.36 sample estimates:

mean of x 47867.9 The "t.test" function can be used in the large sample situation even with non-normal data. The con dence interval it returns is almost the same as the normal con dence interval because the student-t distribution converges to the standard normal distribution as the degrees of freedom increases without bound. Both the student-t procedure (7.4) and the normal procedure (7.3) are very robust against non-normality as long as the underlying distribution is symmetric and unimodal.

7.3.3 Exercises 1. A sample of size 36 from a normally distributed population variable with population standard deviation 20 had a sample mean of 88. Find a 90% con dence interval for the population mean.

2. A sample of size 90 from a population variable had a sample mean of 4.74 and a sample standard deviation of 0.71. Find a 95% con dence interval for the population mean.

3. Import the "reacttimes" data set and consider the 50 observations of the variable "Times" to be a sample from a larger population. Find a 99% con dence interval for the population mean. Construct a normal quantile plot and comment on the appropriateness of the procedure.

4. The Cauchy distribution with density function f(x ) = 1 (1 + x2 ) does not have a mean. Show that the mean does not exist. The distribution is bell-shaped and sym- metric about its median 0. Repeat the computations for the plot "Standardized Exponential Averages" above, but generate samples from the Cauchy distribution using the "rcauchy" command instead of "rexp". Use a sample size of 50. Count the number of the 100 intervals generated that contain the number 0. Display the con dence intervals with the command > cbind(lcl,ucl) Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 119 Comment on the number of intervals that contain 0 and the lengths of the intervals. Repeat the exercise with a sample size of 200. What do you observe about the lengths of the intervals?

5. We wish to estimate the population mean of a variable that has standard deviation 70.5. We want to estimate it with an error no greater than 5 units with probability 0.99. How big a sample should we take from the population? What happens if the standard deviation and the margin of error are both doubled?

6. The data frame "Loblolly" is included with the datasets library of R. Bring it into your R workspace with > data(Loblolly,package="datasets") or just click on it under the "Packages" tab in Rstudio. This data set has measurements of height and age of 84 Loblolly pine trees. Select a sample of 10 of the 84 tree records with the command > mytrees=Loblolly[sample(84,10), ] Attach your personal sample of trees to your search path by > attach(mytrees) Assume that the ratio of height to age is normally distributed. Find a 90% con dence interval for the population mean of this ratio, using your sample of 10 trees. You can address the sample values of this variable simply by the R expression > height/age for example.

7. Assess the normality of "height/age", considering the 84 trees in Loblolly as a sample from a much larger population. Using this sample of 84 trees, nd a 90% con dence interval for the mean ratio height/age in the larger population.

8. Henry Cavendish 1 made 29 measurements of the speci c gravity of the earth. They are provided in the le "Cavendish.txt" at www.math.uh.edu/ charles/data Import this data into R and nd a 95% con dence interval for the speci c gravity of the earth. Assume Cavendish's measurements are a random sample from a population whose mean is the true speci c gravity.

According to NASA's Earth Fact Sheet 1 Henry Cavendish, 1731-1810: British chemist and physicist. An important gure in the history of science. Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 120 http://nssdc.gsfc.nasa.gov/planetary/factsheet/earthfact.html the speci c gravity of Earth is 5.541.

7.4 Estimating a Population Proportion Let Ybe a random variable with a binomial distribution Binom(n; p ), where nis the number of trials and pis the success probability. A random variable like Ymost often arises from sampling nsub jects from a population with a proportion pof "successes". Sampling is with replacement, although if the population is much larger than the sample size it matters very little whether the samples are obtained with or without replacement. The random variable b p = Y =n is the sample proportion of successes . It is an unbiased estimator of p.

E (b p ) = 1 n E (Y ) = 1 n ( np ) = p:

Its variance and standard deviation are var(b p ) = 1 n 2var (Y ) = np (1 p) n 2 =p (1 p) n sd (b p ) = r p (1 p) n In fact, estimating a population proportion is just a special case of estimating a population mean. b p is the sample average of independent samples from the Bernoulli distribution with success probability p.

The main complication is that the population variance p(1 p) cannot be known without also knowing the mean p. Therefore, the assumption of a known population variance does not apply, except in one application described below.

Since we are dealing with a sample average, the central limit theorem applies and the distribution of the standardized sample success proportion Z= p n (b p p) p p (1 p) (7.5) approaches standard normal as n! 1 . We will assume throughout this discussion that nis large enough for the normal approximation to be accurate. In fact, we assume that nis large enough for the enhanced version of the central limit theorem, Theorem7.1, to apply. This implies that Z0 = p n (b p p) p b p (1 b p ) (7.6) is approximately standard normal also (with a somewhat larger value of n). Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 121 7.4.1 Choosing the Sample Size Let be the maximum tolerable error in estimating pwith b p and let 1 be the desired probability of achieving that degree of accuracy. From (7.5), P r(jb p pj ) = P r p n jb p pj p p (1 p) p n p p (1 p) !

= p n p p (1 p) !

p n p p (1 p) !

( z = 2) ( z = 2) = 1 provided p n p p (1 p) z = 2:

Solving for n, n z2 = 2p (1 p) 2 :

(7.7) This is unsatisfactory because one would have to know pto use it. However, it might be that a good guess at pis already known and the purpose of the sampling experiment is simply to re ne that guess.

If p is a prior estimate of p, it can be substituted into the right hand side of (7.7), yielding n z2 = 2p (1 p ) 2 :

(7.8) This procedure is common in public opinion polling, especially during campaigns, when a candidate's approval rating pis updated every few days.

Another approach is to replace the right hand side of (7.7) by something larger. The function p(1 p) has a maximum value of 1 =4 when p= 1 =2. Substituting p= 1 =2 into the right hand side of (7.7), n z 2 = 2 4 2 (7.9) Example 7.3. Candidates A and B are competing in their party primary. Candidate A announced weeks ago and has been conducting polls frequently to determine his favorability rating (the percent- age of all prospective voters who view him favorably). His favorability rating in the last poll was 0.65.

Candidate B just announced that she is running and has no polling history. Each wants to determine his or her favorability rating to within 3 percentage points. How many prospective voters should each candidate sample to achieve 3 percentage point accuracy with probability 0.95?

Solution : A has a prior estimate p = 0 :65 for his success probability, so he can use (7.8). With z = 2= z 0:025 = 1 :96 and = 0 :03, this results in n 1:96 2 0 :65 0:35 0 :03 2 = 971 :

Candidate B has no prior estimate of p, so she uses (7.9).

n 1 :96 2 4 0:03 2= 1067 : Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 122 7.4.2 Con dence Intervals for p Assuming that the sample size nis large, we can apply (7.5) and say P r b p z = 2r p (1 p) n < p < b p + z = 2r p (1 p) n !

= 1 :

This is unusable as a con dence interval because the unknown pis part of the expression for the end points. Below are 3 possible ways of modifying it.

Method 1 - substitute 1/2 for pThis gives the con dence interval b p z = 2 2 p n (7.10) The width of this interval is greater than those of the other intervals in this list. Therefore, it sacri ces a bit of accuracy but achieves a con dence level a bit greater than the the nominal level of 100(1 )%.

Method 2 - substitute b p for p From (7.6) the distribution of the random variable Z0 approaches standard normal as n! 1 . Thus, P r b p z = 2r b p (1 b p n < p < b p + z = 2r b p (1 b p ) n !

= 1 approximately and b p z = 2r b p (1 b p ) n (7.11) is an approximate 100(1 )% con dence interval for p. This is probably the most often used con - dence interval for pbut it has been discovered recently 2 that the rate of convergence of the distribution of Z0 to standard normal is not uniform, even when extreme values of pnear 0 or 1 are avoided. For certain values of the true proportion pthe actual con dence level is signi cantly less than the nominal con dence unless nis quite large.

Method 3 - solve a quadratic inequality for pWe know from (7.5) that 1 is the probability that z = 2 p n (b p p) p p (1 p) z = 2:

Another way of writing this pair of inequalities is n(b p p)2 p (1 p) z2 = 2; 2 L.D. Brown, T. Cai, and A. DasGupta, "Interval Estimation for a Binomial Proportion", Statistical Science 16 (2) 101-133. Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 123 or n(p b p )2 z2 = 2p (1 p):

This is a quadratic inequality in pwhich can be solved by elementary algebra. After a great deal of algebraic manipulation the solution set is the set of all pin the con dence interval M H . M is the midpoint of the interval and is given by M=b p + z 2 = 2 n 1 2 1 + z 2 = 2 n :

(7.12) H is the half-width of the interval.

H=z = 2q b p (1 b p ) n +z 2 = 2 4 n 2 1 + z 2 = 2 n (7.13) Unlike the other intervals this con dence interval is not centered at b p . Rather, it is centered at a point between b p and 1/2. For large values of nthe center is very close to b p . The expression inside the radical in the half-width His a weighted sum of the estimated variance b p (1 b p )=n and the value 1 =4 n that we substitute for the unknown variance in Method 1. It can be shown that the endpoints M H of this interval always lie between 0 and 1. This is not true of the other intervals. Despite its complexity, this con dence interval is considered superior to the others.

Method 3 is implemented in R through the "prop.test" function. For example, suppose that 12 successes were observed in n= 30 samples of a Bernoulli random variable. You should calculate M H by hand to verify that this is the Method 3 con dence interval.

> prop.test(x=12,n=30,correct=F,conf.level=.90) 1-sample proportions test without continuity correction data: 12 out of 30, null probability 0.5 X-squared = 1.2, df = 1, p-value = 0.2733 alternative hypothesis: true p is not equal to 0.5 90 percent confidence interval: 0.2671262 0.5494187 sample estimates: p 0.4 "prop.test" is designed for testing hypotheses about pas well as for calculating con dence intervals.

Unless you specify otherwise, as we did here, "prop.test" will introduce a continuity correction that changes the answers slightly. Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 124 7.4.3 Exercises 1. The Food and Drug Administration monitors the production line of a breakfast cereal company to determine what proportion of its boxes of cereal contain insect parts. The FDA would like to know that proportion to within 5 percentage points with 95% con dence. How many boxes of cereal should they sample?

2. Suppose the company has been inspected before and that previously the proportion of cereal boxes with insect parts was 0.15. The FDA wants to be as unobtrusive as possible. How many boxes of cereal should they sample?

3. The FDA sampled 60 boxes of cereal and found 12 with insect parts. Find a 95% con dence for the true proportion using all three methods described above.

4. Use R to answer question 3 with Method 3.

7.5 Estimating Quantiles Let X 1; X 2; ; X nbe a random sample of a variable Xwith a cumulative distribution F. We shall assume that Fis continuous and that nis large enough for the central limit theorem to apply. We are interested in estimating the pth quantile of F, = q(X; p ) =F 1 (p ) = min fx jF (x ) pg .

For a given real number ylet Ybe the number of the samples that satisfy X i y, and let b F (y ) = Y =n:

b F (y ) is simply the sample proportion of "successes" X i y, but as a function of yit is a bona de cumulative distribution function. It is called the empirical distribution function and is a sample estimate of the cumulative distribution function Fof the variable X. Its pth quantile b F 1 (p ) is the pth quantile of the samples and an estimator of = F 1 (p ). Since b F ( ) is a sample success proportion with mean p, 1 = P r p z = 2r p (1 p) n b F ( ) < p +z = 2r p (1 p) n !

= P r b F 1 p z = 2r p (1 p) n !

< b F 1 p+ z = 2r p (1 p) n !!

Theorem 7.2. For large samples, a 100(1 )% con dence interval for the pth quantile = F 1 (p ) of a continuous distribution Fis b F 1 p z = 2r p (1 p) n !

b F 1 p+ z = 2r p (1 p) n !

(7.14) Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 125 In particular, a 100(1 )% con dence interval for the median = F 1 (:5) is b F 1 0:5 z = 2 2 p n b F 1 0:5 + z = 2 2 p n (7.15) b F 1 (y ) is the sample yth quantile, which is returned by R's "quantile" function. The intervals calcu- lated may depend on the exact rules used by software to nd the sample quantiles. In the example below, we have included the "type=1" argument to the quantile function to ensure that R calculates b F 1 as de ned above.

Example 7.4. The data frame "test.vs.grade" has placement test scores and semester grades for 179 students. Twenty of the test scores are missing. We will treat the known 159 values of the variable "Test" as a sample from a larger population and nd a 95% interval for the median of the population test scores.

> attach(test.vs.grade) > summary(Test) Min. 1st Qu. Median Mean 3rd Qu. Max. NA 's 24.00 70.00 80.00 76.55 88.00 100.00 20 > quantile(Test,0.5-1.96/(2*sqrt(159)),na.rm=T,type=1) 42.22809% 76 > quantile(Test,0.5+1.96/(2*sqrt(159)),na.rm=T,type=1) 57.77191% 84 So, the 95% con dence interval is from 76 to 84.

7.5.1 Exercises 1. Find 90% con dence intervals for the rst and third quartiles of test scores.

2. Generate a sample of size 50 from the exponential distribution with mean 1. Use the "rexp" function in R. Find a 95% con dence interval for the median of the distribution. Is the true median in the interval? Repeat this several times. Repeat several more times with n= 100.

3. Repeat exercise 2 with a con dence level of 90% and a sample of size 50 from the Cauchy distribu- tion with median 0.

4. Use the Loblolly data on 84 pine trees to nd 90% con dence intervals for the quartiles and median of the population variable height/age.

5. Use Cavendish's data to nd an 95% con dence interval for the speci c gravity of the earth, assuming that his data is a random sample from a population whose median is the true speci c gravity. This is a small data set, so the large sample con dence interval may not be reliable. Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 126 7.6 Estimating the Variance and Standard Deviation The sample variance S2 = 1 n 1 n X i =1 ( X i X )2 from a sample X 1; X 2; ; X nof values of a numeric variable Xis an unbiased estimator of 2 = var (X ). To see this, note that (n 1)S2 = n X i =1 ( X i X )2 = n X i =1 X 2 i n X 2 :

Thus, (n 1)E(S 2 ) = n X i =1 E (X 2 i ) nE ( X 2 ):

Also, E(X 2 i ) = var(X i) + E(X i) 2 = 2 + 2 and E( X 2 ) = var( X ) + E( X )2 = 2 n + 2 :

Putting all this together, we have (n 1)E(S 2 ) = ( n 1)2 ; and E(S 2 ) = 2 :

The sample standard deviation Sis not an unbiased estimator of the population standard deviation , but it is asymptotically unbiased and usually is the estimator of choice for .

For samples from a normal distribution, ( n 1) S2 2 has a chi square distribution with n 1 degrees of freedom. This enables us to nd con dence intervals for 2 and . Suppose we want a 100(1 )% con dence interval for 2 . Let q( = 2) and q(1 = 2) denote the =2 and 1 = 2 quantiles of the chi square distribution with n 1 degrees of freedom. Then, 1 = P r q( = 2) < ( n 1)S2 2 < q (1 = 2) = P r 1 q (1 = 2) < 2 ( n 1)S2 < 1 q ( = 2) = P r (n 1)S2 q (1 = 2) < 2 < ( n 1)S2 q ( = 2) To get a con dence interval for , simply take the square roots of the end points of the con dence interval for 2 . Unfortunately, these con dence intervals are rather sensitive to departures from normality, so use them with caution. Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 127 Example 7.5. The R data set "airquality" has measurements of Ozone levels, wind speed, solar radia- tion and temperature for 153 days in New York. Wind speed measurements seem to be approximately normal, so we will apply the method above to nd a 90% con dence interval for the variance and standard deviation of wind speeds in New York.

> attach(airquality) > lcl=152*var(Wind)/qchisq(.95,152) > ucl=152*var(Wind)/qchisq(.05,152) > c(lcl,ucl) [1] 10.37878 15.15278 > sqrt(.Last.value) [1] 3.221612 3.892657 7.7 Hypothesis Testing A statistical hypothesis is simply a statement about one or more distributions or random variables.

Statistical hypotheses usually specify or restrict the values of parameters of distributions. Researchers conduct controlled experiments with the goal of con rming a scienti c hypothesis, which may be expressed in terms of the parameters of the distributions of experimental data. This is called the research hypothesis , and it asserts that there is a real experimental e ect due to the pre-established conditions of the experiment. Opposed to the research hypothesis is the null hypothesis, which asserts that there is no real experimental e ect and any apparent signal in the data is merely random variation.

In other words, it is the hypothesis of a null experimental e ect. The burden of proof is on the research hypothesis because it is the one that makes a de nite, positive assertion about the reality of a phenomenon, and that is the logic of experimental science. The research hypothesis is also called the alternative hypothesis .

7.7.1 Test Statistics, Type 1 and Type 2 Errors Let H 0denote the null hypothesis and H 1the alternative or research hypothesis. These are assertions about the distribution of a population variable X, from which a sample X 1; X 2; ; X nis obtained.

Based on the data, a decision is made either to accept H 1, thereby rejecting H 0, or not to accept H 1 (not reject H 0). We do not usually say that we accept H 0; we either reject it or do not reject it.

Type 1 error - to reject H 0when it is true, in other words, to accept H 1when it is not true.

Type 2 error - to not reject H 0when it is false, i.e., to not accept H 1when it is true.

The data enters into the decision to reject or not reject H 0 through a test statistic , a function = (X 1; X 2; ; X n) of the data which is supposed to "point toward" H 1 and away from H 0 in the sense that the greater the value of the greater the degree of support for H 1. The test statistic is a random variable and its distribution must be known if H 0is true. The hypothesis H 1is accepted if is su ciently large, larger than a critical value .

Decision rule : Reject H 0(accept H 1) if (X 1; ; X n) > . Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 128 The probability of type 1 error then is = P r ( > j H 0) : (7.16) The probability of type 2 error P r( j H 1) usually cannot be calculated without more information because H 1does not completely specify the distribution of . The notation P r( j H 0) is convenient in this context but it does not really refer to a conditional probability. H 0is not an event.

The classical Neyman-Pearson 3 paradigm for hypothesis testing is to hold the probability of type 1 error xed at some small value (conventionally, = :01 ; :05 ; :10), adjust according to (7.16), and make a de nite decision as to whether or not to reject H 0. If H 0is rejected the value of is said to be signi cant at level . We insist that be small because we do not want to accept H 1unless the evidence against H 0is strong.

7.8 Hypotheses About a Population Mean Suppose that X 1; X 2; ; X nis a random sample from a distribution with mean and known standard deviaton . If there is no true experimental e ect, the null hypotheses is H0:

= 0; where 0 is some given number. The alternative is H1:

> 0:

The sample average X is a test statistic. Everyone would agree that larger values of X o er stronger support for H 1 :

> 0. We will assume that the distribution of X is normal, either because the samples are from a normal distribution or because the sample size is large enough for the central limit theorem to apply. Then if H 0is true we know that X N orm ( 0; = p n ). Instead of X , we will take its standardized value Z= p n ( X 0) as our test statistic . If is the desired probability of type 1 error, we choose the critical value of Z so that P r(Z > z j H 0) = ; i.e., z = 1 (1 ). We reject H 0if Z > z ; in other words, if X > 0+ z p n :

The alternative H 1:

> 0is called a one-sided alternative. The alternative hypothesis H 1:

< 0 is also one-sided. To force it into our template, we could choose the test statistic = Z and = z .

Then we reject H 0if Z > z , equivalently, if Z < z or 3 Jerzy Neyman 1894-1981 and Egon Pearson 1895-1980: eminent Polish and British mathematical statisticians Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 129X < 0 z p n :

The alternative hypothesis H 1:

6 = 0 is two-sided . To test it against H 0:

= 0 we use the test statistic = jZ jwith critical value = z = 2and reject H 0if j X 0j > z =2 p n :

Example 7.6. Test automobiles of a given type burning fuel of a given type are known to have a mean fuel e ciency of 23 mpg with a standard deviation of 3 mpg. A new gasoline additive is being tested which supposedly improves e ciency. A sample of 36 test cars were fueled with the newer gasoline and their fuel e ciencies were measured. Their average was 24.5 mpg. At a signi cance level of = 0 :05 can we conclude that the additive does improve e ciency?

Solution : We will assume that sample averages X from samples of size n= 36 are normally distributed.

We are testing the research hypothesis H 1:

> 23 against the null hypothesis H 0:

= 23. We also assume that the variance of fuel e ciency with the additive is the same as the variance without it.

We accept H 1if X > 0+ z p n = 23 + 1 :645 3 p 36 = 23 :8225 Since it is indeed true that X > 23:8225 we accept H 1 and conclude that the additive improves e ciency.

7.8.1 Tests for the mean when the variance is unknown When testing hypotheses about the mean of a variable Xthe variance will not usually be known.

However, for large samples, the random variable T= p n ( X 0) S ; where Sis the sample standard deviation, has an approximate standard normal distribution under the hypothesis H 0:

= 0. Therefore, Tor jT jcan be used as a test statistic for this null hypothesis.

For the one-sided alternative H 1:

> 0we reject H 0when T > z ; that is, when X > 0+ z S p n :

(7.17) For the two-sided alternative H 1:

6 = 0, we reject H 0when j T j> z =2; Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 130 i.e., when j X 0j > z =2 S p n :

(7.18) Example 7.7. The verbal IQs of 200 school children of similar age and circumstances were measured.

The average was 11.30 (not on a 100 point scale), and the sample standard deviation was 2.26. The purpose of the study was to determine whether the population mean IQ is di erent from 11, which was the mean for children of the same characteristics 10 years previously. At a signi cance level of = 0 :05, can we conclude that the mean is di erent? What if = 0 :10?

Solution : The observed value of Tis T = p n ( X 0) S =p 200(11 :30 11) 2 :26 = 1 :877 :

We are not assuming that scores are normally distributed, but n= 200 should be plenty large enough for Tto be nearly standard normal. Since the alternative H 1:

6 = 11 is two-sided, the test statistic is jT jand we reject H 0:

= 11 if jT j> z =2= z :025 = 1 :96 :

However, 1.877 is not greater than 1.96, so we do not accept H 1. If = 0 :10, then z = 2= z :05 = 1 :645 and we would accept H 1.

Student t Tests for Small Samples If nis not large, but the samples come from a normal distribution and H 0is true, Thas a student-t distribution with n 1 degrees of freedom and the critical value is t ( df =n 1), the 100(1 )th percentile of student-t. For H 1:

> 0we reject H 0when X > 0+ t ( n 1) S p n ; (7.19) and for H 1:

6 = 0 reject H 0when j X 0j > t =2( n 1) S p n :

(7.20) Example 7.8. W.S. Gosset, 1876-1937, was an English statistician who worked for the Guinness brewing company. He discovered the distributions we know as the student-t distributions and pub- lished his work under the pseudonym "Student". In one of the rst applications, he compared the yield of barley seeds dried by two di erent methods. Seeds dried by each method were planted in adjacent small plots (split plots) that had no di erence in soil properties or rainfall. 4 There were 11 split plots and the yields for each drying method on each split plot are shown below. We assume that the di erence in yield for a split plot is normally distributed. We do not know the mean or the variance of the distribution of yield di erences. We are interested in the research hypothesis that the mean di erence is not equal to 0 against the null hypothesis that it is 0. 4 W.S.Gosset, "The Probable Error of a Mean",Biometrika 6 (1908), 1-25 Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 131 Plot 1 2 3 4 5 6 7 8 9 10 11 Regular 1903 1935 1910 2496 2108 1961 2060 1444 1612 1316 1511 Kiln 2009 1915 2011 2463 2180 1925 2122 1482 1542 1443 1535 Diff 106 -20 101 -33 72 -36 62 38 -70 127 24 The variable X=Di =Kiln-Regular is assumed to have a normal distribution. The sample mean is X = 33 :73 and the sample standard deviation is S= 66 :17. Thus, the observed value of Tis T = p 11 33:73 66 :17 = 1 :691 :

If = 0 :05, the critical value of jT jis t = 2( df =n 1) = t :025 (10) = 2 :228. Therefore, we do not conclude that there is a di erence in yield for the two drying methods.

7.9 p-values Instead of comparing the test statistic to a critical value for a pre-established signi cance level, many statisticians prefer to simply report the p-valueof the statistic. Recall that the larger the value of the test statistic , the greater is its degree of con rmation of H 1 and discon rmation of H 0. One way to quantify this idea is to compare the observed value of to the distribution of values it would have in future replications of the experiment, assuming H 0 to be true. Let obs be the observed value of , a xed number once the experiment has been done. We de ne the p-value of obs to be P r ( > obsj H 0). The smaller this probability is, the larger obs is, comparatively speaking. A very small p-value is strong evidence that H 1is true and H 0is not. There is a temptation to think of the p-value as the probability of H 0, but that is a misinterpretation.

H 0is not an event.

The gure below is a pictorial representation of p-values. Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 132For the one-sided alternative H 1:

> 0the p-value is P r(T > T obsj H 0). For the two-sided alterna- tive H 1:

6 = 0, it is P r(jT j> jT obs jj H 0).

To test at a xed signi cance level , simply compare the p-value to . If the p-value is less than or equal to , reject H 0. We will calculate the p-value in Example7.7. The observed value of Twas 1.877. Since we were assuming that Tis normally distributed and the alternative was two-sided, the p-value is P r(jT j> 1:877) = 2 P r(T > 1:877) = 2(1 (1 :877)) = 0 :0605 Since 0.05

It is a serious misapplication of p-values to use them to shop for alternative hypotheses. In Example 7.7, the p-value of T obs = 1 :877 for the one sided alternative H 1 :

> 11 is 0.0325, whereas for the alternative H 1:

6 = 11 it is 0.0650. If we take = 0 :05 as the required signi cance level, andNull Density of Test Statistic t tobs p value Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 133 have not speci ed H 1in advance, then we have the ridiculous situation that we are willing to believe > 11 but not willing to believe 6 = 11. Hypothesis testing makes sense only when the hypotheses are formulated before the collection of data.

7.9.1 Exercises 1. For each of the following scenarios state whether H 0should be rejected or not. State any assump- tions that you make beyond the information that is given.

(a) H 0:

= 4, H 1:

6 = 4, n= 15, X = 3 :4, S= 1 :5, = :05.

(b) H 0:

= 21, H 1:

< 21,n= 75, X = 20 :12, S= 2 :1, = :10.

(c) H 0:

= 10, H 1:

6 = 10, n= 36, p-value = 0.061.

2. Use the "test.vs.grade" data and test the null hypothesis that the mean test score for the population is 70 against the alternative that it is greater than 70. Find a p-value and state your conclusion if = 0 :05. Repeat for the null hypothesis = 75.

3. Use your sample of 10 trees to test the null hypothesis that the mean value of height/age is 2 against the alternative that it is greater than 2. Give a p-value and state your conclusion if = 0 :10.

4. Use the Cavendish data to test the research hypothesis that the speci c gravity of the earth is greater than 5.4. Give a p-value.

7.10 Hypotheses About a Population Proportion Let pdenote the proportion of successes in a population and let Ybe the number of successes in a sample with replacement of size nfrom the population. The sample proportion of successes b p = Y =n is an unbiased estimator of p. Consider the null hypothesis H0:

p = p 0 and the one sided alternative H1:

p > p 0:

If nis large and H 0is true, Z= p n (b p p 0) p p 0(1 p 0) = Y np 0 p np 0(1 p 0) :

is standard normal. Therefore, P r(Z > z j H 0) = :

A test of signi cance level for H 0:

p = p 0 against the alternative H 1:

p > p 0is to reject H 0when Z > z , equivalently when b p > p 0+ z r p 0(1 p 0) n ; or when Y > np0+ z p np 0(1 p 0) : Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 134 The level test for H 0against the two sided alternative H 1:

p 6 = p 0 rejects H 0when j Z j> z =2; that is, when jY np 0j > z =2p np 0(1 p 0) :

The p-values for the one sided and two sided alternatives are, respectively, P r(Z > Z obsj H 0) = 1 ( z obs ) and P r(jZ j> jZ obs jj H 0) = 2(1 ( jz obs j )) ; where Zobs = Y obs np 0 p np 0(1 p 0) :

Example 7.9. A gene occurs in a dominant form (allele) with probability 2/3 and in recessive form with probability 1/3. An organism in the population shows the dominant physical characteristic if its two copies of the gene are either two dominant copies or a dominant and a recessive copy. Organisms with two recessive copies of the gene show the recessive physical characteristic. If the population is in genetic equilibrium, the frequency of the dominant characteristic in the population is 89%. A biologist suspects that the population is not in genetic equilibrium and to test his suspicion collects 100 specimens. Eighty of them had the dominant characteristic. Is the biologist's claim supported?

Solution : Let pdenote the proportion of dominant physical types in the population. The research hypothesis is H 1:

p 6 = 0 :89 and the null hypothesis is H 0:

p = 0 :89. The observed value of Y, the number in the sample of n= 100 with the dominant characteristic, is Yobs = 80 ; and the observed value of Zis Zobs = :

80 :89 p 100( :89)( :11) = 2:876 :

Since the alternative is two sided, the p-value is P r(jZ j> 2:876 jH 0) = 2(1 (2 :876)) = 0 :004 :

Since the p-value is so small, we can say that there is strong evidence that the population is not in genetic equilibrium.

The R function "prop.test" that we used to nd con dence intervals for a success probability pis also used to test hypotheses about p. We will use it to answer the question in the preceding example.

> prop.test(x=80,n=100,p=0.89,correct=F) 1-sample proportions test without continuity correction Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 135 data: 80 out of 100, null probability 0.89 X-squared = 8.2737, df = 1, p-value = 0.004022 alternative hypothesis: true p is not equal to 0.89 95 percent confidence interval: 0.7111708 0.8666331 sample estimates: p 0.8 We omitted the Yates continuity correction so that R's answer could be compared to the one above.

The Yates correction alters the answers slightly. The X-squared statistic given in the R output is the square of the test statistic Z, so there is a 1-1 correspondence between its values and the values of jZ j.

If the alternative is one sided, the argument "alternative="g"" or "alternative="l" can be included in the function call. For example, if H 1:

p < 0:89 is the alternative hypothesis, > prop.test(x=80,n=100,p=0.89,correct=F,alternative="l") 1-sample proportions test without continuity correction data: 80 out of 100, null probability 0.89 X-squared = 8.2737, df = 1, p-value = 0.002011 alternative hypothesis: true p is less than 0.89 95 percent confidence interval:

0.0000000 0.8574982 sample estimates: p 0.8 7.10.1 Exercises 1. Let pdenote the proportion of all Math 3339 students who are women. On some random class day, count the number of students attending your class and the number of them who are women. At a sig- ni cance level of = 0 :05, test the null hypothesis H 0:

p = 1 =2 against the alternative H 1:

p < 1=2.

Assume that the students attending your class are a random sample of Math 3339 students.

2. Let Xbe a random variable with a continuous distribution and suppose its median mis unique.

Consider the null hypothesis H 0:

m =m 0and the alternative H 1:

m > m 0. Suppose that a large sample of nvalues of Xis obtained. Let Ybe the number of sample values X i m 0.

Y has a binomial distribution Y Binom (n; p ) with success probability p. If H 0is true, what is the value of p ? What does H 1imply about p? Show how to test H 0against H 1with Y. This is called the sign test for the median of X. Apply the sign test to the variable "Times" in the data set "react.times" and test H 0:

m = 1 :4 against H 1:

m > 1:4. Give a p-value. You can count the number of observations of Times less than or equal to 1.4 with > sum(Times <= 1.4) Go to TOC CHAPTER 7. STATISTICAL INFERENCE FOR A SINGLE POPULATION 136 after attaching react.times to your workspace. Use "prop.test" to perform the test.

3. Modify the sign test to test the hypothesis that the rst quartile of Times is equal to 1.2 against the alternative that it is less than 1.2. Go to TOC Chapter 8 Regression and Correlation 8.1 Examples of Linear Regression Problems The R data set "mammals" is included in the library "MASS". It lists the mean body weights in kilograms and the mean brain weights in grams of 62 mammal species. It is part of a larger data set in a study of sleep in mammals. 1 If for each of the 62 species we plot a point in a rectangular coordinate system, with body weight on the horizontal x axis and brain weight on the vertical y axis, we obtain a scatterplot of the data.

1 Allison,T. and Cicchetti, D.V. Sleep in mammals, ecological and constitutional correlates. Science 194 , (1976).

137 Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 138The purpose of such an exercise might be to discover a relationship between body weight and brain weight that is characteristic of mammalian development. The scatterplot does not suggest much of a relationship. One reason is that the diagram is dominated by a few large mammals and the others are crowded together near the lower left corner. This can be alleviated by plotting the logarithms of the variables.

> data(mammals,package="MASS") > attach(mammals) > plot(log(body),log(brain),xlab="log body wt",ylab="log brain wt")0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 body brain Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 139Since this data is a result of sampling, it is fair to consider Y=log(brain wt) and X=log(body wt) as jointly distributed random variables.

For mammals with log body weights Xnear a given value x, there is a distribution of log brain weights Y. It appears that the centers of these distributions increase almost linearly as xincreases.

Furthermore, the vertical dispersion of the distribution of Ydoes not seem to vary much as xvaries.

We will hypothesize that the conditional mean of Y, given that X=x, is a linear function of x:

E (Y jX =x) = 0 + 1x (8.1) where 0 and 1 are unknown intercept and slope parameters, respectively, and that the variance of the conditional distribution of Y, given that X=x, is constant, independent of x:

var (Y jX =x) = 2 : (8.2) The constant variance 2 is also unknown. These are the two basic assumptions of simple linear regression. The term "regression" was rst applied in this context by Francis Galton (1822-1911), who studied the relationship between the heights of fathers and their full-grown sons. He observed that the 4 2 0 2 4 6 8 2 0 2 4 6 8 log body wt log brain wt Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 140 sons of unusually tall fathers tended to be tall, but not as tall as their fathers. Hence, their heights "regressed to the mean".

The values of Xin this example were not predetermined as part of the design of the experiment.

Rather, they were simply observed along with the corresponding values of Y. When this is the case, the experiment is said to be an observational study. Adesigned experiment is one in which the values of X are controlled. This type of experiment is more common in engineering, some of the hard sciences and pharmaceutical research than in social sciences and business. The scatterplot below shows data on life- times of 1.5 volt batteries 2 . The Xvariable is the voltage of the battery, which decreases slowly from 1.5 to a smaller value when the battery is under a load. The Yvariable is the measured time for the voltage of a battery to de- crease to the experimenter-determined level of x= 1 :3 ;1 :2 ;1 :1, etc. The constant variance assumption (11.2) appears to be at least approximately true, but the assumption of linearity (8.1) seems to be violated. There is a noticeable curvature in the pattern of points in the scatterplot. The methods of linear regression analysis developed in this chapter should be used with caution in such a case. 2 Peter K. Dunn, Comparing the lifetimes of two brands of batteries, Journal of Statistics Education 21 , 1 (2013)0.8 0.9 1.0 1.1 1.2 1.3 2 4 6 8 Voltage Time Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 141 Regardless of whether the experiment is designed or an observational study, we will call Xthe design variable and Ythe response variable . Even if the data comes from an observational study, we write the x-values in lower case since we are concerned only with the conditional distribution of Y , given the value of X. Therefore, the values of Xare treated as non-random inputs.

8.2 Least Squares Estimates The data in a simple linear regression problem consists of npairs f(x i; Y i) g n i =1 . Both variables are numeric in character. We can represent the data as:

Y1 = 0 + 1x 1 + 1 Y 2 = 0 + 1x 2 + 2 .

.

. (8.3) Y n = 0 + 1x n + n The i in the equations above are the random deviations of the Y i from their expected values. They are unobservable since we don't know the values of 0 and 1.

There is an alternate parametrization of (8.3) that is convenient. Let xbe the average of x 1; ; x n and write Yi = + 1( x i x ) + i; where 0 = 1 x , = 0 + 1 x . We will use this parametrization in the derivation below of the least squares estimates. If we have estimators b of and b 1 of 1, we can immediately get the estimator b 0 = b b 1 x of 0.

Given estimators b and b 1, the estimated expected value or predicted value of Y i is b Y i = b + b 1( x i x ); and the deviation of the observation Y i from its predicted value is ei = Y i b Y i:

e i is also called the ith residual . Think of it as an estimate of the unobservable true random error i in (8.3).

The method of least squaresselects estimators b and b 1 that minimize the residual sum of squares :

S S (resid ) =n X i =1 e 2 i = n X i =1 ( Y i b Y i) 2 = n X i =1 ( Y i b b 1( x i x )) 2 : Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 142 With a little algebra, this can be written S S(resid ) =n X i =1 ( Y i b )2 2b 1 n X i =1 ( Y i b )( x i x ) + b 2 1 n X i =1 ( x i x )2 = n X i =1 ( Y i b )2 2b 1S xy +b 2 1 S xx ; (8.4) where Sxy = n X i =1 Y i( x i x ) and Sxx = n X i =1 ( x i x )2 :

Now, the rst term on the right of (8.4) is a quadratic function of b alone and does not involve b 1.

The sum of the other two terms on the right is a quadratic function of b 1 alone. Therefore, S S(resid ) will be minimized when we choose b to minimize the rst term and b 1 to minimize the sum of the other two terms. The solutions are:

b = Y = 1 n X Yi (8.5) b 1 = S xy S xx (8.6) The least squares estimator of the intercept parameter 0 in the original formulation is b 0 = Y b 1 x: (8.7) The line with equation y= b 0 + b 1x is the tted line . For a small data set, a reasonably good calculator will calculate the least squares estimates with the punch of a button, once the x's and Y's are entered. Even without using this feature, the least squares estimates are easy to nd. Note that b 1 is the sample covariance between the x's and Y's divided by the sample variance of the x's.

Example 8.1. The data frame "Mileage" shown below has 11 predetermined fuel mixture ratios and the mileages per gallon of 11 test cars with those fuel ratios. The table leaves space for you to ll in the cross-product terms Y i( x i x ) and the terms ( x i x )2 . Their sums or averages are at the bottom.

This is a good way to organize the calculations if you must do them without technology. Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 143i x i: fuel Y i: mileage Y i( x i x ) ( x i x )2 1 0.40 17.42 2 0.43 17.56 3 0.46 17.62 4 0.49 17.69 5 0.52 18.08 6 0.55 18.01 7 0.58 18.01 8 0.61 18.34 9 0.64 18.26 10 0.67 18.50 11 0.70 18.41 average or sum 0.55 17.99 0.356 0.099 From the bottom line of the table, we have b = Y = 17 :99, S xy = 0 :356, and S xx = 0 :099. Thus b 1 = 0 :356 =0:099 = 3 :594. The equation of the tted line is y= 17 :99 + 3 :594( x 0:55) = 16 :01 + 3 :594 x There are several ways of calculating the least squares estimates in R. If all you want is the estimated values of 0 and 1, the simplest way to get them is > attach(Mileage) > coef(lsfit(fuel,mileage)) Intercept X 16.014242 3.593939 Here "Mileage" is the name of the data frame that contains the variables "fuel" and "mileage". The x-variable "fuel" must be entered rst as an argument to "ls t", which does the computational work.

"coef" extracts the coe cients from the ob ject returned by "ls t". The gure below shows the scat- terplot and the tted line. The function "abline" adds a line to an existing plot.

> Mileage=read.csv("Mileage.csv",header=T) > plot(Mileage) > attach(Mileage) > abline(coef(lsfit(fuel,mileage))) Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 1448.2.1 The "lm" Function in R The "ls t" function calculates the least squares estimates but does not provide all the information needed in linear regression analysis. The R function "lm" (which stands for linear model) does. We will illustrate its use by re-analyzing the data from the preceding example.

> Mileage=read.csv("Mileage.csv",header=T) > attach(Mileage) > mileage.lm=lm(mileage ~ fuel, data=Mileage) > summary(mileage.lm) Call:

lm(formula = mileage ~ fuel, data = Mileage) Residuals: Min 1Q Median 3Q Max0.40 0.45 0.50 0.55 0.60 0.65 0.70 17.4 17.6 17.8 18.0 18.2 18.4 fuel mileage Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 145 -0.12000 -0.06982 -0.03182 0.04845 0.19691 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 16.0142 0.1858 86.18 1.93e-14 *** fuel 3.5939 0.3329 10.79 1.89e-06 *** --- Signif. codes: 0 '*** '0.001 '** '0.01 '* ' 0.05 '. ' 0.1 ' '1 Residual standard error: 0.1048 on 9 degrees of freedom Multiple R-squared: 0.9283, Adjusted R-squared: 0.9203 F-statistic: 116.5 on 1 and 9 DF, p-value: 1.887e-06 Much of the output above relates to topics discussed later. The part labeled "Coe cients:" contains the least squares estimates of the intercept 0 and the coe cient 1 of fuel, the x-variable . It also contains information needed for inferences about the true values of those parameters.

"lm" creates an R ob ject called a linear model ob ject, which has been given a name "mileage.lm".

The arguments to lm are the model formula, "mileage fuel", which tells R that mileage is a linear function of fuel. The circum ex or tilde " " separates the y-variable from the x-variable. The "data" argument tells R that the variables mileage and fuel are in the data frame "Mileage".

Once the linear model ob ject is created by calling lm, the information is displayed with the summary function. The model ob ject actually contains a great deal of information and the summary only dis- plays the most important part of it. Note that it gives a 5-number summary of the residuals. The residuals are stored as part of the model ob ject. If you want to see all of them, type > residuals(mileage.lm) 8.2.2 Exercises 1. Fill in the blank cells of the table in Example8.1and verify that the numbers given for S xy and S xx are correct.

2. Listed below are the log body weights and log brain weights of the primates species in the data set "mammals". Find the equation of the least squares line with y = log brain weight and x = log body weight. Do it by hand, by constructing a table like the one in Example8.1. Then do it with your calculator as e ciently as possible. Finally, use the lm function in R to do it by creating a linear model ob ject "primates.lm". The model formula is "log(brain) log(body)". You can select the primates and put them in a new data frame by rst listing the primate species names:

> primatenames=c("Owl monkey", "Patas monkey", "Gorilla", etc.) and then Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 146 > primates=mammals[primatenames, ] Your "data" argument in calling lm would be "data=primates", as in > primates.lm=lm(log(brain) log(body),data=primates) log body log brain Owl monkey -0.7339692 2.740840 Patas monkey 2.3025851 4.744932 Gorilla 5.3327188 6.006353 Human 4.1271344 7.185387 Rhesus monkey 1.9169226 5.187386 Chimpanzee 3.9543159 6.086775 Baboon 2.3561259 5.190175 Verbet 1.4327007 4.060443 Galago -1.6094379 1.609438 Slow loris 0.3364722 2.525729 3. A recently discovered hominid species Homo oresiensis, nicknamed the hobbit, had a body weight of about 25 kilograms. Use the tted line from the preceding problem to predict its brain weight.

Read about H.

oresiensis in Wikipedia or some other source. Some of the scienti c arguments were very contentious and involved the creature's brain weight.

4. Repeat problem 2 for the rodent species in "mammals". The data are log body log brain Mountain beaver 0.30010459 2.0918641 Guinea pig 0.03922071 1.7047481 Chinchilla -0.85566611 1.8562980 Ground squirrel -2.29263476 1.3862944 Arctic ground squirrel -0.08338161 1.7404662 African giant pouched rat 0.00000000 1.8870696 Yellow-bellied marmot 1.39871688 2.8332133 Golden hamster -2.12026354 0.0000000 Mouse -3.77226106 -0.9162907 Rabbit 0.91629073 2.4932055 Rat -1.27296568 0.6418539 Mole rat -2.10373423 1.0986123 5. Solve the equation log(y ) = 0 + 1 log (x ) for yand simplify. ywill be expressed as a power function of x. Find the estimated power functions for primates and rodents.

8.3 Distributions of the Least Squares Estimators Henceforth, b , b 0, and b 1 will refer only to the least squares estimators. So far, we have not assumed much about the distribution of the Y i, except for their means and variances. Now we assume in Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 147 addition that they are independent and normally distributed. Equivalently, we assume that the errors i are normally distributed with mean 0 and variance 2 . Thus, Y 1; Y 2; ; Y nare independent and Y i N orm ( 0 + 1x i; ). These assumptions have profound consequences.

Theorem 8.1. If the errors i in (8.3) are independent and normally distributed with mean 0 and variance 2 , then (1) b N orm ( ; = p n ); (2) b 1 N orm 1; = p S xx ; (3) b 0 N orm 0; p 1 =n + x2 =S xx ; (4) S S(resid )= 2 C hisq (df =n 2); (5) b , b 1, and S S(resid ) are independent random variables.

(6) Let xbe a given value of the design variable Xand let (x ) = E(Y jX =x) = + 1( x x ) be the expected value of the response Ywhen X=x. Let b (x ) = b + b 1( x x ) be its estimated value. Then b (x ) N orm 0 @ (x ); s 1 n + ( x x )2 S xx 1 A :

Proof :

We shall prove (1), (2), and (6) only. (3) is the special case of (6) when x= 0. b , b 1, and b (x ) are all linear combinations of Y 1; Y 2; ; Y n, which are normally distributed and independent. There- fore, these parameter estimates are normally distributed and we need only calculate their means and variances. First, b :

E(b ) = E( Y ) = 1 n n X i =1 E (Y i) = 1 n n X i =1 ( + 1( x i x )) = since P n i =1 ( x i x ) = 0. Also, by independence and since var(Y i) = 2 , Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 148 var (b ) = var(1 n n X i =1 Y i) = 1 n 2 n X i =1 var (Y i) = n 2 n 2 = 2 n Next, b 1:

E(b 1) = E Sxy S xx = P n i =1 ( x i x )E (Y i) S xx = P n i =1 ( x i x )( + 1( x i x )) S xx = 1 P n i =1 ( x i x )2 S xx = 1S xx S xx = 1 again because P n i =1 ( x i x ) = 0. Since the Y i are independent and have common variance 2 , var (b 1) = P n i =1 ( x i x )2 var (Y i) S 2 xx = P n i =1 ( x i x )2 2 S 2 xx = 2 S xx :

Finally, E(b (x )) = E(b ) + E(b 1)( x x ) = + 1( x x ) = (x ):

Assuming the independence assertion (5), var(b (x )) = var(b ) + ( x x )2 var (b 1) = 2 n + ( x x )2 2 S xx = 2 1 n + ( x x )2 S xx : Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 149 8.3.1 Exercises 1. The numbers below are the values of the design variable Xin a linear regression problem whose true parameter values are 0 = 1 and 1 = 1. The error variance 2 is 2.55. Find the 5th and 95th percentiles of the distribution of b 1.

x: -10 -8 -6 -4 -2 0 2 4 6 8 10 2. With the same data, nd the 5th and 95th percentiles of b .

3. With the same data, nd the 5th and 95th percentiles of b (7 :5).

4. With the same data, nd the 95th percentile of the distribution of S S(resid ).

5. Find P r(j b 1j > 1:38).

8.4 Inference for the Regression Parameters The goal in this section is to develop con dence intervals and hypothesis tests for the unknown parameters , 0, 1, and 2 , the constant variance of the errors i, in (8.3). Let us reconsider equation (8.4) when b and b 1 are the least squares estimators. Since b 1 = S xy =S xx, S S (resid ) =n X i =1 ( Y i b )2 2b 1 n X i =1 ( Y i b )( x i x ) + b 2 1 n X i =1 ( x i x )2 = n X i =1 Y i Y 2 b 2 1 S xx :

Write this as n X i =1 Y i Y 2 = S S (resid ) +b 2 1 S xx ; or S S(tot ) = S S(resid ) +S S(regr ) (8.8) The total sum of squares is the total squared variation of the Y i about their average. If 1 = 0, then the Y i are just a sample from a single distribution and S S(tot )= (n 1) is just the usual sample variance. The regression sum of squares b 2 1 S xx can also be written S S (regr ) = n X i =1 b Y i Y 2 :

It is the total squared deviation of the tted values b Y i from their average, which is also Y . The residual sum of squares and the regression sum of squares are independent random variables. Divide both sides of (8.8) by S S(tot ). Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 150 1 = S S (res ) S S (tot )+ S S (regr ) S S (tot ) De nition 8.1. The coe cient of determination R 2 is de ned as R 2 = S S (regr ) S S (tot )= 1 S S (resid ) S S (tot ):

Here is the correct interpretation of R2 . The total squared variation S S(tot ) of the observations Y i about their average can be decomposed into two independent parts. One is the variation S S(regr ) = n X i =1 ( b 1( x i x )) 2 accounted for by the presumed linear relationship and the variation of the inputs x i. The other is the variation S S(resid ) that comes from the random deviation from the linear relationship. If S S(regr ) is a high percentage of S S(tot ), then most of the variation in the Y i comes from the variation in the x i and little of it comes from random error. Thus, R2 is interpreted as a measure of the strength of the association between Xand Y.

De nition 8.2. The mean square residual is S 2 = S S (resid ) n 2 :

The mean square residual is also denoted by M S(resid ). Its square root Sis called the residual standard error .

From (4) of Theorem8.1, E S S (resid ) 2 = n 1:

It follows that E S 2 = 2 and S2 is an unbiased estimator of 2 . S is the preferred estimator of , though it is not unbiased.

Recall the de nition of the student t distributions from Chapter 6. If Z N orm (0;1) and W C hisq (df = ) are independent, then T= Z p W= has the student-t distribution with degrees of freedom. In the following theorem, we apply this de nition with Zequal to the standardized values of b , b 1, and b (x ) (see Theorem8.1), and with W =S S (resid )= 2 and = n 2. We then have p W= =S= .

Theorem 8.2. Under the assumptions of Theorem8.1, the following random variables all have student-t distributions with n 2 degrees of freedom: Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 151 p n (b ) S ; p S xx b 1 1 S ; b (x ) (x ) S q 1 n + ( x x )2 S xx :

These are sometimes called the studentized values of the estimators.

8.4.1 Con dence Intervals for the Parameters From the preceding theorem, we can immediately get con dence intervals for the regression parameters.

Corollary 8.1. 100(1 ) % con dence intervals for , 1, and (x ) are:

b t = 2( = n 2) S p n ; (8.9) b 1 t = 2( n 2) S p S xx ; (8.10) b (x ) t = 2( n 2)Ss 1 n + ( x x )2 S xx :

(8.11) The estimated standard deviations of the estimators, obtained by substituting the estimator Sfor the unknown parameter , are se(b ) = S p n ; se (b 1) = S p S xx ; se (b (x )) = Ss 1 n + ( x x )2 S xx and are called their standard errors . So, the con dence intervals above may be expressed as:

b t = 2( n 2)se (b ); b 1 t = 2( n 2)se (b 1) ; b (x ) t = 2( n 2)se (b (x )) : Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 152 8.4.2 Hypothesis Tests for the Parameters Tests of hypotheses about the regression parameters are based on the student-t distribution of the studentized parameter estimates. By far the most important hypothesis to be tested is the null hy- pothesis that the slope parameter 1 is equal to zero. If 1 = 0 and the other assumptions of the linear regression model hold, then the distribution of the response Ydoes not depend on the value of the design variable X. We will consider the more general null hypothesis H0:

1 = 10 and the two-sided alternative H1:

1 6 = 10 :

The p-value is P r(jT j> jT obs j ) ; where T T Dist (df =n 2) and Tobs =p S xx (b 1( obs ) 10 ) S is the observed value of the studentized b 1 in Theorem8.2. For one-sided alternatives, eliminate the absolute value signs and adjust the direction of the inequality appropriately.

Example 8.2. The table below shows vacancy rates (percentage of apartments that are vacant for over 1 month) and rental rates per 10 square feet in 30 American cities.

Vacancy 3.0 11.00 17.00 2.0 7.00 18.00 9.00 12.0 13.00 13.00 Rent 21.7 16.42 14.84 24.9 16.62 11.75 19.33 14.6 12.01 19.88 Vacancy 8.00 16.00 10.00 3.00 12.00 17.00 3.00 11.00 16.00 20.00 Rent 19.83 17.78 17.79 17.08 19.39 15.81 21.15 12.33 15.58 10.83 Vacancy 20.00 14.00 8.00 5.00 2.00 20.00 15.00 19.00 2.00 14.00 Rent 17.38 15.09 23.19 19.27 14.88 16.27 18.53 14.25 17.68 19.74 Find a 90 % con dence interval for the expected rental rate when the vacancy rate is 15 %. Test the null hypothesis that the expected increase in Rent for a unit increase in Vacancy is 0.

Solution : What is required is a 90 % con dence interval for (15) where the response Y= Rent and the expected response is a linear function of X= Vacancy. The expected increase in Rent for a unit increase in Vacancy is the slope parameter 1. The data is in the text le "rents". We will import it into R and analyze it with R's full capabilities later. For now we will show all the steps in obtaining the con dence interval. You are encouraged to follow the steps with your calculator. Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 153 You can verify that Y =17.197, x=11.333, S xx =1028.667, and S xy =-312.507. These should all be easy to obtain with the mean, variance, and covariance buttons on your calculator. The least squares estimates of the parameters are:

b = Y = 17 :197 b 1 = S xy S xx = 312 :507 1028 :667 = 0:304 and the tted line is y= 17 :197 0:304( x 11:333) :

Next, we need to nd S, the residual standard error, and for that we need S S(resid ). It isn't necessary to sum the squares of the residuals. Instead, we will use equation (8.8) in the form S S(resid ) =S S(tot ) S S (regr ):

S S (tot ) is n 1 times the sample variance of the responses Y i. Its value is 326.087.

S S(regr ) has essentially been calculated:

S S(regr ) =b 2 1 S xx =S 2 xy S xx = 94 :939 Thus, S S(resid ) = 326 :087 94:939 = 231 :148 M S (resid ) =S S (resid ) n 2 = 231 :148 28 = 8 :255 S = p M S (resid ) = 2:873 Now we can calculate the coe cient of determination R2 and the standard errors.

R 2 = S S (regr ) S S (tot )= 94 :039 326 :087 = 0 :2911 :

se (b (x )) = Ss 1 n + ( x x )2 S xx = 2 :873 r 1 30 + (15 11:333) 2 1028 :667 = 0 :619 and se(b 1) = S p S xx = 2 :873 32 :073 = 0 :090 Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 154 The predicted Rent for Vacancy = 15 is b (15) = 17 :197 0:304(15 11:333) = 16 :082 :

The 90% con dence interval for b (15), the expected Rent when Vacancy = 15, is b (15) t :05 (28) se(b (15)) = 16 :082 1:053 ; i.e., (15.029,17.135).

The observed Tstatistic for testing the null hypothesis H 0:

1 = 0 is T obs =b 1 0 se (b 1) = 0:304 0 :090 = 3:378 Since the alternative H 1:

1 6 = 0 is 2-sided, the p-value is P r (jT j> 3:378) = 2 P r(T > 3:378) = 0 :0022 ; which is a highly signi cant result. We can conclude that 1 6 = 0. The scatterplot and tted line are shown below. Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 155The F Test for Signi cance of Regression De nition 8.3. The F distribution with 1 degrees of freedom in the numerator and 2 degrees of freedom in the denominator is the distribution of a random variable F= U = 1 V = 2; where U C hisq (df = 1) and V C hisq (df = 2) are independent. That Fhas this distribution is indicated by F F Dist ( 1; 2).

It follows immediately from the de nition that if F F Dist ( 1; 2), then 1 =F F Dist ( 2; 1) Tables of the F distributions are included in most textbooks, but because two parameters have to be speci ed, they tend to be rather coarse. We will use the R functions "qf" and "pf" for evaluating the quantile function and the cumulative distribution. For example, the 95th percentile of F Dist( 1 = 20 ; 2= 30) and P r(F 1:4) are > qf(.95,20,30)5 10 15 20 12 14 16 18 20 22 24 Vacancy Rent Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 156 [1] 1.931653 > pf(1.4,20,30) [1] 0.8024855 Now, U=S S (regr )= 2 C hisq (df = 1 = 1) and V=S S (resid )= 2 C hisq (df = 2 = n 2) are independent. Furthermore, V = 2= M S (resid )= 2 . Although it may seem pointless, divide S S(regr ) by its degrees of freedom 1 = 1 and call it M S(regr ). Then F = M S (regr ) M S (resid ) has the F distribution with 1 degree of freedom in the numerator and n 2 degrees of freedom in the denominator. It is in fact the square of the student-t statistic for testing H 0 :

1 = 0 against H 1:

1 6 = 0. Therefore, we reject H 0and accept H 1if F is larger than a critical value, or if its p-value is smaller than a given signi cance level.

Example8.2in R We will use R's "lm" function to answer the questions in the preceding example. First, we create the linear model ob ject and give it a name, then call the summary function for that ob ject.

> rents.lm=lm(Rent~Vacancy,data=rents) > summary(rents.lm) Call:

lm(formula = Rent ~ Vacancy, data = rents) Residuals: Min 1Q Median 3Q Max -5.1521 -2.2374 0.1688 1.9937 4.9807 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.63971 1.14279 18.061 < 2e-16 *** Vacancy -0.30380 0.08958 -3.391 0.00209 ** --- NA Residual standard error: 2.873 on 28 degrees of freedom Multiple R-squared: 0.2911, Adjusted R-squared: 0.2658 F-statistic: 11.5 on 1 and 28 DF, p-value: 0.002089 The "Coe cients" section of the output shows the estimated intercept and slope parameters, their standard errors, the observed student-t statistics associated with them, and the p-values for the 2- sided alternative to the null hypothesis that the parameter is equal to 0. Compare the numbers derived in Example8.2to those in the line preceded by "Vacancy". Except for some slight roundo error they are the same. Also notice that the residual standard error and the value of R2 ("Multiple R-squared") agree with those in the example. The F-statistic at the very bottom of the summary has the same Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 157 p-value as the t-test statistic for H 0 :

1 = 0, as it should. The adjusted R-squared value can be ignored as it is useful only in multiple regressionproblems with more than one design variable.

Since the parameter estimates and their standard errors are given in the output, it would be easy to calculate con dence intervals for the parameters, e.g., b 1 t = 2( n 2)se b 1 :

However, this is unnecessary because R will do it for us. To get 95% con dence intervals for both 0 and 1, use the "con nt" function with the name of the model ob ject as an argument.

> confint(rents.lm) 2.5 % 97.5 % (Intercept) 18.2988060 22.980611 Vacancy -0.4873016 -0.120294 If you want con dence intervals with a level other than 95% you must enter the "level" argument to the function. For example, for 90% con dence intervals, > confint(rents.lm,level=.90) 5 % 95 % (Intercept) 18.6956704 22.5837464 Vacancy -0.4561913 -0.1514043 R will also give you a con dence interval for (x ). To repeat the results of the example, > predict(rents.lm,newdata=data.frame(Vacancy=15),interval="c",level=.90) fit lwr upr 1 16.08274 15.02987 17.13562 The rst argument in "predict" is the name of the tted linear model ob ject. The "newdata" argument has to be given in the way shown above. The new value(s) of Xmust be given the same name as the Xvariable on the right of the model formula and it must be inside the data.frame function as indicated above. The "interval" argument is either "c" for a con dence interval, or "p" for a prediction interval. The "level" argument is to specify the con dence level. Its default is 95%.

A new observation of the response Yat X=xis contained in the prediction interval b (x ) t = 2( n 2)Ss 1 + 1 n + ( x x )2 S xx with probability 1 . The prediction interval is wider than the con dence interval.

8.4.3 Exercises 1. Show that F= ( n 2) R 2 1 R2: Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 158 2. The data below gives the responses Yto a sample of values of a design variable X. Without using R, except to check your work, nd the following: (a) the estimates b , b 1, b 0, (b) all three terms in S S (tot ) = S S(resid ) +S S(regr ), (c) the residual standard error, (d) the standard errors of the estimates, and (e) R2 .

3. With the same data, calculate the student-t test statistics for the hypotheses that the intercept and slope parameters are equal to 0. Calculate the F statistic for signi cance of regression. Calculate their p-values. (You can use R's "pt" and "pf" functions for that.) 4. Without using R, nd 95% con dence intervals for 0 and 1.

X Y 1 7.44 1.61 2 5.36 -0.22 3 6.37 0.21 4 5.46 0.22 5 5.73 -0.76 6 6.00 0.74 7 5.66 1.66 8 4.57 1.13 9 7.47 3.24 10 5.72 -0.47 5. With the "primates" data nd a 90% prediction interval for the brain weight of a primate whose body weight is 25 kg. Hint: First nd the prediction interval for log brain weight. Use R.

6. The R datasets package has a data frame called "airquality", which lists ozone concentration, solar radiation, wind speed and temperature in New York for 154 days in 1973. Some of the data values are missing, but R will automatically omit those cases with missing data. Fit a linear model with Ozone as the response and Wind as the X variable. Find 90% con dence intervals for the expected ozone concentration when wind speed is 0 and for the expected increase in ozone concentration for a unit increase in wind speed.

7. With the airquality data, test the null hypothesis H 0:

1 = 5 against the one-sided alternative 1 > 5. The output of lm does not give you the answer directly, but it does give you the estimated value of 1 and its standard error. You know that the test statistic has the student-t distribution with n 2 degrees of freedom. Give a p-value. Warning: Because of missing data, n= 116 not 154.

8.5 Correlation In an observational study, with Xand Yjointly distributed random variables, modeling Yas a function of Xand predicting the response for new values of Xmight not be important. Instead, one might simply want a measure of the linear association between Xand Y. One such measure is the correlation between Xand Y, more formally called the Pearson product-moment correlation.

= cor (X; Y ) =cov (X; Y ) x y Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 159 The correlation is a parameter, a characteristic of the joint distribution of Xand Y. If Xand Yhave a bivariate normal distribution, enters explicitly in the formula for the joint density function and in the expression for the conditional expectation of Y, given X=x, E (Y jX =x) = y + y x ( x x) ; from which we see that 1 = y= xin the regression equation. This means that the regression null hypothesis H 0:

1 = 0 is equivalent to H 0:

= 0.

The sample correlation is R= S xy p S xx S yy ; (8.12) where S xy and S xx are the same as before and S yy is our new name for S S(tot ):

S yy = n X i =1 Y i Y 2 :

It is simple algebra to show that R= b 1s S xx S yy :

Recall the student-t statistic for testing H 0:

1 = 0.

T = b 1p S xx S :

It is just more simple algebra to write this as T= Rp n 2 p 1 R2:

Therefore, if Xand Yhave a bivariate normal distribution, a test of H 0:

= 0 against H 1:

6 = 0 is to reject H 0 when jT j> t =2( n 2) or when the p-value based on the the student-t distribution T Dist (df =n 2) is smaller than the chosen signi cance level. We have three equivalent tests in the case of a bivariate normal distribution: the T test for H 0:

1 = 0, the T test for H 0:

= 0, and the F test for signi cance of regression.

8.5.1 Con dence intervals for The following theorem is proved by a complicated argument that starts with the central limit theorem.

We will accept its conclusion without proof.

Theorem 8.3. As the sample size n! 1 , the distribution of Z= p n 3 b approaches standard normal, where = 1 2 ln 1 + 1 (8.13) Go to TOC CHAPTER 8. REGRESSION AND CORRELATION 160 b = 1 2 ln 1 + R 1 R (8.14) Thus a large sample 100(1 ) % con dence interval for is b z = 2 p n 3:

The function ln(1 + )= (1 ) is strictly increasing for 1 < < 1 and the expression (8.13) can be inverted to give = e e e + e = tanh ( ); the hyperbolic tangent function. So, a large sample 100(1 ) % con dence interval for is tanh b z = 2 p n 3 ; tanh b + z = 2 p n 3 :

The R function for testing hypotheses and obtaining con dence intervals for the correlation between two variables is "cor.test". We will illustrate it with the variables "Vacancy" and "Rent" in the data frame "rents".

> attach(rents) > cor.test(Vacancy,Rent) Pearson 's product-moment correlation data: Vacancy and Rent t = -3.3912, df = 28, p-value = 0.002089 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval:

-0.7533935 -0.2225779 sample estimates: cor -0.5395793 The con dence level is adjustable with a "conf.level" argument, e.g., "conf.level=.90".

8.5.2 Exercises 1. With the airquality data nd a 99 % con dence interval for the correlation between Ozone and Wind.

2. With the observations of Xand Yin Exercises 4.3, nd the p-value for the test that the correlation between Xand Yis zero against a two sided alternative. Go to TOC Chapter 9 Inference from Multiple Samples 9.1 Comparison of Two Population Means Let X 1; X 2; ; X mbe a random sample from a distribution with mean x and standard deviation x. Let Y 1; Y 2; ; Y nbe a random sample from another distribution with mean y and standard deviation y. The two samples are independent of each other, so everything in sight is independent of everything else. Our goal is to nd con dence intervals for the di erence x y of the population means or to test the null hypothesis H 0:

x = y that they are equal.

Let X , Y ,S x, and S y denote the sample means and standard deviations of the two samples. Naturally, our inferences will be based on the di erence X Y between the sample averages. The expected value and variance of this di erence are:

E( X Y ) = E( X ) E( Y ) = x y (9.1) var ( X Y ) = var( X ) + var( Y ) = 2 x m + 2 y n (9.2) The standard deviation of X Y is sd( X Y ) = r 2 x m + 2 y n and its standard error, obtained by estimating x and y by S x and S y, respectively, is se ( X Y ) = r S 2 x m +S 2 y n (9.3) 9.1.1 Large Samples The central limit theorem applies not just to each sample average separately, but also to their di er- ence. In other words, for large sample sizes mand n, the distribution of 161 Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 162 Z = ( X Y ) ( x y) sd ( X Y ) (9.4) is approximately standard normal. More usefully, since the population variances are not usually known, the studentized value T= ( X Y ) ( x y) se ( X Y ) (9.5) approaches standard normal as m; n! 1 . Ifz = 2is the 100(1 = 2) percentile of the standard normal distribution, then approximately P r( z = 2< T < z =2) = 1 :

Theorem 9.1. For large sample sizes mand na 100(1 )% con dence interval for x y is X Y z = 2r S 2 x m +S 2 y n :

Theorem 9.2. For large sample sizes mand n, a test of signi cance level for the null hypothesis H 0:

x = y against the two sided alternative H 1:

x 6 = y rejects H 0when j T j> z =2; i.e., when X Y > z =2r S 2 x m +S 2 y n :

For the one sided alternative, H 1:

x > y, reject H 0when X Y > z r S 2 x m +S 2 y n :

The p-value for the two-sided alternative is 2(1 ( jT obs j )), where is the standard normal cumulative distribution.

Example 9.1. The data set "lungcap" has measurements of forced expiratory volume (fev), a measure of lung capacity, for 85 male sub jects between the ages of 16 and 18. Thirty ve were smokers and 50 were not. The mean and variance for the smoking group were 3.624 and 0.084. For the nonsmokers they were 3.747 and 0.120. Find a 95% con dence interval for the di erence between the mean fev of nonsmokers and smokers for this age and gender population.

Solution : The sample sizes m= 50 and n= 35 should be large enough to have con dence in the central limit theorem. We will apply Theorems9.1and9.2. Later we will con rm our conclusions with other procedures. Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 163 The standard error of X Y is se( X Y ) = r S 2 x m +S 2 y n = r 0 :120 50 + 0 :084 35 = 0 :069 So, the 95% con dence interval is X Y z :025 se( X Y ) = 3 :747 3:624 1:96(0 :069) = 0 :123 0:135 ; i.e., the interval ( 0:012 ;0 :258).

9.1.2 Comparing Two Population Proportions If the X iand Y j are samples of Bernoulli random variables with success probabilities p x and p y, then the averages X and Y are just the sample proportions b p x and b p y. Their variances are, respectively, var (b p x ) = p x (1 p x ) m and var(b p y) = p y(1 p y) n :

The standard deviation of b p x b p y = X Y is sd (b p x b p y) = r p x (1 p x ) m + p y(1 p y) n and its standard error is se(b p x b p y) = r b p x (1 b p x ) m + b p y(1 b p y) n :

Therefore, by Theorem9.1, if mand nare large, a 100(1 )% con dence interval for the di erence p x p y in the population proportions is b p x b p y z = 2r b p x (1 b p x ) m + b p y(1 b p y) n (9.6) Example 9.2. In a sample ofm= 40 older houses in New Orleans, 12 had termite damage. In a sample of n= 60 older houses in Houston, 14 had termite damage. Find a 90 % con dence interval for the di erence in the incidence of termite damage in New Orleans and Houston.

Solution : The sample incidences are b p x = 12 =40 = 0 :30 for New Orleans and b p y = 14 =60 = 0 :233 for Houston. Therefore, the 90% con dence interval is Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 164 0 :30 0:233 z :05 r :

30( :70) 40 + :

233( :767 60 = 0 :067 1:645(0 :091) = 0 :067 0:149 Testing Equality of Population Proportions To test the null hypothesis that two population proportions are equal H 0:

p x = p y, one could apply Theorem9.2adapted for Bernoulli samples. However, there is a test that is slightly more powerful, that is, less likely to make a type 2 error. If H 0is true, and p x = p y = p, then a better estimator of the common value pthan either b p x or b p y is their weighted average b p = m b p x + nb p y m +n ; called the pooled estimator ofp. If H 0is true then the standard deviation and standard error of b p x b p y are sd(b p x b p y) = s p (1 p) 1 m + 1 n se (b p x b p y) = s b p (1 b p ) 1 m + 1 n (9.7) For large mand nand signi cance level reject H 0for H 1:

p x 6 = p y if j b p x b p yj > z =2se (b p x b p y) (9.8) For a one sided alternative, erase the absolute value signs, replace z = 2with z and reverse the inequality, if necessary.

Comparing Proportions with R Previously we used the "prop.test" function in R to get con dence intervals and test hypotheses for a single proportion. The same function works for comparing two population proportions. To illustrate, we will rework the preceding example.

> prop.test(c(12,14),c(40,60),conf.level=.90,correct=F) 2-sample test for equality of proportions without continuity correction data: c(12, 14) out of c(40, 60) X-squared = 0.5544, df = 1, p-value = 0.4565 alternative hypothesis: two.sided 90 percent confidence interval:

-0.08256681 0.21590014 Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 165 sample estimates: prop 1 prop 2 0.3000000 0.2333333 Except for roundo error, the con dence interval is the same as the one we derived by hand. The p-value of 46% means that there is no evidence that the two cities di er in infestation rates. The rst argument to prop.test, c(12,14), is the vector of counts of successes in the two samples and the second argument, c(40,60), is the vector of the sample sizes, or numbers of trials in the two binomial experiments. The argument "correct = F" was to prevent R from applying the Yates continuity correction, so that the answers would be the same as before. There is some dispute about whether the Yates correction is desireable, but it is the default option in R and it does cause slightly di erent answers to be returned. The number labelled "X-squared" is essentially the square of the standardized test statistic. Therefore, under H 0, it has approximately a chi-square distribution with 1 degree of freedom.

9.1.3 Samples from Normal Distributions When mand nare not large enough for reliance on the central limit theorem, but the two distribu- tions are nearly normal, there are procedures based on student-t distributions that are similar to the procedures for inferences about a single population mean.

The Welch Test and Con dence Interval Theorem 9.3. If the samplesfX ig and fY jg are from normal distributions, then the statistic Tin (9.5) has an approximate student-t distribution with degrees of freedom equal to = S2 x m +S 2 y n 2 S 4 x m 2 (m 1) + S 4 y n 2 (n 1) :

(9.9) Though the student-t distribution is only an approximation, the test for the null hypothesis H 0:

x = y associated with this theorem performs very well in small sample situations. The test is known as Welch's t test. If you are using a table of the student-t distributions, round given above to the nearest integer. Student-t distributions with fractional degrees of freedom are perfectly legitimate and most software is capable of handling them.

Using this result, the Welch 100(1 )% con dence interval for x y is X Y t = 2( df = )r S 2 x m +S 2 y n (9.10) where is given by (9.9).

If is the desired signi cance level, the Welch test for H 0:

x = y against H 1:

x 6 = y rejects H 0 if X Y > t =2( )r S 2 x m +S 2 y n (9.11) For one-sided alternatives, make obvious modi cations to (9.11). Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 166 The t-test with Equal Variances When the distributions are normal and have equal variances 2 x = 2 y = 2 , there are exact student-t procedures for con dence intervals and tests. Both S2 x and S2 y are unbiased estimators of 2 , but a better estimator is a weighted average of them called the pooled estimator of the common variance:

S 2 p = ( m 1)S2 x + ( n 1)S2 y m +n 2 :

S 2 p = 2 has a chi-square distribution with m+n 2 degrees of freedom and is independent of both X and Y . Since X Y N orm x y; r 1 m + 1 n !

it follows that Tp = X Y ( x y) S pq 1 m + 1 n has a student-t distribution with m+n 2 degrees of freedom. The 100(1 )% con dence interval is X Y t = 2( m +n 2)S pr 1 m + 1 n (9.12) Example 9.3. The "lungcap" data set is adapted from the forced expiratory volume data set provided by the Journal of Statistics Education 1 in theJSE Data Archive. It has two variables, one named "fev" for forced expiratory volume and the other named "smoke" for smoking status. "smoke" is a factor with two levels, "no" and "yes".

It is always a good idea to look at side by side boxplots before applying a formal inference procedure. 1 Kahn,M., An exhalent problem for teaching statistics, JSE vol.13,2,2005 Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 167The shapes of the boxplots are consistent with the normality assumption, but their spreads seem to cast doubt on the equal variance assumption. We will apply R's "t.test" function for both equal and unequal variances to get con dence intervals and to test the hypothesis that the population means are equal.

> attach(lungcap) > t.test(fev~smoke) Welch Two Sample t-test data: fev by smoke t = 1.7642, df = 80.225, p-value = 0.0815 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:

-0.01565884 0.26039598 sample estimates: mean in group no mean in group yesno yes 3.0 3.5 4.0 4.5 smoke fev Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 168 3.746740 3.624371 > t.test(fev~smoke,var.equal=T) Two Sample t-test data: fev by smoke t = 1.7101, df = 83, p-value = 0.09098 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.01995121 0.26468836 sample estimates: mean in group no mean in group yes 3.746740 3.624371 The t-tests and the normal test in Example9.1all give nearly the same answers. The con dence intervals are also nearly the same. At = 10% we would reject H 0:

x = y, but not at = 5%.

The formula "fev smoke" in these calls to t.test tells R to treat fev as a function of smoke, i.e.

to separate the values of fev into two groups according to the value of smoke and to treat them as independent samples. There are other ways of calling t.test, including optional arguments for specify- ing one-sided alternatives and di erent con dence levels. Read about them by calling the help function.

> help(t.test) 9.1.4 Exercises 1. A sample of size 60 from one population of weights had a sample average of 10.4 lb. and a sample standard deviation of 2.7 lb. An independent sample of size 100 from another population of weights had a sample average of 9.7 lb. with a sample standard deviation of 1.9 lb. Find a 95% con dence interval for the di erence between the population means.

2. Repeat problem 1, but assume the sample sizes are 6 and 10. State any assumptions you make.

3. The Payroll data set has data on the numbers of employees and the monthly payroll in thousands of dollars for 50 rms in two di erent industries. If you divide "payroll" by "employees" you get the average monthly salary for each rm. The populations of interest are the rms in industry A and those in industry B. The population variable of interest is the average monthly salary in each of the rms of these populations. At a signi cance level of = 0 :05 test the null hypothesis that the means of by- rm average monthly salaries in industries A and B are equal.

4. Construct side-by-side boxplots of average monthly salaries per rm in industries A and B. Critique your answer in problem 3.

5. Samples of sizes 100 and 80 of calculus students were acquired. The students in the rst sample got into calculus by passing the pre-calculus course. Those in the second sample got in by getting a passing score on a placement test. In the rst group, 65 succeeded in calculus. In the second group, Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 169 41 succeeded. Without using R nd a 95% con dence interval for the di erence in the success rates of the two populations.

6. For the same data, nd the p-value for a test of the alternative hypothesis that the two success rates are not equal without using R.

7. Repeat problems 5 and 6 using R.

9.2 Paired Observations An experimental setup that is super cially similar to two independent samples is the paired obser- vations design. In this design, a pair of observations Xand Yis made on each of nexperimental sub jects, or perhaps 2 nsub jects are matched in pairs according to similar characteristics and Xis measured on one sub ject of each pair while Yis measured on the other. In the end, we have npairs ( X i; Y i), i.e., a sample of size nfrom a bivariate distribution. This is not the same thing as a sample of nvalues of Xfrom one population and nindependent values of Yfrom another population. The goal is inference about the mean of D=X Y and usually it su ces to apply the one-sample t-test or con dence interval to the ndi erences D i= X i Y i. Recall Gosset's split plot experiment for com- paring two methods of drying seeds before planting. The experimental units are small homogeneous plots of land. Half of each plot iis planted with seeds dried by the regular method resulting in a yield X i. The other half is planted with seeds that are kiln dried, resulting in yield Y i. We repeat the data and the analysis here.

Plot 1 2 3 4 5 6 7 8 9 10 11 REG 1903 1935 1910 2496 2108 1961 2060 1444 1612 1316 1511 KILN 2009 1915 2011 2463 2180 1925 2122 1482 1542 1443 1535 > attach(gosset) > t.test(REG-KILN) One Sample t-test data: REG - KILN t = -1.6905, df = 10, p-value = 0.1218 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -78.18164 10.72710 sample estimates:

mean of x -33.72727 There is another way to do this in R, illustrated below.

> t.test(REG,KILN,paired=T) Paired t-test Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 170 data: REG and KILN t = -1.6905, df = 10, p-value = 0.1218 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -78.18164 10.72710 sample estimates:

mean of the differences -33.72727 9.2.1 Crossover Studies In a typical crossover study, a potentially e ective treatment and a placebo are both applied to each of m +n human sub jects and their responses ( X i; Y i) to the treatment and placebo are recorded. Because of concerns that the order of application might a ect the responses, they are applied in random order.

Thus, mof the sub jects get the placebo rst and nget the treatment rst. The treatment e ect is measured for each sub ject by rst observation - second observation . For those who receive the treatment rst, it is Di= X i Y i; i =i; ; m and for the sub jects who receive the placebo rst it is D0 j = Y0 j X0 j ; j = 1 ; ; n:

If the treatment has no real e ect, the distributions of Dand D0 should have the same mean, D = D 0 .

The order of application would not a ect Dand D0 di erently. On the other hand, if the treatment e ect is real, D 6 = D 0 . So, even though we are looking at di erences of paired observations, the problem is a two-sample problem with independent samples from the populations "treatment rst" and "placebo rst".

Example 9.4. Twenty mildly hypertensive men were recruited for a crossover study of the e ective- ness of a drug to lower blood pressure. Each man was given the new drug (drug A) for a week and an older drug (drug B) for a week. Their average morning systolic blood pressures for each week were recorded. The order of administration of the drugs was randomized and unknown both to the sub jects and to medical attendants. The data is in the data set "bpcrossover". The average blood pressure for each sub ject and each drug is shown. The "period" variable indicates whether the new drug was given rst or second. The goal is to determine if there is a di erence in the two drugs.

drug A drug B period 1 112 139 1 2 140 135 2 3 125 138 1 4 149 138 2 5 139 151 1 6 121 127 2 7 136 137 2 8 130 139 1 9 146 120 2 Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 171 10 125 122 2 11 121 138 1 12 145 166 1 13 132 139 2 14 146 128 2 15 125 136 2 16 143 142 2 17 143 152 1 18 129 136 1 19 136 132 2 20 128 140 1 If this problem is approached as an ordinary paired observation problem, the student-t test is not signi cant.

> attach(bpcrossover) > t.test(drugA-drugB) One Sample t-test data: drugA - drugB t = -1.458, df = 19, p-value = 0.1612 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -10.229179 1.829179 sample estimates:

mean of x -4.2 However, treating the problem correctly as a crossover experiment, there is a signi cant di erence in the drugs.

> diff1=drugA[period==1]-drugB[period==1] > diff2=drugB[period==2]-drugA[period==2] > t.test(diff1,diff2) Welch Two Sample t-test data: diff1 and diff2 t = -2.5781, df = 16.544, p-value = 0.01984 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -18.568571 -1.835469 sample estimates:

mean of x mean of y -14.111111 -3.909091 Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 172 Estimating the Size of the E ect In the example just above, the student-t procedure returns a con dence interval for the di erence in means E(X Y) E(Y 0 X0 ) = E(X Y) + E(X 0 Y0 ) ; where Xis the response to drug A when it is given rst, X0 is the drug A response when it is given second, Yis the drug B response when it is given second and Y0 is the drug B response when it is given rst. Each of the terms on the right is a measure of the di erence in the e cacy of drug A compared to drug B, under di erent experimental conditions. Therefore, a reasonable overall measure of A's e cacy compared to B's is their average.

1 2 [ E (X Y) + E(X 0 Y0 )] :

If we divide the end points of the returned con dence interval by 2 we obtain an interval estimate of the di erence in mean e cacy. In this particular example, the interval is ( 9:285 ; 0:918).

9.2.2 Exercises 1. Concentrations of particulate matter were measured at 25 locations in a lake after rains caused heavy runo from surrounding areas. After a period of dry weather, they were measured again at the same locations. The data are in the le "runo ". Find a 90% con dence interval for the e ect of runo on particulate concentrations.

2. Repeat problem 1 treating the rainy and dry measurements as independent samples. Compare the results. Which is the better procedure?

3. With the bpcrossover data, make side-by-side boxplots of drugA - drugB for both values of period.

Does this indicate anything about whether the order of administration of the drugs makes a di erence?

If it exists, this is called a period e ect.

4. Go to StatSci.org http://www.statsci.org/data/general/vitaminc.html and download the data set "E ect of Vitamin C on Muscular Endurance" 2 . Perform the appropriate analysis to determine if vitamin C has an e ect and if so, the size of the e ect.

9.3 More than Two Independent Samples: Single Factor Anal- ysis of Variance Let Xbe a variable de ned for several subgroups of a main population. Let X 11 ; X 12; ; X 1n 1 be a random sample of size n 1 from N orm ( 1; ), 2 Keith, R. E., and Merrill, E. (1983). The e ects of vitamin C on maximum grip strength and muscular endurance.

Journal of Sports Medicine and Physical Fitness, 23, 253-256 Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 173 X 21 ; X 22; ; X 2n 2 be a random sample of size n 2 from N orm ( 2; ), .

.

.

X M 1; X M2; ; X M n Mbe a random sample of size n M from N orm ( M ; ).

We assume that the samples are independent of each other. Everything is independent of everything else. Also note that we are assuming that the variances of the variable Xin the Mgroups are all the same: var(X ) = 2 . Typically, the Mgroups are de ned by the Mlevels of a discrete factor variable A that covaries with Xin the larger population. If you like, you can think of X ij as a sample from the conditional distribution of X, given A= i.

The ob jective is to decide if there is any real di erence in the group means 1; 2; ; M, that is, to test the null hypothesis H 0:

1 = 2 = = M against the many sided alternative H 1:

j 6 = k for at least one pair j; k. If we reject this null hypothesis, we conclude that factor A has a real e ect on the expected response.

Let X i = 1 n i n i X j =1 X ij denote the average of the observations in the ith group. Let N = M X i =1 n i be the total number of observations in all the groups, and let X = 1 N M X i =1 n i X j =1 X ij be the average of all the observations (the grand average). The grand average is also a weighted average of the Mgroup averages. X = 1 N M X i =1 n i X i :

The total sum of squares is S S(tot ) = M X i =1 n i X j =1 ( X ij X )2 : (9.13) S S (tot ) can be expanded by the binomial expansion: Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 174 S S (tot ) = M X i =1 n i X j =1 ( X ij X i + X i X )2 = M X i =1 n i X j =1 ( X ij X i ) 2 + 2 M X i =1 n i X j =1 ( X ij X i )( X i X ) + M X i =1 n i X j =1 ( X i X )2 The middle term above is 0. After simplifying the third term, S S(tot ) = M X i =1 n i X j =1 ( X ij X i ) 2 + M X i =1 n i( X i X )2 = S S (resid ) +S S(betw ): (9.14) The residual sum of squares, also sometimes called the within group or error sum of squares, is S S(resid ) =M X i =1 n i X j =1 ( X ij X i ) 2 :

The between group sum of squares, or treatment sum of squares, or factor A sum of squares is S S(betw ) = M X i =1 n i( X i X )2 :

It is pretty clear that S S(betw ) is a measure of how widely dispersed the group averages X i are about their center X . Thus, it is the basis for a test statistic for accepting the alternative hypothesis that the population means are not all the same. To be useful, it has to be compared to a measure of the inherent variability of the data. S S(resid ) is just such a measure. The following theorem tells us how to relate the two.

Theorem 9.4. S S(resid ) and S S(betw ) are independent random variables.

S S (resid ) 2 C hisq (df =N M ):

If H 0:

1 = 2 = = M is true, S S(betw ) 2 C hisq (df =M 1) and F= ( N M ) ( M 1) S S (betw ) S S (resid ) F Dist (M 1; N M ); has the Fdistribution with M 1 degrees of freedom in the numerator and N M degrees of freedom in the denominator. Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 175 We de ne the mean square residual to be M S (resid ) =S S (resid ) N M ; and the factor A mean square to be M S(betw ) =S S (betw ) M 1 ; so that Fcan be written F= M S (betw ) M S (resid ):

For a given signi cance level , let f ( 1; 2) be the 100(1 )th percentile of F Dist( 1; 2). For a test at signi cance level of H 0:

1 = 2 = = M against H 1:

i 6 = j for at least one pair i; j, reject H 0if F= M S (betw ) M S (resid )> f ( M 1; N M ): (9.15) In R, f can be found with the function "qf", the quantile function of the F distribution, and the p-value P r(F > F obs) with the function "pf".

Example 9.5. The summary statistics for samples of observations of a variable Xon four groups are shown below.

n mean var group1 10 1.786 13.792 group2 10 3.268 5.649 group3 10 2.419 3.860 group4 10 4.374 1.631 There are M= 4 groups and each group sample size n i is 10.

N= 40 is the total combined sample size. To nd the grand average, we multiply each group average by the group sample size, add, and divide by N. X = (10 1:786 + 10 3:268 + 10 2:419 + 10 4:374) =40 = 2 :962 :

Now, to get S S(resid ) we multiply the sample variance for each group by n i 1 and add them together.

S S (resid ) = 9 13:792 + 9 5:649 + 9 3:860 + 9 1:631 = 224 :388 The factor A sum of squares S S(betw ) =P M i =1 n i( X i X )2 is S S (betw ) = 10 (1:786 2:962) 2 + 10 (3:268 2:962) 2 + 10 (2:419 2:962) 2 + 10 (4:374 2:962) 2 = 37 :652 ; Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 176 and then, nally, M S(betw ) =S S (betw ) M 1 = 37 :652 3 = 12 :551 M S (resid ) =S S (resid ) N M = 224 :388 36 = 6 :233 F = M S (betw ) M S (resid )= 12 :551 6 :233 = 2 :014 :

We will use the R function "pf" to nd the p-value P r(F > 2:014).

> 1-pf(2.014,3,36) [1] 0.1293147 Thus the test is not signi cant at = 0 :10. We do not conclude that there is a di erence in population means.

This kind of problem and the kind of analysis shown above is called one-way analysis of varianceor single factor analysis of variance because the observations of Xare categorized by the levels of a single factor variable. Analysis of variance is abbreviated anova.

9.3.1 Example Using R The data set "ap lternoise.txt" shows data presented to a Senate subcommittee on a comparison of a new type of automobile pollution lter to an older type. One of the variables of concern was the noise created inside the vehicle by the lter. A factor that might in uence the noise is the size of the vehicle. The data below is a subset " lternoise1" of the full data set consisting only of measurements for the old lter type.

NOISE SIZE 1 810 small 2 820 small 3 820 small 4 840 midsize 5 840 midsize 6 845 midsize 7 785 large 8 790 large 9 785 large 10 835 small 11 835 small 12 835 small 13 845 midsize 14 855 midsize 15 850 midsize 16 760 large Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 177 17 760 large 18 770 large R has a function "aov" for performing analyses of variance for data such as this, but the "lm" function that we used for regression actually gives more information. In fact, linear regresson models and analysis of variance models are both examples of linear models, and "lm" is designed for all types of linear models.

> filternoise.lm=lm(NOISE~SIZE,data=filternoise1) > summary(filternoise.lm) Call:

lm(formula = NOISE ~ SIZE, data = filternoise1) Residuals: Min 1Q Median 3Q Max -15.8333 -5.8333 -0.8333 9.1667 15.0000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 825.833 4.271 193.362 < 2e-16 *** SIZEmidsize 20.000 6.040 3.311 0.00475 ** SIZElarge -50.833 6.040 -8.416 4.59e-07 *** --- Residual standard error: 10.46 on 15 degrees of freedom Multiple R-squared: 0.907, Adjusted R-squared: 0.8946 F-statistic: 73.11 on 2 and 15 DF, p-value: 1.841e-08 > anova(filternoise.lm) Analysis of Variance Table Response: NOISE Df Sum Sq Mean Sq F value Pr(>F) SIZE 2 16002.8 8001.4 73.109 1.841e-08 *** Residuals 15 1641.7 109.4 --- The estimated coe cients in the summary of the linear model ob ject " lternoise.lm" need some expla- nation. The variable SIZE is a factor, i.e., a discrete variable with nominal values or levels, "small", "midsize" and "large". These are stored internally in that order, but this is merely coincidental. SIZE is not an ordered factor, which is a separate class in R.

The Intercept coe cient in the summary is the estimated mean of NOISE for SIZE="small". In other words, it is the quantity X 1 in the discussion above. The second coe cient, labelled SIZEmidsize, is the di erence in estimated means for SIZE="midsize" and SIZE="small". That is, it is the quantity X 2 X 1 . Finally the coe cient SIZElarge is the estimated di erence in the mean of NOISE when Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 178 SIZE="large" and its mean when SIZE="small", i.e., X 3 X 1 . The rst level of the factor is treated as a base level and the others are compared to it.

For each coe cient, the summary gives a student-t statistic for testing the hypothesis that the true coe cient is zero and a p-value for the statistic. Since the p-values here are so small, we are justi ed in concluding that 2 6 = 1 and 3 6 = 1. The summary gives no information to justify the conclusion that 3 6 = 2.

After creating the linear model ob ject, the call > anova( lternoise.lm) produces an analysis of variance table . The rst line, headed "SIZE", shows the between group degrees of freedom M 1, the between group sum of squares S S(betw ), the between group mean square M S (betw ) and the F statistic F=M S (betw )=M S (resid ). The last entry in the rst row is the p-value for the observed value of F. The second row, headed "Residuals" gives the degrees of freedom N M , the residual sum of squares S S(resid ), and the mean square residual M S(resid ). An anova table is the traditional way of presenting the calculations in an analysis of variance. Some of them look slightly di erent from this, but they all convey about the same information. Notice that the F statistic and its p-value are the same in the linear model summary and the anova table.

9.3.2 Multiple Comparisons In the example just above, we concluded that 2 6 = 1 and 3 6 = 1 because the corresponding p-values were so small. We were unable to draw any conclusion about 3 2. At best, the anova table allows us to say that some of the group means are di erent, without saying which ones. We could perform separate two-sample t-tests at level on each pair of groups and reject some of the hypotheses i = j while not rejecting others. The problem is, when there are a lot of pairwise comparisons to be made, the probability that one or more of these pairwise null hypotheses is rejected is substantially greater than , even if they are all true. This is the problem of multiple comparisons. A solution is to reduce the signi cance level of the pairwise tests enough so that the probability of one or more type 1 errors in the whole set of comparisons is less than . The Bonferroni method of adjustment reduces the signi cance level for the pairwise tests to =k, where kis the number of comparisons. The Holm method of adjustment is considerably more complicated. It is the default method used in R, although the Bonferroni method can be selected as an option.

The R function is "pairwise.t.test". It actually adjusts p-values instead of pre-established values of .

We will illustrate it with the data of the preceding example.

> attach(filternoise1) > pairwise.t.test(NOISE,SIZE) Pairwise comparisons using t tests with pooled SD data: NOISE and SIZE large midsize Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 179 midsize 1.8e-08 - small 9.2e-07 0.0047 P value adjustment method: holm > pairwise.t.test(NOISE,SIZE,"bonferroni") Pairwise comparisons using t tests with pooled SD data: NOISE and SIZE large midsize midsize 1.8e-08 - small 1.4e-06 0.014 P value adjustment method: bonferroni > pairwise.t.test(NOISE,SIZE,"none") Pairwise comparisons using t tests with pooled SD data: NOISE and SIZE large midsize midsize 5.9e-09 - small 4.6e-07 0.0047 P value adjustment method: none The method of adjustment didn't make any di erence here because the p-values were so small and there were only three comparisons. The default holm method is recommended, in general.

9.3.3 Exercises 1. Using the summary data in Example9.5, construct an anova table.

2. The table below shows summary statistics for normally distributed measurements on 5 groups. The population variances are all equal. Construct an anova table and determine if there is a di erence in the population means by calculating a p-value.

n mean var grp1 10 52.40 243.38 grp2 21 55.00 142.00 grp3 16 36.25 246.73 grp4 20 53.65 173.82 grp5 18 47.50 267.91 Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 180 3. The data set "airquality" in the R datasets library has data on ozone concentration, wind speed, temperature, and solar radiation by month and day for May through September in New York. Attach airquality to your workspace and then construct side-by-side boxplots of Wind by Month. Month is a numeric variable in the airquality data frame. You can treat it as a factor by using the "as.factor" function, e.g., > plot(Wind as.factor(Month)) Next, do an analysis of variance to determine if wind speed varies signi cantly by month. Finally, use the "pairwise.t.test" function to pick out which pairs of months are signi cantly di erent. Are the answers what you would expect from looking at the boxplots?

4. From the course data folder www.math.uh.edu/ charles/data/Reading import the reading comprehension data. "Group" is a factor whose levels are abbreviations of three methods of reading instuction. Create a linear model ob ject in R with "Post3 as a function of "Group".

Apply the "summary" function to the linear model ob ject and interpret the coe cients. Note the as- sociated p-values. Apply the "anova" function to the linear model ob ject to create an anova table and interpret its output. Apply the function "pairwise.t.test" in the following manner.

> pairwise.t.test(Post3,Group,"bonferroni") Compare these p-values to the p-values in the linear model summary.

9.4 Two-Way Analysis of Variance Agricultural researchers are interested in comparing the yield of 4 varieties of corn. To control for the e ects of rainfall, climate, and soil fertility, they select 5 small plots of land and divide each into fourths. The four varieties are randomly assigned to the subplots. At harvest, the yield of the jth variety on the ith plot is measured. Let X ij denote its value. There are two factors that a ect the yield - the variation in growing conditions between the main plots and the variation between seed varieties.

Hence, this is a two-factor experiment. It is a generalization of the paired observation experimental design discussed earlier. Researchers are primarily interested in the di erences between seed varieties and not so much in the e ect of the growing conditions. Note that there is only one observation for each combination of levels of the two factors.

Let Adenote the plot factor. It has a= 5 levels. Let Bdenote the variety factor, with b= 4 levels.

Let ij = E(X ij) be the expected response (yield) for level iof Aand level jof B. It is assumed that the X ij are independent and normally distributed with common variance 2 .

No inferences are possible without making explicit what kind of e ects we are looking for and restricting the expected values ij in an appropriate way. The model we choose is called the additivemodel:

ij = + i+ j (9.16) Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 181 It is called the additive model because changing only the level of Asimply adds some amount to ij and it is the same amount irrespective of j- the level of B. A similar statement holds for changing the level of B. The parameters satisfy the restrictions a X i =1 i= b X j =1 j = 0 :

Because of these restrictions, only a 1 of the i and b 1 of the j vary freely. Including then, there are a 1 + b 1 + 1 = a+ b 1 parameters that determine the a bmeans ij .

If the means ij do satisfy this model, then , i, and j can be found as follows:

= = 1 ab a X i =1 b X j =1 ij ; i= i ; j = j ; i = 1 b b X j =1 ij ; j = 1 a a X i =1 ij :

The model can also be expressed as ij = i + j ; (9.17) and the degree to which the model is untrue is a X i =1 b X j =1 ( ij i j + )2 : (9.18) Now, factor Ahas no e ect on the expected response if and only if all the i are equal to zero, i.e., if and only if a X i =1 2 i = a X i =1 ( i )2 = 0 : (9.19) Similarly, factor Bhas no e ect if and only if b X j =1 2 j = b X j =1 ( j )2 = 0 : (9.20) Finally, assuming the model to be true, neither factor has any e ect if and only if the ij are all the same, i.e., if and only if Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 182 a X i =1 b X j =1 ( ij )2 = 0 :

The following algebraic identity holds for all of these quantities:

a X i =1 b X j =1 ( ij )2 = b a X i =1 ( i )2 + a b X j =1 ( j )2 + a X i =1 b X j =1 ( ij i j + )2 : (9.21) Of course none of the parameters in this equation are known, but they all have estimated values from the observations X ij. Exactly the same identity holds for the estimated values.

S S(tot ) = S S(A ) + S S(B ) + S S(resid ); (9.22) where S S(tot ) = a X i =1 b X j =1 ( X ij X )2 ; S S (A ) = ba X i =1 ( X i X )2 ; S S (B ) = ab X j =1 ( X j X )2 ; S S (resid ) =a X i =1 b X j =1 ( X ij X i X j + X )2 :

Theorem 9.5. The three terms on the right of (9.22) are independent random variables. Furthermore, S S(resid )= 2 C hisq (df = ( a 1)( b 1)) :

If 1 = 2 = = a = 0 (factor Ahas no e ect), S S (A )= 2 C hisq (df =a 1):

If 1 = 2 = = b = 0 (factor Bhas no e ect), S S (B )= 2 C hisq (df =b 1):

De ne mean square gures by dividing each of the above by its degrees of freedom: Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 183 M S (A ) = S S (A ) a 1 M S (B ) = S S (B ) b 1 M S (resid ) = S S (resid ) ( a 1)( b 1) and Fstatistics by FA = M S (A ) M S (resid ) F B = M S (B ) M S (resid ) To test the null hypothesis H A :

1 = = a = 0 against the alternative that i 6 = k for some i and k, reject H A if FA > f ( a 1;(a 1)( b 1)) where isthe desired type 1 error probability. To test H B :

1 = = b = 0, reject H B if F B > f ( b 1;(a 1)( b 1)) :

Example 9.6. We return to the auto pollution lter noise data. Below is a two-way table showing one observation of NOISE for each combination of the factors SIZE and TYPE.

standard Octel small 825.8 822.5 midsize 845.8 821.7 large 775.0 770.0 In the exercises below, you are asked to perform the calculations above by hand and test the null hypothesis that the type of lter has no e ect on noise. This is an absurdly small data set, so hand calculations are feasible. For any problem of signi cant size, the data is more likely to be given in a form similar to the data frame below.

> filternoise2 SIZE TYPE NOISE 1 small standard 825.8 2 midsize standard 845.8 3 large standard 775.0 4 small Octel 822.5 5 midsize Octel 821.7 6 large Octel 770.0 We will use R's "anova" and "lm" functions to construct the analysis of variance table. Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 184 > anova(lm(NOISE~SIZE+TYPE,data=filternoise2)) Analysis of Variance Table Response: NOISE Df Sum Sq Mean Sq F value Pr(>F) SIZE 2 4341.0 2170.48 32.5434 0.02981 * TYPE 1 175.0 174.96 2.6233 0.24674 Residuals 2 133.4 66.69 --- Based on the p-values for the F statistics, we can conclude that vehicle size has an e ect on noise, but we cannot conclude that the type of lter has an e ect. This is probably not what the manufacturers were hoping for.

9.4.1 Interactions Between the Factors With only one observation for each combination of factor levels, the additive model must be taken as given. If there are multiple observations for each combination then it becomes possible to test the hypothesis that all the interactions ij i j + are equal to zero, in other words, to test the hypothesis that the additive model is true. This is because with multiple observations there is an independent estimate of the error variance 2 . Rather than go through the mathematics, we will illustrate with an example in R.

Example 9.7. The IRS is concerned about the time it takes taxpayers to ll out tax forms. This example is purely ctional. Managers arranged an experiment in which sub jects from three income brackets are timed in their completion of four forms. Since the forms are di erent, the times to complete them are expected to be di erent. Managers are mainly interested in the e ect of income group on completion time. Ten response times were recorded for each combination of income group and form. The data is available in two di erent formats, "taxforms" in a format more suitable for presentation and "Taxforms" in a format more suitable for R.

We will rst do the analysis of variance without interactions, i.e., assuming an additive model.

> anova(lm(time~group+form,data=Taxforms)) Analysis of Variance Table Response: time Df Sum Sq Mean Sq F value Pr(>F) group 2 6719 3359.4 4.1039 0.01901 * form 3 6280 2093.3 2.5572 0.05868 .

Residuals 114 93319 818.6 --- Next, allowing interactions: Go to TOC CHAPTER 9. INFERENCE FROM MULTIPLE SAMPLES 185 > anova(lm(time~group*form,data=Taxforms)) Analysis of Variance Table Response: time Df Sum Sq Mean Sq F value Pr(>F) group 2 6719 3359.4 4.1127 0.01899 * form 3 6280 2093.3 2.5627 0.05857 .

group:form 6 5102 850.3 1.0410 0.40297 Residuals 108 88217 816.8 --- The third line in the table gives the sum of squares, mean square and F statistic for interactions.

The p-value of 40% does not indicate a signi cant interaction between the factors. The main additive e ect for the group factor is signi cant and the additive e ect for form is almost signi cant. Notice that the analysis without interactions gives almost the same answers for the main e ects.

This experimental design is balancedin that each combination of factor levels has the same number of responses. Unbalanced designs can be analyzed but the interpretation of the answers becomes more complicated.

9.4.2 Exercises 1. Do the calculations in Example9.6by hand, without using R except as a calculator.

2. Use the paired observations student-t test to determine if there is an e ect due to TYPE in Example 9.6. Compare the p-value to the anova p-value.

3. With the ap lternoise data perform a two way analysis of variance with NOISE as the response and SIZE and TYPE as the factors. Ignore the variable SIDE. Do the analysis rst without and then with interactions. What are your conclusions? Go to TOC Chapter 10 Analysis of Categorical Data 10.1 Multinomial Distributions Let Xbe a factor variable with levels L 1; ; L m. This means that Xis discrete with a small number of distinct values, which may be expressed numerically for convenience, but generally signify categories such as "male","female" or "low income","middle income", "high income". Let p i = P r (X =L i) be the probability of the ith category. These probabilities satisfy p i > 0 for each iand P m i =1 p i = 1. Because of this last constraint, only m 1 of the p i may be speci ed with some degree of freedom.

Suppose we have nindependent observations of X. Let Y i be the number of observations in which level L ioccurs, i= 1 ; ; m. These jointly distributed random variables Y 1; ; Y m have nonnegative integer values and P m i =1 Y i = n. As we showed in Chapter 4, their joint frequency function is P r (Y 1 = y 1; Y 2 = y 2; ;Y m = y m ) = n ! y 1!

y 2!

y m !p y 1 1 py 2 2 py m m ; where the y i are nonnegative integers whose sum is n.

Example 10.1. Suppose that the students in a calculus class of 45 can be regarded as a random sample from the much larger population of all calculus students. In the population of all calculus students, 15% make A's, 25% make B's, 30% make C's, 15% make D's and 15% either fail or drop.

What is the probability that in this particular class there are 4 A's, 9 B's, 10 C's, 12 D's, and 10 F's or W's?

Solution : According to the formula, the probability is 45! 4!9!10!12!10!

0 :15 4 0 :25 9 0 :30 10 0:15 12 0:15 10 :

It is highly recommended that you use the R function "dmultinom" for such calculations.

> dmultinom(c(4,9,10,12,10),size=45,prob=c(.15,.25,.30,.15,.15)) [1] 1.85789e-05 186 Go to TOC CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 187 10.1.1 Estimators and Hypothesis Tests for the Parameters For each level or category i, the marginal distribution of the number of occurrences Y i is binomial, Y i Binom (n; p i). For large n, the natural estimator b p i = Y i=n is approximately normal with mean p i and standard deviation q p i(1 p i) n . Any of the methods described in Chapter 6 may be used to construct con dence intervals and hypothesis tests for the individual p i. Right now we are more interested in a measure of the overall accuracy with which all mestimators b p 1; ;b p m approximate their target values p 1; ; p m. One such measure is the weighted average squared relative error m X i =1 p i b p i p i p i 2 (10.1) In terms of the observations Y i, this is 1 n m X i =1 ( Y i np i) 2 np i :

The factor of 1 =nis irrelevant for our purposes, so we ignore it. We also abbreviate the expected value E (Y i) = np iby E iand rewrite the expression as Q= m X i =1 ( Y i E i) 2 E i :

(10.2) The following theorems and the tests derived from them are primarily due to Karl Pearson 1 Theorem 10.1. Asn! 1 , the distribution of Qapproaches the chi-square distribution with m 1 degrees of freedom, i.e., for large n, Q C hisq (df =m 1), approximately.

To see how this theorem mught lead to a hypothesis test, suppose that a null hypothesis speci es the values of p 1; ; p m, while respecting the constraint P n i =1 p i = 1. If the estimated values b p i are far from the hypothesized values p i as measured by (10.1), then this tends to discon rm H 0. For a formal test of signi cance level of H 0:

p 1; p 2; p m = given numbers , against the many sided alternative that at least one of the p i is not equal to the given value, reject H 0when Q > 2 ( m 1); where 2 ( m 1) is the 100(1 ) percentile of C hisq(df =m 1).

Example 10.2. Returning to Example10.1, let us test the null hypothesis that the distribution of grades in the particular class is the population distribution of calculus grades. We will assume that the numbers in each grade category are the ones given in Example10.1. The Y i are 4, 9, 10, 12, and 10. The expected values, assuming that the population proportions are the true parameters for this particular class, are E 1 = 45 0:15 = 6 :75, E 2 = 45 0:25 = 11 :25, E 3 = 45 0:30 = 13 :5, E 4= E 5= 45 0:15 = 6 :75. The observed value of Qis Q obs =(4 6:75) 2 6 :75 + (9 11:25) 2 11 :25 + (10 13 :5) 2 13 :5 + (12 6:75) 2 6 :75 + (10 6:75) 2 6 :75 = 8 :13 : 1 Karl Pearson 1857-1936. One of the founders of modern mathematical statistics. A student of Francis Galton. Go to TOC CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 188 Since under the null hypothesis Q C hisq (df = 4), the p-value is > 1-pchisq(8.13,4) [1] 0.08693054 So, the grade distribution in this class is signi cantly di erent from the combined grade distribution in all classes at signi cance level = 0 :10, but not at = 0 :05.

10.1.2 Multinomial Probabilities That Are Functions of Other Parameters In many applications, the null hypothesis H 0 does not completely specify the values of p 1; ; p m.

Instead it restricts their values by expressing them as functions p i( 1; ; k) of more fundamental unknown parameters 1; ; k, where k < m 1. These must be well-behaved, regular functions, but that is seldom a matter of practical concern. A good example is the problem of testing the Hardy-Weinberg model of genetic equilibrium, which we discussed in Chapter 4. A particular gene has two alleles, designated A and a. Each individual has two copies of this gene, and thus has one of the genotypes AA, Aa, or aa. Let p AA , p Aa , and p aa denote the proportions of these genotypes in the population. In a random sample of size nfrom the population, let Y AA , Y Aa , and Y aa denote the counts of the three genotypes. These random variables have a joint multinomial distribution.

Let denote the proportion of all the copies of the gene in the population that are allele A. It is easy to see that = p AA +1 2 p Aa :

(10.3) If the population is thoroughly mixed and breeding indiscriminately, then the pairing of gene copies in reproduction is random and the frequencies of the genotypes do not change over time. In this case, p AA = 2 , p Aa = 2 (1 ), and p aa = (1 )2 . This is the null hypothesis if one intends to test whether or not this particular gene is in equilibrium in the population.

H0: There is a number 2 (0;1) such that p AA = 2 and p Aa = 2 (1 ):

H 1: There is no such number.

If we reject H 0, then we conclude that the population is not in genetic equilibrium.

In the general setting, when p i = p i( 1; ; k), we have to nd estimates of the underlying parameters 1; ; k. They are supposed to be maximum likelihood estimates b i which maximize the multinomial log-likelihood function l( b 1; ;b k ) = m X i =1 Y ilog p i( b 1; ;b k ) : (10.4) Fortunately, the maximum likelihood estimates often turn out to be common sense estimators. That is the case in the most important applications below. In the Hardy-Weinberg example, the maximum likelihood estimator of is the sample analog of (10.3), b = Y AA n + Y Aa 2 n :

(10.5) Go to TOC CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 189 If b 1; ;b k are maximum likelihood estimates, let b E i= np i( b 1; ;b k ) be the corresponding estimate of E(Y i) = np i.

Theorem 10.2. Asn! 1 , the distribution of b Q = m X i =1 ( Y i b E i) 2 b E i (10.6) approaches the chi square distribution with m 1 kdegrees of freedom.

Example 10.3. Suppose we have a random sample of 60 organisms and are able to determine the genotype of each one. Suppose Y AA = 24, Y Aa = 12 and Y aa = 24. Can we conclude that the popula- tion is not in equilibrium?

Solution : The maximum likelihood estimate of is b = 24 60 + 12 120 = 0 :5 and the estimated expected counts of the genotypes are b E AA = 60( :5) 2 = 15 ; b E Aa = 60(2)( :5)( :5) = 30 ; b E aa = 60( :5) 2 = 15 :

So, the observed value of b Q is b Q obs =(24 15) 2 15 + (12 30) 2 30 + (24 15) 2 15 = 21 :6 :

b Q has a chi square distribution with df= 3 1 1 = 1. The p-value is tiny, > 1-pchisq(21.6,1) [1] 3.358518e-06 so we de nitely conclude that the population is not in equilibrium.

10.1.3 Exercises 1. A six-sided die is thrown 50 times. The numbers of occurrences of each face are shown below. Face 1 2 3 4 5 6 Count 12 5 9 11 6 7 Go to TOC CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 190 Can you conclude that the die is not fair?

2. Look at the variable "payroll" in the data set "Payroll". It has 50 values, all between 100 and 400.

Count the number of values in each of the intervals 100 - 160, 160 - 220, 220 - 280, 280 - 340, 340 - 400. Can you conclude that this data does not come from the uniform distribution on the interval (100,400)? (Hint: If the distrbution is uniform, what is the expected count in each interval?) The counting can be done with R's histogram function.

> paycounts=hist(Payroll$payroll, breaks=seq(100,400,60))$counts > paycounts Do the calculations by hand, except for nding the p-value. Then check your work by using R's "chisq.test" function.

> chisq.test(paycounts) 3. Explain how this same method could be adapted to test the hypothesis that the data comes from any given continuous cumulative distribution Fsuch that F(100) = 0 and F(400) = 1.

4. A sample of 60 animals from a given population was obtained. The genotype counts were Y AA = 20, Y Aa = 30, Y aa = 10. Is the population in equilibrium?

5. In the Hardy-Weinberg example, the log-likelihood function (10.4) is l( b ) = Y AA log b 2 + Y Aa log(2 b (1 b )) + Y aa log(1 b )2 :

Show that this is maximized by (10.5). Make use of properties of the logarithm.

6. You can make a graph depicting genetic equilibrium as follows.

> theta=seq(.01,.99,.01) > pAA=theta^2 > pAa=2*theta*(1-theta) > plot(pAA,pAa,type="l") > paa=(1-theta)^2 > plot(pAA,paa,type="l") Points on the curve represent genetic equilibrium, Points o the curve represent disequilibrium. Pick a few points o the curve and calculate where on the curve they end up after applying equation (10.3).

10.2 Testing Equality of Multinomial Probabilities Suppose that rindependent multinomial experiments are performed and that all the experiments have the same set of categories L 1; ; L m. Let Y ij denote the number of occurrences of L j in the ith experiment and let p ij denote the probability of L j in the ith experiment. Let n i denote the number of trials in the ith experiment. This information can be arranged in tabular form as Go to TOC CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 191Experiment L 1 L 2 L m n i 1 Y 11 Y 12 Y 1m n 1 2 Y 21 Y 22 Y 2m n 2 .

.

. .

.

. .

.

. .

.

. .

.

. .

.

. r Y r1 Y r2 Y rm n r The ith row of the table gives the data for the ith multinomial experiment. The rows are independent of one another.

We are interested in testing the null hypothesis that the probability of L j is the same for all r experiments. More precisely, H0:

p 11 = p 21 = p 31 = =p r1 = 1; (10.7) p 12 = p 22 = p 32 = =p r2 = 2; .

.

.

p 1m = p 2m = p 3m = =p rm = m ; where 1; 2; ; m are unknown positive numbers whose sum is 1.

For each j= 1 ; : : : ; m , letY j = P r i =1 Y ij . Let N=P r i =1 n i. The log-likelihood function for the r combined experiments is the sum of their individual log-likelihoods.

l( b 1; : : : ; b m ) = r X i =1 m X j =1 Y ij log b j = m X j =1 Y j log b j:

(10.8) Given the constraint P m j =1 b j = 1, this is maximized when b j = Y j N :

This is a perfectly sensible estimator. It is just the combined proportion of occurrences of category L j in all rexperiments. The corresponding estimated expected value of Y ij is b E ij = n ib j = n iY j N :

Now let b Q = r X i =1 m X j =1 Yij b E ij 2 b E ij :

(10.9) By a slight extension of Theorem10.2, as all n i ! 1 , the distribution of b Q approaches chi square.

The number of degrees of freedom is equal to the number of free parameters in the full model, without assuming the null hypothesis, minus the number of free parameters under the null hypothesis. The number of free parameters in the unrestricted model is r (m 1), while under the null hypothesis it is m 1 Therefore, if all n i are large, Go to TOC CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 192 b Q C hisq (df = ( r 1)( m 1)) :

To test the null hypothesis (10.7) at signi cance level , reject H 0if b Q > 2 (( r 1)( m 1)) :

Example 10.4. A certain course had 3 sections last semester. The observed counts of their grades are shown below. Can we conclude that the probabilities of the grade categories are actually di erent, perhaps because the 3 instructors have di erent standards, or could the apparent di erences be due merely to chance?

> grades38 A B C DFW class 1 7 17 17 22 class 2 17 14 11 15 class 3 13 14 11 13 > addmargins(grades38) A B C DFW Sum class 1 7 17 17 22 63 class 2 17 14 11 15 57 class 3 13 14 11 13 51 Sum 37 45 39 50 171 The second table is the same as the rst, except that the rows and columns have been summed. This gives us the values n i and Y j in the discussion above. The number Nis the grand total in the lower right corner, N= 171. For given iand j, the estimated expected count b E ij = n iY j N is the sum of row itimes the sum of column jdivided by the grand total N, e.g., b E 11 = 63 37 171 = 13 :63 ; b E 12 = 63 45 171 = 16 :58 ; .

.

.

b E 34 = 51 50 171 = 14 :91 :

b Q = (7 13:63) 2 13 :63 + (17 16 :58) 2 16 :58 + +(13 14 :91) 2 14 :91 = 7 :38 The p-value is > 1-pchisq(7.38,6) [1] 0.2871293 Go to TOC CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 193 Thus, there is little evidence that the grade probabilities di er among the 3 instructors.

In R, the procedure is simple.

> chisq.test(grades38) Pearson 's Chi-squared test data: grades38 X-squared = 7.3753, df = 6, p-value = 0.2875 10.3 Independence of Attributes: Contingency Tables Let Nrandomly sampled members of a large population be cross-classi ed according to two factors or attributes Aand B. For example, Amight be income level, low, medium or high and Bmight be political party a liation, Whig, Free Soil, Know Nothing or Other. The question of interest is whether these factor variables are independent or not. In general, let us designate the levels of Aas i = 1 ;2 ; ; rand the levels of Bas j= 1 ;2 ; ; m. No particular ordering of levels is implied.

Let Y ij be the number of individuals in the sample that have attribute Aat level iand attribute B at level j. Let Y j = r X i =1 Y ij ; Y i = m X j =1 Y ij :

Then r X i =1 m X j =1 Y ij = r X i =1 Y i = m X j =1 Y j :

The joint distribution of the r m variables Y ij is multinomial, with Ntrials and with outcome probabilities pij = P r (A = i; B =j):

Note that there are rm 1 free parameters. Let p i = P r (A = i) and p j = P r (B =j). Aand Bare independent if and only if H0:

p ij = p i p j for all iand j is true. In the restricted model, as speci ed by H 0, there are r 1 free choices for the p i parameters and m 1 free choices for the p j parameters, for a total of r+ m 2. The di erence is ( rm 1) (r + m 2) = ( r 1)( m 1): Go to TOC CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 194 The maximum likelihood estimators of the parameters p i and p j are just the natural sample frequency estimators b p i = Y i N ; b p j = Y j N :

and so the estimated expected counts at the combinations of factor levels are b E ij = Nb p i b p j = Y i Y j N :

Notice that b E ij is row total column total divided by grand total , the same formula that was used in testing equality of several multinomial distributions. In fact, the procedure for testing for inde- pendence is exactly the same as for testing equality of multinomial parameters. For large Nthe distribution of b Q (10.9) is chi square with ( r 1)( m 1) degrees of freedom. If the p-value for the observed value of b Q is too small we conclude that factors Aand Bare dependent.

A tabular layout such as A B C DFW class 1 7 17 17 22 class 2 17 14 11 15 class 3 13 14 11 13 is called a contingency table , and whether testing for equality of multinomial parameters or testing for indpendence of attributes, the chi-square test is called a contingency table analysis. The di erence is merely one of emphasis.

Obviously, more than two factors could be in play. If so, the contingency table would have three or more dimensions. The extension of the procedure is straightforward.

Example 10.5. The data set "Titanic" included with R has a cross classi cation of 2201 passengers on the Titanic, classi ed according to sex, class of accommodations, adult or child, and survived or did not survive. Read about the data:

> help(Titanic) We will test whether survival rates for males and females are equal or di erent. We could do this by testing for equality of proportions the way we did in Chapter 9, but instead we will create a 2 2 contingency table and apply the procedure above.

> margin.table(Titanic,c(2,4)) Survived Sex No Yes Male 1364 367 Female 126 344 Go to TOC CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 195 > chisq.test(.Last.value,correct=F) Pearson 's Chi-squared test data: .Last.value X-squared = 456.8742, df = 1, p-value < 2.2e-16 Obviously, sex and survival status are not independent. Factor 2 is "Sex" and factor 4 is "Survived".

The argument "c(2,4)" to the function "margin.table" tells R to sum the entries in the 4 dimensional array along all the variables but these two. The argument "correct=F" to "chisq.test" prevents R from applying the Yates continuity correction so that the answer will agree with the answer derived from the procedure described above.

Sometimes the data is not already cross tabulated. Instead it may be presented in the form of a text le, a spreadsheet, or an R data frame that lists individual cases. If so, the "table" function in R will convert it to suitable input for "chisq.test". Data from the Montana outlook poll conducted by the University of Montana is included in the course data folder at http://www.math.uh.edu/ charles/data/Montana.txt For 209 randomly selected residents it lists age group, sex, income group, political a liation, region of residence, personal nancial outlook, and opinion about state outlook. All of these are categorical variables, although SEX is coded as 0 for male and 1 for female. Hence it will imported in R as a numeric variable. If the imported data frame is named "Montana", this can be xed by > Montana$SEX=factor(Montana$SEX,labels=c("m","f")) After making this change, the summary of all the variables looks like this:

> summary(Montana) AGE SEX INC POL AREA FIN <35 :72 f:102 <20K :47 Dem :84 NE:58 better:71 >=55 :70 m:107 >35K :60 Ind :40 SE:78 same :76 35-54:66 20-35K:83 Rep :78 W :73 worse :61 NA 's : 1 NA 's :19 NA 's: 7 NA 's : 1 STAT better :118 no better: 63 NA 's : 28 We will tabulate political a liation and area of residence and test the null hypothesis that these two attributes are independent, > attach(Montana) > table(POL,AREA) Go to TOC CHAPTER 10. ANALYSIS OF CATEGORICAL DATA 196 AREA POL NE SE W Dem 15 30 39 Ind 12 16 12 Rep 30 31 17 > chisq.test(.Last.value) Pearson 's Chi-squared test data: .Last.value X-squared = 13.849, df = 4, p-value = 0.007793 From the p-value we see that they are not independent.The west is blue, northeast red, and southeast purple.

10.3.1 Exercises 1. Carry out the calculations of the Titanic example by hand.

2. Use "margin.table" to get the Titanic marginal table for factors 1 (class of accomodations) and 4 (survival). Apply the chi square test to see if class and survival are dependent. Do it both by hand and with "chisq.test" in R.

3. With the Montana survey data, tabulate other pairs of variables and test for independence. Go to TOC Chapter 11 Miscellaneous Topics 11.1 Multiple Linear Regression In Chapter 8 we studied simple linear regression, in which the expected value of the response random variable Ydepends linearly on a single predictor variable X:

E (Y jX =x) = 0 + 1x:

In multiple linear regression we allow multiple predictor variables X 1; ; X kand the expected re- sponse for given values of X 1; ; X kis E (Y jX 1= x 1; X 2= x 2; ; X k= x k) = 0 + 1x 1 + 2x 2 + + kx k; (11.1) where 0; 1; ; kare unknown real numbers, the k+ 1 regression coe cients .

As in simple linear regression, we assume that the variance of the response Yis constant, independent of the values x 1; ; x k:

var(Y jX 1= x 1; ; X k= x k) = 2 ; (11.2) where 2 is an unknown positive constant.

The data in a multiple regression experiment arises from nindependent observations of the response Y corresponding to npossibly di erent values of each of the predictor variables X 1; ; X k. In tabular form it would look something like this. Y X 1 X 2 X k Y 1 x 11 x 12 x 1k .

.

. .

.

. .

.

. .

.

. .

.

. Y n x n1 x n2 x nk Here x ij is the ith value of the design variable X j.

197 Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 198 Given estimates b 0; b 1; ;b k of the regression coe cients, de ne the ith prediction error or residual as ei = Y i b 0 k X j =1 b jx ij :

As in simple linear regression, we estimate the regression coe cients by the method of least squares.

That is, we choose b 0; b 1; ;b k to minimize the residual sum of squares S S(resid ) =n X i =1 e 2 i :

The least squares estimates satisfy a linear system of p= k+ 1 equations. The system can have a unique solution only if n p, which we shall henceforth assume. Deriving the system of equations and their solution requires a background in linear algebra, so we will omit it. Complete derivations can be found in many textbooks 1 .

11.1.1 Inferences Based on Normality For the rest of theis chapter, we will assume that the conditional distribution of Y, given the values of X 1; ; X kis normal, with mean given by (11.1) and constant variance (11.2).

De nition 11.1. The predicted or tted value of Y i is b Y i = b 0 + k X j =1 x ij b j:

The total sum of squares is S S(tot ) = n X i =1 ( Y i Y )2 :

The regression sum of squares is S S(regr ) = n X i =1 ( b Y i Y i) 2 :

The residual sum of squares is S S(resid ) =n X i =1 ( Y i b Y i) 2 :

The proof of the following theorem requires linear algebra for a complete understanding. See the book by Montgomery, Peck and Vining previously cited.

Theorem 11.1. 1.S S (tot ) = S S(regr ) +S S(resid ).

Under the assumption of normality, 1 E.g., Introduction to Linear Regression Analysis , 5th Ed. by Montgomery, Peck and Vining, Wiley 2012 Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 199 2. Each of the estimated regression coe cients b j has a normal distribution with mean j. The standard deviation of b j depends only on = p 2 and a known function of all of the values fx il g .

3. S S (regr ) and S S(resid ) are independent random variables.

S S(resid ) 2 C hisq (df =n p):

If 1 = 2 = = k = 0, then S S(regr ) 2 C hisq (df =k):

4. If M S(regr ) =S S (regr ) k and M S(resid ) =S S (resid ) n p and 1 = = k = 0, then F = M S (regr ) M S (resid ) (11.3) has the F distribution with kdegrees of freedom in the numerator and n pdegrees of freedom in the denominator: F F Dist (k; n p).

5. If in the expression for the standard deviation of b j we replace the unknown with S= p M S (resid ), the resulting value is the standard error of b j, se (b j), and b j j se (b j) t( df =n p); (11.4) the student-t distribution with n pdegrees of freedom.

11.1.2 Using R's "lm" Function for Multiple Regression The computations involved in multiple regression problems are virtually impossible without computer help. The principal tool in R for multiple regression is the function "lm". We will illustrate its use by tting a linear model to the data "nlschools". 2 The response variable Yis the score on a language test administered to 200 school children in the Netherlands. The predictor variables are verbal IQ (VerbIQ), class size, and a numeric measure of socioeconomic status (SocEconStatus). Below is R's output.

> summary(nlschools.lm) Call:

lm(formula = Language ~ VerbIQ + ClassSize + SocEconStatus, data = nlschools) Residuals: Min 1Q Median 3Q Max -21.8839 -4.7511 0.1737 5.6190 16.3403 2 Snijders, T. A. B. and Bosker, R. J. (1999) Multilevel Analysis. An Introduction to Basic and Advanced Multilevel Modelling. London: Sage. Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 200 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.16999 3.83467 1.870 0.063 .

VerbIQ 2.60929 0.27116 9.623 <2e-16 *** ClassSize -0.01034 0.09123 -0.113 0.910 SocEconStatus 0.01973 0.05978 0.330 0.742 --- Residual standard error: 8.078 on 196 degrees of freedom Multiple R-squared: 0.3574, Adjusted R-squared: 0.3475 F-statistic: 36.33 on 3 and 196 DF, p-value: < 2.2e-16 In the "Coe cients" section of the output the estimated regression coe cients b j are listed, labelled with the names of the predictor variables they are associated with. The next column gives their standard errors se(b j). Next comes the values of the student-t test statistics (11.4) for testing the individual null hypotheses H 0:

j = 0. Finally, the p-values of the test statistics are given. In this example, only the estimated coe cient of VerbIQ is signi cantly di erent from 0, and it is highly signi cant. There is no reason to conclude that ClassSize and SocEconStatus have any predictive power for the language test score when VerbIQ is included in the model.

"F-statistic" is the value of Fin (11.3) and is the test statistic for the null hypothesis that all the regression coe cients are equal to 0. Clearly, we can reject that hypothesis in this example. "Multiple R-squared" and "Adjusted R-squared" are de ned by the equations 1 R2 = S S (resid ) S S (tot ); 1 R2 adj =M S (resid ) M S (tot ) where M S(tot ) = S S (tot ) n 1 :

R 2 is interpreted in the same way as in simple linear regression. It is the fraction of the total squared variation in the response Yaccounted for by the linear relationship and the variation in the predictor variables. If additional predictor variables are added to the model equation, the value of R2 always increases, indicating greater success at predicting the observed values Y i. However, at some point, this greater success may be so small that the cost of increasing the complexity of the model is not justi ed.

Adjusted R2 does not always increase when additional terms are included. Typically, it begins to decrease when more terms are included than are needed. For this reason, it is a good criterion for deciding when to stop adding terms. To illustrate, we will re t the model above with only VerbIQ as a predictor.

> summary(nlschools.lm2) Call:

lm(formula = Language ~ VerbIQ, data = nlschools) Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 201 Residuals: Min 1Q Median 3Q Max -21.8126 -4.6522 0.3191 5.6919 16.2251 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.038 2.902 2.425 0.0162 * VerbIQ 2.642 0.252 10.484 <2e-16 *** --- Residual standard error: 8.04 on 198 degrees of freedom Multiple R-squared: 0.357, Adjusted R-squared: 0.3537 F-statistic: 109.9 on 1 and 198 DF, p-value: < 2.2e-16 Notice that the smaller model has a smaller value of R2 but a larger value of R2 adj .

The "con nt" function works for multiple regression just as it does for simple linear regression to give con dence intervals for the regression coe cients.

> confint(nlschools.lm,level=.90) 5 % 95 % (Intercept) 0.8325551 13.5074157 VerbIQ 2.1611647 3.0574217 ClassSize -0.1611106 0.1404393 SocEconStatus -0.0790690 0.1185373 The "predict" function works for multiple regression as well.

> predict(nlschools.lm,newdata=data.frame(VerbIQ=10,ClassSize=30, SocEconStatus=12),interval="c") fit lwr upr 1 33.18966 31.19966 35.17966 A visual assessment of how well the model ts the data can be obtained by plotting the tted values b Y i on the horizontal axis and the observed values Y i on the vertical axis.

> nlschools=read.csv("nlschools.csv") > nlschools.lm=lm(Language~VerbIQ+ClassSize+SocEconStatus,data=nlschools) > plot(fitted(nlschools.lm),nlschools$Language) > abline(0,1) Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 20211.1.3 Factor Variables as Predictors Suppose Xis a factor variable with only two levels L 0 and L 1. Code the values of Xnumerically as 0 for L 0 and 1 for L 1. Let Ybe a numeric variable whose distribution depends on the value of X.

Consider the simple linear regression equation E(Y jX =x) = 0 + 1x:

Under the usual assumptions for linear regression, the values of Yare grouped as independent samples from two normal populations having the same variance, one corresponding to X= 0 with mean E (Y jX = 0) = 0; and the other corresponding to X= 1 with mean E(Y jX = 1) = 0 + 1:20 30 40 50 10 20 30 40 50 fitted(nlschools.lm) nlschools$Language Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 203 Therefore, the parameter 1 is the di erence between the population means. If we want to test hypotheses or nd con dence intervals for the di erence in means, we can use the methods of simple linear regression to make inferences about 1. We already have a method for two-sample problems from Chapter 9, the two-sample t test with equal variances. In fact, the two sample t test with equal variances and the regression approach are mathematically equivalent. To illustrate we will revisit the "lungcap" data set to see how the distributions of the variable "fev" (forced expiratory volume) depend on whether the sub ject is a smoker or not. Using the two sample t-test, the results are:

> t.test(fev~smoke,var.equal=T,data=lungcap) Two Sample t-test data: fev by smoke t = 1.7101, df = 83, p-value = 0.09098 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.01995121 0.26468836 sample estimates: mean in group no mean in group yes 3.746740 3.624371 Using the regression approach, > fev.lm=lm(fev~smoke,data=lungcap) > summary(fev.lm) Call:

lm(formula = fev ~ smoke, data = lungcap) Residuals: Min 1Q Median 3Q Max -0.87474 -0.20437 0.01363 0.19526 0.76826 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.74674 0.04592 81.60 <2e-16 *** smokeyes -0.12237 0.07155 -1.71 0.091 .

--- Residual standard error: 0.3247 on 83 degrees of freedom Multiple R-squared: 0.03404, Adjusted R-squared: 0.0224 F-statistic: 2.925 on 1 and 83 DF, p-value: 0.09098 > confint(fev.lm) Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 204 2.5 % 97.5 % (Intercept) 3.6554150 3.83806503 smokeyes -0.2646884 0.01995121 As shown here, it is not necessary to actually carry out the numeric 0-1 coding of the factor levels.

R automatically does that internally. It chooses one level as the base level and the other is compared to it. The results from "lm" are the same as those from "t.test" except for the sign of the di erence in means. For unordered factors Xwith more than two levels, we have already observed in Chapter 10 that "lm" gives results equivalent to analysis of variance. For ordered factors, the interpretation of the estimated coe cients returned by "lm" is quite di erent and we shall not discuss it here.

A more interesting kind of problem is one in which the expected response is modeled as a linear function of several predictors, some numeric and others factor variables.

Example 11.1. In the Netherlands school data, let us treat the response "Language" as a linear func- tion of the numeric predictor "SocEconStatus" and also of the two-level factor variable "CombGrades", which is an indicator of whether the student was taught in a classroom with combined grades or not.

"CombGrades" is already coded as 0 for no and 1 for yes.

> nlschools.lm1=lm(Language~SocEconStatus+CombGrades,data=nlschools) > summary(nlschools.lm1) Call:

lm(formula = Language ~ SocEconStatus + CombGrades, data = nlschools) Residuals: Min 1Q Median 3Q Max -26.4304 -6.3586 0.8444 7.2351 22.1529 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.17093 1.84627 17.425 < 2e-16 *** SocEconStatus 0.26762 0.06668 4.014 8.49e-05 *** CombGrades -4.76910 1.36909 -3.483 0.00061 *** --- NA Residual standard error: 9.49 on 197 degrees of freedom Multiple R-squared: 0.1085, Adjusted R-squared: 0.09941 F-statistic: 11.98 on 2 and 197 DF, p-value: 1.227e-05 The mean Language score for students not in combined classrooms exceeds the means score for students in combined classrooms by 4.76910 for all values of SocEconStatus. The tted lines for the two groups of students are parallel, separated by a vertical distance of 4.76910.

> plot(Language~SocEconStatus,col=CombGrades+1,data=nlschools) > abline(32.17093,0.26762,col=1) > abline(32.17093-4.76910,0.26762,col=2) > legend(x=35,y=15,legend=c("Not combined","Combined"),fill=c(1,2)) Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 205The tted lines for the two groups in the preceding example are parallel because our model formula speci ed that there be no interaction between SocEconStatus and CombGrades. In other words, the expected e ect of being in a combined classroom is the same for all values of socioeconomic status. It is certainly conceivable that being in a combined classroom could a ect the rate at which increased status leads to increased language comprehension. In that case, there would be an interaction between SocEconStatus and CombGrades. Let us temporarily rename these variables X 1 and X 2 and the response Y. Consider the model regression equation E(Y jX 1= x 1; X 2= x 2) = 0 + 1x 1 + 2x 2 + x 1x 2; where is another unknown constant. The term x 1x 2 is called an interaction term.

When X 2= 0 we have E(Y jX 1= x 1; X 2= 0) = 0 + 1x 1; whereas when X 2= 1, E(Y jX 1= x 1; X 2= 1) = ( 0 + 2) + ( 1 + )x 1:10 20 30 40 50 10 20 30 40 50 SocEconStatus Language Not combined Combined Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 206 Therefore, the parameter is the di erence in expected rate of change of Ywith respect to x 1 when X 2 = 1, i.e., when students are in a combined classroom. The output from R with the interaction model is > nlschools.lm=lm(Language~SocEconStatus*CombGrades,data=nlschools) > summary(nlschools.lm) Call:

lm(formula = Language ~ SocEconStatus * CombGrades, data = nlschools) Residuals: Min 1Q Median 3Q Max -26.709 -6.876 1.274 6.504 20.603 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 34.82266 2.36785 14.706 < 2e-16 *** SocEconStatus 0.15743 0.09087 1.732 0.08476 .

CombGrades -10.91000 3.72008 -2.933 0.00376 ** SocEconStatus:CombGrades 0.23578 0.13292 1.774 0.07764 .

--- NA Residual standard error: 9.439 on 196 degrees of freedom Multiple R-squared: 0.1225, Adjusted R-squared: 0.1091 F-statistic: 9.124 on 3 and 196 DF, p-value: 1.109e-05 The coe cients whose estimates are given in this summary are, in order from the top, 0, 1, 2, and . With a p-value of 7.8%, the estimated value of is marginally signi cantly di erent from 0.

> plot(Language~SocEconStatus,col=CombGrades+1,data=nlschools) > abline(34.82266,0.15743,col=1) > abline(34.82266-10.91000,0.15743+0.23578,col=2) > legend(x=35,y=15,legend=c("Not combined","Combined"),fill=c(1,2)) Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 20711.1.4 Exercises 1. The airquality data has missing values in some of the variables. Eliminate those records from the data set with the command > airquality2=airquality[complete.cases(airquality), ] With the airquality2 data, t a multiple linear regression model with Ozone as the response and So- lar.R and Wind as predictor variables. Which variables contribute signi cantly to ozone levels?

2. Find a 95% con dence interval for the expected value of Ozone when Solar.R=300 and Wind=15.

Find a 95% con dence intervals for the regression coe cients.

3. Make a scatterplot of observed values of Ozone vs. tted values, with tted values on the hori- zontal axis and observed values on the vertical axis. Superimpose the line with intercept 0 and slope 1.10 20 30 40 50 10 20 30 40 50 SocEconStatus Language Not combined Combined Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 208 4. With the data "nlschools" t a multiple linear regression model with Language as the response and ClassSize, SocEconStatus and CombGrades as predictor variables. Allow interaction between SocEconStatus and CombGrades by using the model formula Language ClassSize+SocEconStatus*CombGrades The asterisk * separating two terms in a model formula means to include additive e ects as well as interactions between the variables.

Interpret the results. How does the expected value of Language depend on ClassSize and SocEcon- Status when CombGrades=0?, when CombGrades=1?

11.2 Nonparametric Methods The families of distributions studied in this course up to now are parametric families. That is, individual members of the family are singled out by giving the values of a few parameters. For example, the family of normal distributions is parametric because the value of the mean and standard deviation completely determine which normal distribution is meant. Other parametric families are the binomial distributions, the Poisson distributions, the gamma distributions, Weibull distributions, and so on. Except for large sample procedures for a population mean, the inference procedures we have studied so far are primarily inferences (hypothesis tests or con dence intervals) for parametric families. In this section we introduce some methods that make almost no assumptions about the underlying distributions, except that they are continuous.

11.2.1 The Signed Rank Test The Wilcoxon 3 signed rank test and rank sum test utilize the ranks of sample values rather than the data values themselves. The signed rank test is designed to test the hypothesis that a continuous distribution is symmetric about a certain value and also to nd a con dence interval for the center of symmetry.

De nition 11.2. The distribution of a random variable Xis symmetric about 0 if X and Xhave the same distribution. If the cumulative distribution Fof X is a continuous function, this is equivalent to F( x) = 1 F(x ) (11.5) for all real numbers x. The distribution of Xis symmetric about a number if X is symmetric about 0, that is, if F( x) = 1 F( + x) (11.6) for all x.

If a distribution is symmetric about , then is a median. If the distribution has a mean, then is its mean. If it has a continuous density function f, then f( x) = f( + x) for all real x. 3 Frank Wilcoxon, 1892-1965 Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 209 Let x 1; x 2; ; x nbe distinct numbers. The rank of x i is 1 if it is the smallest of these nnumbers.

rank (x i) = 2 if x i is the next smallest, rank(x i) = 3 if x i is the third smallest. Finally, rank(x i) = n if x i is the largest of the numbers. Below is a list of 10 numbers and beneath it is the list of their comparative ranks.

> round(rexp(10),digits=3) [1] 0.083 0.916 0.369 0.009 1.422 1.065 0.439 0.571 0.004 3.904 > rank(.Last.value) [1] 3 7 4 2 9 8 5 6 1 10 Let X 1; X 2; ; X nbe a random sample from a continuous distribution F. Find the comparative ranks of their absolute values and let rank(jX ij ) denote the rank of jX ij . Let sgn(X i) = 1 if X i> 0 and sgn(X i) = 1 if X i< 0. The Wilcoxon signed rank statistic is W 1= n X i =1 sgn (X i) rank (jX ij ) : (11.7) Theoretically, since the distribution Fis continuous, no two of the absolute values will be equal and none of the observations will be exactly 0. Therefore, the signs and ranks are unambiguously deter- mined. Of course roundo error may intrude and cause some ties in the ranks. R has a procedure for handling ties, which we need not be concerned about now.

If the distribution of Xis symmetric about 0, then there is no association between ranks and signs.

Each possible rank is as likely to go with a positive sample value as with a negative sample value. In that case, the distribution of W 1is the same as the distribution of V= n X i =1 iS i; (11.8) where S 1; ; S nare independent and each S i is either +1 or -1, each with probability 1/2. On the other hand, if the distribution of Xis symmetric about a number >0, then the higher ranks will tend to go with positive sample values and W 1will tend to be greater than 0. If <0, then W 1 tends to be less than 0. Therefore, W 1is a reasonable test statistic for the null hypothesis H 0: = 0 against either a one-sided or two-sided alternative. To make use of W 1, we must be able to compute p-values P r(V > v ) orP r(V < v ), where vis the observed value of W 1. Many textbooks 4 have tables of the distribution of V. We will use R's built-in function "wilcox.test" for calculations.

To test for symmetry of the distribution of Xabout some point 06 = 0, simply replace rank(jX ij ) with rank(jX i 0j ) and sgn(X i) with sgn(X i 0) in (11.7).

The sum of the ranks for positive observations is W1+ =X X i> 0rank (jX ij ) (11.9) and the sum of ranks for negative observations is 4 E.g., Devore, J.L., Probability and Statistics for Engineering and the Sciences, 8th Ed. , Brooks-Cole 2012 Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 210 W 1 = X X i< 0rank (jX ij ) :

From the de nitions, we have W1+ W 1 = W 1 and W1+ + W 1 = n X i =1 i = n (n + 1) 2 :

These two equations imply that W1+ =1 2 W 1+ n (n + 1) 4 :

(11.10) Therefore, W 1+ is just as good as W 1as a test statistic. In fact, it is the statistic used in R.

The null distribution of W 1+ is the same as the distribution of V+ = X S i=1 i:

(11.11) Corresponding to (11.10), we have V+ = 1 2 V + n (n + 1) 4 :

(11.12) In the exercises below you are asked to calculate the exact distribution of V + for a very simple case.

The Mean and Variance of Vand V + In the de nition of V, the expected value of S i is 0 and its variance is 1. Thus, E (V ) = n X i =1 iE (S i) = 0 : (11.13) and var(V ) = n X i =1 i 2 = n (n + 1)(2 n+ 1) 6 :

(11.14) From these and (11.12), we have E(V + ) = n (n + 1) 4 (11.15) and var(V + ) = 1 4 var (V ) = n (n + 1)(2 n+ 1) 24 :

(11.16) Proofs of the following theorem as well as other theorems in this section can be found in the book by Randles and Wolfe. 5 5 R.H. Randles and D.A. Wolfe, Introduction to the Theory of Nonparametric Statistics , Wiley 1979. Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 211 Theorem 11.2. Asn! 1 , the distribution of Z= V + E(V + ) sd (V + ) approaches standard normal. The same is true with V + replaced by V.

For small values of nthe exact distribution of V + or Vis tabulated or can be calculated without much trouble. For larger values of n, Theorem11.2allows us to use a normal approximation for nding p-values.

Example 11.2. The signed rank test is most often used with paired data to test the hypothesis that the di erence in paired observations has median 0. We will use it to analyze Gosset's split plot data, for which we used a student-t test in Chapter 9. The yields for the two drying methods on each of the split plots and their di erences are repeated below. The R output shows two methods of using the function "wilcox.test" for paired observations. The t test is repeated for comparison. Note the close agreement in p-values.

PLOT 1 2 3 4 5 6 7 8 9 10 11 REG 1903 1935 1910 2496 2108 1961 2060 1444 1612 1316 1511 KILN 2009 1915 2011 2463 2180 1925 2122 1482 1542 1443 1535 DIFF -106 20 -101 33 -72 36 -62 -38 70 -127 -24 > wilcox.test(DIFF) Wilcoxon signed rank test data: DIFF V = 15, p-value = 0.123 alternative hypothesis: true location is not equal to 0 > wilcox.test(REG,KILN,paired=T) Wilcoxon signed rank test data: REG and KILN V = 15, p-value = 0.123 alternative hypothesis: true location shift is not equal to 0 > t.test(DIFF) One Sample t-test data: DIFF t = -1.6905, df = 10, p-value = 0.1218 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -78.18164 10.72710 Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 212 sample estimates:

mean of x -33.72727 Con dence Intervals for the Location Parameter Let X 1 and X 2 be independent random variables having the same continuous distribution F. The pseudomedian of Fis the median of the random variable ( X 1+ X 2) = 2. If the distribution is symmetric about , then both the median and the pseudomedian are equal to . If X 1; ; X nis a random sample from F, the sample pseudomedian is the median of the n(n + 1) =2 pairwise averages A ij = X i+ X j 2 for all iand j, i j. The signed rank statistic W 1+ is equal to the number of positive A ij 6 . The sample pseudomedian is also called the Hodges-Lehmann estimator of 7 . The Hodges-Lehmann estimator is in some respects a better estimator of than the sample median is if the distribution is symmetric.

A con dence interval for the parameter is obtained by taking sample quantiles of the A ij as end points. We will illustrate again with Gosset's data.

> wilcox.test(REG,KILN,paired=T,conf.int=T) Wilcoxon signed rank test data: REG and KILN V = 15, p-value = 0.123 alternative hypothesis: true location shift is not equal to 0 95 percent confidence interval: -84 20 sample estimates:

(pseudo)median -34.5 Compare this to the paired sample t test results.

Exercises 1. Tabulate the exact distribution of V + for n= 5. (Hint: The possible values of V + are the integers 0 through 15. One way that V + = 5 could occur is for ranks 1 and 4 to belong to positive observations and all the other ranks to belong to negative observations, but there are other ways it could occur.) 6 Randles and Wolfe, op cit, p.57 7 J.L. Hodges 1922-2000, E.L. Lehmann 1917-2009 Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 213 2. Apply the signed rank test and con dence interval to the paired data in the data frame "runo ".

Compare the results to the results of a student-t test and con dence interval.

3. From the course data folder www.math.uh.edu/ charles/data/punts.txt import the paired data on punt distance for footballs in ated with Helium and with air. Find the Hodges-Lehmann estimator and con dence interval for the di erence between Helium punt distance and air punt distance. Since the data is recorded with low precision, there will be ties in the ranks.

"wilcox.test" can handle this, but will return some warnings that exact intervals cannot be obtained.

To suppress the warnings, use the argument "exact=F" in the call to wilcox.test. This forces a normal approximation.

4. If a distribution is symmetric, the median and pseudomedian are equal and also equal to the mean, if it exists. This is not necessarily true for non-symmetric distributions. Find the median and pseudomedian of the exponential distribution with mean 1. If X 1and X 2are independent and have the exponential distribution with mean 1, then X 1+ X 2 Gamma (shape = 2; scale = 1). You will have to either nd a numerical approximation of the pseudomedian or estimate it by simulation.

11.2.2 The Wilcoxon Rank Sum Test For testing equality of population means with independent samples from two distributions, we have so far been restricted to a large sample normal test, which assumes the distributions have variances, or the student-t test, which for small samples assumes near normality of the populations. A nonpara- metric alternative to these procedures is the Wilcoxon rank sum test . The development of the rank sum test closely parallels that of the Wilcoxon signed rank test. The outline below omits many of the details.

Let X 1; X 2; ; X nand Y 1; Y 2; ; Y m be independent samples from two continuous cumulative dis- tributions F X and F Y . If these distributions di er only in location, there is a constant such that Y has the same distribution as X+ , or in other words, FY ( y ) = F X ( y ) for all real numbers y. is called a shift parameter . It is the di erence between the quantiles of Y and corresponding quantiles of X, also the di erence between their means if they exist. We are interested in testing the null hypothesis that = 0, in which case the distributions of Xand Yare the same.

Rank all m+ndata values (signed values, not absolute values) together, with the smallest getting rank 1 and the largest getting rank m+n. One form of the Wilcoxon rank sum statistic is W Y = m X i =1 rank (Y i) :

If = 0 and Xand Yhave the same distribution, then there is nothing special about the ranks assigned to the Y's. W Y will be just the sum of mrandomly chosen integers between 1 and m+n. Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 214 This de nes the null distribution of W Y. If >0, the ranks of the Y's will tend to be among the larger ones and W Y will tend to be large. If <0, W Y will tend to be small. Thus, W Y is a reasonable test statistic for H 0: = 0 against either a one-sided or two-sided alternative.

W Y is also the number of pairs ( i; j) such that Y j > X i. The test using this expression for W Y is called the Mann-Whitney test.

8 It was not recognized at rst that the Mann-Whitney test and the rank sum test are mathematically equivalent.

It is convenient to modify the de nition of W Y by subtracting the smallest possible value of the rank sum. Thus, WY = m X i =1 rank (Y i) m (m + 1) 2 :

(11.17) Theorem 11.3. The mean of the random variable W Y is E (W Y) = mn 2 :

Its variance is var(W Y) = mn (m +n+ 1) 12 :

If = 0, for large mand nthe distribution of Z= W Y E(W Y) sd (W Y) is approximately standard normal.

Estimating the Shift Parameter The shift parameter is estimated much like the center of symmetry is in the one-sample Wilcoxon signed rank procedure. The estimator b is also called the Hodges-Lehmann estimator and it is de ned as the median of the mnpairwise di erences dij = Y j X i:

End points of con dence intervals for are obtained from quantiles of the d ij .

Example 11.3. In Chapter 9 we used the two sample student-t test to test for a di erence in forced expiratory volume (fev) for smokers and nonsmokers. The data is in the "lungcap" data set. We will repeat the t test, assuming equal variances, and compare the results to the results of the Wilcoxon test and con dence interval.

> t.test(fev~smoke,data=lungcap,var.equal=T) Two Sample t-test 8 Henry B. Mann 1905-2000, Donald R. Whitney 1915-2001 Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 215 data: fev by smoke t = 1.7101, df = 83, p-value = 0.09098 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.01995121 0.26468836 sample estimates: mean in group no mean in group yes 3.746740 3.624371 > wilcox.test(fev~smoke,data=lungcap,conf.int=T) Wilcoxon rank sum test with continuity correction data: fev by smoke W = 1072.5, p-value = 0.07855 alternative hypothesis: true location shift is not equal to 0 95 percent confidence interval: -0.01502935 0.27396080 sample estimates:

difference in location 0.1389037 As you can see, the results are very similar. This is near normal data, so the Wilcoxon test compares favorably to the t-test, even when conditions are favorable for the latter.

11.2.3 Exercises 1. In Chapter 9, Example 4 we analyzed the data in "bpcrossover" from a crossover experiment on blood pressure. Repeat this analysis, rst using the student-t test and then the Wilcoxon rank sum test. Compare the results. In the call to "wilcox.test" use the argument "exact=F" so that you won't get warnings about ties. However, note that the sample sizes are small, so the normal approximation is questionable.

2. The rank sum test is usually applied in the context of two distributions that are supposedly the same except for location. However, it applies to the more general situation that one of the distributions is always less than or equal to the other one, i.e., the null and alternative hypotheses H0:

F X = F Y H 1:

F X ( x ) F Y ( x ) for all xwith strict inequality for some x.

For example, the rank sum test detects the di erence between two exponential distributions even though they are not related by a shift parameter. Generate a random sample of Xvalues of size 10 from the exponential distribution with mean 2 and a random sample of Yvalues of size 10 from the exponential distribution with mean 1. Compare them with "wilcox.test" using the optional argument "alternative = "greater"". Play with the sample sizes and the mean of Y. Note the cases when the p-value reaches 10 % or less. Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 216 11.3 Bootstrap Con dence Intervals A random variable Xwith density function f(x ) = 2 (1 + x)3 ; x 0 has a mean but does not have a variance. The conditions for the large sample con dence interval for the mean that we used in Chapter 7 are not satis ed. The distribution of Xis not symmetric, so the Hodges-Lehmann estimator for the center of symmetry also cannot be justi ed. Bootstrap con dence intervals are a way of proceeding when one wants to make as few theoretical assumptions as possible.

The name comes from the fable about a character lifting himself by his own bootstraps.

The histogram below is of a random sample of size 30 from the distribution above. The idea behind the bootstrap is that repeated resampling from the original samples yields as much information about precision of estimators as can be obtained from the data itself without making additional theoretical assumptions. One sample of the same size as the original sample, obtained by sampling the data with replacement, is called a bootstrap sample. The values in a bootstrap sample x Frequency 0 1 2 3 4 5 6 7 0 5 10 15 20 25 Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 217 are among the values of the original sample data, but some will be repeated and others will be omitted.

For the distribution above, which incidentally has a mean of 1, the original sample was given a name.

> thesample=runif(30) > thesample=1/sqrt(thesample)-1 > thesample [1] 1.01865 0.69580 0.40683 3.02067 0.46590 0.04336 1.07206 1.33057 0.02912 1.12058 2.51662 0.05060 0.10014 0.30890 3.32139 [16] 0.75437 0.35492 0.10888 0.92074 0.08729 0.17593 0.35980 0.66303 0.01714 0.88771 0.09114 1.01506 0.10497 0.27204 0.36616 > mean(thesample) [1] 0.7226787 Then a single bootstrap sample of size 30 is obtained and its mean (the bootstrap mean) is > bootsamp=sample(thesample,30,replace=T) > bootsamp [1] 0.01714 0.35492 1.07206 0.17593 1.12058 0.17593 0.09114 0.10014 1.01506 0.69580 0.40683 0.40683 3.32139 1.12058 0.92074 [16] 3.02067 0.05060 1.01506 0.27204 0.75437 0.10014 0.10497 0.04336 3.32139 1.33057 0.75437 0.02912 3.02067 3.02067 1.07206 > mean(bootsamp) [1] 0.9635043 Notice that the bootstrap mean x is di erent from the original sample mean x. If another bootstrap sample should be obtained and its bootstrap mean calculated, it would be di erent too. In other words, there is a distribution of bootstrap means. The key assumption in the bootstrap con dence interval is that the bootstrap distribution of x x is similar to the sampling distribution of x , where is the population mean. At least they are similar enough that quantiles of the bootstrap distribution of x x can substitute for quantiles of x .

If the quantiles q( x ; = 2) and q( x ;1 = 2) were known, we could form the 100(1 )% con dence interval x q( x ; 1 = 2) < < x q( x ; = 2):

Since they are not known, we substitute the quantiles q (x x; 1 = 2) = q (x ;1 = 2) x and q (x x; = 2) = q (x ; = 2) x of the bootstrap distribution of x . After rearranging terms, this leads to the bootstrap con dence interval 2 x q (x ;1 = 2) < < 2 x q (x ; = 2) (11.18) for . Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 218 To get the bootstrap quantiles of x , we generate many bootstrap samples of size 30 and calculate the bootstrap mean x for each one. Then we take the sample quantiles of all the bootstrap means. The R procedure is > bootmeans=replicate(1000,mean(sample(thesample,30,replace=T))) > 2*mean(thesample)-quantile(bootmeans,.975) 97.5% 0.3890227 > 2*mean(thesample)-quantile(bootmeans,.025) 2.5% 0.986073 For comparison, the student-t con dence interval is > t.test(thesample) One Sample t-test data: thesample t = 4.6273, df = 29, p-value = 7.139e-05 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 0.4032608 1.0420965 sample estimates:

mean of x 0.7226787 The t-test con dence interval happens to be better in this one instance, in the sense that it does con- tain the target value of 1. It should be noted that no two bootstrap con dence intervals are exactly the same, even starting with the same primary sample data. This is because of the randomness in obtaining bootstrap samples.

As another example, we will nd a bootstrap con dence interval for the standard deviation of an exponential distribution with mean 3, and therefore with standard deviation = 3. If sis the sample standard deviation from the primary sample and s is the bootstrap sample standard deviation, we could equate bootstrap quantiles of s swith sampling quantiles of s in the manner above.

However, since is a positive scale parameter, it is preferable to equate bootstrap quantiles of s =s to sampling quantiles of s=. This leads to the bootstrap con dence interval s2 q (s ;1 = 2) < < s 2 q (s ; = 2) (11.19) In R, > thesample=rexp(30,rate=1/3) > sd(thesample) [1] 2.180378 > bootsd=replicate(1000,sd(sample(thesample,30,replace=T))) Go to TOC CHAPTER 11. MISCELLANEOUS TOPICS 219 > var(thesample)/quantile(bootsd,.975) 97.5% 1.891767 > var(thesample)/quantile(bootsd,.025) 2.5% 2.874222 11.3.1 Exercises 1. The coe cient of variation of a distribution with mean and standard deviation is = :

Create a function in your R workspace for calculating the sample coe cient of variation.

> cv=function(x) sd(x)/mean(x) Use the "rgamma" function in R to simulate a sample of size 30 from a gamma distribution with shape parameter = 4. Mimic the example above and nd a 95% bootstrap con dence interval for the coe cient of variation. Treat the coe cient of variation like a scale parameter. The call "replicate" should be > bootcv=replicate(1000,cv(thesample,30,replace=T)) Repeat this with other shape parameters. How many of the con dence intervals contain the true coe cient of variation?