Statistics Question

Week 3 Lecture 11 Regression Analysis Regression analysis is the development of an equation that shows the impact of the independent variables ( the inputs we can generally control ) on the output result. While the mathematical language may sound strange, most of you are quite familiar with regression like instructions and use them quite regularly. To make a cak e, we take 1 box mix, add 1¼ cups of water, ½ cup of oil, and 3 eggs. Al l of this is combined and cooked. The recipe is an example of a regression equation. T he output (or result or dependent variable) is the cake, the inputs (or independent variables) are the inputs used. Each input is accompanied by a coefficient (AKA weig ht or amount) that tells us how “much” of the variable is “used” or weighted into the outcome. So, in an equation format, this cake recipe might look like: Y = 1X 1 + 1.25X 2 + .5X 3 + 3X 4 where: Y = cake X 1 = box mix X 2 = cups of water X 3 = cups of oil X 4 = an egg. Of course, for the cake, the recipe needs to go through the cooking process; while for other regression equations the outputs need to go through whatever “process” turns the inputs into the output – this is often called “life.” Example With a regression analysis, we can identify what factors influence an outcome. So, with our Salary issue, the natural question to help us answer our research question of do males and females get equal pay for equal work would be: what factors influence or explain an individual’s pay? This is a perfect question for a multi- variate regression. Multi -variate simply means we have multiple input variables with a single output variable (Lind, Marchel, & Wathen, 2008).

Variables. A regression analysis uses two distinct ty pes of data. The first are variables that are at least interval level or better ( the same as the other techniques we have used so far ). The other is called a dummy variable , a variable that can be coded 0 or 1 indicating the presence of some characterist ic. In our data set, we have two variables that can be used as dummy coded variables in a regression , Degree and Gender; both coded 0 or 1. In the case of Degree, the 0 stands for having a bachelor’s degree and the 1 stands for having an advanced degree. For Gender , 0 means a male and 1 means a female. How these are interpreted in a regression output w ill be discussed below. For now, the significance of dummy coding is that it allows us to include nominal or ordinal data in our analysis. Excel Approach. For our question of what factors influence pay, we will use Excel’s Regression function found in the Data Analysis section. This function will produce two output tables of interest. The first table tests to see if the entire regression equation is statistically significant; that is, do the input variables significantly impact the output variable. If so, we would then examine the second table – the coefficients used in a regression equation for each of the variables. We would have a second set of hypothesis statements for each variable , the null would be the coefficient equals 0 versus an alternate of the coefficient is not equal to 0. Typically, we list these before we start the analysis. Step 1: For the regression equation: Ho: The regression equation is not significant Ha: The regression equation is significant. For the coefficients if the regression equation is significant:

Ho: The regression coefficient equals 0 Ha: The regression coefficient is not equal to 0.

Note: We would write one pair of statements for each variable, for space reasons, we include only one general statement that should be applied to each variable. Step 2: Reject each null hypothesis claim if the related p -value > (is greater than) p -value = .05.

Step 3: Regression Analysis Step 4: Perform the test. Selecting the Regression option in Data Analysis will open a familiar data entry box. The Input Y Range would be the salary range including the label. The Input X range would the labels and data for our input variables . In this case we will use Midpoint, Age, Performance Rating, Service, Raise, Degree, and Gender. Be sure to check the labels box and pick an output range upper left corner. This will result in the following output (values rounded to three decimal places) : Step 5: Conclusions and Interpretation. Let’s look at each table separately.

The Regression Stati stics table shows A Multiple R and an R squared value. Multiple R is the multiple correlation value. Similar to our Pearson Coefficient it shows the relationship between the dependent (output or Salary in this case) variable with all for the independent or input variables. Multiple R is the multiple coefficient of determination , similar to the Pearson coefficient of determination, it displays the percent of variation in common between the dependent and all of the independent variables. The adjusted R square reduces the R square by a factor that involves the number of variables and the sample size, a suggestion if the design impacted the outcome more than the variables. We have an insignificant reduction. The standard error is a measure of variation in the outcome used for predictions. The count shows the number of cases used in the regression. The ANOVA table, sometimes called ANOR – analysis of regression – provides us with our test of significance outcome. Similar to the ANOVA covered in W eek 3, we look at the Significance of F (AKA P -value) to see if we reject or fail to reject the null hypothesis of no significance. In this case, with a p -value of 8.44E -36 (equaling 0.00000000000000000000000000000000000844) is less than .05, so we reject the null of no significance. The regression equation explains a significant proportion of the variation in our dependent variable of salary. Now that we have a significant regression equation, we move on to the final table that presents and tests the coefficients fo r each variable. One of the important parts of a regression equation is that it shows us the impact of each factor if all other factors are held constant. A regression has the form: Y = A + B1* X1 + B2*X2 + B3*X3 + …. Where Y is the output, A is the int ercept (places the line up or down on the Y axis when all other values are 0), the B’s are the coefficient values, and the X’s are the variable names. Before considering whether each coefficient is statistically significant or not, our equation would be: Salary - - 4.009 + 1.22* Midpoint + 0.029*Age – 0.096*Perf Rat – 0.074*Service + 0.834*Raise + 1.002*Degree + 2.552* Gender. Whew! What does this mean? The intercept is an adjustment factor, one that we do not need to analyze. For midpoint, it means that as midpoint goes up by a thousand dollars (remember salary and midpoint are measured in thousands), the salary goes up by 1.22 thousand – higher graded employees are paid relatively more compared to midpoint than others (all others things equal). For Perf ormance Rating, employees lose $96 ( -0.096) for every higher PR point they have – certainly not what HR would like! Now, let’s look at our dummy variables, Degree and Gender. For Degree, an extra $1,002 is added to employees having a Deg code = 1, as if D eg = 0, the +1.002* 0 = 0; so graduate degree holders get an extra $1002 per year. The same thing applies to Gender, those coded 0 get nothing extra and those coded 1 get $2,552 more per year (all other things equal). Since females are coded 1, if this f actor is significant, they would be paid $2552 more than males with all other factors equal (the definition of equal work). So, now let’s take a look at the statistical significance of each of the variables. This is determined with the P -value column (nex t to the t Stat value). This is read the same way we noticed in the t -test and ANVOA tables, if the value is less than 0.05 we reject the null hypothesis of no significance. While the intercept has a significance value, we tend to ignore it and include the intercept in all equations. For the other variables, the only significant variables are: Midpoint, Perf Rating (unrounded it was 0.0497994…), and Gender. So, the regression equation including only our statistically significant factors is Sal = - 4.009 +1.22*Midpoint -).096*Perf Rat + 2.552*Gender. So, we now have a clear answer to our question about males and females getting equal pay for equal work. Not only is the answer no (as gender is a significant factor in determining salary) but also females are paid $2552 more annually all other things equal! This is certainly not the outcome most of us expected when we began this journey. What we see is that variation within any measure has some often unanticipated outcomes, and unless we examine the inputs int o our results, we often do not understand them very well. Single measure tests such as the t and ANOVA tests are quite valuable comparing similar results, but they do not always get to the root of what causes differences. Reference Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008). Statistical Techniques in Business & Finance. (13th Ed.) Boston: McGraw -Hill Irwin.