Statistics

Week 8 Discussion Example : Week 8 Discussion Board Example – Linear Regression – Certification Time An oil company wants to develop an aptitude test that can predict how efficient a new employee will be, before they spend thousands of dollars training and outfitting them to work on their floating oil rigs.

They want you to evaluate the relationship between the test scores and the length of time it took the workers to compl ete their on -site certification. You collected the information on 50 new hires and performed a linear regression on their test scores and the time it took them to complete their certification. Use what you have learned about Linear Regression to answer the following question. The output from the Excel ToolPak, Regression Tool is located below . Test Score Hours to Complete Certification Test Score Hours to Complete Certification 1 119 277 26 164 237 2 91 266 27 114 275 3 110 258 28 202 261 4 256 227 29 208 154 5 239 223 30 259 193 6 292 103 31 247 145 7 176 217 32 210 228 8 193 176 33 264 254 9 211 281 34 93 221 10 196 183 35 200 242 11 124 287 36 172 181 12 120 276 37 203 168 13 263 110 38 251 152 14 190 202 39 193 174 15 252 231 40 264 108 16 179 236 41 196 228 17 237 191 42 290 141 18 290 275 43 236 185 19 198 288 44 120 233 20 286 108 45 122 275 21 180 232 46 166 290 22 206 219 47 106 279 23 231 175 48 104 251 24 176 265 49 121 206 25 174 233 50 230 213 DATA SUMMARY OUTPUT Regression Statistics Multiple R 0.6070713 R Square 0.368535564 Adjusted R Square 0.355380055 Standard Error 41.99982907 Observations 50 ANOVA df SS MS F Significance F Regression 1 49415.90919 49415.91 28.01378 2.95E-06 Residual 48 84671.31081 1763.986 Total 49 134087.22 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept 323.5161 21.0445 15.3729 0.0000 281.2032 365.8290 281.2032 365.8290 Test Score -0.5494 0.1038 -5.2928 0.0000 -0.7582 -0.3407 -0.7582 -0.3407 What is the regression equation from the Summary Output? Is this a useful model? How do you know. ̂= 323 .5− 0.5495 The tool automatically performs a test of hypothesis to determine if the slope of the population regression model ( 1) is equal to zero. 0: 1= 0 If 1 is equal to zero, then the predicted y value would be a constant and there would be no value in knowing the regression equation. The p -value for this test is less than 0.0000. Therefore, we can reject the null hypothesis and we know the model is useful. Are the assumptions of regression satisfied? How did you verify them? There are four assumptions we need to verify, before we try to use the regression model. Linearity – Is there a linear relationship between the dependent and independent variables? We check this assumption by looking at the scatter plot of the original da ta. The scatter plot shows a relatively weak negative linear relationship. Independence – Are the errors (residuals) independent of each other? We check this assumption by looking at the residuals plot. The residuals plot does not show any patterns or tren ds. This indicates the errors are independent. Normality – Are the residuals normally distributed around the regression line? We check this assumption by looking at the Normal Probability Plot of the residuals. A straight line indicates the residuals are n ormally distributed. Our plot shows a small step near the left end of the curve, but other than that, look s fairly linear. Equal variance – Is the spread of the residuals approximately the same across the range of the dependent variable? We check this assu mption by looking at the residuals plot. The spread appears to be wider on the far right side and narrow on the far left. It appears fairly equal across the mid -section. The equal variance assumption appears questionable, but the other three assumptions ar e satisfied . These assumptions are fairly robust. We should use care when using the regression model. Does test score appear to be a good predictor for the certification time ? Why do you think that? The coefficient of correlation (R 2) is the percentage of the change in y (change in certification time ) that can be explained by the change in x (change in test score ). The R 2 value for these two variable is 0. 3685 . About 37 % of the change in certification time can be explained by the change in test scores . The test scores appear to be a fair predic tor for the time required to complete certification, but there are other factors that are probably equally important. One of the company’s new employees scored a 150 on the test while another scored 25. What is th e predicted time to complete certification for these two employees ? If we start with the regression equation and plug 150 in for x we can calculate the predicted time to complete certification for the first employee . ̂= 323 .5− 0.5495 ̂= 323 .5− 0.5495 ∗(15 0) ̂= 241 He should complete his certification in approximately 241 hours. We should not use this regression model to predict the second employee’s certification time.

The model covered test scores from about 90 to 300. A score of 25 is too far outside th e range of the model.