Following the instruction of Assignment# and use R-studio to do simple linear regression& Multiple linear regression. Then submit two documents that are: (1) Your write-up. This should be a PDF that i

This study resource was shared via CourseHero.com For this assignment, you will use the baseballdata.csv file, which can be found on our Blackboard page. To complete this assignment, you will analyze the data for your year. Each student in the class will be assigned a unique year, so no two submissions will be the same. Submit your entire set of answers -- which will include some things typed by you, as well as several screenshots, as a single PDF file, and submit it via the class Blackboard site. Your Tasks: I. The CSV file in front of you is pretty massive, but thankfully, you only have to deal with one year’s worth of data... but first, you need to isolate your data . A. Working with the spreadsheet, delete all the rows that pertain to years other than yours, or copy the rows that do pertain to your year, and p aste them into a new sheet. B. Once you have done this, resave your csv file with a new name. Note : T he sort function should be first ly used to make sure all the years are ordered. Then all years’ data except that from 1989 were deleted . II. Read your csv into R. A. Using the mean() function in R, find the average number of wins for all teams in your season. a. What code did you enter into R to accomplish this? __________________ This study resource was shared via CourseHero.com B. Now, use the mean function again to determine the average number of losses for all teams in your season. a. What code did you enter into R to accomplish this? ___________________ b. What did you notice about the results from the last two functions you used ? Why does this make sense? The two outputs are the same. The win of one team also imp lies the loss of its competitor . As a result, the total number of wins and the total number of loss must be the same, so must the two average numbers. C. Make a scatterplot that shows wins per team as a function of runs. What do you notice about this relationship? Include a screenshot of your R source code, and the scatterplot that it generates. Be sure to label your x and y axes. This study resource was shared via CourseHero.com Figure 1. Team R uns versus Team Wins It shows a general relationship that approximately the more runs the team took, the more wins the team could get. Th is is understandable: Because the number of total games (161 or 162 ) was the almost the same for all the teams, more runs indicated that the average runs per game were higher. The higher of average runs per game, the higher possibility for the team to win the game. As a result, the higher possibility for the team to have higher number of total wins. D. Now, add a line of best fit to the scatterplot. Show a screenshot of your R source code as well as the resulting output. This study resource was shared via CourseHero.com Figure 2 line regression between number of the runs and number of the wins E. Was any league dominant this season? Create a vertical barplot that enables a viewer to compare the win totals for each league (NL West, AL East, etc.) side - by - side. Give each league’s bar a unique color. Show a screenshot of your R source code as wel l as the resulting output. 550 600 650 700 750 60 70 80 90 100 Runs Wins This study resource was shared via CourseHero.com Figure 3. Wins by League in 1989 Results show all leagues have similar number of wins, but AL west has around 100 more (21.27%) wins than the last NL We s t . F. Create a histogram that shows the number of wins per team on t he x - axis, and the frequency on the y - axis. Now, suppose you want to see a finer level of detail. How can you increase the number of bins in your histogram? Show a screenshot with your R source code and the resulting output. This study resource was shared via CourseHero.com Figure 4. Histogram for number of wins in 1989 The distribution is not well determined. In my opinion, this should follow binomial distribution. F. Using the GGally package, create a scatter plot matrix that shows the relationship among all of the following variables: Wins, Losses, Runs, Runs Against, Average Batter’s Age, Average Pitcher’s Age. Show a screenshot of your R source code as well as th e resulting output. This study resource was shared via CourseHero.com Figure 5 Correlation Matrix The wins and losses are with correlation - 1 because the total number of wins and losses should be the same. Moreover, the sum of correlations between Wins and Losses are always 0 in each column. Wins has obvious positive correlation factor to runs and obvious negative correlation factor in Runs Against. We can get that if the runs of the against is higher, the team will has higher probabilities to lose the game. Furthermore, the pitcher’s age also ha s positive correlation factor and it can be explained that the old pitch can have more experiences, and can pitch better. G. Now, build a heatmap correlation matrix that shows the relationship between the same variables from part F. Show a screenshot of y our R source code as well as the resulting output. This study resource was shared via CourseHero.com Figure 6 Heatmap of Correlation Matrix H. Run the prcomp() function on three of the variables in your data set -- wins, runs, and runs against. How many Principal Components did it require for you to account for more than 80% of the variation in the data? Show a screenshot of your R source code as well as the resulting output. This study resource was shared via CourseHero.com Powered by TCPDF (www.tcpdf.org)The result show that PC1 and PC2 can explain almost 99.82 % results . PC1 indicate that there is a positive relationship between Runs and Runs Against , but the number of Wins has slightly negative impact . PC2 explain the positive relationship existing between Wins and Runs, also shown in Questions C and D. If scaling the parameters before doing the Principle Components Analysis The result show that the first two principle components can account for 9 7.87% data. PC1 can be better explained because PC1 indicate the Wins and Runs are posit ive related, and Runs Against should be negative related to them . PC2 indicates that Runs and Runs Against has positive relationship.