StudyDaddy Statistics

Waiting for answer This question has not been answered yet. You can hire a professional tutor to get the answer.

QUESTION

Apr 25, 2019

The data set is http://www.edu/stat/data/binary.csv Problem 1:

The data set is http://www.ats.ucla.edu/stat/data/binary.csv

Problem 1: A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and pres- tige of the undergraduate institution, effect admission into graduate school. The response variable, admit/dont admit, is a binary variable. Load the attached data file, binary.csv, into R and use R function str() to find out the structure of the file and function head() to view the first few rows of the data.

(a) Use xtabs() function to obtain a two-way contingency table of cat- egorical outcome and predictors we want to make sure there are not 0 cells. Use R glm (generalized linear model) function to estimate the coefficients of the logistic regression model with predictor variables gre, gpa, and as.factor(rank); i.e., treat the variable "rank" as categorical (or indicator) variable and NOT as a numerical variable. Make sure use glm() function with option family = "binomial" as it is shown in the example in class; see lecture notes for additional detail. Then use function summary() to obtain a summary of the results.

(b) (i) For every one unit change in gre, the log odds of admission (ver- sus non-admission) increases by how much? (ii) For a one unit increase in gpa, the log odds of being admitted to graduate school increases by how much? (iii) The indicator variables for rank have a slightly dif- ferent interpretation. For example, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1, how does the log odds of admission change?

(c) Use the predict() function to calculate the predicted probability of admission at each value of rank, holding gre and gpa at their means.

Problem 2:Refer to the previous problem.

(a) Use the first 200 cases of the data as a "training data" to fit the data by a logistic regression with predictor variables gre, gpa, and as.factor(rank); i.e., treat the variable "rank" as categorical (or indica- tor) variable and NOT as a numerical variable; i.e., repeating the work in part(b) in the previous problem BUT only using the first 200 cases. Then use the predict() function to calculate the predicted probability for each of the remaining 200 cases. Since we the admission status is known for each of these remaining 200 cases. Do you think the model give a good prediction? Why? Explain.

This is related to the so-called cross-validation method.

(b) Repeat part (b) but use the first 300 cases as "training data". Then use the fit to predict the remaining 100 cases. Do you think the model give a good prediction? Why? Explain.