I attached Task details and data file. I need the to be done by tomorrow morning October 1, 2021 by 11AM EST
Tasks: The file, “datafile.csv”, is to be used for this assignment and contains 7 independent attributes, a1 to a7, and 1 dependent attribute a8. Using “datafile.csv”, code a Python or R script to perform the following:
• Read “datafile.csv” into a dataframe
• Plot each attribute against each other and view how the data are related to each other
o Since 8 attributes might make the graph less easy to read, use 4 attributes at a time
o Try out “ggpairs(dataframe)” from the ggplot2 and GGally packages. Note the differences between the ggpairs plot and the plot command learned in class.
• Figure out how many 1’s and 0’s are in the a8 attribute column. Note the imbalance.
• Make a new dataframe by selecting randomly 150 of the rows where a8=1 and 150 of the rows where a8=0.
• From the new dataframe, make a training and a test set where the training set is 70% of the new dataframe and the test set 30%. o Try one more train-test split, such as 60-40.
• Build five logistic regression models on the training set using all attributes, a1 through a7 with a8, and four different subsets of the attributes, such as using a1, a2, and a3 with a8. Note the differences in the accuracies and examine the confusion tables.
• With the logistic regression model that you consider the best, run the model on the entire new dataframe, look at the confusion matrix, and note the accuracy.
• Generate the ROC curve with the results from the entire new dataset. What do you think it means?
• Generate the PR curve with the results from the entire new dataset. What do you think it means?
• Using k-fold cross-validation, run logistic regression on the entire new dataframe with 2 different fold sizes. Note the confusion tables, accuracies, and average accuracy.
• Save the new dataframe to a file called “newdatafile.csv” (without a row name column).
• Answer the following questions at the end of your script file using comments:
1. Which train-test split worked best for you? 70-30 or another? Why? Texas Tech University Whitacre College of Engineering
2. Which logistic regression model worked best of the five you tried? Why?
3. What does the resulting ROC curve tell you about the model?
4. What does the resulting PR curve tell you about the model?
5. Which size fold worked the best? Why?