Purdue STAT350

STAT 350 ( Fall 2017 ) Lab 1 1 Lab 1 (100 points): Introduction to Statistical Packages Objectives: Loading files , cleaning and manipulating the data. A. (10 pts.) Online Pre lab B. (9 0 points) US Demographic, Crime, and Test -Score Data . This semester, we are going to be exploring some data concerning Demographic, Crime, and Test -Score data for counties across the United States . The data we will analyze is in the data set “ USData .txt ”. The variable names and definitions are listed in the file “ US_ Data set _Definition .docx”. In this lab, we are going to explore what is included in the data set, lo ad it into the software package, and do some basic manipulations. 1. (10 points) How many variables does this data set contain? Which are categorical or qualitative variables and which are quantitative or numeric variables ? Besides looking at the documentation file provided, you might want to look at the data file itself in a spreadsheet, notepad or the software package (R only) . 2. (16 pts.) Write two analysis questions that can be answered from the data provided . In the project due at the end of the semester , your group will have to pose general question s that can be answered by three different statistical methods . You will be allowed to change the question when you start the project, but this will get you thinking of possibilities. 3. (20 points) Load the data into your software package, and provide the programming code used to do so. If you used menu options to loa d the data , rather than code, please describe the procedure you followed . No output is required. 4. (19 points) Are there missing value s (NA) in the data set ? If so, please create a new data set by removing any rows that contain one or more NAs from the original data set . Please save this new data set to your computer and/or ITAP folder ; this will be the one you use for the rest of the semester . a. (5 pts.) Code b. (9 pts.) We want to know how many rows were removed, so please answer: i. How many observations are there in the original data set ? (The output is all that is required .) ii. How many observations are there after removing the incomplete data? (The output is all that is required.) iii. How many rows were removed (show the work, even though it is a quick calculation )? c. (5 pt.) In which directory did you save your cleaned data set? STAT 350 ( Fall 2017 ) Lab 1 2 5. (10 points) For readability, we want to transform the values of Region from the two -character code to the full region name. That is, please create a new variable called RegionNew such that: If Region is "NE" , RegionNew is "Northeast ", "NC " → "North Central ", "SO " → "South ", and "WE " → "West” a. (5 pts.) Code b. (5 pts.) Print or display the data set (on the computer, not to physical paper), and take screen clippings which demonstrate the following rows: 2, 22, 222, and 322.

Please highlight or somehow indicate the changes. These rows capture all four regions and will prove that your code worked correctly. To save space, you are permitted to restrict the data set to show only the relevant columns. 6. (15 points) We are going to show that "PopulationDensity " can be calculated from other variables in the data set. a. (5 pts.) Write down the equation relating "PopulationDensity " to “Population” and “LandArea.” b. (5 pts.) Write code (and provide it here) to create a new va riable called PopulationDensityNew which implements the calculation described in part (a). c. (5 pts.) Show that your code i s correct by displaying the original variable "PopulationDensity " and “PopulationDensityNew ”. Please only print out the first 6 rows. To save space, you are permitted to restrict the data set to show only the relevant columns.