Purdue STAT350

SAS Tutorial for Lab 1 Author: Leonore Findsen, Cheng Li , Min Ren 1 STAT 350: Introduction to Statistics Department of Statistics, Purdue University, West Lafayette, IN 47907 0. Downloading Data You will have to download all of the data used from the internet before SAS can access the data. If the file is accessed via a link, then right click on the file name and save it to a directory on your hard drive. It can be difficult to find the files, so if you are unfamiliar with PCs, I suggest that you download it directly to the W drive (main camp us drive). SAS cannot read a zipped file. Therefore, if any file is zipped, remember to extract it before you try to access it in SAS. Please remember where your files are stored so that you can access them for the labs. 1. General Information for SAS a) To help in debugging SAS code, the SAS editor color codes the information. The colors can help you debug the code so I recommend that you memorize them or refer back to this listing. The following are some of the colors used by SAS: blue or blue : comm ands black: responses gree n: comments blue green : numbers purple : text/titles black : data b) In the code that is presented, there are a large number of comments ( in green ). Please read them to understand what the variables represent. Comments are not required in your submitted code. c) All command lines in SAS must end in a semicolon ‘;’. d) To run a program, be sure that the editor window is active and then Run  Submit or just click on the . In addition to running all of the code, you may highlight parts of the code and only run that part. e) If the file does not run as you expect, please look in the log file ; specifically at the information in red or green to see if that helps you find the problem. f) SAS will append the output to the Result Viewer and Log screens which can cause great confusion when you are doing more than one problem in a session. If you are running the programming locally, that is, you are NOT using goremote, I would strongly suggest that you add the following code to the beginning of each program, this will clear both of these screens when you run the code: ods html close ; ods html ; g) Data set names need to be one word (no spaces) and start with a letter. A common way to indicate a space is to use an underscore '_'. Capitalization is important. Not all special characters can be used. Please change your data set names and variable names to match each situation. Points will be deduct ed if this is not done. SAS Tutorial for Lab 1 Author: Leonore Findsen, Cheng Li , Min Ren 2 STAT 350: Introduction to Statistics Department of Statistics, Purdue University, West Lafayette, IN 47907 h) When you submit the code in your lab report, please include all of the code for the question at the beginning of the question. This makes it easier to replicate your results. i) Do not print out the whole data set inside of the program! We will take off points if there is code for this in your submission. j) In SAS, every time you load in a file, you need to input ALL of the variables; not just the variables that are used to answer the question. 2. Importing Data sets, cleaning, manipulating, and printing them. Hummingbirds and flowers. (Da ta Set: ex01 -88helicon _m .txt ) Different varieties of the tropical flower Heliconia are fertilized by different species of hummingbirds. Over time, the lengths of the flowers and the form of the hummingbirds’ beaks have evolved to match each other. Here are data on the lengths in millimeters of three varieties of these flowers on the isl and of Dominica: data helicon ; infile 'W:/ex01 -88helicon _m .txt' delimiter = '09'x firstobs = 2 ; input variety $ length; /*delimiter = '09'x means the file uses tab s as delimiter s firstobs = 2 means we start reading the data from the second line, since the first line is the name of variables.*/ /* Since variety is a categorical variable, a '$' follows it.*/ run ; proc print data = helicon ; run ; I am only printing out the whole data set because it is small. DO NOT DO THIS FOR THE DATA SET IN CLAS S. The first part of the output is below: SAS Tutorial for Lab 1 Author: Leonore Findsen, Cheng Li , Min Ren 3 STAT 350: Introduction to Statistics Department of Statistics, Purdue University, West Lafayette, IN 47907 I have highlighted the first five data points. You will see some error messages because there is some missing data. We will be removing those values later. Please name your data set and variables in context. In this case, we are interested in Heliconia flowers, so I named the data set helicon. The variables in the text file are Variety and Length so that is what I use those in the input statement. Since capitalization is important, I often only use small letters. If you want to put a space in the name, use an underscore, _. Note that there is a maximum length in SAS for both data set names and variable nam es so be careful. If you want to copy tables (or parts of tables) from the SAS output, I suggest that you use the Snipping Tool. You can also use that tool to highlight your answer. This is the procedure that I used in the above table. For large data set s, Do NOT print out the whole data set with all of the variables , only print out parts of it . The following code will print out the first 10 data points for variable Length only . proc print data =helicon ( obs = 10 ); * the obs keyword has to be in parentheses; var length; run ; SAS Tutorial for Lab 1 Author: Leonore Findsen, Cheng Li , Min Ren 4 STAT 350: Introduction to Statistics Department of Statistics, Purdue University, West Lafayette, IN 47907 If you want to print out more than one variable, separate them by spaces. The variables will be printed out in the order specified. For example, if I wanted to switch the order that the variables were printed out, I would use the code: proc print data =helicon ( obs = 10 ); var length variety; run ; Last ly, in large data sets, it can be important to only print specific rows (and columns) . The following code prints rows 2, 20, and 50. ** Create dataset printme with only desired rows; data printme; set helicon; if _n_ in ( 2, 20 , 50 ); run ; ** Print it; proc print data = printme; var length variety; run ; SAS Tutorial for Lab 1 Author: Leonore Findsen, Cheng Li , Min Ren 5 STAT 350: Introduction to Statistics Department of Statistics, Purdue University, West Lafayette, IN 47907 The Obs column does not correspond to the columns in helicon, but to the new dataset “printme”. Cleaning and saving dataset s Since many datasets have missing values, it is important to process them before beginning the analysis. For this class, we will simply remove those rows. The following command will remove rows that are not complete and then view the cleaned file. When you are displaying , be sure that you include the correct dataset name. Remember to NEVER print out large dat a files! data helicon_cleaned; set helicon; if nmiss(of _NUMERIC_)= 0 AND variety ^= "NA" /* only outputs the data if the numeric da ta is numeric, not text, and variety is not NA ) */ run ; proc print data =helicon_cleaned; run ; In the above code, the new data set is called ‘helicon_cleaned ’ which is based on the old data set called ‘helicon ’. Once you have cleaned the data set, you will want to save it so that you don't have to clean the data each time you use the data set. The procedure to save the new data set is as follows: File → Export Data → In the 'Member', select the appropriate data set, in this case, it would be helicon_cleaned, be sure to select "Write variable labels as column names " (Fig. 1) → Next → Select the Tab Delimited Files (*.txt) (Fig. 2) → Next → Where do you want to save the file? → Browse - Browse on your computer for the appropriate location (I would suggest your W: drive). As I do not have a W: drive on my office computer, I am choosing another location (Fig. 3) → Finish Use the following screen shots as guides SAS Tutorial for Lab 1 Author: Leonore Findsen, Cheng Li , Min Ren 6 STAT 350: Introduction to Statistics Department of Statistics, Purdue University, West Lafayette, IN 47907 Fig. 1 Fig. 2 Fig. 3 SAS Tutorial for Lab 1 Author: Leonore Findsen, Cheng Li , Min Ren 7 STAT 350: Introduction to Statistics Department of Statistics, Purdue University, West Lafayette, IN 47907 Manipulating Data For readability, you might want to change the sho rtened name or abbreviation to the full version. Remember to NEVER print out large data files! This is done by the following code: data helicon_new; length variety $ 15 ; set helicon_cleaned; if variety = 'red' then variety = 'Carabaea_Red' ; if variety = 'yellow' then variety = 'Carabaea_Yellow' ; run ; proc print data =helicon_new ; run ; Some of the output is shown below: If the length command is not used, then the name would be truncated. Always use a length command if you are replacing an abbreviation with a full name or increasing the length of a categorical variable. This command needs to be placed right after the data statement. In addition, you might want to create a new variable based on mathematical operations from old variable (s). You can use the sample code below to convert the lengths of the beaks from millimeters t o inches. Remember to NEVER print out large data files! The conversion factor is 1/25.4. data helicon_new; set helicon_new ; length_inches = length/ 25.4 ; run ; proc print data =helicon_new; run ; SAS Tutorial for Lab 1 Author: Leonore Findsen, Cheng Li , Min Ren 8 STAT 350: Introduction to Statistics Department of Statistics, Purdue University, West Lafayette, IN 47907 You will see a new column in the data set called length_inches: Note that it is not necessary to change the name of the new data set. However, I strongly recommend that you do change it since (1) in case there’s a mistake, you won’t overwrite the original data, and (2) you can comp are the new with the original. In addition, you should never re -use variable names. That is, your modified variable should always have a distinct name from any other variable in your data set.