Final Project (100 Points) Due: August 4 at 5:30PM ET Directions: Please submit an electronic copy of your R code with a clear documentation using...

Final Project ( 100 Points ) Due: August 4 at 5:3 0PM ET Directions : Please submit an electronic copy of your R code with a clear documentation using Blackboard Learn E -mail . Please use your script file name as “lastname_ Project .R”. See further instruction in the course syllabus. 10 points will be deducted from your project grade for every hour delay The midterm project consist s of writing an R function that performs the nonparametric Gehan test, applying the function to two datasets, and creating a useful graphical description of each dataset. The method for the Gehan test is given below , along with an example. Follow the guidel ines given on the last page for writing the function and performing the analysis. Gehan Test The Gehan test is a nonparametric procedure for comparing the medians of two independent samples that may contain da ta that is left truncated at diff erent values. We must assume that the truncation mechanism is the same for both populations. However we do not have to make any assumptions about the variance of the population distributions. This procedure is frequently used with environmental data to determine if levels of a chemical at a site are diff erent then levels that naturally occur in nearby areas. For example, arsenic, a known carcinogen, naturally occurs in soil and is also a byproduct of mining activity. The Gehan t est can be used to compare soil samples from a mining site to nearby areas that are not aff ected by mining. Furthermore, environmental data is often left truncated. Laboratory machines have a detection limit. If the concentr ation of a chemical is lower tha n this limit, the concentration canno t be detected. In which case , the data is left truncated at a known detection limit. Procedure The following procedure is taken from the 2002 Naval Facilities Engineering Command Guidance for Environmental Background Analysis, Volume 1: Soil, available at the Argonne National Laboratory website, http://www.ead.anl.gov/ . Suppose m background samples and n site samples are collected. Site refers to a po tentially hazardous site that is being investigate d and background refers to nearby areas that reflect naturally occurring levels. If an observation is a non -detect, that is laboratory machines are unable to detect the chemical concentration, then the detection limit of the machine is given along with a less -than sign to denote the truncated observation. The following procedure is used to test the hypothesis, H0: Median of Site = Median of Background Ha: Median of Site > Median of Background I. List the combined m background and n site measurements, including non -detect values, from smallest to largest. The total number of combined samples is N = m + n. Use the given detection limit for non -detect data . II. Determine the N ranks, , for the N ordered data values using the method described in the example below. III. Compute the N scores, , where , for IV. Compute the Gehan statistic, G, where hi is an indicator, hi = 1 if the ith observation is from the site population and hi = 0 if the ith observation is from the background population. V. Calculate the p-value. When and calculate the p-value using a large sample approximation. Otherwise for small m and n calculate the p-value using a permutation test.  For large samples, the distribution of G is approximately standard normal.

Therefore, reject the null hypothesis if , where G.obs is the observed value of the Gehan statistic G and is the 100(1 - )th percentile of the standard normal distribution.  To perform a permutation test for small samples, (a) Take a random sample of size n from the pooled data without replacement. These n values represent site data and the other m observations are the background data. (b) Calculate G for this resample. (c) Repeat steps (a) and (b) several thousand times. (d) The distribution of the test statistics calculated in step (c) approximates the sampling distribution under the null hypothesis. The permutation p- value is the proportion of resamples that give a result at least as great as the observed G. NR R R , , , 2 1  ) ( , ), ( ), ( 2 1 NR a R a R a  1 2 ) (    N R R a i i . , ,2,1 N i   , )1 ( )} ( { ) ( 2/1 1 2 1               N N R a mn R ah G N i i N i i i 10 m 10n   1 . Z obs G 1Z  Example : Below are 10 samples from site and background areas. The < denotes a n on -detect observation, data that is left t runcated at the detection limit. Background: 1 <4 5 7 <12 15 18 <21 <25 27 Site: 2 <4 8 17 20 25 34 <35 40 43 The followi ng steps are used to create the T able 1 below which is then used to calculate G. 1. List the combined m background and n site measurements in column 1 of the Table from smallest to largest. Use the given detection limit for non -detect d ata. 2. Place a 0 or 1 in the second column of the Table, hi, using the following rule: hi = 1 If the ith measurement is from the site hi = 0 If the ith measurement is from background 3. Place a 0 or 1 in the third column of the Table 1, , using the following rule: = 1, i f the ith measurement is a detection = 0 , if the ith measurement is a non -detect 4. Determine the values of di and ei using these rules:  If the fi rst value is a detect, that is, if = 1, then set d1 = 1 and e1 = 0.  If the fir rst value is a non -detect, that is, if = 0, then set d1 = 0 and e1 = 1.  For each successive row increase di by 1 when = 1; i = 2 ,…, 20.  For each successive row increase ei by 1 when = 0; i = 2 ,…., 20. 5. Let T denote the total number of non -detect values in the pooled background and site datasets. For this dataset there are T = 6 non -detects. Compute the rank of the ith observation by,   6. Compute the N = 20 scores, , where . Using the columns of the Table 1, the Gehan statistic is G = 1. 77. S ince G > 1:645 = ,we reject the null hypothesis at the 0.05 level. Data hi di ei Ri a(R i) Data hi di ei Ri a(R i) 1 0 1 1 0 4 -13 2 1 1 2 0 5 -11 <4 0 0 2 1 4.5 -12 <4 1 0 2 2 4.5 -12 5 0 1 3 2 7 -7 7 0 1 4 2 8 -5 8 1 1 5 2 9 -3 <12 0 0 5 3 6 -9 15 0 1 6 3 10.5 0 17 1 1 7 3 11 .5 2 18 0 1 8 3 12.5 4 20 1 1 9 3 13.5 6 <21 0 0 9 4 8 -5 <25 0 0 9 5 8 -5 25 1 1 10 5 15.5 10 27 0 1 11 5 16.5 12 34 1 1 12 5 17.5 14 <35 1 0 12 6 9.5 -2 40 1 1 13 6 19 17 43 1 1 14 6 20 19 i i i i i i i .1 ,2/) (     i i i i if e T d R  .0 ,2/) 1 (     i i i if d T R  ) ( , ), ( ), ( 2 1 NR a R a R a  1 2 ) (    N R R a i i 05.01 Z i i Project Guidelines 1. The goal of the project is to develop a function(s) that will be useful to other statisticians who are interested in using the Gehan test for large and small samples. You will need to decide on the structure and organization of your function(s). For example, you could write one function that performs the large -sample approximation and the permutation test. This function could include a n argument that indicates which method to use with a default approach based on the sample size. Or you could write two separate functions one for each approach. You will also need to carefully consider the format and type of arguments that would be most ap propriate for general use of the Gehan test. Do whatever you think is best, but keep in mind that you are writing a general program for others to use. Requirements  Write an R function(s) that performs the Gehan test using the large -sample approximation and the permutation test.  Include code that checks if the arguments are valid and returns an error message if there is an invalid argument.  Return a warning message if the large -sample test is used when m < 10 and n < 10.  Function(s) should return an object that contains the test statistic, the p -value, and the method.  The print method should output: the name of the test (Gehan), the method used, the test statistic, and the p -value.  Comment your code. Including a d escription of the function(s), the type and format of the arguments, and a description of the values returned. Your comments should act like a manual for how to use the function. Also include comments in the body of the function that describe what is going on. 2. Apply your function(s) to the following two datasets. Use the large -sample approxi mation for the fi rst dataset and the permutation test for the second dataset. Dataset 1 Background: 4 <18 13 27 39 11 <23 <6 <3 9 29 <19 36 Site: 49 10 <17 <28 50 30 32 20 <26 34 37 48 45 Dataset 2 Background: 18 <10 27 22 <3 Site: 30 44 23 <16 13 A < denotes non -detect data, in which case the detection -limit is given. 3. Create at least one graph for each dataset that will be useful for comparing site and background data. For non -detect data use the detection -limit. Source: Dr. A. Wahed, Uni. of. P ittsburgh