 QUESTION

# Exercise 26 Determining the Normality of a Distribution Most parametric statistics require that the variables being studied are normally distributed....

Exercise 26

Determining the Normality of a Distribution

Most parametric statistics require that the variables being studied are normally distributed. The normal curve has a symmetrical or equal distribution of scores around the mean with a small number of outliers in the two tails. The first step to determining normality is to create a frequency distribution of the variable(s) being studied. A frequency distribution can be displayed in a table or figure. A line graph figure can be created whereby the x axis consists of the possible values of that variable, and the y axis is the tally of each value. The frequency distributions presented in this Exercise focus on values of continuous variables. With a continuous variable, higher numbers represent more of that variable and the lower numbers represent less of that variable, or vice versa. Common examples of continuous variables are age, income, blood pressure, weight, height, pain levels, and health status (see Exercise 1).

The frequency distribution of a variable can be presented in a frequency table, which is a way of organizing the data by listing every possible value in the first column of numbers, and the frequency (tally) of each value as the second column of numbers. For example, consider the following hypothetical age data for patients from a primary care clinic. The ages of 20 patients were: 45, 26, 59, 51, 42, 28, 26, 32, 31, 55, 43, 47, 67, 39, 52, 48, 36, 42, 61, and 57.

First, we must sort the patients' ages from lowest to highest values:

26

26

28

31

32

36

39

42

42

43

45

47

48

51

52

55

57

59

61

67

272

Next, each age value is tallied to create the frequency. This is an example of an ungrouped frequency distribution. In an ungrouped frequency distribution, researchers list all categories of the variable on which they have data and tally each datum on the listing. In this example, all the different ages of the 20 patients are listed and then tallied for each age.

AgeFrequency262281311321361391422431451471481511521551571591611671

Because most of the ages in this dataset have frequencies of "1," it is better to group the ages into ranges of values. These ranges must be mutually exclusive (i.e., a patient's age can only be classified into one of the ranges). In addition, the ranges must be exhaustive, meaning that each patient's age will fit into at least one of the categories. For example, we may choose to have ranges of 10, so that the age ranges are 20 to 29, 30 to 39, 40 to 49, 50 to 59, and 60 to 69. We may choose to have ranges of 5, so that the age ranges are 20 to 24, 25 to 29, 30 to 34, etc. The grouping should be devised to provide the greatest possible meaning to the purpose of the study. If the data are to be compared with data in other studies, groupings should be similar to those of other studies in this field of research. Classifying data into groups results in the development of a grouped frequency distribution. Table 26-1 presents a grouped frequency distribution of patient ages classified by ranges of 10 years. Note that the range starts at "20" because there are no patient ages lower than 20, nor are there ages higher than 69.

TABLE 26-1

GROUPED FREQUENCY DISTRIBUTION OF PATIENT AGES WITH PERCENTAGES

Adult Age RangeFrequency (f)Percentage (%)Cumulative Percentage20-29315%15%30-39420%35%40-49630%65%50-59525%90%60-69210%100%Total20100%

Table 26-1 also includes percentages of patients with an age in each range; the cumulative percentages for the sample should add up to 100%. This table provides an example of a percentage distribution that indicates the percentage of the sample with scores falling into a specific group. Percentage distributions are particularly useful in comparing this study's data with results from other studies.

273

As discussed earlier, frequency distributions can be presented in figures. The common figures used to present frequencies include graphs, charts, histograms, and frequency polygons. Figure 26-1 is a line graph of the frequency distribution for age ranges, where the x axis represents the different age ranges and the y axis represents the frequencies (tallies) of patients with ages in each of the ranges.

FIGURE 26-1  Frequency distribution of patient age ranges.

The Normal Curve

The theoretical normal curve is an expression of statistical theory. It is a theoretical frequency distribution of all possible scores (see Figure 26-2). However, no real distribution exactly fits the normal curve. This theoretical normal curve is symmetrical, unimodal, and has continuous values. The mean, median, and mode are equal (see Figure 26-2). The distribution is completely defined by the mean and standard deviation, which are calculated and discussed in Exercises 8 and 27.

FIGURE 26-2  The normal curve.

Skewness

Any frequency distribution that is not symmetrical is referred to as skewed or asymmetrical. Skewness may be exhibited in the curve in a variety of ways. A distribution may be positively skewed, which means that the largest portion of data is below the mean. For example, data on length of enrollment in hospice are positively skewed because most of 274the people die within the first 3 weeks of enrollment, whereas increasingly smaller numbers of people survive as time increases. A distribution can also be negatively skewed, which means that the largest portion of data is above the mean. For example, data on the occurrence of chronic illness in an older age group are negatively skewed, because more chronic illnesses occur in seniors. Figure 26-3 includes both a positively skewed distribution and a negatively skewed distribution.

FIGURE 26-3  Examples of positively and negatively skewed distributions.

In a skewed distribution, the mean, median, and mode are not equal. Skewness interferes with the validity of many statistical analyses; therefore, statistical procedures have been developed to measure the skewness of the distribution of the sample being studied. Few samples will be perfectly symmetrical; however, as the deviation from symmetry increases, the seriousness of the impact on statistical analysis increases. In a positively skewed distribution, the mean is greater than the median, which is greater than the mode. In a negatively skewed distribution, the mean is less than the median, which is less than the mode (see Figure 26-3). The effects of skewness on the types of statistical analyses conducted in a study are discussed later in this exercise.

Kurtosis

Another term used to describe the shape of the distribution curve is kurtosis. Kurtosis explains the degree of peakedness of the frequency distribution, which is related to the spread or variance of scores. An extremely peaked distribution is referred to as leptokurtic, an intermediate degree of kurtosis as mesokurtic, and a relatively flat distribution as platykurtic (see Figure 26-4). Extreme kurtosis can affect the validity of statistical analysis because the scores have little variation. Many computer programs analyze kurtosis before conducting statistical analyses. A kurtosis of zero indicates that the curve is mesokurtic, kurtosis values above zero indicate that the curve is leptokurtic, and values below zero that are negative indicate a platykurtic curve (Grove, Burns, & Gray, 2013).

FIGURE 26-4  Examples of kurtotic distributions.

Tests of Normality

Skewness and kurtosis should be assessed prior to statistical analysis, and the importance of such non-normality needs to be determined by both the researcher and the statistician. Skewness and kurtosis statistic values of ≥+1 or ≥−1 are fairly severe and could impact the outcomes from parametric analysis techniques. Because the severity of the deviation from symmetry compromises the validity of the parametric tests, nonparametric analysis 275techniques should be computed instead. Nonparametric statistics have no assumption that the distribution of scores be normally distributed (Daniel, 2000).

There are statistics that obtain an indication of both the skewness and kurtosis of a given frequency distribution. The Shapiro-Wilk's W test is a formal test of normality that assesses whether a variable's distribution is skewed and/or kurtotic. Thus this test has the ability to calculate both skewness and kurtosis by comparing the shape of the variable's frequency distribution to that of a perfect normal curve. For large samples (n > 2000) the Kolmogorov-Smirnov D test is an alternative test of normality for large samples.

SPSS Computation

A randomized experimental study examined the impact of a special type of vocational rehabilitation on employment among veterans with felony histories (LePage, Bradshaw, Cipher, & Hooshyar, 2014). Age at study enrollment, age at first arrest, years of education, and number of times fired were among the study variables examined. A simulated subset of the study data is presented in Table 26-2.

TABLE 26-2

AGE, EDUCATION, AND TERMINATION HISTORY AMONG VETERANS WITH FELONIES

IDAgeAge at 1st ArrestEducationNumber Times Fired146191322562313334824125450191265582111064120121756141218561214094723120105238123116316140126059121136217123144919112156031130165628123175243120185827140194329120206342140

276

This is how our dataset looks in SPSS.

Step 1: From the "Analyze" menu, choose "Descriptive Statistics" and "Frequencies." Move the four study variables over to the right.

277

Step 2: Click "Statistics." Check "Skewness" and "Kurtosis." Click "Continue."

Step 3: Click "Charts." Check "Histograms." Click "Continue" and then "OK."

278

Interpretation of SPSS Output

The following tables are generated from SPSS. The first table contains the skewness and kurtosis statistics for the four variables.

Frequencies

The next four tables contain the frequencies, or tallies, of the variable values. The last four tables contain the frequency distributions of the four variables.

Frequency Table

280

Histogram

281

In terms of skewness, the frequency distribution for "Age at enrollment" appears to be negatively skewed, and the other three variables' frequency distributions appear to be positively skewed. The absolute values of the skewness statistics for "Age at first arrest" and "Number of times fired" are greater than 1.0. The kurtosis statistic for "Age at first arrest" is also greater than 1.0. No other skewness or kurtosis statistics were greater than 1.0.

In order to obtain a comparison of the study variables' deviation from normality (and thereby assessing skewness and kurtosis simultaneously), we must compute a Shapiro-Wilk test of normality.

Step 1: From the "Analyze" menu, choose "Descriptive Statistics" and "Explore." Move the four study variables over to the box labeled "Dependent List."

282

Step 2: Click "Plots." Check "Normality plots with tests." Click "Continue" and "OK."

SPSS yields many tables and figures—for this example, SPSS produces over 20 tables and figures. In the interest of saving space, we will focus on the table of interest, titled "Tests of Normality." This table contains the Shapiro-Wilk tests of normality for the four study variables. The last column contains the p values of the Shapiro-Wilk statistics. Of the four p values, three are significant at p < 0.05. "Age at enrollment" is the only variable that did not significantly deviate from normality (p = 0.373).

Explore

In summary, the skewness statistics as well as the Shapiro-Wilk values for "Age at first arrest" and "Number of times fired from a job" indicated significant deviations from normality. "Age at enrollment," while appearing to be slightly negatively skewed, did not yield skewness, kurtosis, or Shapiro-Wilk values that indicated deviations from normality. Years of education appeared to be positively skewed but did not have an extreme skewness or kurtosis value. However, the Shapiro-Wilk p value was significant at p = 0.001. It is common for Shapiro-Wilk values to conflict with skewness and kurtosis statistics, because the Shapiro-Wilk test examines the entire shape of the distribution while skewness and kurtosis statistics examine only skewness and kurtosis, respectively. When a Shapiro-Wilk value is significant and visual inspection of the frequency distribution indicates non-normality, the researcher must consider a nonparametric statistical alternative. See Exercise 23 for a review of nonparametric statistics that would be appropriate when the normality assumption for a parametric statistic is not met.

283

Study Questions

1. Define skewness.

2. Define kurtosis.

3. Given this set of numbers, plot the frequency distribution:

1, 2, 9, 9, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 14, 14.

4. How would you characterize the skewness of the distribution in Question 3: positively skewed, negatively skewed, or approximately normal? Provide a rationale for your answer.

5. Given this set of numbers, plot the frequency distribution:

1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 10, 11.

6. How would you characterize the skewness of the distribution in Question 5: positively skewed, negatively skewed, or approximately normal? Provide a rationale for your answer.

284

7. Given this set of numbers, plot the frequency distribution:

4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6.

8. How would you characterize the kurtosis of the distribution in Question 7: leptokurtic, mesokurtic, or platykurtic? Provide a rationale for your answer.

9. When looking at the frequency distribution for "Age at first arrest" in the example data, where is the mean in relation to the median?

10. What is the mode for "Years of Education"?

285

1. Skewness is defined as a frequency distribution that is not symmetrical.

2. Kurtosis is defined as the degree of peakedness of the frequency distribution.

3. The frequency distribution approximates the following plot:

4. The skewness of the distribution in Question 3 is negatively skewed, as evidenced by the "tail" of the distribution appearing below the mean.

5. The frequency distribution approximates the following plot:

6. The skewness of the distribution in Question 5 is positively skewed, as evidenced by the "tail" of the distribution appearing above the mean.

286

7. The frequency distribution approximates the following plot:

8. The kurtosis of the distribution in Question 7 is leptokurtic, as evidenced by the peakedness of the distribution and the limited variance of the values.

9. The mean, for "Age at First Arrest" is above, or higher than the median due to its positively skewed distribution.

10. The mode for "Years of Education" is 12.

287

Using the same example from LePage and colleagues (2014), the following data include only the last 15 observations (the first 5 were deleted). The data are presented in Table 26-3.

TABLE 26-3

AGE, EDUCATION, AND TERMINATION HISTORY AMONG VETERANS WITH FELONIES

IDAge at EnrollmentAge at 1st ArrestEducationNumber Times Fired64120121756141218561214094723120105238123116316140126059121136217123144919112156031130165628123175243120185827140194329120206342140

289

EXERCISE 26 Questions to Be Graded

Name: _______________________________________________________ Class: _____________________

Date: ___________________________________________________________________________________

1. Plot the frequency distribution for "Age at Enrollment" by hand or by using SPSS.

2. How would you characterize the skewness of the distribution in Question 1—positively skewed, negatively skewed, or approximately normal? Provide a rationale for your answer.

3. Compare the original skewness statistic and Shapiro-Wilk statistic with those of the smaller dataset (n = 15) for the variable "Age at First Arrest." How did the statistics change, and how would you explain these differences?

4. Plot the frequency distribution for "Years of Education" by hand or by using SPSS.

290

5. How would you characterize the kurtosis of the distribution in Question 4—leptokurtic, mesokurtic, or platykurtic? Provide a rationale for your answer.

6. What is the skewness statistic for "Age at Enrollment"? How would you characterize the magnitude of the skewness statistic for "Age at Enrollment"?

7. What is the kurtosis statistic for "Years of Education"? How would you characterize the magnitude of the kurtosis statistic for "Years of Education"?

8. Using SPSS, compute the Shapiro-Wilk statistic for "Number of Times Fired from Job." What would you conclude from the results?

9. In the SPSS output table titled "Tests of Normality," the Shapiro-Wilk statistic is reported along with the Kolmogorov-Smirnov statistic. Why is the Kolmogorov-Smirnov statistic inappropriate to report for these example data?

10. How would you explain the skewness statistic for a particular frequency distribution being low and the Shapiro-Wilk statistic still being significant at p < 0.05?

(Grove 271-290)

Grove, Susan K., Daisha Cipher. Statistics for Nursing Research: A Workbook for Evidence-Based Practice, 2nd Edition. Saunders, 022016. VitalBook file.

The citation provided is a guideline. Please check each citation for accuracy before use.