Suppose you are a manager and have to choose a set of evaluation instruments for assessing the work performance of your employees. Pick any job you wish. Pages 145-154 of the textbook describe three g

EVALUATION INSTRUMENTS

So far, the topic of performance evaluation has been at a very general level. Let’s get down in the weeds and examine some very specific approaches to evaluation. There is a host of instruments that have been proposed and used in an attempt to measure performance in a reliable and valid way with high levels of accuracy, precision, and fairness. Mostly, the attempts are efforts to produce credible subjective measures.

Checklists

One approach is to have the appraisers go through a list of job-relevant behaviors and score the worker in terms of the number of behaviors observed during the per-formance period. Ideally, the list of behaviors would be generated from a job analysis. The simplest form of the checklist is to just count the number of behaviors checked by the appraiser. This is a “down-and-dirty” approach that yields very little valuable information. Two variations on the checklist instrument are the weighted checklist and the forced-choice checklist.

WEIGHTED CHECKLIST

The problem with the standard checklist is that not all of the behaviors are of equal value. The job of a bank teller is not only to handle money, but also includes greeting the customers, answering questions, and keeping the workspace organized. Of these four behaviors, clearly handling the money is the most important, but the standard checklist would give it the same value as the others. The weighted checklist, however, assigns a value (weight) to each behavior so that the final tally will reflect the differences in importance of the behaviors. Here is what a weighted checklist for team behaviors might look like: Check all that apply: 1. _____ Cooperates fully on team tasks (3) 2. _____ Attends meetings on time (1) 3. _____ Assumes leadership role when working in teams (5) 4. _____ Gives information during group discussions (2) Now, when one or more of these behaviors is checked off and counted, the tally reflects the weight assigned (shown in parentheses on the right side). If Paul is evaluating

Mary and checks behaviors 1, 3, and 4, she will get a total scored of 10 (3 + 5 + 2 = 10). If Paul is evaluating Sue and checks behaviors 1, 2, and 4, she will get a total score of 6 (3 + 1 + 2 = 6). In this case, Mary is a better team member than Sue, mainly because she is willing to take on a leadership role, which is a highly valued behavior for this organization. How are the weights established? This is frequently done by SMEs (often groups of supervisors). The weights themselves are typically not shown to either the evaluator or the worker. This practice serves two purposes. This releases the evaluators from the leniency bias because, if they do not know the value of the behaviors, selecting one is not seen as more positive than selecting another (although there is a chance that the evaluator may be able to guess the values). Keeping the weights disguised also tends to neutralize the punitive nature of the evaluation for the workers for the same reason (all the items are seen as positive, unless workers are able to guess their value).

FORCED-CHOICE CHECKLIST

The idea behind the forced-choice checklist is to create a set of behavioral items that appear to be of equal value and then have the evaluator pick which item is most characteristic of the employee. In reality, one of the items in the set has a high value and the other (or others, if there are more than two) has lower values. If the evaluator consistently picks the high-value items, then the employee scores high; if low-value items are consistently selected, the employee scores low. There are many variations of the forced-choice procedure, but a simple version is shown below: Check the one item in the pair that most applies to the employee: Pair # 1 1. _____ Takes initiative in starting new projects 2. _____ Provides helpful information for planning new projects

Pair #2 1. _____ Is dependable and gets to meetings on time 2. _____ Works cooperatively with others on the team

For the first pair, although both behaviors are positive, item 1 is the preferred behavior. In the second pair, item 2 is the preferred behavior. The forced-choice checklist is a bit more difficult to construct than the weighted checklist. Although the item pairs could be determined by SMEs, a far better method is to generate a pool of behavioral statements and then correlate them with other performance measures (e.g., sales, work quality, customer comments). From these correlations, statements can be selected from those that have a high correlation with these other measures and those that have a low correlation with these other measures. The high-correlation items can then be paired with the low-correlation items. For example, for a technology firm, in the first pair above, item 1 might correlate highly with other measures such as number of patents filed and revenue generated for the firm, whereas item 2 does not correlate highly with these important performance measures. As with the weighted checklist, when all items are positive and the difference in value is disguised, the procedure limits the leniency bias and removes much of the negativity.

COMPARISON RANKINGS

Ranking employees in terms of their level of performance is an age-old method of evaluation. In its simplest form, the supervisor or manager creates an ordered listing of the employees, placing the highest performers at the top and the lowest performers at the bottom. This is a crude method, but it does have the advantage of eliminating the lenience bias (not all employees can be regarded as high performers). Again, there are several variations on this process, beginning with the simple rank order.

RANK ORDER

The easiest way to do a rank-order list is to consider the overall performance of each employee and then order them from highest to lowest. This has the advantage of simplicity, but overlooks a number of important aspects of evaluation. First, it only considers one dimension of performance. A better approach is to divide performance into several dimensions (at a minimum, divide performance into quantity, quality, and timeliness) and rank order employees on each separate dimension. A second draw-back is that using only overall performance gives employees very little information that they can use to improve future performance (just knowing that you are at the bottom of the stack is not very helpful). A third drawback is that the rank ordering does not indicate the distance between employees. The distance between the top-performing employee and the next best may not be the same as between number 2 and number 3. This distance information can be useful when making decisions about merit pay, promotions, or salary increases. For example, it would be difficult to justify a promotion for one person if there is only a hairsbreadth difference between that person and the next best person. Also, salary levels and amount of merit pay sometimes are based on the degree of performance differences between people, but the rank-order technique does not provide information on such degrees.

PAIRED COMPARISON

A paired-comparison technique is a ranking process that allows the evaluator to indi-cate degrees of distance between employees. The method involves obtaining a list of employees and then comparing each person with every other person, proceeding two at a time. Once all of the two-by-two comparisons are made, the evaluator then gives a score to each person based on the number of times the person is chosen as the better performer in each pair. A ranking of workers then shows not only the order of best to worst performer, but also the distance between people (scaled in terms of the number of times the person came out on top in the paired comparisons). The distance scaling is clearly an advantage, but the downside to this approach is that it can be very labor intensive. With 10 employees, the number of comparisons is 45 [10 x (10 - 1)/2 = 45]. With 30 employees, the number jumps to 435.

POINT ALLOCATION

The point-allocation comparison is pretty simple and also allows for a distance scaling. The evaluator is given a fixed number of points (e.g., 100) that she or he can distribute to the workers being evaluated. For any given performance dimension (e.g., quality, timeliness, quantity), the evaluator assigns points to each of the workers, depending on their perceived proficiency, with the stipulation that the evaluator can only assign the fixed number of points. For example, if a supervisor has 100 points to work with and has 10 workers, she may assign 25 points to her most proficient employee. That then leaves only 75 points to distribute to the other 9 people. If she then assigns 20 points to the next proficient worker, she then has only 55 points left to assign to the remaining 8. The rest of the point allocation might then look something like: Person 8 = 15 Person 7 = 12 Person 6 = 10 Person 5 = 7 Person 4 = 5 Person 3 = 3 Person 2 = 2 Person 1 = 1 As you can see, this system rank orders the workers and also assigns points that show the distances between people. This system can be used for a single perfor-mance dimension (e.g., overall proficiency) or repeated for several dimensions. If points are allocated for several dimensions, then a total score can be calculated from the separate scores. The easiest calculation is simply to sum the separate points for each person. The people with the highest total are the superior workers. As with the checklist, however, this simple calculation assumes that each dimension is of equal value. If this is not the case, then the separate dimensions need to be assigned weights, and a weighted sum should be calculated.

(I will send FIGURE 5:3 separate)

Figure 5.3 provides a hypothetical example of a point allocation system with three performance dimensions (quantity, quality, and timeliness) with the unweighted sum (total) and weighted sum (total) points. As you may have guessed, using a weighted set of points (columns 5–8) is more involved than simply adding the points (columns 1–4). First, you must establish the weights. In this case, SMEs rated each dimension on a 5-point scale with one (1) indicating that the dimension was minimally important; three (3) indicating moderate importance; and five (5) indicating that the dimension was very important. For the purpose of illustration, the weights in Figure 5.3 have been simplified to whole numbers, but in a real-life situation, several SMEs would rate the dimensions and then their average ratings would become the weights expressed with fractional points. After each dimension receives a weight, the points for each person (columns 1–3) are multiplied by the weight (columns 5–7) to get a weighted score for each dimension. Finally, in column 8, all of the weighted scores are summed to get a total score for each employee. This whole weighting scheme may seem like a lot of effort, but it does have a cou-ple of advantages. First, the system allows the users to make finer distinctions between people. Take the first two workers, Leslie and Lynn, for example. If we just total the assigned points without any weighting, their scores are tied at 60 (column 4). If we weight the points, then Leslie scores higher than Lynn (200 versus 160 in column 8). This is because Leslie earned high points in an area that was most important (quantity, with a rating of 5), whereas Lynn got her highest points in a less important area (timeliness with a rating of 1). Without the ratings (weights), the evaluators would not be able to distinguish between these two workers. A second advantage of the weighted point-allocation system is that one can drill down and examine the relative strengths and weaknesses of each employee. For example, in Figure 5.3, we can see that Leslie and Sandy are very good at completing lots of work, but the quality of Sandy’s work is lacking. Lynn scores higher than anyone in the timeliness area, but because delivery time is not very important, Lynn’s high score does not contribute much to her weighted total (about 15%) and does not mean as much as the quantity and quality. FORCED DISTRIBUTION The last form of ranking (forced distribution) will be recognized by most students as the “grading on a curve” approach. If we think about performance evaluation in the same way as grading students on a normal distribution, then we are forced to place the high performers in the areas above average and the low performers in the areas below average. Figure 5.4 is a normal distribution divided into six segments (inadequate performance, marginal performance, below expectation, meets expec-tation, exceeds expectation, and outstanding performance). The person doing the performance appraisal must place workers into each segment, with the stipulation that only a fixed percentage can go in each spot. The distribution in Figure 5.4 is based on standard deviations, and so the percentages shown apply to all normal distributions. That is, for all normal distributions, only 5% of the cases fall beyond two standard segments between one and two standard deviations (13.5% in the lower part, 13.5% in the upper part); and 68% are in the segments between the mean and one standard deviation (34% in the lower part and 34% in the upper part). If a personnel department wishes to use the normal distribution, then the instructions to the evaluators would state that no more than 2.5% of the workers should be classified as inadequate and only another 2.5% should be considered outstanding. Similarly, only 13.5% should be marginal and only 13.5% should exceed expectations. The rest (68%) should either fall below expectations or meet expectations. Imagine that a supervisor has 40 people working for her. If she were to use the stan-dard deviation cutoffs, then only one person would be outstanding and another would be inadequate (2.5% of 40 = 1); about 11 people would be split between marginal and exceeds expectations (27% of 40 = 10.8); and the rest she would have to decide which half were below and which half exceeded expectations.

These are pretty strict cutoff points. The forced-distribution approach does not have to use standard deviations as the basis for the segments, however. The evaluation can be based on any cut points that seem reasonable for the organization. Moreover, the number of segments and the la-bels for each segment can vary (e.g., far below average, below average, average, above average, far above average), depending on the needs of the organization. For example, with five segments, the very bottom and very top (e.g., “far below and far above aver-age”) could be 10% each. The next segments (“below and above average”) could be 20% each, and the middle segment (“average”) could be 40%. With this system, our supervisor with 40 employees would have 4 at the very bottom and 4 at the very top; 8 below average and 8 above average; and 16 in the middle (average). Regardless of how the segments are defined, the important point is that only a fixed percentage of people can be placed in each segment, thus imposing a rank order on the evaluation process. RATING SCALES Possibly the most common form of evaluation tool is the rating scale. People some-times use the terms ranking and rating interchangeably, but they are actually two very different approaches. Whereas rankings force the user to compare and order the em-ployees, ratings require no such direct comparisons. Rating scales come in many forms (see Figure 5.5), but they all have several things in common. First, they all have a contin-uum that ranges from low to high performance (this is called the response scale). The response scale may start with low levels and progress to higher levels, but you will also see scales that start high and end low. Also, the scales are mostly arranged horizontally, but there is nothing that says they cannot be shown vertically. The second common feature of rating scales is that points along the scale are labeled with numbers, verbal descriptors, or both. These labels may just be at the beginning and end (Figure 5.5 [a]) or at several points along the scale (Figure 5.5 [b]). The number of points on the scale can vary widely, but usually there are no fewer than three and no more than nine. Some

Figure 5.5. Examples of different rating scale formats. of the points on the scale may serve as “anchors” that more clearly define these spots and are placed at the beginning, end, and middle of the scale (Figure 5.5 [c]). The third common feature of all rating scales is that the rater is required to assess the level of performance by marking a spot on the scale. If the scale is a straight line (Figure 5.5 [a], [b], and [c]), the marks may be placed anywhere to reflect judgments that fall between discrete points. If the evaluators prefer judgments in whole units, boxes can replace the line (Figure 5.5 [d] and [e]). Each person is usually rated independently based on the rater’s best judgment of the level of performance, but in many cases, the ratings of one person may be influenced by the ratings of others. Raters can be asked to make just an overall judgment of performance, but more typically the rater is asked to judge several relevant dimensions of performance (established by a job analysis), and these separate ratings are combined into an overall (composite) score. The design of rating scales is more than just the cosmetic look of the scale. The scales need to provide sufficient information to allow the rater to make a reasoned and accurate judgment. The trick to creating meaningful scales is to define and label the anchors carefully. This is often a labor-intensive process that involves a careful analysis of the critical job behaviors. Three behaviorally oriented rating scales are presented in Figure 5.6 and discussed below. BEHAVIORALLY ANCHORED RATING SCALE (BARS) The Behaviorally Anchored Rating Scale (BARS) was originated by Smith and Kend on personal traits such as dependability or positive attitude. Traits are very subjective qualities, and poor ratings on traits are difficult to defend either to the person being rated or in a court of law. In addition, trait ratings do not give the employee much valuable feedback to use for future improvement (what concrete actions can you take after learning that your attitude is poor?). Behaviors, on the other hand, are much more observable and objective, provided they are relevant to the job. A BARS is developed in several stages using SMEs, and if done precisely as out-lined by Smith and Kendall, it can be a laborious process. In essence, the first step has SMEs identify the job dimensions that must be evaluated. Several means may be used to accomplish this, but most are based on some form of job analysis (e.g., critical incidents analysis). Next, a separate group of SMEs write behavioral statements that describe behaviors characteristic of high, average, and low levels of performance for each job dimension. Once the statements are generated, then numerical scale values are assigned to each so that the statements reflecting low performance levels get low numbers and the statements reflecting higher performance levels get higher numbers (see the example in Figure 5.6 [a] of a BARS for the dimension of team interaction). Ideally, the BARS process should occur several times during a performance period, and then these accumulated ratings should be used to generate a final rating. In practice, however, the scale is often only used once at the end of the performance period and becomes the final rating.

BEHAVIORAL SUMMARY SCALE (BSS)

Raters sometimes have difficulties using BARS, and so other scales have been de-veloped as replacements. Borman (1979) developed the Behavioral Summary Scale (BSS), which still requires identifying job dimensions and characteristic behaviors, but replaces the more specific behavioral anchors of the BARS with more general descrip-tors (see Figure 5.6 [b]). The research tends to indicate that the BSS and other scale formats are just as good as the BARS, are easier to use, and are preferred by raters (Borman, 1979; Landy & Farr, 1980, Pulakos, 1997).

BEHAVIORAL OBSERVATION SCALE (BOS)

The Behavior Observation Scale (BOS) takes a slightly different approach to behavioral ratings (Latham & Wexley, 1981). As with the BARS and BSS, the job is broken down into dimensions and relevant job behaviors are identified, but instead of using actual behav-ioral anchors such as the BARS, the BOS requires the rater to report the frequency with which certain behaviors occur (see Figure 5.6 [c]). These frequency judgments can be further clarified and become more specific by using a time index (e.g., 0–30%, 31–50%, 51–70%, 71–100% of the time). The original intent of the scale was to remove some of the more evaluative anchors from the evaluation process (“poor” and “excellent” in the BSS) and encourage the rater to report on observations of behavior and not whether the behavior was good or bad. Research by Murphy, Martin, and Garcia (1982) suggests that this is not what usually happens. Raters are predisposed to interpret frequency ratings as mirroring favorable and unfavorable appraisals, so the purpose of the frequency ratings is somewhat diminished. In general, it would appear that the choice of the BARS, BSS, or BOS is fairly arbi-trary and mostly based on personal preference and organizational needs.

COMBINING INSTRUMENTS

Up to this point, I have described the various evaluation instruments as if they were isolated measures. In reality, job performance is evaluated by a collection of different measures that are combined into a collective or composite score. There is no single preferred way to create these combined scores, but two methods that are recom-mended are standardized scores and the objectives matrix (performance indexing).

NOTE: All figures mentioned in reading will be sent in separate document