reading analysis
Understanding the Challenger Disaster: Organizational Structure and the Design of Reliable Systems Author(syf & ) / D U U \ + H L P D Q n Source: The American Political Science Review , Vol. 87, No. 2 (Jun., 1993yf S S 5 Published by: American Political Science Association Stable URL: http://www.jstor.org/stable/2939051 Accessed: 16-08-2016 18:48 UTC Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://about.jstor.org/terms JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Cambridge University Press, American Political Science Association are collaborating with JSTOR to digitize, preserve and extend access to The American Political Science Review This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms American Political Science Review Vol. 87, No. 2 June 1993 UNDERSTANDING THE CHALLENGER DISASTER: ORGANIZATIONAL STRUCTURE AND THE DESIGN OF RELIABLE SYSTEMS C. F. LARRY HEIMANN Michigan State University e destruction of the space shuttle Challenger was a tremendous blow to American space policy. To what extent was this loss the result of organizational factors at the National .1.. Aeronautics and Space Administration? To discuss this question analytically, we need a theory of organizational reliability and agency behavior. Martin Landau's work on redundancy and administrative performance provides a good starting point for such an effort. Expanding on Landau's work, Iformulate a more comprehensive theory of organizational reliability that incorporates both type I and type II errors. These principles are then applied in a study of NASA and its administrative behavior before and after the Challenger accident. O n 28 January 1986, the entire nation focused on a single event. Seventy-three seconds after lift-off, the space shuttle Challenger was de- stroyed in a powerful explosion fifty thousand feet above the Kennedy Space Center. The losses result- ing from this catastrophe were quite high. Seven astronauts, including Teacher-in-Space Christa McAuliffe, were killed as the shuttle broke apart and fell into the sea. The shuttle itself had to be replaced at a cost of over two billion dollars. The launch of many important commercial and military satellites had to be delayed as American space policy ground to a complete halt. The accident had a profound impact on the National Aeronautics and Space Administra- tion (NASAyf D V Z H O O 7 K H D J H Q F \ V F U H G L E L O L W \ D Q G L W s reputation for flawless execution of complex techno- logical tasks were lost, along with the Challenger. To this day, the legacy of the Challenger haunts the decisions of both the agency and its political superi- ors in Congress and the White House. An examination of the shuttle remnants and other launch data revealed the technical cause of the acci- dent. The shuttle was destroyed when an O-ring seal on the right solid rocket motor failed, allowing the escaping hot gases to burn through and ignite the main fuel tank of liquid hydrogen and liquid oxygen. Although identifying the technical cause of this dis- aster is important, it represents only one aspect of the problem. Perrow (1984yf K D V D U J X H G W K D W R U J D Q L ] D - tional and technological failures have become so intimately linked that to fully understand the cause of most major accidents, we must analyze both the administrative and technical aspects of the situation. While there have been some administrative critiques of NASA in the wake of this disaster, almost all have centered on issues of bureaucratic culture, such as the agency's propensity to ignore key evidence and its myopic view of its mission. Surprisingly little has been done on a systematic analysis of the NASA organization structure and how it may or may not have contributed to the loss of the Challenger. One reason for this void is that the analytical tool necessary-a comprehensive theory of organizational reliability-is still lacking. Those involved in the debate over structural design have adopted two op- posing stances. The traditional public administration focus on organizational design has involved the pur- suit of efficiency in the sense of minimizing costs for a given level of output. Implicit in this traditional analysis is that the reliable performance of the orga- nization was constant. The critical question, there- fore, has usually been how one might achieve same level of services at a lower cost. As a result, the policy recommendations from this traditional line of think- ing have been to streamline administrative systems and reduce organizational redundancy as much as possible. The work of Martin Landau (1969yf R Q U H G X Q G D Q F y in organizations was a particularly important contri- bution to this debate, inasmuch as he recognized that administrative reliability is dependent on structural factors. In breaking with conventional wisdom, Landau's landmark 1969 essay, "Redundancy, Ration- ality, and the Problem of Duplication and Overlap," asserted that the critical question was not how to cut the costs of administrative performance but, rather, how to ensure the organization's effectiveness. Landau began by noting that, "no matter how much a part is perfected, there is always the chance that it will fail" (p. 350yf $ V K H R E V H U Y H G D V W U H D P O L Q H d system requires only one part to fail for the entire system to fail. Drawing on concepts from engineering reliability theory, Landau argued that redundancy built into the system can make an organization more reliable than any of its parts. Therefore, Landau concluded, administrative redundancy and duplica- tion can be an important part of effective govern- ment. Some work has been done to extend Landau's initial insights regarding organizational redundancy and administrative reliability, primarily by students and colleagues of Landau. Jon Bendor's (1985yf 3 D U D O O H l Systems, originally written as a dissertation under Landau, was the most comprehensive effort to follow up on these ideas. Bendor formalized many of Land- au's concepts and was the first to conduct empirical testing of these propositions, focusing on transporta- tion planning and operations in three American met- 421 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms Understanding the Challenger Disaster June 1993 ropolitan areas. Donald Chisholm's (1989yf & R R U G L Q D - tion without Hierarchy, also written originally as a dissertation under Landau, discussed the notion of reliability and redundancy as it existed in informal organizational structures. Others have extended some of Landau's concepts of reliability to the oper- ations of air traffic controllers and aircraft carrier battle groups (LaPorte and Consolini 1991; Rochlin, LaPorte, and Roberts 1987yf . The policy prescriptions that follow from these two positions are thus quite different. The tradition- alist argument tells us to reduce redundancy and streamline administrative systems whenever possi- ble, whereas Landau advocates increasing redun- dancy by adding parallel units to the system. Most interesting, however, is the fact that an analysis of the organizational changes at NASA reveals that neither the traditionalists nor Landau are fully cor- rect. Certain parts of the space agency followed a traditionalist policy of streamlining, while other seg- ments of NASA adopted the Landau perspective by generating new parallel linkages. Yet (as I shall showyf E R W K W K H V H G H F L V L R Q V Z H U H Z U R Q J D Q G F R Q W U L E - uted to the untimely destruction of the Challenger. How can this be? The reason is a fundamental limitation in the current framework for discussing organizational reliability. Most work in this area has implicitly assumed that there was only one kind of institutional failure and thus only two possible states for organizational performance: the agency either adopted the proper policy or not. But it should be recognized that the latter possibility conceals two different problems: the agency can simply fail to act, or it can adopt an improper policy. Considering the impact of both forms of error is important, because it leads us to a different set of policy prescriptions. An organizational structure that is effective at preventing one type of error may not be equally effective at preventing the other type of error. To date, Bendor alone has formally recognized this distinction; but his analysis was limited (1985, 49-52yf . I shall extend our understanding of organizational reliability by exploring how each kind of failure is affected by different kinds of administrative struc- tures. I shall lay a foundation for an analysis of multiple forms of administrative failure by describing the principles from engineering reliability theory, then show how different kinds of structures leave the agency vulnerable to different kinds of errors. Apply- ing these arguments to the structure and decision- making process of NASA in the pre- and post- Challenger eras, I shall argue that the disaster had its roots in the structural changes adopted by the space agency in the 1970s and 1980s. Although occasionally cited by students of public administration, Landau's work on institutional per- formance and reliability has received little attention from political scientists. One possible reason for this stems from the engineering roots of reliability theory. Political scientists may have considered the issue of organizational redundancy and reliability simply to be a technocratic problem that could be solved by reference to engineering formulas that dictate the appropriate structural form and do not raise political issues. Indeed, if we limit the scope of the problem to two-state devices, this apolitical view of structural design has some merit. But in a system with multiple types of errors, trade-offs between the errors must be made; and this moves us into the realm of politics. Which goals will be embraced? How will resources be allocated to combat each kind of error? Answers to these questions are ultimately political issues. TWO TYPES OF ADMINISTRATIVE ERRORS Before proceeding further, it is important to discuss more thoroughly what is meant by the term policy failure. As just noted, administrative problems often have a richer structure than the simple two-state (operating/failedyf P R G H O F D Q L Q F R U S R U D W H , Q S D U W L F X - lar, bureaucracies are often in the position to commit two types of errors; (1yf L P S O H P H Q W L R Q R I W K H Z U R Q g policy, an error of commission; and (2yf I D L O X U H W R D F t when action is warranted, an error of omission. If we consider the agency's decision to take action to be comparable to the acceptance of a hypothesis, we can relate these two kinds of failures to the more familiar type I and type II errors often studied in statistics. To establish this link, we must first define the null and alternative hypothesis in terms of potential bu- reaucratic action. Since presumption often favors the status quo, let us consider the null hypothesis to be that the agency should not take any new action and the alternative hypothesis to be that the bureau should take such action. Therefore, if the agency chooses to act when it is improper to do so (rejects the null hypothesis when it is trueyf D W \ S H , H U U R U K D s been committed. Likewise, if the bureau fails to act in a situation where it should (accepts the null hypoth- esis when it is falseyf W K H Q D W \ S H , , H U U R U K D V E H H n committed. To illustrate the concept of type I and type II errors in organizational systems, let us consider the exam- ple of NASA and its decision to launch the space shuttle. The space agency traditionally approaches the launch decision with the assumption that a mis- sion is not safe to fly. Subordinates are then required to prove that such is not the case before the launch is permitted. The null hypothesis, therefore, is that the mission should be aborted. If NASA were to reject the null hypothesis by launching a mission that is actually unsafe, it would be committing a type I error. On the other hand, if NASA decided not to launch a mission that was technologically sound, then it would have committed a type II error. The agency's choices and consequences are summarized in Figure 1. Each form of failure is associated with a different set of costs. By committing a type II error, NASA loses the opportunity to achieve its objectives and wastes time, effort, and materials that could have 422 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms American Political Science Review Vol. 87, No. 2 Summary of NASA Responses and Possible Errors Regarding Launch Decisions The proper course of action: Launch Abort Correct Type I1 Error 0 : Decision e Mission Accident occurs; co cu~ision Possible loss of . successful life and/or successful equipment < U Type 11 Correct Z o Error Decision .: Missed opportunity; Accident wasted resources avoided null hypothesis: the mission should be aborted. been usefully employed elsewhere. For example, the shuttle's propellants in the external tank, liquid hy- drogen and liquid oxygen, are lost if the mission is scrubbed and have alone been valued at approxi- mately five hundred thousand dollars. Furthermore, the agency may forfeit the opportunity to carry out rare scientific research, as was the case when NASA missed the launch date for its ASTRO mission to study Halley's Comet. The Challenger accident clearly demonstrated, however, that a type I failure may be far more costly. The destroyed shuttle was replaced by the Endeavour at an expense over two billion dollars, and the death of the seven astronauts repre- sents an incalculable loss. Certainly, in the case of NASA, type I errors are associated with greater costs than type II failures. The exact cost trade-off between these types of failure would vary, of course, for different agencies and according to the individual circumstances prevailing at the time of each decision. For this reason, it is important to develop a general approach to the study of organizational reliability that recognizes both types of errors and identifies the consequences associated with them. By recognizing both forms of potential failure, we are better able to understand the nature of the trade- offs demanded. As with hypothesis testing, gains in type I reliability often come at the expense of type II reliability. However, just as it is conceivable to reduce both a and , in hypothesis testing by increasing the sample size, it is possible to increase both type I and type II reliability by raising an agency's resource levels. But because of resource limitations in the real world, it is inevitable that bureaucrats and their political superiors will have to strike a balance be- tween each type of reliability. On the whole, three- state devices allow us to consider both forms of error and thus provide a richer framework for studying the issue of organizational reliability.1 BASIC CONCEPTS OF STRUCTURE AND RELIABILITY General Assumptions Throughout this discussion I will be contrasting com- ponent and system reliability. It is important to clarify, in advance, what is meant by these terms. A system is a collection of subunits, known as components, which are linked together in a particular structure. In administrative theory, identifying components is de- pendent on the system level in question. If one were concentrating on the behavior of an agency, then the agency as a whole is the system, and the offices within it are the components. Alternatively, if one were looking at the executive branch as a whole, that would be the system and each agency a component within the system. In an organizational context, de- termining what the components are depends largely on the system with which you are concerned. In this case study of NASA, the system of concern is the agency, and each component represents an office or division inside NASA. Three other assumptions of this work must also be specified. First, I start by assuming that the probabil- ity of failure for each component in the system has already been determined either by testing or past history,2 then, later, relax this assumption, demon- strating that the theoretical framework is valuable even when the exact probabilities of component fail- ure are not known. Making this assumption at the onset, however, is useful for developing the theory; since the components are assumed to be known quantities, the remaining question is how to assemble these components into a reliable network. Second, I assume that organizational reliability is static, not dynamic. In the engineering literature, reliability is often treated as time-dependent so as to simulate the breakdown of mechanical components; the probabil- ity of such failure naturally increases with age. For administrative systems, however, it is not clear how component reliability would change over time. One might argue that agents become more reliable over time because they have greater experience and exper- tise with the issues and are thus better able to address them. On the other hand, it could be said that agents are less reliable over time because they are more secure in their positions and lose their incentive to perform well. Additionally, interest-group capture of some public agency may affect the reliability of its performance. While these ideas raise interesting questions, static models of administrative reliability will provide sufficient insight for our needs here. Finally, I assume that the states of all components are statistically independent. In other words, the failure of one component or subsystem does not affect the probability of failure of other components. Those who study public administration may question the 423 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms Understanding the Challenger Disaster June 1993 Redundant System in Parallel (ayf V H U L D O F R Q I L J X U D W L R n (byf S D U D O O H O F R Q I L J X U D W L R n validity of this assumption. Bendor, however, proves a theorem demonstrating that some degree of com- ponent interaction does not negate the general results of reliability theory (1985, 44 49yf 7 K H U H I R U H W K H D V - sumption of component independence is a useful sim- plifying assumption that does not undercut the gener- alizability of the results. With these assumptions in hand, we can now focus our attention on the structural configurations commonly found in organizations. Types of Organizational Structures There are several basic organizational forms em- ployed in the development of administrative sys- tems. The first is a serial structure, a form often found in organizations. In a traditional serial structure, pictured in Figure 2, the first component must cor- rectly process a policy initiative before sending it on to the'next component. To be effective, policy must successfully pass through each of these components. The result is that in order for the system as a whole to fail, it is only necessary for one of the components in series to fail. If any one component were to fail to pass the policy to the next unit, then all the compo- nents that followed it would be unable to act, and the policy could not get through the system. Another possible organizational form is the estab- lishment of parallel linkages between components. In a parallel structure, such as the one illustrated in Figure 2, a policy may pass through any one of the components in order to get to the implementation stage. It is different from the serial structure inasmuch as even if one or more units fail to pass the policy along, it may still be able to make it through the system. The end result is that for a policy to fail to get through this type of system, all components must fail. One variation of this structural form is the k-out- of-m unit network. In certain cases, we require that a certain number, k, of the m units in an active parallel redundant system must work for the system to be successful. One example of this type of system might be the requirement that at least two of the space shuttle's four major computers be on-line for launch. An organizational application of this system would be an agency director's decision rule not to imple- ment any policy that a majority of the staff cannot agree upon (k = + 1yf . There are two special cases of the k-out-of-m unit network which merit attention. The first instance is where k = 1, in which case we are back to a simple parallel system. The second case is where k = m, which effectively reduces the structure to a serial system. (See Appendix for proof.yf , Q W K H O D W W H U F D V H , although the network is configured as an active parallel system, in terms of reliability it is behaving as a serial structure. To differentiate between this type of system and a traditional serial network, we will classify this type of structure as a serially independent system. This name signifies that while the system is essentially a serial one, its components operate com- pletely independently from each other. In an organizational context, an important distinc- tion between a serially independent system and the traditional serial structure is the difference in process- ing time. In a traditional serial system, the first component must process the information, then pass it on to the next component for processing. This continues until all the components have processed the information. The total processing time of the system is the sum of processing times for each component plus some factor to account for transmis- sion delay between units. In contrast, the serially independent system allows all components to oper- ate simultaneously. The time needed for the serially independent system to complete its task is simply the time it takes the slowest component to finish its operations. So while both structures yield the same level of reliability, serially independent systems re- quire less processing time. Larger organizations often utilize combinations of serial and parallel structural forms. Two examples of this can be seen in Figure 3. In a series-parallel system there are several parallel subsystems linked together in a serial fashion. A parallel-series configuration is the result of several serial substructures combined into a larger parallel network. As these two examples illus- trate, most large and complex structures can be decom- posed into smaller units for easier analysis. THE ADVANTAGE OF PARALLEL SYSTEMS IN A TWO-STATE WORLD Components aligned in a series configuration are perhaps the easiest systems to analyze, as well as the most commonly encountered. As I noted earlier, in order for the system as a whole to fail, only one of the components in series has to fail. Defining the proba- bility of failure for component i as fi, a mathematical statement of the reliability of a serial system with m components would be m i = 1 424 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms American Political Science Review Vol. 87, No. 2 Combinations of Structures with m Units in Series and n Units in Parallel 1,1 1,21, 2. 1 2.. . .. . . . . ..r 21,1 2,2 L 2,m (ayf 6 H U L H V 3 D U D O O H O & R Q I L J X U D W L R n 1,1 ~~~1,2 . ..... 1.m By...~~~~~~~~.......... 2,1 222,m (byf 3 D U D O O H O 6 H U L H V & R Q I L J X U D W L R n We can see that two factors determine the reliability of a series system: component reliability and the number of components in the system. In order to increase the reliability of this type of system, one must either increase the reliability of the components or decrease the total number of components em- ployed. As illustrated in Figure 4, marginal gains in system reliability from increasing component perfor- mance decrease as component reliability increases. Because the costs of increasing component reliability often rise exponentially, it is more effective in many cases to reduce the total number of components in series to reach reliability goals. Perhaps the most common means of increasing reliability in a two-state world is to add parallel components to a system. It is assumed that all branches are active and that a signal needs to pass through only one branch to be successfully transmit- ted. Since it is assumed that all branches must fail in order for the system to fail, the reliability function for a parallel system with n components is simply n i = 1 Figure 7 illustrates the relationship between compo- nent reliability, the number of parallel elements, and overall system reliability. In this case, we see that the Declining Reliability of Serial Systems 0.90 , R = 0.99 n 0.80 '= 0.98 0 0.70 . 2 ~~~~~~~~0 4.. ~~~~~~~0 * 0.60 o l _ ?0 R=0.95 0.50 R Component Reliability | 0 0 0.40 .............. 1 3 5 7 9 11 13 15 Number of Components In Series marginal gains from adding parallel channels de- crease as the number of parallel components in- creases. In comparing Figures 4 and 5, the attractiveness of parallel systems in a two-state world is clear. Holding component reliability constant, the addition of redun- dancy in a parallel fashion will raise the reliability of the overall system, while creating serial redundancies decreases total system reliability. It is not surprising, therefore, that many scholars in this area have spurned serial systems and focused, instead, on parallel linkages when discussing this issue. Press- man and Wildavsky (1973yf K D Y H F U L W L F L ] H G W K H H [ L V W - ence of "multiple clearance points" (a serial systemyf for an implementive decision because it reduces the likelihood that any policy can ever be executed. Landau (1969yf G L U H F W V W K H V D P H O R J L F D J D L Q V W V W U H D P - lined" serial structures, claiming that such systems are more susceptible to failure than those with paral- Increasing Reliability with Parallel Systems 1.00n 4 0.90 n= 0.80 n=2 cc 0.70 E , 0.60 n= n = number of parallel components 0.50 0.40 I l l l l I l l l 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Component Reliability 425 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms Understanding the Challenger Disaster June 1993 lel redundancies. As a result, the major policy pre- scription of the work of Landau (1969, 1973, 1991yf D Q d others (Bendor 1985; Lerner 1986yf K D V E H H Q W R H P S K D - size the use of parallel linkages in order to create more reliable organizations. However, if it is true that serial systems are gener- ally less reliable, why are such structures ever adopted by organizations? To answer this question fully, we must go beyond the two-state world to allow for multiple forms of error. TYPE I AND TYPE II ERRORS IN AN ADMINISTRATIVE SETTING The Advantages of Serial Systems in a Three-State World As we have seen in the two-state world, serial struc- tures are less reliable than others of comparable size. Nonetheless, it is clear that this structural form has been employed repeatedly in the development of bureaucracies. The reason this type of organizational structure has survived, I will suggest, is that it is valuable when we must design systems to accommo- date multiple types of error. I previously assumed that there existed only two states for every compo- nent: good (operatingyf D Q G I D L O H G Q R W R S H U D W L Q J \f. I shall now elaborate a'theory of organizational reliabil- ity that allows for the existence of both type I and type II errors. In general, a series structure is better suited to stop type I errors from occurring. Since the null hypothe- sis is that the agency should not take any action, if any component chooses to pass the proposed policy through its part of the system, then it is rejecting the null hypothesis. For any policy to pass through a serial system successfully, all units must agree to pass it along. If rejecting the null hypothesis was actually the incorrect decision (a type I erroryf W K H Q D O O X Q L W s must commit such an error for the serial system as a whole to fail. Mathematically, we can identify both the probability of a type I failure occurring in the system (Fayf D Q G W K H U H O L D E L O L W \ R I W K H V \ V W H P D J D L Q V t this error (Rayf D s m Fa= [I ai m Ra = 1 - H ai, i= 1 where ca is the probability of a type I error occurring in the ith component and m is the number of compo- nents linked in series. For an example, consider NASA's launch decision process. We would find that a serial structure is more effective in preventing unsafe launches. In order to approve an unsafe launch in a serial system, every unit must err. The more hurdles are established (via increasing numbers of serial componentsyf W K H K D U G H r it is for an unsafe launch proposal to pass through the system unopposed. With regard to type II errors, however, series structures are less effective. Consider the fact that if one unit accepts the null hypothesis of no new action, then the policy cannot be passed through the rest of the serial system. If accepting the null hypothesis was actually incorrect (a type II erroryf W K H Q W K H Z K R O e system would have failed. Thus, in order for a type II error to occur at the system level, only one compo- nent in series must fail in this manner for the system to fail. The more components added in series, the greater the probability that a component will commit a type II error, causing the system to fail (Fityf 7 K e reliability of a series structure with regards to type II errors (RByf F D Q E H U H S U H V H Q W H G D s m F03l rl -oiyf i = 1 m i= 1 where f3i is the probability of a type II error occurring in the ith component and m is the number of compo- nents linked in series. We may also be interested in the overall reliability of the system-the likelihood that an agency would commit either a type I or type II error. To find this, we assume that for a given event at a given time, it is not possible for an administrative system to commit both a type I and a type II error simultaneously. This assumption is not unreasonable. A type I error re- quires that the agency act, while a type II error demands that the agency postpone any action. An organization cannot both act and not act at the same time. Consider NASA's decision to launch the shut- tle. The space agency can either decide to launch the shuttle now (possibly resulting in a type I erroryf R r abort the launch until another time (possibly result- ing in a type II erroryf , W F D Q Q R W G H F L G H W R E R W K O D X Q F h and abort at the same time. Given this argument, we can conclude that type I and type II errors are, at any point in time, mutually exclusive events. Therefore, the probability of either a type I or type II error in a serial system can be found as P(Fa U Fayf 3 ) D \f + P(FOyf = m e m niai+1 rl (1-oiyf i=1 i=+ The overall reliability of the serial system is then found to be Rsys = 1 - P(Fa U Fpyf 1 - (fi a i+ 1 -fi(1 ni yf \f 426 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms American Political Science Review Vol. 87, No. 2 m m Rsys =1 (1 - pijyf D L - i=l i=1 A brief numerical example will demonstrate these properties of serial systems. Assume that the NASA launch decision structure has two components that operate in series, each having the probability of committing a type I error a = .10 and the probability of committing a type II error 18 = .20. The probability that the overall system approves the unsafe launch (type Iyf L V ) D Z K L F K L V P X F K O H V V W K D Q H L W K H U R f the two components. However, the probability that NASA will abort a safe launch (type IIyf L V ) . The system as a whole is more likely to commit the type II error than either of its elements. The overall reliability of the system is found to be RSYS = .63. If we concern ourselves only with the system's overall performance, we can find the number of elements in series that, for a given level of component reliability, will result in the optimal level of reliability. The system reliability of a series network consisting of m identical and independent components is RSYS = (1 - Mmyf P $ R ' L I I H U H Q W L D W L Q J W K L V H T X D W L R Q Z L W h respect to m and setting the result equal to zero will give us the optimal number of components to be linked in series in order to maximize the system's reliability. This result (derived in the Appendixyf L s I1n a * ln(l-(18yf (n (1- a The Disadvantage of Parallel Systems in a Three-State World As the reader will have realized by now, there is a inverse relationship between the reliabilities of series and parallel systems. For instance, although series structures are ineffective against type II errors, paral- lel systems are able to reduce the probability of such errors occurring. This is because it is necessary for all components to fail in this manner for the overall system to commit a type II error. Likewise, type I errors are increased in such a framework, because it is only necessary for the incorrect action to pass through one channel in order to be implemented by the system. Therefore, the more independent parallel branches are attached to a structure, the more likely a type I error will occur but the less likely a type II error will be committed by the system. The mathematical formulation of the reliability of a parallel system is n FaY=1- H1(1-aiyf i = 1 n Ra H (1-iyf i = 1 n Fp3= HI 3i i= 1 n Rp3= 1- HI 3i n n Rsys =fj (1 -ajyf I , M j i=1 i=1 Modifying the example used earlier, let us assume that instead of connecting the two units in series, NASA links its components together in a parallel network. In this case, launch will occur if either unit recommends it. Now, the probability of launching an unsafe mission is greater, Fa = .19; but the probability of aborting a good mission is only F. = .04. The overall reliability of this system in preventing either type of error is much greater than before: RSYS = .77 in parallel, while Rsys = .63 in series. Just as we were able to do earlier for the series structure, we can calculate the optimal number of identical and independent components linked in par- allel. The system reliability of a parallel network consisting of n identical and independent compo- nents is R = (1 - ayf B Q ' L I I H U H Q W L D W L Q J W K L s equation with respect to n, and setting the result equal to zero will give us the optimal number of components to be linked in parallel in order to max- imize the system's reliability. This result (derived in the Appendixyf L V D V I R O O R Z V : In(l - ayf l (1 - ayf \ k1n Combinations of Serial and Parallel Systems In hypothesis testing, the probability of type I and type II errors can be simultaneously reduced by increasing the sample size. Similarly, we can lower the probability of both types of error occurring in reliability theory by adding additional components to the system both in series and in parallel. With regard to multiple errors, we may employ the expressions derived earlier to analyze the reliability of networks that contain both parallel and series sub- systems. To do this, we must first reduce the overall structure into a set of serial and/or parallel sub- systems. The various subsystems are evaluated to find the probability of type I and type II failure at this level. Each subsystem is then treated as a single component in a larger model of the overall structure. 427 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms Understanding the Challenger Disaster June 1993 Using this system reduction method, we are able to find the reliability of more complex administrative networks. Finding the optimal number of components in a mixed structure is more difficult than it was for simple series or parallel systems. Consider a com- bined system having m components in series and n components in parallel. We can raise the overall reliability level of the system to a point arbitrarily close to 1 by simply increasing the number of com- ponents in both series and parallel without bound (Barlow and Proschan 1965, 187yf + R Z H Y H U J L Y H Q a fixed number of components in either series or par- allel, we can find the optimal number of components needed in the other dimension. I shall discuss this for both series-parallel and parallel-series systems. First, let us consider a series-parallel system hav- ing m components in series and n components in parallel (as in Figure 4yf $ O O F R P S R Q H Q W V D U H D V V X P H d to be identical and independent. Regarding the sys- tem only as a series, we find that its overall reliability is Rsys-SP = (1 - Fyf ) P . Next, we must find the probability of type I and type II errors in each parallel subsystem. Fa = 1 -(1 - ayf ' Fit = f3fl. Combining these equations, we find that the overall system reliability of the n x m series-parallel network is Rsys - Sp = (1 - 8nyf D \fnyf m Once we have the equation for the overall reliabil- ity of the system and have decided whether to fix the level of m or n, we can find the optimal number of components in the other dimension by differentiating RSYS-S. with respect to the variable we seek to opti- mize and setting the result equal to zero. There are a number of computer routines available that are also capable of solving this type of problem. This approach also works for parallel-series sys- tems. Figure 5 shows a parallel-series network of n x m independent and identical components. Using the same method as before, we find the overall system reliability of a parallel-series system to be RSYS -= (1- - (1-a(1- 38yf P \fn Again, once we have the equation for the overall reliability of the system and have decided whether to fix the level of m or n, we can find the optimal number of components in the other dimension by differentiating RSSP with respect to the variable we seek to optimize and setting the result equal to zero. ALTERING RELIABILITY THROUGH SYSTEM LINKAGES AT NASA This theoretical approach to structural design can now be used to examine the institutional failures at NASA that ultimately led to the destruction of the Challenger. I will show that during the 1970s and 1980s, NASA altered its organizational structure in order to achieve different reliability goals. I examine changes within two specific areas of the NASA's structure that the Rogers Commission mentioned in its report on the Challenger accident in 1986. The first area of concern involves the organization of NASA's reliability-and-quality-assurance (R&QAyf I X Q F W L R Q V . The second area involves changes that took place in the agency's launch decision structure. Changes Within NASA's Reliability-and- Quality-Assurance Function Prior to 1961, NASA had no explicit reliability-and- quality-assurance function within the agency.3 In trying to match the Soviet space achievements, NASA devoted little of its effort toward preventing bad launches (type I erroryf D Q G X V H G D O P R V W D O O L W s resources trying to launch as often as possible (avoid- ing type II erroryf $ V D U H V X O W W K H D J H Q F \ V P L V V L R n success rate from 1958 to 1961 was dismal (Weiss 1971yf . At that point, the Soviet Union seemed to be winning the space race, inasmuch as they were able to draw attention to their many successes, while the United States had experienced a number of visible failures.4 It became clear that the only hope of beating the Soviets was for the United States to play down the numbers of launches and emphasize mission quality. This fact underscored the need for an agen- cywide effort to increase type I reliability. Conse- quently, NASA administrator James Webb estab- lished the first R&QA function within the space agency in 1961. The R&QA function was initially formed on three levels: within headquarters, at the field centers, and in the contractors' plants. As such, it represents a classic example of serial redundancy. For a mistake to be made by NASA, the error would have to pass all through three checkpoints undetected. As I have noted, system reliability with regard to type I error increases with each additional serial component. Constructing such an organizational structure is wholly consistent with the agency's increased con- cern over type I reliability at this time. Following the success of the Apollo program in the 1970s, NASA faced less demand for type I reliability and more for type II reliability. In the Apollo era, NASA's primary concern was for successful achieve- ment. The agency had a specific mandate that it knew had to be fulfilled at any cost. Having secured victory in the race to the moon, NASA faced increasingly tighter budget constraints. At the same time, the demand from politicians for services continued un- 428 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms American Political Science Review Vol. 87, No. 2 abated. This meant that NASA now had to do more with fewer resources available. As a result of the changing political and economic landscape, the agency focus in the Shuttle era had shifted to the efficiency and cost-effectiveness of its policies (Hei- mann 1991yf . By demanding that NASA develop more cost- effective policies, politicians put the agency under greater pressure to pursue type II reliability. As noted earlier, type II failure costs are generally associated with wasted resources and inefficient behavior. From a short-term perspective, then, a concern for type II reliability appeared to be more sensible than type I reliability.5 The greater pressure for efficient space policy, therefore, led NASA to allocate more of its resources to type II reliability. The results of this shift are visible in the changes in the R&QA structure since 1970. At the headquarters level, there was a consolidation of the R&QA Office with the Safety Office to take advantage of economies of scale.6 Later in 1973, this office was combined with other staff and placed under the associate adminis- trator for organization and management.7 Some of the R&QA work was integrated with the Office of Procurement at this time. Further consolidations oc- curred in 1977, and all safety- and reliability-oriented functions were transferred to the Office of the Chief Engineer. Placement of the R&QA function within the Office of the Chief Engineer did not promote type I reliabil- ity. Documentation from NASA makes it clear that while safety and reliability were considered impor- tant, they were a secondary function of the chief engineer's office (National Aeronautics and Space Administration 1983yf 7 K H H I I R U W V R I 5 4 $ Z H U H D O V o hampered by the continual loss of personnel. From 1970 to 1985, NASA experienced a 31yb G H F O L Q H L n total personnel; but within the total R&QA function, this decline was over 62yb $ V D U H V X O W W K H 1 $ 6 $ V W D I f allocated to R&QA was just over 5.1yb L Q L n 1985, it was less than 2.8yb 8 By the end of 1985, the R&QA staff at NASA headquarters totalled only 17 people. As Chief Engi- neer Silveira stated in an interview, "We were trying to use the field center organizations rather than having that function here:"9 NASA had all but elim- inated one of the serial components in its safety and reliability-and-quality-assurance function. This action reduced the probability of NASA's committing a type II error but increased the chances of experiencing a type I failure. The R&QA function was not immune to changes at the field centers. This level experienced reductions in manpower, as well.'0 From 1970 to 1985, the three major centers for manned space flight (i.e., the John- son Space Center, Kennedy Space Center, and Mar- shall Space Flight Centeryf F X W 5 4 $ S H U V R Q Q H O E \ , 54, and 84yb U H V S H F W L Y H O \ 7 K H K H D Y \ F X W V D W 0 D U - shall, in particular, were understandable, given the pressure the center was under to be cost-effective. Again, cutting R&QA personnel can be seen as reducing the serial linkages at this level that can serve to prevent Type I errors. In addition to the personnel reduction, the field centers' ability to supervise the activities of contrac- tors was limited in this period. Normally, NASA seeks to "penetrate" its contractors to provide an adequate check on their work. Aerospace contractors, however, generally conduct business with the De- partment of Defense, as well as NASA. The Defense Department in the 1970s became concerned that too many NASA inspectors at the contractors' plants could jeopardize national security and wanted to place a cap on the number and scope of their inspec- tion activities (Smith 1989, 230-31yf , Q D V P X F K D V W K e agency was dependent on the Defense Department for political support for the shuttle program, NASA had little choice but to accept these limitations on the plant inspection efforts. After the Challenger, NASA's R&QA function changed dramatically. The Rogers Commission casti- gated the space agency for its "silent safety program" and recommended that it revitalize its R&QA func- tion. In response, NASA created the Office of Safety, Reliability, Maintainability, and Quality Assurance (later renamed the Office of Safety and Mission Qualityyf ( V W D E O L V K H G D V D O H Y H O R U J D Q L ] D W L R Q W K L s office is headed by an associate administrator who reports directly to the NASA administrator. Staffing for the new headquarters office was immediately doubled and has since experienced more than 350yb growth over its 1985 level. While NASA personnel levels have grown by 10yb R Y H U W K H S D V W I L Y H \ H D U V , the R&QA function as a whole has increased by 123yb during that time. As a result, R&QA as a percent of total NASA staff is back at 5.6yb V L P L O D U W R L W V S R V L W L R n at the time of Apollo 11 in 1969; NASA has made a strong and visible effort to restore the serial compo- nent at this level. Within this office, there exist several divisions, such as the Safety Division and the Space Station Safety and Product Assurance Division, which for- mulate office policy in their respective areas and provide some monitoring to ensure compliance. The real "teeth" of the office, however, are found in the Systems Assessment Division and the Programs As- surance Division. In Systems Assessment, top-level senior engineers from a wide range of disciplines perform independent evaluation of technical problem areas and testing of systems readiness. The Programs Assurance Division also employs senior engineers to identify critical technical problems and to ensure that the office's concerns are properly addressed by the program organizations and field centers. Inasmuch as either of these two units have the ability to stop a launch perceived to be unsafe, these two divisions work as serially independent linkages within the headquarters office. Creating two serial components within the larger serial component at headquarters increases the organization's reliability with regard to type I errors. The R&QA function at the field center level has also been resuscitated in the wake of the Challenger 429 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms Understanding the Challenger Disaster June 1993 failure. Marshall, which had initiated the largest cuts in this area, has since increased its R&QA personnel by 178yb R Y H U O H Y H O V 5 4 $ V W D I I L Q J O H Y H O V D W W K e Kennedy Space Center and the Johnson Space Center have also increased over 1985 levels by 175yb D Q d 56yb U H V S H F W L Y H O \ . The agency has further augmented its R&QA func- tion through the development of the NASA Safety Reporting System (NSRSyf 5 X Q E \ D F R Q W U D F W R U Z L W h no other NASA business, the NSRS provides employ- ees with a confidential means of reporting problems that they believe have not been properly addressed. The system acts as an additional serially independent component in the NASA R&QA structure. Officials at the Office of Safety and Mission Quality state that there have been "no showstoppers" among the re- ports received by the NSRS. Furthermore, the prob- lems coming through the NSRS have almost always been identified by the office first, evidence that their administrative structure is doing its job properly. It is important to consider these organizational changes in the context of three-state reliability theory. In Table 1 (Set Ayf , D Q D O \ ] H W K H Y D U L R X V 5 4 A structures first under the assumption that each com- ponent has a 15yb F K D Q F H R I P D N L Q J W \ S H , D Q G W \ S H , I errors. The original Apollo structure had three serial units, reducing the change of a type I error to a remote .3yb 7 K H D G Y D Q W D J H R I W K H $ S R O O R V W U X F W X U e can be clearly seen. Even if the individual compo- nents are not that reliable with regard to type I error, the structure as a whole would guard against such a failure. As we noted earlier, the NASA structure-in the Shuttle era prior to Challenger was effectively reduced to a single unit. This shift resulted in a dramatic reduction of type II errors and lowered the probability of either form of system failure from 38.9yb W R \b. The disadvantage is that a type I error, such as the Challenger accident, would be far more likely, rising in this case from .3yb W R \b. Since then, NASA has made a concerted effort to restore the serial structure in R&QA and has even added an additional unit through the Office of Safety and Mission Quality. Adding the fourth component to the serial unit in this case lowers the probability of a type I error to a paltry .1yb . Some could argue that a change to a more stream- lined structure would be acceptable if accompanied by an increase in component reliability. This is not necessarily the case. In Set B, I allow each component to lower its probability of error by a factor of three, from 15yb W R \b. Under these circumstances, the pre-Challenger structure has a 5yb F K D Q F H R I F R P P L W - ting a type I error. The Apollo structure, even with less reliable components, had less than a 1yb F K D Q F e of committing such an error. This comparison makes it clear that these structural changes at NASA were consequential. As this analysis has shown, streamlin- ing the R&QA function increased the probability that a type I failure such as the Challenger accident would eventually occur. Probabilities of Failure for NASA R&QA Structures (all results expressed as a percentageyf COMPONENTS & STRUCTURES SET A SET B Component Failure Type I error 15.0 5.0 Type II error 15.0 5.0 Apollo Structure Type I error 0.3 <0.1 Type II error 38.6 14.3 Overall error 38.9 14.3 Shuttle Structure Before Challenger Type I error 15.0 5.0 Type II error 15.0 5.0 Overall error 30.0 10.0 After Challenger Type I error 0.1 <0.1 Type II error 47.8 18.5 Overall error 47.8 18.5 Note: A Type I error for NASA would be a decision to launch an unsafe mission. A Type II error would be a decision to abort a technically sound mission. The figures for component reliability, as well as the calculations which follow, are not empirically derived estimates but rather they are assump- tions made for the purpose of illustration. Changes in the Launch Decision Structure At the same time that the R&QA function at NASA was undergoing significant modifications, important changes in the launch decision process also occurred that exacerbated the problem of type I reliability at NASA. I shall examine both the development and the impact of structural changes in the launch decision process prior to the Challenger. Figure 6 illustrates NASA's official launch decision structure. Designed originally in the Apollo era to limit type I errors, the system has a large number of serial components. Although the field centers are configured as parallel units in the diagram, this level of the structure actually operates as a serially inde- pendent system. The reason for this is that the operating rule at the preflight review is that if any center reports it is unready to fly, the mission is aborted. This is an example of a k-out-of-m network, where k = m. While most of this structure remained intact throughout the shuttle program, critical changes occurred at the Marshall Space Flight Center. Marshall has responsibility for three aspects of the shuttle program; the main engines, the external tank, and the solid rocket boosters. At the center, there is a project manager and staff assigned to each section of the program. Before a launch, contractors for each element in the shuttle must certify in writing that their components have been examined and are ready to fly the specified mission. After this step, the Marshall staff responsible for that segment of the program must also verify that it is safe to fly under 430 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms American Political Science Review Vol. 87, No. 2 NASA Official Launch Structure, 1981-1986 Mission Management Team Flight Readiness Review I Readiness Review MrshlSFKnedy SC Johnson SC Flight Readiness Flight Readiness Flight Readiness Review Review Review Space Shuttle Solid Rocket Exerna Tank Shuttlsing Main Engine Booster ExtracTor Processing Oractor Integration Contractor Contractor Contractor Contractor Contractor Contractor 'these conditions. This process is illustrated in Figure 7. This structure works well at preventing type I failures, but it is not as effective with regard to type II errors. Assume for a moment that the probability of each component in this system committing a type I error is 5yb D Q G W K H F K D Q F H R I D W \ S H , , H U U R U L V \b.11 In such a case, each subsystem (solid rocket boosters, main engine, and external tankyf K D V R Q O \ D \b chance of committing a type I failure; but a 27.75yb chance of allowing a type II error. The probability that the center as a whole would be responsible for any type II launch failure is 62yb . This is certainly not good news for Marshall, which was under a great deal of pressure to operate cost- effectively. The center had been slated for shutdown following the completion of Apollo. As former NASA administrator James Fletcher told his successor, "Closing Marshall has been on the Office of Manage- ment and Budget's agenda ever since I came to NASA in 1971" (Smith 1989, 84yf 7 R S U H Y H Q W W K L V I U R m occurring, NASA headquarters sent several high- visibility projects, such as the Space Telescope and the shuttle's solid rocket motors, to Marshall. Despite these efforts, Marshall was still threatened with large reductions in personnel and other resources follow- ing Apollo.12 This pressure became even more intense as the shuttle neared operational status. In 1978, the OMB had expressed to NASA that it wished to impose major cuts at Marshall. Although NASA suggested "new roles" for the center, OMB officials made it clear they were not interested.'3 While NASA as an institution faced strong pressure to run its operations cost-effectively, Marshall felt this demand more in- tensely. Consequently, managers at the center sought to increase their type II reliability, so as not have a launch stopped on account of a Marshall part (McConnell 1987, 109, 112yf . To achieve this objective, Marshall Center director William Lucas unofficially changed the organizational structure from a serial system to a parallel one (Figure 8yf , I H L W K H U W K H F R Q W U D F W R U R U W K H 0 D U V K D O O V W D I I V W D W H d that a launch was justifiable, Lucas would insist that Official Structure of Flight Readiness Review at the Marshall Space Flight Center Onto Pre-Flight Readiness Review (Level 2 Reviewyf Main Engine Solid Rocket External Tank Marshall Booster staff Marshall Staff -- Marshall Staff Space Shuttle Solid Rocket External Tank Main Engine Booster ContracTor Contractor Contractor Contractor 431 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms Understanding the Challenger Disaster June 1993 An Unofficial Change in the Structure of the Flight Readiness Review at the Marshall Space Flight Center Onto Pre-Flight Readiness Review (Level 2 Reviewyf Space Shuttle Main Engine EtraTnk External Tank Solid Rocket Solid Rocket Main Engine Marshall Extraltan Marshall Booster Booster staff Contractor Staff Contractor Staff Contractor -- Marshall all parties agree to forward the necessary paperwork. Through his personal supervision of the launch deci- sion process and his domineering style of manage- ment, Lucas was able to ensure that this unofficial structure prevailed. Following the investigation of the Challenger acci- dent, it was widely publicized that Marshall solid rocket booster managers had coerced engineers at Morton Thiokol into agreeing to a launch they op- posed. Such pressure, however, has worked in both directions. In January 1985, solid rocket booster man- ager Larry Mulloy sent an urgent memo to Thiokol concerning O-ring erosion. A week later, in the review for shuttle mission 51-D, Thiokol's launch decision was to "accept risk" (Presidential Commis- sion on the Space Shuttle Challenger Accident 1986, vol. 1yf 2 Q F H W K L V R S L Q L R Q Z D V H [ S U H V V H G W K H D E L O L W y of the Marshall staff to stop the launch was extremely limited. In sum, the launch decision structure at Marshall had been transformed from a serial system, which would require both parties to authorize the flight readiness of the equipment, into a parallel structure needing only one component to approve the launch. This structural shift had a profound impact on the reliability of the launch decision process at Marshall. Table 2 (Set Cyf F R Q W U D V W V W K H R U J D Q L ] D W L R Q D O U H O L D E L O L W y of each system, assuming once again, that each component commits type I and type II errors with a probability of .05 and .15, respectively. The new parallel structure reduces the probability that any subsystem will commit a type II error from 27.75yb W o 2.25yb & X U W D L O L Q J W K L V I R U P R I H U U R U Z D V F R P S O H W H O y consistent with Marshall's requirements at the time. The converse is that the probability of a type I failure in each subsystem increases from .25 yb W R \b. The probability that at least one subsystem at the Marshall center would commit a type I error rose from .75yb W o 26.49yb X Q G H U W K H Q H Z S U R F H V V . As noted, NASA had changed its emphasis from type I to type II reliability. To consider the impact of this transformation, Table 2 (Set Dyf F R P S D U H V W K H W Z o structures with the component probabilities of type I and type II errors reversed. Even when biased toward type II reliability, the serial decision structure is able to keep type I error rates at a low 2.25yb L Q H D F h subsystem. In contrast, the new parallel organization increases this failure rate to 27.75yb 7 K H S U H V F U L E H d serial structure clearly limits the possibility that such organizational changes would lead to fatal type I mistakes, a feature that is lost in the unofficial launch process. In light of this analysis, perhaps the most surprising aspect of the Challenger accident was that it did not happen earlier. Since then, the launch decision process at Marshall has been restored to its previous status. In addition, the Office of Safety and Mission Quality has been given a direct voice in the launch decision at the flight readiness review. This office now has the authority to stop any launch that it believes is unsafe, linking it to the system in a serially independent manner. This authority has been exercised on previous launch attempts when the office was concerned with hydro- gen leaks on the shuttle, defective door lug-bolts, and other technical problems. On the whole, the launch decision system is now structured to be more protec- tive against type I failures and prevent another mis- hap like the Challenger. Probabilities of Failure for Marshall Launch Decision Structures (all results expressed as a percentageyf SET C SET D COMPONENTS & TYPE TYPE TYPE TYPE STRUCTURES I II I II Component Failure 5.0 15.0 15.0 5.0 Serial Structure Subsystem Failure 0.25 27.75 2.25 9.75 Center Failure 0.75 62.29 6.60 26.49 Parallel Structure Subsystem Failure 9.75 2.25 27.75 0.25 Center Failure 26.49 6.60 62.29 0.75 Note: A Type I error for NASA would be a decision to launch an unsafe mission. A Type II error would be a decision to abort a technically sound mission. The figure for component reliability, as well as the calculations which follow, are not empirically derived estimates but rather they are assump- tions made for the purpose of illustration. 432 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms American Political Science Review Vol. 87, No. 2 CONCLUSION Several conclusions can be drawn at this point. First, organizational structure can have an important im- pact on administrative reliability. I have demon- strated, both in theory and for the case of NASA, that changes in the number and alignment of administra- tive components alters the probability that an agency would commit either a type I or type II error. Perhaps most interesting is the fact that NASA changed its institutional configuration to appease both the tradi- tional and the Landau schools of thought in public administration, yet both decisions contributed to the biggest failure in NASA history. With regard to its reliability-and-quality-assurance function, NASA fol- lowed the traditional public administration prescrip- tion by eliminating redundancies and streamlining its organizational system. At the same time, the space agency also heeded the advice of Landau by creating parallel linkages in the launch decision structure. As I have shown, however, both those decisions were incorrect and contributed to the destruction of the Challenger. While recognizing its limitations, we should not be overly critical of the work in this field by Landau and his colleagues. It is clear that Landau's major objec- tive at the time was to respond those in the discipline who emphasized efficiency over the pursuit of reli- ability. As such, his work was path-breaking; and his focus on two-state (operating/failedyf G H Y L F H V Z D V X Q - derstandable. As I have shown here, however, the three-state world gives us a theoretically richer and more empirically useful framework for evaluating organizational reliability. Expanding this subject to allow for multiple types of error is an important step toward a more general theory of organizational reli- ability and agency behavior. Using this theoretical framework, it is also clear that the agency's choice of structure can provide insights on its priorities. The NASA case is particu- larly illustrative of this point. In the Apollo era, the agency pursued type I reliability more than type II. To do this, they developed a large serial structures in both the reliability-and-quality-assurance function and in the launch decision process. Over time, there was greater demand for type II reliability, which was met through a series of structural changes in the agency. The intense criticism following the Challenger led NASA to radically shift its structure to a system that was even more effective against type I errors. That structural changes mirrored the shifting de- mands for each form of reliability does not surprise us. As Alfred Chandler, the eminent management historian, once put it, structure follows strategy (Chandler 1962yf % \ O R R N L Q J D W W K H V W U X F W X U D O P R G L I L - cations within an agency and understanding how these changes influence organizational reliability, we can gain some insight into the agency's true prefer- ences and priorities with regard to different types of reliability. It is possible to perform such an analysis even if we do not know the exact values of component reliability for both types of error. In the theoretical section, I assumed that the reliability of each component was known. When analyzing NASA, however, I was still able to apply these principles by making some rea- sonable assumptions about the reliability of the com- ponents. More generally, when circumstances dictate greater effort toward one form of reliability, we can use these principles to develop a set of structures that are relatively better at minimizing the error of con- cern. The theory can also be used to recognize poten- tial weak points in more complex organizations and allow us to reinforce them with regard to the partic- ular error of concern. Explicitly incorporating multiple forms of error in our analysis will open up new avenues in the public administration research agenda. For example, more work could be done on the study of agency incentives to pursue different forms of reliability. This work helps us to understand how agencies may adjust their structural design to meet the demands for different forms of reliability. The question of why agencies make the choices they do is one that would be of great interest to political scientists. As Moe (1990yf notes, structural choice is, at heart, a political issue. This is particularly true when we recognize that agencies, endowed with limited resources, must make trade-offs between each form of reliability. These choices have political consequences and are influenced by political factors. Further research could also be done comparing component and systems reliability. Till now, the focus has been on the system as a whole, not the individual components. Concentrating on compo- nent issues raises a whole new set of questions. What factors affect component reliability? To what extent is component reliability influenced by the strategic be- havior of agents? To what extent can and should agencies pursue reliability objectives through in- creased component reliability, rather than through structural changes? Answering these questions will also contribute to the development of a more general theory of organizational reliability and agency behav- ior. In the end, we must recognize that a theory of organizational reliability is not sufficient to eliminate risks altogether. Space exploration is still a risky business-it always has been and always will be. The men and women who have gone into space under- stood these risks. Even with the technological ad- vances resulting from Project Apollo and the organi- zational changes after the Challenger, there remains a large element of danger in manned space flight. According to a report by the Office of Technology Assessment in April 1990, there is a 50yb F K D Q F H R f losing another orbiter over the next 34 missions. Understanding organizational reliability cannot elim- inate these risks altogether, but it can help to ensure that NASA's next technical problem is not com- pounded by managerial mistakes. 433 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms Understanding the Challenger Disaster June 1993 APPENDIX Proof That a k-out-of-m Unit Network Reduces to Serial System Where k = m Since the success/failure of each of the m units can be modeled as a Bernouilli process, we can use the binomial distribution to describe the system reliability for a k-out-of-m unit network. Letting f represent the probability of failure for each independent unit, the overall reliability of a k-out-of-m unit network may be represented mathematically as Rklm = yf W L I L . When k = m, this expression reduces to R m = Rklm = G (-fyf 0 I P m which further reduces to R = (1 - fiyf P 1 R W H W K D W W K L s expression is equivalent to equation 1, which repre- sents the reliability of a serial system in a two-state world. Series Systems Optimization The system reliability of a series network consisting of m identical and independent components is R, = (1 - Mm - am. Differentiating this equation with respect to m and setting the result equal to zero will give us the optimal number of components to be linked in series in order to maximize the system's reliability: aR - = [(1 - (yf P O Q \f] - [am 1na] = 0 (1 - 3yf P O Q \f = am 1na Taking the natural log of each side and grouping like terms, main (1 13yf \f = I n(1- yf * ~((1 -/3yf \f m* =_ Parallel System Optimization The system reliability of a parallel network consisting of n identical and independent components is R. = (1 - ayf Q B R L Q ' L I I H U H Q W L D W L Q J W K L V H T X D W L R Q Z L W h respect to n and setting the result equal to zero will give us the optimal number of components to be linked in parallel in order to maximize the sys- tem's reliability: OR -= [(1 - ayf O Q D \f] - [3in (n/3] = 0 (1 - ayf Q O Q D \f = f3 n n/3. Taking the natural log of each side and grouping like terms, nin (1( n* = ______________ ((1n ayf \f Notes A previous version of this paper was presented at the 1992 Midwest Political Science Association Meeting. I am indebted, for their helpful comments, to Don Chisholm, John Gilmour, Jim Granato, Tom Hammond, Paula Kearns, Jack Knight, Jack Knott, and Bill Lowry. 1. A three-state device is a unit that can be described as either operating correctly, committing a type I error, or committing a type II error. One might argue that what we have here are four-state devices, inasmuch as there are two types of correct decisions, as well. While this is true, the focus of reliability theory has historically been on the analysis and impact of failure. As a result, it is the norm to refer to multiple error units as three-state devices. 2. While it is later relaxed, making this assumption is not unreasonable. Measuring personnel productivity is a subject that has been discussed in industrial engineering and other areas of management science. Research from these areas make it possible to generate statistically sound estimates of component performance and reliability. 3. The reliability-and-quality-assurance function at NASA refers to those offices and individuals specifically charged with agency oversight on matters of reliability and safety. This function is not limited to a single office and has undergone a number of changes over time. 4. The Soviets' control over the media allowed them to limit their coverage to the successful flights. In contrast, the United States' first attempt to launch a satellite was a highly publicized failure. The United States learned from this embar- rassing experience and tried to minimize press coverage in later launch activities. Such news can rarely be kept a com- plete secret in our society, however; and the continued contrast between Soviet success and American failure had a political impact at home and abroad. 5. It may well be that considering the policy over a longer time horizon would demonstrate the cost-effectiveness of type I reliability. However, the American political system is notorious for its myopic approach to many public policy issues. After the Challenger, many experts were critical of the space agency's shortsightedness and claimed that such be- havior was irrational. If the political environment of the agency is factored in, the policy outcomes may seem unde- sirable; but they are not irrational from the agency's perspec- tive (Heimann 1990, 1991yf . 6. While these two offices had overlapping responsibility, the Reliability and Quality Assurance Office was primarily concerned with hardware and technological issues, while the 434 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms American Political Science Review Vol. 87, No. 2 Safety Office focused on human factors and included respon- sibility for worker safety procedures. 7. Special Announcement, February 6, 1973, "Establish- ment of Office of Safety and Reliability and Quality Assur- ance," NASA History Office Library. 8. Figures on reliability-and-quality-assurance manpower were calculated from numbers provided to Congress by the chief engineer and by the Office of Safety and Mission Quality and the Office of Human Resources at NASA. 9. Mark Tapscott, "Cuts Hurt NASA's Safety," Washing- ton Times, 5 March 1986. 10. Directors at each center have traditionally been given flexibility to set R&QA staffing and funding for projects under their control. Before the Challenger, R&QA officials at head- quarters could make recommendations about center staffing, but had no direct control over this process so as to reinforce the independent nature of the field center unit. 11. This assumption would be consistent with the belief that NASA is generally risk-averse with regard to type I error. As noted earlier, however, increasing type I reliability often raises the likelihood that a type II error could occur. 12. News Release, 5 March 1974, no. 74-31, NASA History Office Library. 13. Memorandum, assistant associate administrator for management operations to associate administrator for man- agement operations, September 7, 1978, "Institutional As- sessment Presentation to OMB," NASA History Office Li- brary. References Barlow, Richard E. and Frank Proschan. 1965. Mathematical Theory of Reliability. New York, NY: John Wiley & Sons, Inc. Bendor, Jonathan B. 1985. Parallel Systems: Redundancy in Government. Berkeley, CA: University of California Press. Chandler, Alfred. 1962. Strategy and Structure: Chapters in the History of Industrial Enterprise. Cambridge, MA: MIT Press. Chisholm, Donald W. 1989. Coordination Without Hierarchy: Informal Structures in Multiorganizational Systems. Berkeley, CA: University of California Press. Heimann, C. F. Larry. 1990. "Reliability Versus Efficiency: Striking a Balance in the Political Arena." Presented at the Annual Meeting of the Southern Political Science Associa- tion. Atlanta, November 8-10. Heimann, C. F. Larry. 1991. Acceptable Risks: A Theory of Organizational Reliability and Agency Behavior. Ph.D. Disser- tation, Washington University, St. Louis. Landau, Martin. 1969. "Redundancy, Rationality and the Problem of Duplication and Overlap." Public Administration Review 29(4yf - X O \ $ X J X V W \f:346-358. Landau, Martin. 1973. "Federalism, Redundancy, and System Reliability." Publius 3(2yf ) D O O \f:173-196. Landau, Martin. 1991. "On Multiorganizational Systems in Public Administration." Journal of Public Administration Re- search and Theory. 1(1yf M D Q X D U \ \f:5-18. LaPorte, Todd R. and Paula M. Consolini. 1991. "Working in Practice But Not in Theory: Theoretical Challenges of High- Reliability Organizations." Journal of Public Administration Research and Theory. 1(1yf M D Q X D U \ \f:19-47. Lerner, Allan W. 1986. "There is More Than One Way to be Redundant." Administration and Society. 18(3yf 1 R Y H P E H U \f: 334-59. McConnell, Malcolm. 1987. Challenger: A Major Malfunction. Garden City, NY: Doubleday & Company, Inc. Moe, Terry M. 1990. "The Politics of Structural Choice: Towards a Theory of Public Bureaucracy." In Organization Theory, ed. Oliver E. Williamson. New York: Oxford Uni- versity Press. National Aeronautics and Space Administration. 1983. The NASA Organization. Internal report: NASA. Perrow, Charles. 1984. Normal Accidents: Living with High-Risk Technologies. New York, NY: Basic Books. Pressman, Jeffrey L. and Aaron Wildavsky. 1973. Implementa- tion. Berkeley, CA: University of California Press. Rochlin, Gene I., Todd R. LaPorte, and Karla H. Roberts. 1987. "The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea." Naval War College Review. (Autumnyf . Smith, Robert W. 1989. The Space Telescope. Cambridge, En- gland: Cambridge University Press. U.S. Presidential Commission on the Space Shuttle Chal- lenger Accident. 1986. Report to the President. Washington: The Commission. Weiss, Howard M. 1971. "NASA's Quality Program- Achievements and Forecast." Presented at 25th ASQC Technical Conference, Chicago. C. F. Larry Heimann is Assistant Professor of Political Science, Michigan State University, East Lansing, MI 48824-1032. 435 This content downloaded from 69.67.124.210 on Tue, 16 Aug 2016 18:48:44 UTC All use subject to http://about.jstor.org/terms