MPA625 M7D1: Henry Ford Jones & Universal Principles Part 1Each part has to have at least 500 WordsUse Question/ Answer format. In this activity, you will critically evaluate the information provided

Journal of Business and Behavioral Sciences Vol. 25, No. 1; Spring 2013 THE RISE OF STATISTICAL SIGNIFICANCE TESTING IN PUBLIC ADMINISTRATION RESEARCH AND WHY THIS IS A MISTAKE Raymond Hubbard C. Kenneth Meyer Drake University ABSTRACT: The growth of statistical significance testing in articles published in the Public Administration Review for the period 1945 through 2008 is ex amined. Comparisons with sister journals, the American Political Science Review and the American Journal of Political Science, show this growth to be less emphatic and of more recent origin than theirs. That public administration researchers are not yet qu ite entrenched in this practice is good because statistical significance testing, with its focus on p -values, is largely ritualistic and adds almost nothing of scientific value to a study. The justification for this conclusion is presented. Instead of the infatuation with p -values, we encourage public administration researchers to report and interpret sample statistics, effect sizes, and their confidence intervals in empirical work. This offers a better prospect for developing cumulative knowledge within th e field . INTRODUCTION It is evident that the current practice of focusing exclusively on a dichotomous reject -nonreject decision strategy of null hypothesis testing can actually impede scientific progress. I suspect that the continuing appeal of null h ypothesis significance testing is that it is considered to be an objective scientific procedure for advancing knowledge. In fact, focusing on p values and rejecting null hypotheses actually distracts us from our real goals: deciding whether data support our scientific hypotheses and are practically significant. The focus of research should be on our scientific hypotheses, what data tell us about the magnitude of effects, the practical significance of effects, and the steady accumulation of knowledge (K irk, 2003, p. 100). A heated debate over the merits of quantitative versus qualitative research methodologies has occurred recently in a number of public administration and public policy journals, especially the Journal of Public Administration Research and Theory and Administration and Society (see, e.g., Gill, J., & Meier, K. J. (2000); Luton, 2007, 2008; Lynn, Heinrich, and Hill, 2008; Meier and O’Toole, 2007). Because our paper deals with methodology, we would like for readers to know of our own perspective on this debate from the outset. Both of us are staunch advocates of a postpositivist philosophy of science, credited to Bhaskar (1978, l979), called criti cal realism . While incorporating Journal of Business and Behavioral Sciences 5 aspects of both, critical realism provides an alternative philosophy to those found wanting — positivism/empiricism on the one hand and relativism/interpretivism on the other (Sayer, 2000). In essence, this philosophy state s that, first, the world exists independently of our knowledge of it. Second, the goal of science is to create genuine, but always fallible, knowledge about the world. Because knowledge is produced socially, and hence is theory laden, does not make it theory determined (Sayer, 2000). Third, all theories concerning knowledge claims must be subject to critical evaluation; knowledge is not immune to empirical check (Sayer, 1992). It is through this critical evaluation of competing theories that the scienti fic community, over time , is able to decide on which ones to retain and which to discard. That is, not all knowledge is equally fallible (Smith, 2006). Critical realism, then, sees value in both quantitative and qualitative research approaches to the exa mination of knowledge claims. Both camps can help in the triangulation of findings. It must be acknowledged, however, that throughout our careers we have been engaged chiefly with quantitative research. But this should not be read as a ringing endorseme nt of a quantitative research orientation. This brings us to the subject of this paper, one which finds much at fault with mainstream quantitative work. Especially its obsession with p-values. In many social science journals the p-value from a statistic al significance test is a staple of empirical research. This same index is now appearing with greater frequency in the pages of the Public Administration Review (PAR ). We agree with Gill and Meier (2000, p. 163) that this is not a welcome trend, and we sh ow in some depth why this is the case. While considered de rigueur in the social sciences, tests of statistical significance are largely bereft of value in the analysis of data. Statistical significance testing is mostly a ritual — a “meaningless parlor gam e” (Ziliak and McCloskey, 2008, p. 2) — which appears to lend “scientific” respectability to the research enterprise. It lends no such thing. This paper presents evidence on the growth of statistical significance testing and p -values in empirical work publi shed in the PAR over the period 1945 through 2008. For purposes of comparison, we do likewise with sister journals the American Political Science Review (APSR ) and the American Journal of Political Science (AJPS ). Next, the popularity of p-values in empir ical public administration and political science research (and, for that matter, the social sciences in general) is explained. This revolves around, first, the desire for “scientific” credibility in these disciplines, and the role that statistical analysis might play in this endeavor. A second reason for their omnipresence is that researchers often have little idea of what a p-value is, other than somehow being associated with “statistical significance.” We show, therefore, exactly what a p-value is and how , even when examined on its own terms, it is a very poor measure of statistical evidence. The third reason for the popularity of p-values is an extension of the second; because most researchers don’t know what a p-value is, it is erroneously invested with all kinds of powerful capabilities it simply does not have. Hubbard and Meyer 6 The popularity of p -values and their attendant limitations and uses is acknowledged despite the vast body of scholarly literature that has called into question, over the last fifty (50) years, th eir worth in empirical research. For instance, Bakan refers to the test of significance as being “…essentially mindlessness in the conduct of research” (Bakan, 1966); Hunter suggested “ A Ban on the Significance Test” (Hunter, 1997); Hubbard and Ryan f ound it unfathomable that “…a methodology as bereft of value as SST [statistical significance testing] has survived…more than four decades of criticism in the psychology literature” (Hubbard and Ryan, 2000); and, Ziliak and McCloskey”…say that a finding of “statistical” significance…is on its own almost valueless, a meaningless parlor game” (Ziliak and McCloskey, 2008); and, Stang, Poole,and Kuss noting the misunderstanding of the p-value assert: “The ubiquitous misuse and tyranny of SST [statistical sign ificance test] threatens scientific discoveries and may even impede scientific progress [and] harm patients….” (Stand, Poole, and Kuss, 2010). A compilation of these conclusions and those of many other researchers is presented in Table 1, Overt Criticism of the Worth of S tatistical Significance Testing. Table 1: Criticism of the Statistical Significance Testing (NHST ) Authors Quotations Bakan (1966, p. 436) …the test of significance in psychological research may be taken as an instance of a kind of essential mindlessness in the conduct of research… Carver (1978, p. 378) The emphasis on statistical significance over scientific significance in educational research represents a corrupt form of the scientific method. Cohen (1990, p. 1310) I believe…that hypothesis testing has…diverted our attention from crucial issues. Mesmerized by a single all -purpose, mechanized, “objective” ritual in which we convert numbers into other numbers and get a yes –no answer, we have come to neglect close scrutin y of where the numbers came from. Cohen (1994, p. 997) …null hypothesis significance testing (NHST; I resisted the temptation to call it statistical hypothesis inference testing)… Cox (1977, p. 60) As noted…there are considerable dangers in overemphasizing the role of significance tests in the interpretation of data. Cox (1982, pp. 327 –328) The criterion for publication should be the achievement of reasonable precision and not whether a significant effect has been found. Journal of Business and Behavioral Sciences 7 Authors Quotations Cox (1986, p. 120) It has been widely felt, probably for 30 years or more, that significance tests are overemphasized and often misused and that more emphasis should be put on estimation and predictions. Falk and Greenbaum (1995, pp. 75 -76) Our position is that the prevalen ce of the significance -testing practice is due not only to mindlessness and the force of habit…there are profound psychological reasons leading scholars to believe that they cope with the question of chance and minimize their uncertainty via producing a significant result. Greenwald (1975, p. 19) …it is to be hoped that journal editors will base publication decisions on criteria of importance and methodological soundness, uninfluenced by whether a result supports or rejects a null hypothesis. Guttman (1985, p. 4) We shall marshal arguments against such [statistical significance] testing, leading to the conclusion that it be abandoned by all substantive science and not just by educational research and other social sciences which have begun to raise voic es against the virtual tyranny of this branch of inference in the academic world. Hubbard and Ryan (2000, p. 678) It seems inconceivable to admit that a methodology as bereft of value as SST [statistical significance testing] has survived, as the centerpi ece of inductive inference no less, more than four decades of criticism in the psychology literature. Hunter (1997, p. 3) Needed: A Ban on the Significance Test. Loftus (1996, p. 162) …I believe the reliance on NHST [Null Hypothesis Significance Testing] has channeled our field into a series of methodological cul -de -sacs… Lykken (1968, p. 158) The moral of the story is that the finding of statistical significance is perhaps the least important attribute of a good experiment: it is never a sufficient cond ition for concluding that a theory has been corroborated, that a useful empirical fact has been established with reasonable confidence —or that an experimental report ought to be published. Hubbard and Meyer 8 Authors Quotations McCloskey and Ziliak (1996, p. 111) We would not assert that every economist misunderstands statistical significance, only that most do, and these [are] some of the best economic scientists. Morrison and Henkel (1970, p. v) Even their strongest proponents [of statistical significance testing] agree that there is much misuse, misinterpretation, and meaningless use of the tests. Nelder (1999, p. 257) The kernel of these non -scientific procedures is the obsession with significance tests as the end point of any analysis. Nester (1996, p. 407) Clearly, point hypothesis te sting has no place in statistical practice… This means that most paired and unpaired t-tests, analyses of variance…linear contrasts and multiple comparisons, and tests of significance for correlation and regression coefficients should be avoided by statist icians and discarded from the scientific literature. Rosnow and Rosenthal (1989, p. 1277) It may not be an exaggeration to say that for many PhD students, for whom the .05 alpha has acquired almost an ontological mystique, it can mean joy, a doctoral degr ee, and a tenure -track position at a major university if their dissertation p is less than .05. However, if the p is greater than .05, it can mean ruin, despair, and their advisor’s suddenly thinking of a new control condition that should be run. …surely, God loves the .06 nearly as much as the .05. Rozeboom (1960, p. 417) The thesis to be advanced is that despite the awesome pre -eminence this method has attained in our experimental journals and textbooks of applied statistics, [the Null Hypothesis Significance Test] is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research. Rozeboom (1997, p. 335) Null -hypothesis significance testing is surely the most bone -headedly misguided procedure ever institutionalized in the rote training of science students. Journal of Business and Behavioral Sciences 9 Authors Quotations Salsburg (1985, p. 220) And it provides Salvation: Proper invocation of the religious dogmas of Statistics will result in publication in prestigious journals. T his form of Salvation yields fruit in this world (increases in salary, prestige, invitations to speak at meetings) and beyond this life (continual references in the citation indexes). Schmidt (1996, p. 116) My conclusion is that we must abandon the statistical significance test. Schmidt and Hunter (1997, p. 57) Significance testing never makes a useful contribution to the development of cumulative knowledge. Schmidt and Hunter (2002, p. 65) …most researchers in the physical sciences regard reliance on significance testing as unscientific… Shrout (1997, p. 1) Significance testing of null hypotheses is the standard epistemological method for advancing scientific knowledge in psychology, even though it has drawbacks and it leads to common inferential mistakes. Stang, Poole, and Kuss (2010, p. 1) …the P-value is perhaps the most misunderstood statistical concept in clinical research… The ubiquitous misuse and tyranny of SST [statistical significance testing] threatens scientific discoveries and may eve n impede scientific progress [and] harm patients… Tryon (1998, p. 796) …NHST, the fact that statistical experts and investigators publishing in the best journals cannot consistently interpret the results of these analyses is extremely disturbing. Seventy -two years of education have resulted in miniscule, if any, progress toward correcting this situation. It is difficult to estimate the handicap that widespread, incorrect, and intractable use of a primary data analytic method has on a scientific discipline, but the deleterious effects are undoubtedly substantial and may be the strongest reason for adopting other data analytic measures. Walster and Cleary (1970, p. 16) A virtual prerequisite for the publication of research in the social sciences is the attainment of statistical significance. Hubbard and Meyer 10 Authors Quotations Ziliak and McCloskey (2008, p. 2) We…say that a finding of “statistical” significance…is on its own almost valueless, a meaningless parlor game. So while a p-value is of only trivial scientific importance, it has nevertheless emerged as the most decisive arbiter in interpreting research outcomes. This truly is a remarkable state of affairs. Rather than recording p-values , we recommend instead that researchers in pub lic administration (and political science) would better serve their fields by reporting and discussing sample statistics, effect sizes, and the confidence intervals around them. This strategy is a more productive one for acquiring a cumulative body of scie ntific knowledge. THE PUBLICATION FREQUENCY OF STATISTICAL SIGNIFICANCE TESTING IN PUBLIC ADMINISTRAITON AND POLITICAL SCIENCE EMPIRICAL RESEARCH Based upon a simple random sample of one issue of each journal per year, Tables 2, 3, and 4 show the publica tion frequency of both empirical research, Table 2 : The Growth of Statistical Significance Testing in the Pub lic Administration Review, 1945 –2008 Years Total Number of Papers Number of Empirical Papers Percent Number of Empirical Papers Using Statistic al Significance Tests Percent 1945 –49 33 - - - - 1950 –59 56 1 1.8 - - 1960 –69 68 11 16.2 - - 1970 –79 105 14 13.3 2 14.3 1980 –89 99 38 38.4 16 42.1 1990 –99 91 28 30.8 20 71.4 2000 –08 73 33 45.2 28 84.8 Note: Obtained from a content analysis of a randomly selected issue of the PAR for each year from 1945 through 2008. Table 3: The Growth of Statistical Significance Testing in the American Political Science Review, 1945 –2008 Journal of Business and Behavioral Sciences 11 Years Total Number of Papers Number of Empirical Papers Percent Number of Empirical Papers Using Statistical Significance Tests Percent 1945 –49 49 2 4.1 - - 1950 –59 99 20 20.2 4 20.0 1960 –69 100 38 38.0 13 34.2 1970 –79 125 70 56.0 50 71.4 1980 –89 117 74 63.2 54 73.0 1990 –99 113 62 54.9 58 93.5 2000 –08 80 42 52.5 38 90.5 Note: Obtained from a content analysis of a randomly selected issue of the APSR for each year from 1945 through 2008. Table 4: The Growth of Statistical Significance Testing in the American Journal of Political Science, 1957 –2008 Years Total Number of Papers Number of Empirical Papers Percent Number of Empirical Papers Using Statistical Significance Tests Percent 1957 –59 15 7 46.7 - - 1960 –69 50 31 62.0 11 35.5 1970 –79 99 87 87.9 47 54.0 1980 –89 103 89 86.4 76 85.4 1990 –99 135 113 83.7 108 95.6 2000 –08 112 97 86.6 95 97.9 Note: Obtained from a content analysis of a randomly selected issue of the AJPS for each year from 1957 through 2008. The AJPS initially was called the Midwest Journal of Political Science. and empirical research employing p-values, in the PAR and APSR from 1945 through 2008, and for the AJPS for 1957 through 2008, respectively They reveal some interesting patterns. For example, the PAR has a history, continued to this day, where empirical research does not dominate its pages. As displayed in Table 2, only 45.2% of papers published in the PAR during 2000 -2008 were empirical. A similar picture can be seen with regard to the APSR in Table 3, Hubbard and Meyer 12 where 52.5% of research for this same time period is empirical. On the other hand, as revealed in Table 4, this balance between empirical and conceptual articles found in the PAR and APSR is absent in the AJPS, where data -based research has occupied wel l over 80% of its contents since the 1970s. Tables 2, 3, and 4 also show for the three journals the percentage of empirical work using tests of statistical significance. Increasing reliance on these methods is clearly seen. Based on our sample data, none of the journals used significance tests for the 1945 –1949 period. In stark contrast, by 2000 –2008 the percentage of empirical articles employing p-values in the PAR, APSR, and AJPS is 84.8%, 90.5%, and 97.9%, respectively. The reporting of these indexes is now seen to be well -nigh indispensable in empirical research. Further, it is noteworthy that in comparison with the AJPS and APSR, the PAR is a relative newcomer with respect to the usage of statistical significance testing. In our sample, the first occur rence of p-values in the PAR was in the 1970s when 14.3% of empirical work adopted them, while the corresponding figures are 54.0% for the AJPS and 71.4% for the APSR. However, this slower embrace of p-values in the PAR is viewed by us with approval. Desp ite their near universality, p-values for the most part are scientifically meaningless. Which begs the question: Why the hegemony of p-values in empirical research? As noted earlier, there seem to be three major reasons for this dominance. First, there is the desire for scientific respectability among those in the social sciences; second, there is widespread confusion over what p-values are; and third, because of these misunderstandings, p-values are imbued with many useful features they do not possess. The se three issues are discussed below. Why the Hegemony of P-Values? The Desire for Scientific Authenticity in the Social Sciences From the outset there were aspirations of establishing the scientific legitimacy of the social sciences, political science in cluded. Thus, for example, Henry Ford Jones wrote in the early twentieth century that a goal of political science should be to provide “universal principles permanent in their applicability” (Ross, 1991, p. 288), which Abbott Lawrence Lowell thought might be attained via the use of statistical methods (Ross, 1991, p. 290). But by far the most influential person in the adoption of statistical techniques in the social sciences was the eminent statistician Ronald A. Fisher.

Encouragingly, Fisher pointed out t hat “Statistical methods are essential to social studies, and it is principally by the aid of such methods that these studies may be raised to the rank of sciences” (1970, p. 2). He promoted the role of significance tests and p-values in the numerous editi ons of his ground -breaking books Statistical Methods for Research Workers (1925) and The Design of Experiments (1935). For Fisher, a significance test is a method for assessing the probability (p-value) of an outcome on a null hypothesis (H 0) of zero effect or relationship. Journal of Business and Behavioral Sciences 13 That is, the investigator proposes a null hypothesis that a sample comes from a hypothetical infinite population with a known sampling distribution. The null hypothesis is said to be rejected if the sample estimate differs from the mean of the sampling distribution by more than a specified criterion, the level of significance (Gigerenzer & Murray, 1987; Hubbard & Bayarri, 2003). Fisher then cemented this criterion at p ≤ .05: “It is usual and convenient for experimenters to take 5 pe rcent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard” (1966, p. 13). Moreover, Fisher viewed the p-value as an objective way of judging the (im)plausibility of the null hypot hesis:… the feeling induced by a test of significance has an objective basis in that the probability statement on which it is based is a fact communicable to and verifiable by other rational minds. The level of significance in such cases fulfils the condit ions of a measure of the rational grounds for the disbelief [in the null hypothesis] it engenders. (1959, p. 43) So here is arguably the greatest statistician of all time with a message for members of the “social studies” that the way to elevate their fields to the rank of sciences is through the adoption of statistical methods. In addition, he provides them (and oth ers) with an ostensibly objective means of adjudicating knowledge claims — be sure to reject the null hypothesis at p ≤ .05. It was advice that was greeted enthusiastically by social science researchers following World War II, and intensified with the contin uing growth to the present of both hardware and software to make the computation of p-values effortless. But the great man of statistics misled the scientific community with his emphasis on significance tests and p-values. Even judged on their own terms, they are of marginal scientific worth. To see why the p-value is in no way commensurate with its ubiquity, it is instructive to state explicitly what it is. This is done in the next section. What is a P -Value? Fisher used discrepancies in the data to rej ect the null hypothesis, H 0. So he calculated the probability of the data on a true null hypothesis, or Pr(x | H0), where x stands for the data. More formally, p = Pr(T(X) ≥ T(x) | H0). The p-value is the probability of getting a test statistic T(X) greate r than or equal to the observed result, T(x), in addition to more extreme, unobserved ones, assuming a true null hypothesis of no effect or association. The rationale is that if the data are judged to be rare or discrepant under H 0, this constitutes induct ive evidence against H 0. Consequently, values like p < .01 and p < .001 are seen to indicate even greater evidence against H 0 than p ≤ .05. Most applied researchers believe that a p-value of .05 means that there is only a 5% probability that the results a re due to “chance” (Berger & Sellke, 1987). Although ruling out chance explanations for findings is an admirable goal, the p-value does not, in fact, say that there is only a 5% probability that the results are due to chance. It says much less. To see this , note from the definition above that a p-value is a conditional probability, one whose calculation is Hubbard and Meyer 14 dependent on the truth of the null hypothesis. This seemingly innocuous caveat, which is habitually overlooked, has important implications. It means that a p-value for an outcome is calculated on the assumption of a zero difference between a pair of means, or a zero correlation between two variables in the population. But finding that a difference between two means is not exactly zero, or that a correlatio n between two variables is not exactly zero, have only a 5% (p = .05) chance of being attributable to sampling error alone are trivial findings, and rarely of interest to scientists. Sadly, such findings are the lingua franca of empirical social science. In addition, the usefulness of p-values as credible measures of evidence is severely challenged in a study by Sellke, Bayarri, and Berger (2001). They used an applet, accessible at www.stat.duke.edu/~berger , which permits a simulation of a long series of s ignificance tests on normal data of the form H 0: θ = 0; HA: θ ≠ 0. This is a point null hypothesis, the kind routinely tested in the social sciences. The simulation records how often H 0 is true for p-values in given ranges, like approximately equal to .05 or .01. Of concern, Sellke et al. (2001) demonstrate that “statistically significant” outcomes near the .05 or .01 levels often come from true null hypotheses. Specifically, they found that in tests for which the p-value is close (e.g., .049) to the .05 le vel, at least 22% (and typically about 50%) came from true nulls. Thus, a p-value of .05 may constitute no evidence at all against H 0. While true nulls may be specified in theoretical research, as above, this is not so in applied work. Taken literally, po int null hypotheses of precisely zero differences between means or precisely zero correlations between variables do not exist in nature. In the real world point null hypotheses are always false, even if only to some small degree, such that large enough sam ples will lead to their rejection. Or as another leading statistician John Tukey (1991, p. 100) put it: “All we know about the world teaches us that the effects of A and B are always different — in some decimal place — for any A and B. Thus asking ‘Are the eff ects different?’ is foolish.” But if the point null hypothesis is always false, what’s the point of testing a point null hypothesis? Indeed there is a growing literature, summarized by Hubbard and Lindsay (2008), showing the p-value to be a very poor meas ure of evidence in data analysis. A common thread running through much of this literature is that p-values exaggerate the evidence against H 0, thereby allowing the Holy Grail of “statistically significant” results easy to attain. Unfortunately, this means that the validity of much published work, even those with p ≤ .05 results, must be called into question. We have seen that the p-value, examined on its own merits, is a minor statistic, and certainly not one deserving of the center stage it holds in empir ical investigations. The latter has occurred because too many researchers do not appreciate the strictly limited role it plays. They have instead embellished this index by erroneously investing it with all kinds of magical powers it simply does not have. Journal of Business and Behavioral Sciences 15 Common Misconceptions About P -Values There is widespread misunderstanding, perpetrated in articles, textbooks, and in the classroom, about the capabilities of p -values (see Carver, 1978; Kline, 2004; and Nickerson, 2000, about this). A brief discussion of t hese follows. A p -value is the probability of the null hypothesis being true. This is a restatement of the argument noted earlier that a p-value of .05 means there is only a 5% probability that the results are due to “chance,” or Pr(H 0 | x). But this is no t the case. The p-value is Pr(x | H0), the probability of the data (and more extreme observations) conditional on a true null hypothesis. A p -value is the probability (in the sense of 1 -p) of the alternative hypothesis being true. If a researcher gets a p-value of .05 this is generally taken to mean there is a .95 probability that H A is true, or Pr(H A | x). Not so. In the first place, Fisher never had an alternative hypothesis, or ever saw the need for one (Hubbard and Bayarri, 2003). The alternative hypoth esis was introduced by Jerzy Neyman and Egon Pearson as a way of “improving” on Fisher’s model. Second, only Bayesians can give probabilities of hypotheses; frequentist statisticians like Fisher and Neyman –Pearson (although of very different stripes) canno t. A p -value is the probability (again in the sense of 1 -p) that the result will replicate. Thus, a p-value of .05 means that there is a .95 chance that the result(s) will replicate. This is false. There is no formal warrant for using p-values as measures of the replicability of results. Yet many academic psychologists in the UK (Oakes, 1986) and Germany (Gigerenzer, Krauss, and Vitouch, 2004), including in the latter sample those teaching statistics, subscribe to the 1 -p view of replicat ion success. A p -value measures the magnitude of an effect . This myth is promoted when researchers use language such as p ≤ .05 is a significant result, p < .01 is a very significant result, and p < .001 is an extremely significant result, usually accompan ied by *, **, ***, respectively. But a p-value says nothing about the magnitude of an effect. A trivial effect with a large enough sample will be statistically significant; a large effect with too small a sample will not. And this has dire implications for determining whether a result is of substantive or practical significance in any given field. A p -value is a type I error rate ( ). Statistics textbooks in the social sciences typically present an anonymous hybrid of two incompatible frequentist paradigms — Fisher’s and Neyman –Pearson’s — as if it constituted a single, coherent method of statistical analysis (see Hubbard and Bayarri, 2003, for details). Because of this there are two, entirely different, conceptions of what “statistical significance” means. One is Fisher’s p-value, a data -dependent random variable distributed uniformly over the interval [0, 1] under the null hypothesis, and a measure of inductive evidence against H 0 applicable to individual studies. The other is Neyman –Pearson’s  level, the err oneous rejection of the null hypothesis, a fixed value that is specified prior to conducting the study, and is of relevance only to hypothetical long -run repetitions of an experiment. The two conceptions of statistical significance could hardly be more Hubbard and Meyer 16 inc ongruent, but since they both appear in the hybrid model it is not surprising to see that p-values routinely are misinterpreted as “data -adjusted” Type I error rates (Bayarri and Berger, 2000; Hubbard and Bayarri, 2003). A p -value is a measure of the gene ralizability of a result. Another mistaken interpretation. A p-value yields no information about whether a result obtained in one set of circumstances will generalize to other contexts. And yet the attainment of statistically significant results in one stu dy leads to over - optimism, of the 1 -p kind, regarding the generalizability of findings. The results of single studies with p ≤ .05 results are credited with far broader application than they deserve. REPORT CONFIDENCE IN TERVALS AROUND POINT ESTIMATES Rather than the obsession with significance testing and p-values, the aim of empirical research in individual studies should be the estimation of sample statistics, effect sizes, and the confidence intervals (CIs) around them. The reporting of CIs focuses attention on estimation over testing. Scientific progress usually involves plausible estimates of the magnitude of effect sizes in the population (Edwards, 1992), and the CI does this. CIs also incorporate the precision or reliability of the estimate throu gh the width of the interval. Moreover, because they are expressed in the same units as the point estimate, CIs make it easier to judge whether the results are theoretically or substantively, as opposed to statistically, significant. And although we do not recommend its use in this fashion, a CI can be employed as a significance test; a 95% CI which does not include the null value (mostly zero) is equivalent to rejecting the hypothesis at the .05 level. Also, and of crucial importance, initial findings mus t be replicated and extended. Once more, CIs play a fundamental role in this process. In particular, we suggest the criterion of overlapping CIs around sample statistics and effect sizes across similar studies as a measure of replication success. Overlappi ng CIs indicate agreement on estimates of the same population parameter(s). In this manner, use of CIs fosters cumulative knowledge development by stressing commonalities in the data, whereas p-values reward the search for differences. Finally, CIs are a frequentist statistical measure. Frequentist statistics is mainstream or orthodox statistics. It is what is taught in undergraduate and graduate courses throughout the world. Therefore, the transition from emphasizing CIs rather than p-values should be a f airly straightforward one, unlike teaching future generations of students a different methodological paradigm, like Bayesian statistics. CONCLUSIONS We have shown how the p-value has been gaining in popularity in empirical work published in the PAR. Howev er, it is not yet quite as firmly ensconced in the PAR as it is in the AJPS and APSR. Nor would we want it to be. Significance Journal of Business and Behavioral Sciences 17 tests and p-values lend the appearance of scientific rigor in empirical research in public administration (and political science). Yet this is purely deceptive. That the p-value — a statistic of such limited consequence — lies at the heart of social “scientific method” is incredulous. It has led to a situation in the social sciences where a very poor measure of statistical inference is now equated with scientific inference. And this is why we agree wholeheartedly with the distinguished statistician John Nelder (1999, p. 26 1) and his prescription to “demolish the P-value culture.” In summary, researchers in public administration should not follow the example of our colleagues in political science with their fixation on p-values. We understand, however, that there is substan tial peer, reviewer, and editorial pressure to include p-values in empirical work. So report them if you must. But more importantly, it is vital to emphasize the role of sample statistics, effect sizes, and their CIs in the interpretation of data. The latt er approach offers a far better route to the acquisition of a body of scientific knowledge in public administration research. REFERENCES Bakan, D. (1966). The Test of Significance in Psychological Research. Psychological Bulletin , 77 , 423 -437. Bayarri , M. J., & Berger, J. O. (2002). P Values for Composite Null Models. Journal of the American Statistical Association, 95 (4), 1127 –42. Berger, J. O., & Sellke, T. (1987). Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence (with comments). Journal of the American Statistical Association, 82 (1), 112 –39. Bhaskar, R. (1978). A Realistic Theory of Science. Hassocks: Harvester Press. Bhaskar, R. (1979). The Possibility of Naturalism . Hassocks: Harvester Press. Carver, R. P. (1978) . The Case Against Statistical Significance Testing. Harvard Educational Review, 48 (3), 378 –99. Cohen, J. (1990). Things I Have Learned (So Far). American Psychologist, 45, 1304 -1312. Cohen, J. (1994). The Earth is Round ( p<.05). American Psychologist , 49 , 997 - 1003. Cox, D. R. (1977). The Role of Significance Tests (with discussion). Scandinavian Journal of Statistics, 4, 49 -70. Cox, D. R. (1982). Statistical Significance Tests. British Journal of Clinical Pharmacology, 14, 325 -331. Cox, D. R. (1986). Some General Aspects of the Theory of Statistics. International Statistical Review , 54 , 117 -126. Hubbard and Meyer 18 Edwards, A. W. F. (1992). Likelihood (Expanded ed.). Baltimore, MD: Johns Hopkins University Press. Falk, R. & Greenbaum C. W. (1995). Significance Tests Die Hard: The Amazing Persistence of a Probabilistic Misconception. Theory of Psychology , 5, 75 -98. Fisher, R. A. (1925). Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd. Fisher, R. A. (1935). The Design of Experiments. Edinburgh: Oliver and Boyd. Fisher, R. A. (1959). Statistical Methods and Scientific Inference (2nd ed.). Edinburgh: Oliver and Boyd. Fisher, R. A. (1966). The Design of Experiments (8th ed.). Edinburgh: Oliver and Boyd. Fisher, R. A. (1970). Statistical Methods for Research Workers (14th ed.). New York: Hafner Publishing Company. Gill, J., & Meier, K. J. (2000). Public Administration Research and Practice: A Methodological Manifesto. Journal of Public Administration Research and Theory 10 (1), 157 -199. Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The Null Ritual: What You Always Wanted to Know About Significance Testing But Were Afraid to Ask. In D. Kaplan (Ed.), The SAGE Handbook of Quantitative Methodology for the Social Sciences (pp. 391 –408). Thousand Oaks, CA: Sage. Gigerenzer, G., & Murray, D. J. (1987). Cognition as Intuitive Statistics. Hillsdale, NJ: Erlbaum. Greenwald, A. G. (1975. Consequences of Prejudice Against the Null Hypothesis. Psychological Bulletin , 82, 1-20. Guttman, L. (1985). The Illogic of Statistical Inference for Cumulative Science. Applied Stochastic Models and Data Analysis, 1 , 3 -10. Hubbard, R., & Bayarri, M. J. (2003). Confusion Over Measures of Evidence (P’s) Versus Errors ( ’s) in Classical Statistical Testi ng (with comments). The American Statistician, 57 (3), 171 –82. Hubbard, R., & Lindsay, R. M. (2008). Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing. Theory & Psychology, 18 (1), 69 –88. Hubbard, R. & Ryan, P. A. (2000). The Historical Growth of Statistical Significance Testing in Psychology — and Its Future Prospects (with discussion). Educational and Psychological Measurement , 60 , 661 -696. Hunter, J. E. (1997). Needed: A Ban on the Significance Test. Psychological Science , 8, 3-7. Kirk, R. E. (2008). The Importance of Effect Magnitude. In S. Davis (Ed.), Handbook of Research Methods in Experimental Psychology , 83 -105. Oxford: Wiley -Blackwell. Kline, R. B. (2004). Beyond Significance Testing: Reforming Data Analysis Metho ds in Behavioral Research. Washington, DC: American Psychological Association. Journal of Business and Behavioral Sciences 19 Loftus, G. R. (1996). Psychology Will Be a Much Better Science When We Change the Way We Analyze Data. Psychological Science, 7 , 161 -171. Luton, L. S. (2007). Deconstructing Pu blic Administration Empiricism. Administration and Society , 39 , 527 -544. Luton, L. S. (2008). Beyond Empiricists Versus Postmodernists. Administration and Society, 40 , 211 -218. Lykken, D. T. (1996). Statistical Significance in Psychological Research. Psych ological Bulletin , 70, 151 -159. Lynn, L. E., Heinrich, C. J., & Hill, C. J. (2008). The Empiricist Goose Has Not Been Cooked. Administration and Society, 40 , 104 -109. McCloskey, D. N. & Ziliak S. T. (1996). The Standard Error of Regressions. Journal of Ec onomic Literature, 34, 97 -114. Meier, K. J., & O’Toole, L. J. (2007). Deconstructing Larry Luton: Or What Time is the Next Train to Reality Junction? Administration and Society, 39, 786 -796. Morrison, D. E. & Henkel R. E., Eds. (1970). The Significance Test Controversy —A Reader . Chicago: Aldine. Nelder, J. A. (1999). Statistics for the Millennium: From Statistics to Statistical Science. The Statistician , 48 (Part 2), 257 -269. Nester, M. R. (1996). An Applied Statistician’s Creed. Applied Statistics , 45 (4), 401 -410. Nickerson, R. S. (2000). Null Hypothesis Statistical Testing: A Review of an Old and Continuing Controversy. Psychological Methods, 5(2), 241 –301. Oakes, M. (1986). Statistical Inference: A Commentary for the Social and Behavioural Sciences. Chichester: Wiley. Rosnow R. L. & Rosenthan R. (1989). Statistical Procedures and the Justification of Knowledge in Psychhological Science. American Psychologist, 44, 1276 - 1284. Ross, D. (1991). The Origins of American Social Science. Cambridge: Cambridge University Press. Rozeboom, W. H. (1960). The Fallacy of the Null -Hypothesis Significance Test. Psychological Bulletin , 57 , 416 -428. Rozeboom, W. W. (1997). Good Science is Abductive, Not Hypothetico - Deductive. What If There Were No Significance Tests? Harlow, L. L., Mulaik, S. A. & Steiger, J. H. (Eds.). Mahwah, NJ: Erlbaum, 335 -391. Salsburg, D. S. (1985). The Religion of Statistics as Practiced in Medical Journals. The American Statistician, 39, 220 -223. Sayer, A. (1992). Method in Social Science: A Realist Approach (2nd ed.). London: Routledge. Sayer, A. (2000). Realism and Social Science . London: Sage. Schmidt, F. L. (1996). Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for the Training of Researchers.

Ps ychological Methods, 1 (1), 115 -129. Schmidt, F. L. & Hunter, J. E. (1997). Eight Common But False Objections to the Discontinuation of Significance Testing in the Analysis of Research Data. Hubbard and Meyer 20 What If There Were No Significance Tests? Harlow L. L., Mulaik S. A., & Steiger, J. H. (Eds.). Mahwah, NJ: Erlbaum, 37 -64. Schmidt, F. L. & Hunter, J. E. (2002). Are There Benefits From NHST? American Psychologist , 57 , 65 -66. Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of p Values for Testing Precis e Null Hypotheses. The American Statistician, 55 (1), 62 –71. Shrout, P. E. (1997). Should Significance Tests Be Banned. Psychological Science, 8, 1-2. Smith, M. L. (2006). Overcoming Theory -Practice Inconsistencies: Critical Realism and Information System s Research. Information Organization, 16 , 191 -211. Tukey, J. W. (1991). The Philosophy of Multiple Comparisons. Statistical Science, 6(1), 100 –116. Tryon, W. W. (1998). The Inscrutable Null Hypothesis. American Psychologist , 53, 796. Walster, G. W. & Clear y T. A. (1970). A Proposal for a New Editorial Policy in the Social Sciences. The American Statistician, 241 , 16 -19. Ziliak, S. T., & McCloskey, D. N. (2008). The Cult of Statistical Significance: How the Standard E rror Costs Us Jobs, Justice, and Lives. Ann Arbor, MI: University of Michigan Press. Copyright ofJournal ofBusiness &Behavioral Sciencesisthe property ofAmerican Society of Business &Behavioral Sciencesanditscontent maynotbecopied oremailed tomultiple sites orposted toalistserv without thecopyright holder'sexpresswrittenpermission.

However, usersmayprint, download, oremail articles forindividual use.