Statistical Significance Tests are Unnecessary Even When Properly Done and Properly Interpreted: Reply to Commentaries (original) (raw)

A Review of the Latest Literature on Whether Statistical Significance Tests Should Be Banned

1999

Significance Testing (NHST) as a tool for advancing knowledge in the social sciences has intensified in recent years. Literature for and against the use of statistical significant tests is reviewed and three major limitations of these tests are summarized. The first is that "p" values themselves cannot be used as indices of effect size. A second limitation is the recognition that unlikely results are not necessarily interesting or important. A third limitation is that "p" values do not bear on the important issue of result replicability because statistical tests do not test the possibility that sample results occur in the population. A summary is also presented of what NHST can and cannot do. references.) (SLD) Abstract Controversy over the merits of Null Hypothesis Statistical Significance Testing (NHST) as a tool for advancing knowledge in the social sciences has intensified in recent years. The present paper reviews the literature concerning arguments both in favor of and opposed to the use of statistical significance tests and summarizes three major limitations of these tests. Finally, a summary is presented of what null hypothesis statistical significance tests can and cannot do.

The Roles of Statistical Significance Testing In Research

The research methodology literature in recent years has included a full frontal assault on statistical significance testing. The purpose of this paper is to promote the position that, while significance testing as the sole basis for result interpretation is a fundamentally flawed practice, significance tests can be useful as one of several elements in a comprehensive interpretation of data. Specifically, statistical significance is but one of the three criteria that must be demonstrated to establish a position empirically. Statistical significance merely provides evidence that an event did not happen by chance. However, it provides no information about the meaningfulness (practical significance) of an event or if the result is replicable. Thus, we support other researchers who recommend that statistical significance testing must be accompanied by judgments of the events practical significance and replicability.

A review of post-1994 literature on whether statistical significance test should be banned

annual meeting of the Southwest Educational …, 2000

The present paper summarizes the literature regarding statistical significance testing with an emphasis on (a) the post-1994 literature in various disciplines, (b) alternatives to statistical significance testing, and (c) literature exploring why researchers have demonstrably failed to be influenced by the 1994 APA publication manual's "encouragement" (p. 18) to report effect sizes. Also considered are defenses of statistical significance tests.

Null hypothesis significance tests: A mix-up of two different theories, the basis for widespread confusion and numerous misinterpretations

2014

Null hypothesis statistical significance tests (NHST) are widely used in quantitative research in the empirical sciences including scientometrics. Nevertheless, since their introduction nearly a century ago significance tests have been controversial. Many researchers are not aware of the numerous criticisms raised against NHST. As practiced, NHST has been characterized as a 'null ritual' that is overused and too often misapplied and misinterpreted. NHST is in fact a patchwork of two fundamentally different classical statistical testing models, often blended with some wishful quasi-Bayesian interpretations. This is undoubtedly a major reason why NHST is very often misunderstood. But NHST also has intrinsic logical problems and the epistemic range of the information provided by such tests is much more limited than most researchers recognize. In this article we introduce to the scientometric community the theoretical origins of NHST, which is mostly absent from standard statistical textbooks, and we discuss some of the most prevalent problems relating to the practice of NHST and trace these problems back to the mix-up of the two different theoretical origins. Finally, we illustrate some of the misunderstandings with examples from the scientometric literature and bring forward some modest recommendations for a more sound practice in quantitative data analysis.

Even statisticians are not immune to misinterpretations of Null Hypothesis Significance Tests

International Journal of Psychology, 2003

W e investigated the way experienced users interpret Null Hypothesis Significance Testing (NHST) outcomes. An empirical study was designed to compare the reactions of two populations of NHST users, psychological researchers and professional applied statisticians, when faced with contradictory situations. The subjects were presented with the results of an experiment designed to test the efficacy of a drug by comparing two groups (treatment/placebo). Four situations were constructed by combining the outcome of the t test (significant vs. nonsignificant) and the observed difference between the two means D (large vs. small). Two of these situations appeared as conflicting (t significant/D small and t nonsignificant/D large). Three fundamental aspects of statistical inference of statistical inference were investigated by means of open questions: drawing inductive conclusions about the magnitude of the true difference from the data in hand, making predictions for future data, and making decisions about stopping the experiment. The subjects were 25 statisticians from pharmaceutical companies in France, subjects well versed in statistics, and 20 psychological researchers from various laboratories in France, all with experience in processing and analyzing experimental data. On the whole, statisticians and psychologists reacted in a similar way and were very impressed by significant results. It must be outlined that professional applied statisticians were not immune to misinterpretations, especially in the case of nonsignificance. However, the interpretations that accustomed users attach to the outcome of NHST can vary from one individual to another, and it is hard to conceive that there could be a consensus in the face of seemingly conflicting situations. In fact, beyond the superficial report of "erroneous" interpretations, it can be seen in the misuses of NHST intuitive judgmental "adjustments" that try to overcome its inherent shortcomings. These findings encourage the many recent attempts to improve the habitual ways of analyzing and reporting experimental data.

Significance tests as sorcery: Science is empirical—significance tests are not

Theory & Psychology, 2012

Since the 1930s, many of our top methodologists have argued that significance tests are not conducive to science. Bakan (1966) believed that “everyone knows this” and that we slavishly lean on the crutch of significance testing because, if we didn’t, much of psychology would simply fall apart. If he was right, then significance testing is tantamount to psychology’s “dirty little secret.” This paper will revisit and summarize the arguments of those who have been trying to tell us—for more than 70 years—that p values are not empirical. If these arguments are sound, then the continuing popularity of significance tests in our peer-reviewed journals is at best embarrassing and at worst intellectually dishonest.

Statistical significance and its critics: practicing damaging science, or damaging scientific practice?

Synthese

While the common procedure of statistical significance testing and its accompanying concept of p-values have long been surrounded by controversy, renewed concern has been triggered by the replication crisis in science. Many blame statistical significance tests themselves, and some regard them as sufficiently damaging to scientific practice as to warrant being abandoned. We take a contrary position, arguing that the central criticisms arise from misunderstanding and misusing the statistical tools, and that in fact the purported remedies themselves risk damaging science. We argue that banning the use of p-value thresholds in interpreting data does not diminish but rather exacerbates data-dredging and biasing selection effects. If an account cannot specify outcomes that will not be allowed to count as evidence for a claim—if all thresholds are abandoned—then there is no test of that claim. The contributions of this paper are: To explain the rival statistical philosophies underlying the...

Significance Testing is Still Wrong, and Damages Real Lives: A Brief Reply to Spreckelsen and Van Der Horst, and Nicholson and McCusker

Sociological Research Online, 2017

This paper is a brief reply to two responses to a paper I published previously in this journal. In that first paper I presented a summary of part of the long-standing literature critical of the use of significance testing in real-life research, and reported again on how significance testing is abused, leading to invalid and therefore potentially damaging research outcomes. I illustrated and explained the inverse logic error that is routinely used in significance testing, and argued that all of this should now cease. Although clearly disagreeing with me, neither of the responses to my paper addressed these issues head on. One focussed mainly on arguing with things I had not said (such as that there are no other problems in social science). The other tried to argue either that the inverse logic error is not prevalent, or that there is some other unspecified way of presenting the results of significance testing that does not involve this error. This reply paper summarises my original p...

Null hypothesis significance testing: On the survival of a flawed method

American Psychologist, 2001

Null hypothesis significance testing (NHST) is the researcher's workhorse for making inductive inferences. This method has often been challenged, has occasionally been defended, and has persistently been used through most of the history of scientific psychology. This article reviews both the criticisms of NHST and the arguments brought to its defense. The review shows that the criticisms address the logical validity of inferences arising from NHST, whereas the defenses stress the pragmatic value of these inferences. The author suggests that both critics and apologists implicitly rely on Bayesian assumptions. When these assumptions are made explicit, the primary challenge for NHST-and any system of induction-can be confronted. The challenge is to find a solution to the question of replicability. Editor's note. J. Bruce Overmier served as action editor for this article.

The Magical Influene of Statistical Significance

This paper examined 1122 statistical tests found in 55 master theses accredited during 1995-2000 at Mu'tah University. It tried to answer two main questions: First, do researchers still relying on the level of significance (a) as the only criterion to judge the importance of relations and differences? Second, to what extent practical significance can be found along with statistical significance? Results showed that researchers do consider statistical significance as the only criterion to judge the importance of their findings. 74.33% of the statistically significant tests were having a small practical significance, and only 10.27% were oflarge practical significance.