Implications of "Too Good to Be True" for Replication, Theoretical Claims, and Experimental Design: An Example Using Prominent Studies of Racial Bias (original) (raw)
Related papers
On the reproducibility of psychological science
Journal of the American Statistical Association, 2016
Investigators from a large consortium of scientists recently performed a multi-year study in which they replicated 100 psychology experiments. Although statistically significant results were reported in 97% of the original studies, statistical significance was achieved in only 36% of the replicated studies. This article presents a reanalysis of these data based on a formal statistical model that accounts for publication bias by treating outcomes from unpublished studies as missing data, while simultaneously estimating the distribution of effect sizes for those studies that tested nonnull effects. The resulting model suggests that more than 90% of tests performed in eligible psychology experiments tested negligible effects, and that publication biases based on p-values caused the observed rates of nonreproducibility. The results of this reanalysis provide a compelling argument for both increasing the threshold required for declaring scientific discoveries and for adopting statistical summaries of evidence that account for the high proportion of tested hypotheses that are false. Supplementary materials for this article are available online.
Excess success for three related papers on racial bias
Frontiers in psychology, 2015
Three related articles reported that racial bias altered perceptual experience and influenced decision-making. These findings have been applied to training programs for law enforcement, and elsewhere, to mitigate racial bias. However, a statistical analysis of each of the three articles finds that the reported experimental results should be rare, even if the theoretical ideas were correct. The analysis estimates that the probability of the reported experimental success for the articles is 0.003, 0.048, and 0.070, respectively. These low probabilities suggest that similar future work is unlikely to produce as successful outcomes and indicates that readers should be skeptical about the validity of the reported findings and their theoretical implications. The reported findings should not be used to guide policies related to racial bias, and new experimental work is needed to judge the merit of the theoretical ideas.
Too good to be true: Publication bias in two prominent studies from experimental psychology
Psychonomic Bulletin & Review, 2012
Empirical replication has long been considered the final arbiter of phenomena in science, but replication is undermined when there is evidence for publication bias. Evidence for publication bias in a set of experiments can be found when the observed number of rejections of the null hypothesis exceeds the expected number of rejections. Application of this test reveals evidence of publication bias in two prominent investigations from experimental psychology that have purported to reveal evidence of extrasensory perception and to indicate severe limitations of the scientific method. The presence of publication bias suggests that those investigations cannot be taken as proper scientific studies of such phenomena, because critical data are not available to the field. Publication bias could partly be avoided if experimental psychologists started using Bayesian data analysis techniques.
Replication unreliability in psychology: elusive phenomena or "elusive" statistical power
Frontiers in psychology, 2012
The focus of this paper is to analyze whether the unreliability of results related to certain controversial psychological phenomena may be a consequence of their low statistical power. Applying the Null Hypothesis StatisticalTesting (NHST), still the widest used statistical approach, unreliability derives from the failure to refute the null hypothesis, in particular when exact or quasi-exact replications of experiments are carried out. Taking as example the results of meta-analyses related to four different controversial phenomena, subliminal semantic priming, incubation effect for problem solving, unconscious thought theory, and non-local perception, it was found that, except for semantic priming on categorization, the statistical power to detect the expected effect size (ES) of the typical study, is low or very low. The low power in most studies undermines the use of NHST to study phenomena with moderate or low ESs. We conclude by providing some suggestions on how to increase the statistical power or use different statistical approaches to help discriminate whether the results obtained may or may not be used to support or to refute the reality of a phenomenon with small ES.
Are Most Published Research Findings False?
In a provocative article Ioannidis (2005) argues that, in disciplines employing statistical tests of significance, professional journals report more wrong than true significant results. This short note sketches the argument and explores under what conditions the assertion holds. The “positive predictive value” (PPV) is lower than 1/2 if the a priori probability of the truth of a hypothesis is low. However, computation of the PPV includes only significant results. If both significant and non-significant results are taken into account the “total error ratio” (TER) will not exceed 1/2 provided no extremely large publication bias is present. Moreover, it is shown that theory-driven research may reduce the proportion of errors. Also, the role of replications is emphasized; replication studies of original research are so important because they drastically decrease the error ratio.
Replication and Reproducibility in Psychological, Medical and other Sciences Authors
Medical Research Archives, 2022
As there are no universal constants in psychological, medical and economic sciences, only constructive-phenomenon replications are meaningful. Yet, psychologists continue to perform direct replications, as evidenced by recent preregistered multilab attempts at exact replications of the ego depletion effect. Statistics are driving the replication movement into a ditch because of an overemphasis on the determination of statistical magnitude of effects while ignoring commonsense magnitude and other criteria for evaluating phenomena's validity, reliability, and viability. The nature of the human mind and the variability of psychological phenomena pose difficult challenges for the scientific method and insurmountable obstacles for precise replications in psychological sciences. The situation is no better in medical and economic sciences. The interaction effect of person (genetics) and environment (lifestyle) calls for constructive replications to determine, for example, drugs' efficacy as a function of group and individual differences. The vaccinevaccination paradox is an interesting case because psychological and medical sciences meet at this intersection. In all fields, science advances by theory building and model expansion, not by replication tests of statistical hypotheses. Rigorous logical and theoretical analysis always precedes and guides good empirical tests. The nonexistence of an effect is not viable if it can withstand rigorous logical and theoretical analyses. Empirical studies are mainly evaluated for their theoretical relevance and importance, not their success or failure to exactly reproduce the original findings.
PSYCHOLOGY. Estimating the reproducibility of psychological science
Science (New York, N.Y.), 2015
Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteris...
Experimental psychology is said to be having a reproducibility crisis, marked by a low rate of successful replication. Researchers attempting to respond to the problem lack a framework for consistently interpreting the results of statistical tests, as well as standards for judging the outcomes of replication studies. In this paper I introduce an error-statistical framework for addressing these issues. I demonstrate how the severity requirement (and the associated severity construal of test results) can be used to avoid fallacious inferences that are complicit in the perpetuation of unreliable results. Researchers, I argue, must probe for error beyond the statistical level if they want to support substantive hypotheses. I then suggest how severity reasoning can be used to address standing questions about the interpretation of replication results.