A more principled use of the p-value? Not so fast: a critique of Colquhoun’s argument. (original) (raw)
Related papers
Steven Goodman: Twelve P-value Misconceptions
The P value is a measure of statistical evidence that appears in virtually all medical research papers. Its interpretation is made extraordinarily difficult because it is not part of any formal system of statistical inference. As a result, the P value's inferential meaning is widely and often wildly misconstrued, a fact that has been pointed out in innumerable papers and books appearing since at least the 1940s. This commentary reviews a dozen of these common misinterpretations and explains why each is wrong. It also reviews the possible consequences of these improper understandings or representations of its meaning. Finally, it contrasts the P value with its Bayesian counterpart, the Bayes' factor, which has virtually all of the desirable properties of an evidential measure that the P value lacks, most notably interpretability. The most serious consequence of this array of P-value misconceptions is the false belief that the probability of a conclusion being in error can be calculated from the data in a single experiment without reference to external evidence or the plausibility of the underlying mechanism. Semin Hematol 45:135-140
2018
For almost a century after its introduction the p-value remains the most frequently used inferential tool of statistical science for steering research in various scientific domains. This ubiquitous powerful statistic itself is now under fire being surrounded by numerous controversies. We here review some of the important papers which highlight the prevailing myths, misunderstandings and controversies about this statistic. We also discuss recent developments made by the American Statistical Association (ASA) in interpreting p-value and guiding researchers to avoid confusion. Our paper is based on a search of selective databases and we do not claim it to be an exhaustive review. It specifically aims to help medical researchers/professionals who have little background of this contentious statistic and have been chasing it indiscriminately in publishing significant findings.
P-values are a common component and outcome measure in most every published observational or randomized clinical trial. However, junior faculty, fellows, and residents have little or no training in statistics and are forced to rely on the interpretation of results based solely on the authors or secondary sources. This education gap applies to an even larger audience including many physicians, researchers, journalists, and policy makers. That is a dangerous approach. Statistical analysis of data often involves the calculation and reporting of the p-value as statistically significant or not, without much further thought. But p-values are highly un-replicable and their definition is not directly associated with reproducibility. Findings from clinical studies are not valid if they cannot be reproduced. Although other methodological issues relate to reproducibility, such as statistical power to reproduce an effect, the p-value is arguably at the root of the problem given its wide variability from study to study. Many common misinterpretations and misuses of the p-value are practiced. It is essential to bring more awareness to this critical issue by providing a deeper educational understanding of the p-value to the proper interpretation of study results. Recognizing this need the American Statistical Association (ASA) recently published its first ever policy statement concerning their proper use and interpretation of p-values for scientists and researchers. This policy statement addresses the misguided practice of interpreting study results based solely on the p-value, given that it is often irreproducible in subsequent, similar studies. To further educate and illustrate this issue we investigated the irreproducibility of the p-value by using simulation software and results reported from a published randomized control trial. We show that the probability of attaining another statistically significant p-value varied quite widely on replication. We also show that power alone determines the distribution of p, and will vary with sample size and effect size. The percentage of replication means which fell within the original confidence interval (CI) from each replicated experiment revealed that the 95% CI included only 85.4% of future replication means. In conclusion, p-values interpreted solely by themselves, can be misleading if interpreted devoid of context potentially leading to biased inferences from clinical studies.
The arbitrary magic of p<0.05: Beyond statistics
Journal of B.U.ON. : official journal of the Balkan Union of Oncology, 2020
Modern research and scientific conclusions are widely regarded as valid when the study design and analysis are interpreted correctly. P-value is considered to be the most commonly used method to provide a dichotomy between true and false data in evidence-based medicine. However, many authors, reviewers and editors may be unfamiliar with the true definition and correct interpretation of this number. This article intends to point out how misunderstanding or misuse of this value can have an impact in both the scientific community as well as the society we live in. The foundation of the medical education system rewards the abundance of scientific papers rather than the careful search of the truth. Appropriate research ethics should be practised in all stages of the publication process.
P-values: misunderstood and misused
P-values are widely used in both the social and natural sciences to quantify the statistical significance of observed results. The recent surge of big data research has made the p-value an even more popular tool to test the significance of a study. However, substantial literature has been produced critiquing how p-values are used and understood. In this paper we review this recent critical literature, much of which is routed in the life sciences, and consider its implications for social scientific research. We provide a coherent picture of what the main criticisms are, and draw together and disambiguate common themes. In particular, we explain how the False Discovery Rate is calculated, and how this differs from a p-value. We also make explicit the Bayesian nature of many recent criticisms, a dimension that is often underplayed or ignored. We conclude by identifying practical steps to help remediate some of the concerns identified. We recommend that (i) far lower significance levels are used, such as 0.01 or 0.001, and (ii) p-values are interpreted contextually, and situated within both the findings of the individual study and the broader field of inquiry (through, for example, meta-analyses).
Biology Letters, 2019
The p -value has long been the figurehead of statistical analysis in biology, but its position is under threat. p is now widely recognized as providing quite limited information about our data, and as being easily misinterpreted. Many biologists are aware of p 's frailties, but less clear about how they might change the way they analyse their data in response. This article highlights and summarizes four broad statistical approaches that augment or replace the p -value, and that are relatively straightforward to apply. First, you can augment your p -value with information about how confident you are in it, how likely it is that you will get a similar p -value in a replicate study, or the probability that a statistically significant finding is in fact a false positive. Second, you can enhance the information provided by frequentist statistics with a focus on effect sizes and a quantified confidence that those effect sizes are accurate. Third, you can augment or substitute p -value...
The fickle P value generates irreproducible results
Nature methods, 2015
The reliability and reproducibility of science are under scrutiny. However, a major cause of this lack of repeatability is not being considered: the wide sample-to-sample variability in the P value. We explain why P is fickle to discourage the ill-informed practice of interpreting analyses based predominantly on this statistic.
On the challenges of drawing conclusions from p-values just below 0.05
In recent years researchers have attempted to provide an indication of the prevalence of inflated Type 1 error rates by analyzing the distribution of p-values in the published literature. De Winter and Dodou (2015) analyzed the distribution (and its change over time) of a large number of p-values automatically extracted from abstracts in the scientific literature. They concluded there is a ‘surge of p-values between 0.041-0.049 in recent decades’ which ‘suggests (but does not prove) questionable research practices have increased over the past 25 years’. I show the changes in the ratio of fractions of p-values between 0.041-0.049 over the years are better explained by assuming the average power has decreased over time. Furthermore, I propose that their observation that p-values just below 0.05 increase more strongly than p-values above 0.05 can be explained by an increase in publication bias (or the file drawer effect) over the years (cf. Fanelli, 2012; Pautasso, 2010), which has led to a relative decrease of 'marginally significant' p-values in abstracts in the literature (instead of an increase in p-values just below 0.05). I explain why researchers analyzing large numbers of p-values need to relate their assumptions to a model of p-value distributions that takes the average power of the performed studies, the ratio of true positives to false positives in the literature, the effects of publication bias, and the Type 1 error rate (and possible mechanisms through which it has inflated) into account. Finally, I discuss why publication bias and underpowered studies might be a bigger problem for science than inflated Type 1 error rates, and explain the challenges when attempting to draw conclusions about inflated Type 1 error rates from a large heterogeneous set of p-values.
A study on American Statistical Association (ASA) policy statement on statistical significance testing and p-value of 2016 was carried out in Tanzania. The purpose of the study was to explore the feelings and reactions of university statistics tutors towards the American Statistical Association policy statement on statistical significance testing and p-value of 2016. A sample of 9 statistics tutors from different disciplines were selected from public and private universities via heterogeneous purposive sampling to participate in the study. Respondents had mixed feelings towards ASA policy statement of 2016. The ASA policy statement was criticized for being shallow in depth, subjective and failing to answer the core problems raised against the use of Null Hypothesis Significance Testing (NHST) and p-value. The ASA policy statement was dismissed as a non event with nothing new to offer. However, despite being shallow, the ASA policy on NHST and p-value is likely to trigger a health debate on the shortfalls of NHST and p-value and the debate will eventually lead to a breakthrough.