Understanding the Role of P Values and Hypothesis Tests in Clinical Research (original) (raw)

P values (the product of significance tests) and hypothesis testing methods are frequently misunderstood and often misused in clinical research. 1-3 Despite a large body of literature advising otherwise, reliance on these tools to characterize and interpret scientific findings continues to increase. 4 The Ameri-can Statistical Association 5 recently published a consensus white paper attempting to promote a more limited, rational role for the P value in science. That consensus statement was accompanied by 21 individual commentaries from members of the panel, each adding his or her own caveats to the discussion. Our justification for writing yet another article on this surprisingly controversial subject lies in the hope that, by taking a substantially different approach that is more conceptual and less technical, we can enhance understanding of the roles that P values and hypothesis tests are best suited to fill. Science as Measurement: Understanding the Nature of Clinical Evidence When we do science, in most situations that means we measure something. 6-8 In the context of a therapeutic clinical trial, we can think of the treatment as a metaphorical therapeutic force, an " oomph " effect that pushes the intervention group away from the control group, which creates distance or separation between the 2 groups with respect to outcomes of interest. 9,10 An ineffective treatment , therefore, has no oomph. To measure the consequences of this therapeutic force, we use 2 complementary concepts, namely, magnitude and precision. Outcomes measured in individual patients are used to estimate the magnitude or size of the treatment effects produced, both for benefit and for harm as the circumstances dictate. The further the therapy in question " pushes " the treatment group away from its peers in the control group, the larger the average treatment effect is. How large a treatment effect has to be to be consequential is a matter for clinical judgment. One of the added complexities of clinical sciences is that, when a treatment saves a life or prevents a stroke or heart attack, for example, the prevented event is invisible clinically and can only be measured indirectly using appropriate controls. We usually measure therapeutic and other clinical effects in groups of patients using quantitative constructs, such as odds ratios, hazard ratios, relative risk reductions , and survival or adverse event rate differences. Statisticians often use the term estimation when summarizing the measurement of cohort-level clinical therapeutic effects. The second key concept involves the idea of " precision, " the amount of spread or variability in the data. At the level of individual patients, precision can be understood in terms of measure-remeasure variability of clinical variables. The tighter measurements cluster around each other, the more precise (free from " random " or patternless error) the measurement process is thought to be. In much of clinical research, however, variability of the measure-P values and hypothesis testing methods are frequently misused in clinical research. Much of this misuse appears to be owing to the widespread, mistaken belief that they provide simple, reliable, and objective triage tools for separating the true and important from the untrue or unimportant. The primary focus in interpreting therapeutic clinical research data should be on the treatment (" oomph ") effect, a metaphorical force that moves patients given an effective treatment to a different clinical state relative to their control counterparts. This effect is assessed using 2 complementary types of statistical measures calculated from the data, namely, effect magnitude or size and precision of the effect size. In a randomized trial, effect size is often summarized using constructs, such as odds ratios, hazard ratios, relative risks, or adverse event rate differences. How large a treatment effect has to be to be consequential is a matter for clinical judgment. The precision of the effect size (conceptually related to the amount of spread in the data) is usually addressed with confidence intervals. P values (significance tests) were first proposed as an informal heuristic to help assess how " unexpected " the observed effect size was if the true state of nature was no effect or no difference. Hypothesis testing was a modification of the significance test approach that envisioned controlling the false-positive rate of study results over many (hypothetical) repetitions of the experiment of interest. Both can be helpful but, by themselves, provide only a tunnel vision perspective on study results that ignores the clinical effects the study was conducted to measure.