Low agreement among reviewers evaluating the same NIH grant applications (original) (raw)
Obtaining grant funding from the National Institutes of Health (NIH) is increasingly competitive, as funding success rates have declined over the past decade. To allocate relatively scarce funds, scientific peer reviewers must differentiate the very best applications from comparatively weaker ones. Despite the importance of this determination , little research has explored how reviewers assign ratings to the applications they review and whether there is consistency in the reviewers' evaluation of the same application. Replicating all aspects of the NIH peer-review process, we examined 43 individual reviewers' ratings and written critiques of the same group of 25 NIH grant applications. Results showed no agreement among reviewers regarding the quality of the applications in either their qualitative or quantitative evaluations. Although all reviewers received the same instructions on how to rate applications and format their written critiques, we also found no agreement in how reviewers "translated" a given number of strengths and weaknesses into a numeric rating. It appeared that the outcome of the grant review depended more on the reviewer to whom the grant was assigned than the research proposed in the grant. This research replicates the NIH peer-review process to examine in detail the qualitative and quantitative judgments of different reviewers examining the same application, and our results have broad relevance for scientific grant peer review. peer review | social sciences | interrater reliability | linear mixed-effects models I n the past decade, funding at the National Institutes of Health (NIH) has increased at a much slower rate (1) than the number of grant applications (2), and consequently, success rates have steadily declined (3). There are more deserving grant applications than there are available funds, so it is critical to ensure that the process responsible for awarding such funds-grant peer review-reliably differentiates the very best applications from the comparatively weaker ones. Research on grant peer review is inconclusive: Some studies suggest that it is unreliable (4-13) and potentially biased (14-17), whereas others show the validity of review systems and final outcomes (18-20). However, even if peer review effectively discriminates the good applications from the bad, it is now imperative to empirically assess whether, in this culture of decreasing funding rates, it can discriminate the good from the excellent within a pool of high-quality applications. As Chubin and Hackett (21) argue, intensified competition for resources harms peer review because funding decisions rely on an evaluation process that is not designed to distinguish among applications of similar quality-a scenario that they argue is most prevalent at the NIH. Indeed, the findings in the present paper suggest that, in fact, reviewers are unable to differentiate excellent applications (i.e., those funded by the NIH in the first round) from good applications (i.e., those unfunded but later funded by the NIH after subsequent revisions). Because the grant peer-review process at NIH is confidential, the only way to systematically examine it is to replicate the process outside of the NIH in a highly realistic manner. This is precisely what we did in the research reported in this paper. We recruited 43 oncology researchers from across the United States to participate in one of four peer-review panels (called "study sections" at NIH), each composed of 8-12 reviewers. Fig. 1 presents a deidentified image from one study section meeting. We solicited 25 oncology grant applications submitted to NIH as R01s-the most competitive and highly sought after type of grant at NIH-between 1 and 4 y before our study. Sixteen of these were funded in the first round (i.e., the best applications), whereas 9 of these were funded only after subsequent resubmission (i.e., the good applications). The NIH uses a two-stage review process. In the first stage, two to five reviewers individually evaluate each grant application by assigning a preliminary rating using the NIH's reverse 9-point scale (1 = exceptional, 9 = poor) and writing a critique describing the application's strengths and weaknesses. Most typically, three reviewers are assigned to an application: a primary, a secondary, and a tertiary reviewer, ranked in order of the relevance of their expertise. Reviewers then convene in study section meetings, where they discuss the applications that received preliminary ratings in the top half of all applications evaluated. After sharing their preliminary ratings and critiques, the two to five assigned reviewers discuss the application with all other study section members, all of whom assign a final rating to the application. This final rating from all members is averaged into a final "pri-ority score." In the second stage, members of NIH's advisory councils use this priority score and the written critiques to make Significance Scientific grant peer reviewers must differentiate the very best applications from comparatively weaker ones. Despite the importance of this determination in allocating funding, little research has explored how reviewers derive their assigned ratings for the applications they review or whether this assessment is consistent when the same application is evaluated by different sets of reviewers. We replicated the NIH peer-review process to examine the qualitative and quantitative judgments of different reviewers examining the same grant application. We found no agreement among reviewers in evaluating the same application. These findings highlight the subjectivity in reviewers' evaluations of grant applications and underscore the difficulty in comparing the evaluations of different applications from different reviewers-which is how peer review actually unfolds.