Evaluating Meta-analyses in the General Surgical Literature: A Critical Appraisal (original) (raw)

Abstract

Objective:

To assess the methodologic quality of meta-analyses of general surgery topics published in peer-reviewed journals.

Summary Background Data:

Systematic reviews and meta-analysis are used to seek, summarize, and interpret primary studies on a given topic. Accordingly, systematic reviews and meta-analyses of high-quality primary studies may be the highest level of evidence for issues of prevention and treatment in evidence-based medicine. However, not all published meta-analyses are rigorously performed.

Methods:

We searched MEDLINE (from January 1, 1997, to September 1, 2002) and reference lists and solicited general surgery specialists to identify relevant meta-analyses. Inclusion criteria were use of meta-analytic methods to pool the results of primary studies in general surgery on issues of diagnosis, causation, prognosis, or treatment. Our search strategies identified 487 potentially relevant articles. After excluding articles based on a priori criteria, 51 meta-analyses fulfilled eligibility criteria. In duplicate and independently, 2 reviewers assessed the quality of these meta-analyses using a 10-item index called the Overview Quality Assessment Questionnaire.

Results:

Overall concordance between 2 independent reviewers was good (interobserver agreement 81%, and a κ of 0.62 (95% CI 0.55–0.69). Of 51 relevant articles, 38 were published in surgical journals. Most studies had major methodologic flaws (median score of 3.3, scale of 1–7). Factors associated with low overall scientific quality included the absence of any prior meta-analyses publications by authors and meta-analyses produced by surgical department members without external collaboration.

Conclusions:

This critical appraisal of meta-analyses published in the general surgery literature demonstrates frequent methodologic flaws. The quality of these reports limits the validity of the findings and the inferences that can be made about the primary studies reviewed. To improve the quality of future meta-analyses, we recommend following guidelines for the optimal conduct and reporting of meta-analyses in general surgery.


The range in quality of meta-analyses published over the last 5 years on topics dealing with general surgery is wide. The average overall quality is low; this may impair the validity of the results obtained.

The profusion of publications in scientific and biomedical journals makes it difficult for busy clinicians, educators, and investigators to keep abreast of new developments. For this reason, systematic reviews and meta-analysis are used to seek, summarize, and interpret primary studies on a given topic. These publications have been used to inform clinical practice, to aid teaching, direct health policy, guide future research, and to serve as a foundation for practice guidelines.1 Accordingly, systematic reviews and meta-analyses of high-quality primary studies are the highest level of evidence for issues of prevention and treatment in evidence-based medicine.2 Indeed, the Oxford Centre for Evidence Based Medicine (www.indigojazz.uk/cebm) ranks systematic reviews/meta-analyses as level 1a evidence.

However, not all published meta-analyses are rigorously performed.3–13 Moreover, the results of meta-analyses have been criticized because sometimes they differ from the results of subsequent large randomized trials.5,14–17 The discordance between meta-analyses and subsequent randomized trials may be in part due to these shortcomings in meta-analysis methodology.3,5,6,8,11,18

The Overview Quality Assessment Questionnaire (OQAQ) was developed as a tool for the critical appraisal of the quality of meta-analyses.19 Its operating characteristics have been validated; including interrater reliability, face validity, and construct validity as measured against 7 a priori hypotheses dealing with how the instrument should perform if adequately measuring scientific quality of systematic reviews.19–21 The OQAQ has been used to assess publications in both the emergency medicine and anesthesia literature. Using the OQAQ, our objective was to assess the quality of meta-analyses published in the general surgery literature. Our primary goals were to assess general surgery topics reviewed using the technique of meta-analysis, to evaluate the rigor of several specific steps in the conduct of meta-analyses, and to identify areas of weakness to target for improvement in future reviews.

MATERIALS AND METHODS

Inclusion Criteria and Search Strategy

We defined a “meta-analysis” as a systematic review which includes “a statistical analysis of the results from independent studies, which generally aims to produce a single estimate of effect.”22 A literature review was performed to identify all meta-analyses of general surgery topics published in peer-reviewed paper-based journals. We performed the literature search using the PubMed (MEDLINE) search engine in November 2002 for the period January 1, 1997, to September 1, 2002 (search strategy outlined in Fig. 1). The search term used was surgery (276,268 articles) limited using the “limits” to the following: meta-analysis publication types (528 articles), human only studies (526 articles), and English language (487 articles). All abstracts were then reviewed by 1 investigator to identify studies for inclusion. Inclusion criteria were (1) use of meta-analytic methodology to pool the results of primary studies; (2) issues of diagnosis, causation, prognosis, or treatment; and (3) focus on conditions relating directly to general surgery practice. Publications were excluded if they (1) were neither meta-analyses nor systematic reviews (N = 145); (2) because they were systematic reviews but not meta-analyses (eg, the primary study results were not pooled statistically) (N = 114); (3) they did not address general surgery topics (N = 165) (22 neurosurgery, 22 chemo/radiotherapy, 21 cardiothoracic, 19 orthopedics, 15 anesthesia/analgesia, 10 otolaryngology, 9 obstetrics and gynecology, 8 vascular surgery, 8 gastroenterology, 6 pediatric surgery, 5 ophthalmology, 5 dentistry, 4 plastic surgery, 4 urology, 3 critical care, 2 transplantation, and 2 miscellaneous); and (4) they were Cochrane Reviews (N = 12), since prior research has shown these to be of high quality.23 We did not consider duplicate publications, unpublished work, abstracts, and conference proceedings. We identified 51 relevant meta-analyses.

graphic file with name 8FF1.jpg

FIGURE 1. Literature search strategy.

Quality Appraisal of Articles and Overview Methods

All meta-analyses were critically appraised independently in duplicate (ED, CD), using the OQAQ. This checklist includes 9 items (scored as done, partially done/cannot tell, or not done), and a 10th item requiring a summary evaluation.19–21 The OQAQ has been psychometrically tested and found to be valid and reliable (interrater reliability, face validity, and construct validity).19–21 When scoring items 1 through 9, we scored “partially” if methods were reported incompletely, or “cannot tell” if methods were not reported at all. These items were scored as “yes” or “no” only when the criterion was explicitly met or not met.19 The 10th item is an overall assessment of scientific quality on a scale of 1 to 7 (1 indicating extensive flaws with major risk of bias to 7 indicating minimal flaws with minor risk of bias). This score is based on the results of the preceding 9 items, and we followed the published recommendations for scoring. If a meta-analysis scored “cannot tell” on 1 or more of the 9 core items, we considered it to have minor flaws at best and it received a score of 4 or lower. If the meta-analysis scored a “no” on question 2, 4, 6, or 8, we considered it to have, at a minimum, major flaws, and it received a score of 3 or less.19 Final scores were obtained by consensus of the 2 reviewers. Overall concordance was good, with an interobserver agreement rate of 81% and a κ coefficient of 0.62 (95% CI 0.55-0.69) prior to consensus. When consensus between the 2 reviewers could not be reached, a third reviewer was used to adjudicate the final score (as occurred for 2 meta-analyses) on points of disagreement (MH, FS).

To identify publication evidence of expertise in meta-analytic methods, we also performed a PubMed search of all listed authors of the included meta-analyses using the PubMed limit “meta-analysis” to identify the number of prior meta-analyses published. Data regarding the number of patients and studies included in the meta-analysis were abstracted when possible. We recorded the department producing the meta-analysis. Finally, the impact factors specific to the year of publication were obtained for the host journal (Institute for Scientific Information).

Analysis

Using the methods of Spearman, correlation coefficients are reported for associations examined between covariates and the summary score. Summary effect sizes were extracted from each meta-analysis. When available, we used odds ratios (ORs) for mortality. If raw data were reported, we calculated ORs. Otherwise, we abstracted relative risks as the main summary statistic. All metrics were converted such that values (OR or RR) greater than 1 favored the experimental treatment over the control. Statistical analysis was done using SAS software.

RESULTS

In Table 1, we present the meta-analyses and their characteristics. In Table 2, we present the component scores (9 core questions) of all 51 articles reviewed. Sixty-five percent of all meta-analyses used comprehensive search methods, and 67% clearly reported their search strategies. Inclusion criteria were described in 70% of the studies reviewed, although 10% did not describe inclusion criteria. Between 14% and 43% of the meta-analyses did not adequately avoid bias in the selection of studies included in the meta-analysis. These publications were weakest with regard to reporting explicit inclusion criteria, the description of validity criteria used to select studies to include in the meta-analysis, and the appropriateness of these criteria. Seventy-one percent of the articles had methodologic flaws in the description of validity criteria, while 39% to 70% lacked or did not use appropriate validity criteria. Items 7 and 8 of the OQAQ focus on the reporting and appropriateness of the way in which the studies were combined. Sixty-seven percent of the articles reported how the results of the individual studies were combined, and 65% of the studies combined the results appropriately. Question 9 assesses whether or not the conclusions are supported by the results, in whole or in part. These meta-analyses articles scored the highest on this item; 78% had conclusions supported by the data reported.

TABLE 1. References Included in Review

graphic file with name 8TT1A.jpg

TABLE 1. (Continued)

graphic file with name 8TT1B.jpg

TABLE 2. Summary of Questions 1–9: Core Index Questions

graphic file with name 8TT2.jpg

Cumulative responses to question 10 (overall scientific quality, OSQ) include: median score of 3 and a mean score of 3.33. The scores for these 51 meta-analyses were distributed as follows: 7 (14% for score of 1), 16 (31%, score of 2), 11 (22%, score of 3), 4 (8%, score of 4), 4 (8%, score of 5), 1 (2%, score of 6), and 8 (16%, score of 7).

Table 3 shows that, on average, 27 (3-420) studies and 11,853 (303-327,523) patients were included in each meta-analysis. The mean impact factor for the journal of publication was 2.684 (0.502-6.674). PubMed was searched using each author's name included in the article, and the total per publication was summed (eg, if a publication had 5 authors and 4 authors had previously published 1 meta-analysis each and the fifth had published 2, then the total for that paper was 6 prior meta-analysis publications). When authors on a publication had coauthored 1 prior meta-analysis together, it was only scored as 1. A summary “density” score was then calculated by dividing the total number of prior meta-analyses published by the number of authors on the paper. The mean “density” score for all 51 articles is 1.48 (0.2-8.3). The cumulative authors per publication had 4.78 (0-25) other meta-analysis publications. The following factors were not significantly correlated with OSQ: the impact factor of the journal in which the meta-analysis was published, the number of patients and studies analyzed in the meta-analysis, and the summary measure of effect (OR or RR) (Table 3). A significant positive correlation was detected between the OSQ and the number of previous meta-analyses published by authors of the index meta-analysis. Analysis by the number of other meta-analyses published by all the authors of a given paper is contained in Table 4. The majority of meta-analysis authors (40 of 51) had published meta-analyses previously. Those papers produced by authors with prior meta-analysis published have a significantly higher mean OSQ score compared with those by first-time authors (3.55 versus 2.55).

TABLE 3. Miscellaneous Characteristics of Publications and Correlation Between Factors

graphic file with name 8TT3.jpg

TABLE 4. Analysis of Factors Determining Overall Quality of Publications

graphic file with name 8TT4.jpg

Various subgroup analyses are outlined in Table 4. The number of meta-analyses published per year is shown, with a range of 7 to 11. There is no significant difference in mean OSQ by year of publication. We list the number of publications per journal in decreasing order of mean OSQ. The Annals of Surgery had the highest number of publications by a single journal (6 meta-analyses). Notably, this journal also had the highest mean OSQ (4.67). Journals publishing at least 2 meta-analyses were analyzed separately. Those surgical journals publishing 1 meta-analysis were grouped together (mean OSQ, 3.250). All nonsurgical journals were grouped together (“all other publications,” mean OSQ, 3.308). We identified a substantial range in quality of these meta-analyses, stratified by journal of publication. We also categorized articles by topic; the mean OSQ listed by topic, from greatest to least: laparoscopy, surgical technique, wound closure, miscellaneous, oncology, anticoagulation, and antibiotic therapy. The articles were also categorized according to the department(s) that produced the publication. The number of publications, along with the mean OSQ by department of publication, is listed in Table 4. By mean OSQ, from greatest to least, the order of department(s) of publication is Public Health/Epidemiology, Other, Surgery and Public Health/Epidemiology, and Surgery.

DISCUSSION

To the best of our knowledge this is the first review of meta-analyses published on general surgical topics. Despite recognition of the importance of strategies to limit bias in the assembly and analysis of primary studies included in meta-analyses,19,24 our review demonstrates that this knowledge has not been well implemented in the general surgical literature. When conducted rigorously, systematic reviews may usefully guide practice, teaching, research, and health policy.2 However, when bias in the primary studies and/or the review methods is not minimized, the results and conclusions of meta-analyses may not be valid.5,25 Dissemination of meta-analyses with significant methodologic flaws may lead readers to abandon systematic reviews altogether or misinterpret or misuse them.

Overall, we found that the quality of these surgical meta-analyses is low. We have identified some factors associated with both high and low overall quality scores. Two factors that correlate with the overall score of the articles relate to the department producing the publication, as well as the meta-analysis publication experience of the authors. Studies produced by groups in which at least 1 author was a member of a department of public health or epidemiology unit had the highest scores, whereas meta-analyses produced by authors all of whom were members of a surgical department had the lowest. As well, our data demonstrate that meta-analyses by authors with at least 1 previously published meta-analysis have higher overall mean scores than other published meta-analyses. We suggest that meta-analyses should be authored by a group of individuals with both clinical expertise and methodologic expertise. In those circumstances where none of the authors have prior methodologic expertise or experience, an expert should be consulted and the QUOROM guidelines should be followed.

We found that the range of mean OSQ by journal of publication is wide, 4.670 to 1.500. Although the mean summary score did not significantly correlate with the impact factor of the journal, 2 of the 3 journals with the highest mean scores are both high impact surgical journals (Annals of Surgery and the British Journal of Surgery [BJS]). The BJS instructions to authors contain a referral to the use of the QUORUM criteria9 for those who are going to submit a systematic review/meta-analysis. To some degree, it is the responsibility of peer reviewers and editors to ensure that careful steps are taken to ensure a valid meta-analysis and transparent reporting of systematic review methods. Editorial interest in these issues may partially explain the high-quality meta-analyses in certain journals.

The methodologic areas of weakness we identified include validity assessment, selection bias, reporting of search strategies, and pooling of data. Kelly et al26 have outlined strategies that can be used to both assess and improve the scientific quality of meta-analyses/systematic reviews; these are outlined below. Validity assessment can be improved by the use of a validated scoring system to grade the quality of studies included in the meta-analysis. Similarly, minimizing selection and ascertainment bias can be accomplished by the use of multiple assessors, blinding of assessors, adjudication, and measurement reporting. Reporting of search strategies along with comprehensive search strategies should decrease bias. Search strategies include MEDLINE, EMBASE, and other bibliographic databases, seeking unpublished literature, contact with authors, hand searching, and inclusion of non-English language. The use of methodologic guidelines such as those incorporated in the QUOROM statement should help to minimize bias and poor reporting of meta-analyses.

By comparison, this review of the surgical literature fall somewhat short of the anesthesia literature; 41.5% of the systematic reviews in a recent anesthesia report had minor or minimal flaws.1 In this review, the mean OSQ for the anesthesia literature was 4.3. Similarly, the emergency medicine literature was reviewed using the OQAQ; only 13% had minor flaws.26 The mean overall score was 2.7, which is significantly lower than the mean score of 3.33 for our present review. Thus, although we found significant shortcomings in meta-analyses in the field of general surgery, the surgical literature is keeping with the quality of published meta-analyses in other fields of medicine. Despite our findings in the field of general surgery, there are some encouraging trends. Figure 2 demonstrates the mean OSQ and impact factor by year of publication. Over time, meta-analyses are being published in journals with impact factors of increasing value. Possible explanations for this change include dissemination and acceptance of the value of meta-analytic methods by both researchers and peer reviewers and an increase in the impact factors of surgical journals. The trend in OSQ over time is less robust and may demonstrate a slow improvement in overall quality.

graphic file with name 8FF2.jpg

FIGURE 2. Characteristics of meta-analysis by year of publication. White line represents mean OSQ score by year (left vertical axis). Black line represents mean impact factors by year (right vertical axis). Horizontal axis represents year of publication, with n equal to number of articles published for a given year.

An important feature of our analysis reinforces the potential relationship between the overall quality of published meta-analyses and both the direction and magnitude of treatment effect estimates. Estimates of effect (OR and relative risks) demonstrate a weak correlation between OSQ and the magnitude of effect, which is not statistically significant (Table 3). We found that as the quality of the meta-analysis decreases, the magnitude of effect increased (correlation coefficient = −0.12). This “overestimation” of “effect” has been demonstrated in prior studies examining both randomized trials and meta-analyses,10,11 with overestimates of effect being as high as 41%. Studies of low quality therefore have results that may not be valid and more importantly demonstrate systematic error, or a bias favoring the “experimental treatment.”

Overall, the scientific quality of meta-analyses published on topics pertaining to general surgery is low, and the majority have methodologic flaws. This may impair the validity of these publications and thus limit their use for clinical, educational, research, and policy purposes. In the future, more attention to rigorous systematic review methods by authors, constructively critical suggestions by peer reviewers, and attention to the QUOROM statement recommendations by editors should lead to improvement in these important publications in the field of surgery.

ACKNOWLEDGMENTS

Dr. D. Cook is a Research Chair of the Canadian Institutes for Health Research.

Footnotes

Reprints: Christopher James Doig, MD, MSc, University of Calgary, Foothills Hospital, Division of Critical Care, 1403–29th Street NW, Calgary, Alberta, Canada T2N 2T9. E-mail: Chip.Doig@CalgaryHealthRegion.ca.

REFERENCES