Heading down the wrong pathway: on the influence of correlation within gene sets - PubMed (original) (raw)

Heading down the wrong pathway: on the influence of correlation within gene sets

Daniel M Gatti et al. BMC Genomics. 2010.

Abstract

Background: Analysis of microarray experiments often involves testing for the overrepresentation of pre-defined sets of genes among lists of genes deemed individually significant. Most popular gene set testing methods assume the independence of genes within each set, an assumption that is seriously violated, as extensive correlation between genes is a well-documented phenomenon.

Results: We conducted a meta-analysis of over 200 datasets from the Gene Expression Omnibus in order to demonstrate the practical impact of strong gene correlation patterns that are highly consistent across experiments. We show that a common independence assumption-based gene set testing procedure produces very high false positive rates when applied to data sets for which treatment groups have been randomized, and that gene sets with high internal correlation are more likely to be declared significant. A reanalysis of the same datasets using an array resampling approach properly controls false positive rates, leading to more parsimonious and high-confidence gene set findings, which should facilitate pathway-based interpretation of the microarray data.

Conclusions: These findings call into question many of the gene set testing results in the literature and argue strongly for the adoption of resampling based gene set testing criteria in the peer reviewed biomedical literature.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Publications that assume independence between genes (light grey) greatly outnumber publications that use array resampling methods (dark grey). Panel (a) shows the cumulative number of publications and panel (b) shows the number of publications using each method per year. Year of publication is displayed on the horizontal axis.

Figure 2

Figure 2

The inter-gene correlation within GO categories is consistent across experiments and platforms. Mean correlation among genes in GO categories (a,b) and KEGG pathways (c,d) on two human (a,c) and two mouse (b,d) microarray platforms. The correlation of all transcripts with all transcripts on each platform is shown in red. Spearman correlations of the correlations are in upper right. Crosses represent +/- 1 standard error on each axis.

Figure 3

Figure 3

False positive rates are greatly increased using independence assumption methods, GO Biological Process categories. The proportion of permutations in which at least one GO Biological Process category is called significant using an independence assumption method with a Bonferroni correction (α = 0.05), the Benjamini & Hochberg FDR (α = 0.05, 0.10), and the resampling approach described in this manuscript. Red lines = 5% & 10%.

Figure 4

Figure 4

Variance inflation due to correlation of gene expression increases the false positive rate, even when using a Bonferroni correction. The percentage of permutations in which at least one GO Biological Process category was called significant is shown versus the variance of the gene set statistic, for two human (a,b) and two mouse (c,d) arrays. Spearman correlations are in the upper left of each panel.

Figure 5

Figure 5

GO Biological Process categories that are called significant by chance under permutation are likely to be called significant in the observed data. The proportion of times that a category is declared significant under permutation is plotted versus the proportion of times it is called significant in the observed data. Spearman correlations in upper left corner.

References

    1. Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006;7:55–65. doi: 10.1038/nrg1749. - DOI - PubMed
    1. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES. et al.Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. - DOI - PMC - PubMed
    1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT. et al.Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. - DOI - PMC - PubMed
    1. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999;27:29–34. doi: 10.1093/nar/27.1.29. - DOI - PMC - PubMed
    1. Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S. et al.GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 2003;4:R28. doi: 10.1186/gb-2003-4-4-r28. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources