Discovering statistically significant pathways in expression profiling studies - PubMed (original) (raw)

Discovering statistically significant pathways in expression profiling studies

Lu Tian et al. Proc Natl Acad Sci U S A. 2005.

Abstract

Accurate and rapid identification of perturbed pathways through the analysis of genome-wide expression profiles facilitates the generation of biological hypotheses. We propose a statistical framework for determining whether a specified group of genes for a pathway has a coordinated association with a phenotype of interest. Several issues on proper hypothesis-testing procedures are clarified. In particular, it is shown that the differences in the correlation structure of each set of genes can lead to a biased comparison among gene sets unless a normalization procedure is applied. We propose statistical tests for two important but different aspects of association for each group of genes. This approach has more statistical power than currently available methods and can result in the discovery of statistically significant pathways that are not detected by other methods. This method is applied to data sets involving diabetes, inflammatory myopathies, and Alzheimer's disease, using gene sets we compiled from various public databases. In the case of inflammatory myopathies, we have correctly identified the known cytotoxic T lymphocyte-mediated autoimmunity in inclusion body myositis. Furthermore, we predicted the presence of dendritic cells in inclusion body myositis and of an IFN-alpha/beta response in dermatomyositis, neither of which was previously described. These predictions have been subsequently corroborated by immunohistochemistry.

PubMed Disclaimer

Figures

Fig. 1.

Fig. 1.

Outline of the methodology. An extensive collection of pathway information is assembled from various databases; a statistical test is applied to find relationships between the expression levels and the phenotype, and then two different testing procedures are used to find statistically significant pathways. Proper adjustments for correlation structure and multiple testing are critical.

Fig. 2.

Fig. 2.

A scatterplot of the SDs of null distributions for the ES vs. the observed ES for the diabetes data. Each point represents a gene set. The Pearson correlation coefficient is 0.55. Without proper normalization among different gene sets, a high score may be due to its wide null distribution, which depends on the size and correlation structure of the gene set.

Similar articles

Cited by

References

    1. Speed, T., ed. (2003) Statistical Analysis of Gene Expression Microarray Data (Chapman & Hall/CRC, Boca Raton, FL).
    1. Pavlidis, P., Li, Q. & Noble, W. S. (2003) Bioinformatics 19, 1620-1627. - PubMed
    1. Kim, R. D. & Park, P. J. (2004) Genome Biol. 5, R70. - PMC - PubMed
    1. Dahlquist, K. D., Salomonis, N., Vranizan, K., Lawlor, S. C. & Conklin, B. R. (2002) Nat. Genet. 31, 19-20. - PubMed
    1. Zhong, S., Li, C. & Wong, W. H. (2003) Nucleic Acids Res. 31, 3483-3486. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources