Theme discovery from gene lists for identification and viewing of multiple functional groups - PubMed (original) (raw)

Comparative Study

Theme discovery from gene lists for identification and viewing of multiple functional groups

Petri Pehkonen et al. BMC Bioinformatics. 2005.

Abstract

Background: High throughput methods of the genome era produce vast amounts of data in the form of gene lists. These lists are large and difficult to interpret without advanced computational or bioinformatic tools. Most existing methods analyse a gene list as a single entity although it is comprised of multiple gene groups associated with separate biological functions. Therefore it is imperative to define and visualize gene groups with unique functionality within gene lists.

Results: In order to analyse the functional heterogeneity within a gene list, we have developed a method that clusters genes to groups with homogenous functionalities. The method uses Non-negative Matrix Factorization (NMF) to create several clustering results with varying numbers of clusters. The obtained clustering results are combined into a simple graphical presentation showing the functional groups over-represented in the analyzed gene list. We demonstrate its performance on two data sets and show results that improve upon existing methods. The comparison also shows that our method creates a more simplified view that aids in discovery of biological themes within the list and discards less informative classes from the results.

Conclusion: The presented method and associated software are useful for the identification and interpretation of biological functions associated with gene lists and are especially useful for the analysis of large lists.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Graphical results from the analysis of H2O2 dataset. The figure shows the non-nested hierarchical clustering tree obtained from GENERATOR with the H2O2 dataset. Each layer presents one clustering solution and each box a single cluster. Boxes show the two best scoring functional classes and the colour of the box corresponds to the over-representation of the best scoring functional class. Best correlating clusters between the consecutive clustering layers are connected with lines. A thicker line indicates a stronger correlation. The correlation value is indicated beside each line. The lines between the first and second level (marked with asterisks) do not present any value as the correlation measure is not defined here. Section A presents a view where two functional classes that contributed most to the cluster formation are shown for each cluster. Section B shows more informative visualization, the default view of GENERATOR, where two classes that were most over-represented in both the original sample list and in the cluster in question are shown. Note the conserved clusters across the different clustering results. We have marked them with Roman numerals.

Figure 2

Figure 2

Replications of non-nested hierarchical clustering tree with H2O2 dataset. The figure presents the four replications for the non-nested hierarchical clustering graph for H2O2 dataset. We have marked the conserved gene clusters with the same Roman numerals as in figure 1. Notice that most clusters (especially I, II and III) can be observed over several levels in each cluster tree.

Figure 3

Figure 3

Graphical results from the analysis of itraconanzole dataset. The figure shows the non-nested hierarchical clustering tree obtained from GENERATOR with the itraconanzole dataset. Section A shows the tree with functional classes that contributed most to the formation of each cluster. Section B shows the default view of GENERATOR with the highest over-represented functional classes in the original list and in the cluster in question. The details of the presentation are explained in text for figure 1. Also in this figure we highlight some conserved clusters with roman numbers.

Figure 4

Figure 4

Flow diagram of the method. The gene associations with the GO functional classes in the sample and reference gene lists are transformed into binary matrices (A and a) and a sum vector (b). The sample set is clustered with NMF based method (B) into a varying number of sub-groups producing a non-nested hierarchical tree (C). Contents of the clusters are described with the over-represented classes within them (c and D).

Figure 5

Figure 5

The measures for studying over-representation of classes. Over-representation of classes is measured by using A) the whole sample gene list as a sample and the reference gene list as a remainder population, O.log(p); B) a single cluster as a sample and the rest of the sample gene list and the reference gene list as a remainder population, C.log(p); and C) a single cluster as a sample and the rest of the sample gene list as a remainder population excluding the reference gene list, S.log(p). In each situation, Fisher's exact test f(x, M, n, k) [16] is used to determine the over-representation. O.log(p) presents the original over-representation of sample gene list without clustering. C.log(p) highlights the classes that are over-represented in the original sample gene list and in individual cluster. S.log(p) reports the contribution to the formation of cluster structure.

References

    1. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO. Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000;11:4241–4257. - PMC - PubMed
    1. Thorpe GW, Fong CS, Alic N, Higgins VJ, Dawes IW. Cells have distinct mechanisms to maintain protection against different reactive oxygen species: oxidative-stress-response genes. Proc Natl Acad Sci U S A. 2004;101:6564–6569. doi: 10.1073/pnas.0305888101. - DOI - PMC - PubMed
    1. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. - DOI - PMC - PubMed
    1. Weng S, Dong Q, Balakrishnan R, Christie K, Costanzo M, Dolinski K, Dwight SS, Engel S, Fisk DG, Hong E, Issel-Tarver L, Sethuraman A, Theesfeld C, Andrada R, Binkley G, Lane C, Schroeder M, Botstein D, Michael Cherry J. Saccharomyces Genome Database (SGD) provides biochemical and structural information for budding yeast proteins. Nucleic Acids Res. 2003;31:216–218. doi: 10.1093/nar/gkg054. - DOI - PMC - PubMed
    1. Mewes HW, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B. MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 2000;28:37–40. doi: 10.1093/nar/28.1.37. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources