GATHER: a systems approach to interpreting genomic signatures (original) (raw)

Journal Article

Department of Molecular Genetics and Microbiology, Duke Institute for Genome Sciences and Policy, Duke University Medical Center

Durham, NC 27710, USA

Search for other works by this author on:

Joseph R. Nevins

Department of Molecular Genetics and Microbiology, Duke Institute for Genome Sciences and Policy, Duke University Medical Center

Durham, NC 27710, USA

*To whom correspondence should be addressed.

Search for other works by this author on:

Revision received:

23 August 2006

Accepted:

11 September 2006

Navbar Search Filter Mobile Enter search term Search

Abstract

Motivation: Understanding the full meaning of the biology captured in molecular profiles, within the context of the entire biological system, cannot be achieved with a simple examination of the individual genes in the signature. To facilitate such an understanding, we have developed GATHER, a tool that integrates various forms of available data to elucidate biological context within molecular signatures produced from high-throughput post-genomic assays.

Results: Analyzing the Rb/E2F tumor suppressor pathway, we show that GATHER identifies critical features of the pathway. We further show that GATHER identifies common biology in a series of otherwise unrelated gene expression signatures that each predict breast cancer outcome. We quantify the performance of GATHER and find that it successfully predicts 90% of the functions over a broad range of gene groups. We believe that GATHER provides an essential tool for extracting the full value from molecular signatures generated from genome-scale analyses.

Availability: GATHER is available at Author Webpage

Contact: [email protected]

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Classical genetic experiments establish relationships between single or small groups of genes and observable traits. This approach, when coupled with the use of molecular biology techniques, represents the paradigm for much of 20th century biology—an ability to link gene function with biological phenotype. This paradigm has been extended with the use of various computational tools that aid analyses of gene structure and function. One widely used tool, BLAST, infers the function of a single gene or protein based on the functions of their homologs (Altschul et al., 1990). Nevertheless, these approaches still represent the paradigm of single gene analyses, although now in a high-throughput fashion.

With the development of complete genome sequences, the study of biology has been transformed in two ways. First and perhaps most critical, has been the development of technologies to enable high-throughput assays of gene activity. When focused on a biological process of interest, such assays identify groups of molecules that cooperatively effect or signify a process. In fact, many biological processes are evident only when analyzing the general patterns manifested by coordinate gene activity (Mootha et al., 2003). Such signatures describe a phenotype as a snapshot of gene activity in a cell or tissue sample at a given instant of time and have been developed to distinguish classes of leukemia using gene expression (Golub et al., 1999), to predict ovarian cancer using serum proteins (Petricoin et al., 2002) and to recognize deregulation of oncogenes in tumors (Bild et al., 2006). Such assays are becoming commonplace; the GEO repository of gene expression data are growing at an average rate of >20 million data points a month (Barrett et al., 2005). High-throughput genomic signatures are transforming biology from an observational molecular science to a data-intensive quantitative genomic science.

Second, databases of genomic data, such as gene or protein sequences, transcription factor-binding localizations, microRNA regulation, protein interaction networks or pathways, provide opportunities to study a signature in the context of various biological systems. By linking multiple systems together and understanding the signature with respect to transcriptional or post-transcriptional regulation, protein complexes or its relationship to other pathways, there is an opportunity to develop novel insight and a deeper understanding of biological processes (Kirschner, 2005).

Typically, analyses of gene expression profiles focus on the function of single genes within the profile, searching for aspects of function that might be relevant in the context of the profiling experiment. But, such a strategy often focuses on a few genes within the profile that might be logical while ignoring the vast majority of genes that might provide the most relevant context. As such, single gene annotations may well miss important associations with pathways or other biological context that is most relevant. Critically important is the recognition that biological processes most likely involve the concerted action of multiple genes—the identification of profiles or signatures through microarray experiments provide an opportunity to reveal these events but require methods that can evaluate groups of genes in a biologically relevant manner.

To address this changing paradigm for functional discovery and provide a mechanism to extract the function from a comprehensive view of a signature, we have developed GATHER, a Gene Annotation Tool to Help Explain Relationships. It is accessible using a web-based interface that allows entry of a list of genes from a high-throughput genomic experiment and can analyze the signature against a series of data sources.

To illustrate the use of GATHER, we have analyzed gene expression signatures that reflect the activity of the Rb/E2F pathway. In addition to the Rb/E2F analysis, we provide a quantitative evaluation of the accuracy with which GATHER can describe gene signatures. At the core of the GATHER system is a statistical foundation that quantifies the significance of functional associations. Methodologically, it builds upon previous methods that annotate genes with functional descriptors from Gene Ontology (GO) (Zeeberg et al., 2003; Al-Shahrour et al., 2004) or other sources (Zhang et al., 2005). These methods quantified the association of an annotation with a group of genes using the _P_-value from the hypergeometric distribution or chi-square test (Allison et al., 2006). Their results are essentially regarded as definitive and coherence in such functional annotations has been used to verify experimentally derived networks (Rual et al., 2005). The application and goals motivating GATHER, however, is most closely related to Gene Set Enrichment Analysis, in that both algorithms find subtle functional patterns within the context of a molecular signature, although GATHER is not limited to signatures from gene expression (Subramanian et al., 2005).

GATHER extends previous work in three ways paramount for extracting maximal information from molecular signatures. First, GATHER has the capacity to discover novel functions of gene groups by integrating annotations from evolutionary homologs and other genes related through protein interaction or literature networks (Rual et al., 2005; Jenssen et al., 2001; Hoffmann and Valencia, 2004). Second, GATHER annotates the characteristics of the genes with respect to datasets from multiple systems, helping synthesize evidence to develop or reinforce hypotheses. A third advancement is the development of a Bayesian statistical model, which we show increases the accuracy with which GATHER can infer novel functions of signatures.

2 METHODS

2.1 Representions of annotations

We defined an annotation as any discrete or discretized attribute that could be associated with a gene. The distribution of an annotation across two gene groups was represented in a 2 × 2 contingency table X where _X_1,• and _X_2,• contained the annotations associated with gene group 1 and 2, respectively; and _X_•,1 and _X_•,2 indicated the number of genes that either are or are not associated with the annotation.

In such an annotation table, gene group 1 contained the genes specified by the user and gene group 2 contained, conceptually, the remaining genes in the genome. In practice, however, many genes were not annotated (55% of the human genes in Entrez Gene had no GO codes), so we discarded those from the groups.

2.2 Statistical significance: Bayes factor

We quantified the evidence supporting the association between a gene group and an annotation using a Bayes factor (Gelman et al., 2003). This assessed the hypothesis that the distribution of annotations varied across gene groups against the hypothesis that the distribution was identical. A positive Bayes factor indicated that the evidence supported the association, while a negative one indicated no association. Its magnitude corresponded to the strength of the evidence for the association, where higher values were stronger. Because the strength of evidence was quantified on a continuous scale and required no significance cutoff, when testing multiple hypotheses, there was no cutoff to adjust. The Bayes factor, the strength of evidence for an association, did not depend on what other associations were also tested. We modeled the distribution of an annotation across groups of genes as a binomial process:

Bayes factor=P(H1∣data)P(H2)P(H2∣data)P(H1)

P(H1∣data)∝Bin(Θ1∣X1,1,X1,2)Bin(Θ1∣α,β) ×Bin(Θ2∣X2,1,X2,2)Bin(Θ2∣α,β)

∝Beta(X1,1+α,X1,2+β)Beta(X2,1+α,X2,2+β)

P(H2∣data)∝Bin(Θ3∣X1,1+X2,1,X1,2+X2,2)Bin(Θ3∣α,β)

∝Beta(X1,1+X2,1+α,X1,2+X2,2+β)Beta(α,β)

We assumed that Θ had a uniform prior distribution. We used non-informative priors, setting the hyperparameters α and β to one, and P(H1) and P(H2) to 50%. We imposed non-informative priors and distributions because, in the general case, there was no information about the strength of an association of an unknown annotation with an unknown gene list. However, for contexts where the expected degree and confidence of such an association may be known, they could be quantified as priors in this Equation.

2.3 Sources of annotations

GATHER supports multiple types of annotations representing different biological systems. For each type of annotation, we obtained or created an index that mapped an annotation to an Entrez Gene ID. From the index, we could associate each annotation to a gene group using a contingency table representation described previously and calculate its significance. We included annotations from GO, MEDLINE abstracts, MeSH terms; a gene network derived from the literature, KEGG pathways, a protein interaction network, microRNA regulation and transcription factor-binding sites. A detailed description of the methods to obtain these annotations is provided in the Supplementary materials.

2.4 Inferring annotations

We created three networks where the nodes represented genes (or proteins) and the edges connected genes that were related according to different criteria. In the first network, the protein interaction network, we connected the genes that were shown to bind in at least one of two large-scale screens of protein binding in humans (Rual et al., 2005; Stelzl et al., 2005). In the second network, the literature network, we connected the genes that co-occurred together, in at least 10 MEDLINE records. The final network, the homology network, connected genes from different organisms if they were documented as orthologs in the Homologene database (Wheeler et al., 2001).

Using one of the three networks, GATHER inferred annotations for a group of genes by adding to the group, all genes that were connected to at least one gene from the original group in the network. Then, GATHER performed its analysis based on the expanded group of genes.

2.5 Structural-based gene groups with functional significance

To evaluate the accuracy of functional inference, we merged 1277 genes into 407 groups based on structural superfamilies in the 1.69 (July 2005) release of the Structural Classification of Proteins (SCOP) database (Murzin et al., 1995). We linked the proteins to Entrez Gene identifiers through the SWISS-PROT database (Bairoch and Boeckmann, 1991). For genes that were assigned to multiple groups, we kept only the assignment to the one with which, it shared highest average sequence identity.

3 RESULTS

3.1 The Rb/E2F pathway

To illustrate the use of and evaluate GATHER, we analyzed the Rb/E2F pathway (Fig. 1A; genes and analysis at Author Webpage). This pathway is central to the control of cell proliferation, providing the signal transduction events that link stimulation of growth with entry to the cell cycle. Deregulation of the Rb/E2F pathway is common to the development of an oncogenic state and in cancer cells, E2Fs are overexpressed (Sherr, 1996; Rhodes et al., 2005). The retinoblastoma (Rb) protein represses the activity of the E2Fs, a family of eight transcription factors that bind a single consensus site. E2Fs are transcriptional regulators necessary for proper cell cycle control and DNA synthesis. Recent experiments have also established a role in a broad range of activities, such as mitosis, DNA repair and apoptosis. Thus, the roles of E2F in normal function and in tumorigenesis are growing, and its functions and regulatory mechanisms are areas of intense investigation (Nevins, 1998; Dimova and Dyson, 2005). We assembled a gene expression signature of 231 genes representing E2F activation from standard biochemical experiments (DeGregori et al., 1995; Ohtani et al., 1996; Leone et al., 1998; Moroni et al., 2001) and gene expression microarrays (Ishida et al., 2001; Muller et al., 2001; Black et al., 2003). The datasets reflected perturbed E2F activity under various cell cycle states (quiescence, synchronized cycles or asynchronous cycles) in cultured cells. Because the assays captured the entire transcriptional response, the compiled signature included genes both directly regulated by E2F (i.e. E2F transactivated or repressed the gene through interactions with its promoter) as well as those regulated indirectly (i.e. E2F activity resulted in altered expression of the gene) in a variety of conditions.

Fig. 1

(A) The Rb/E2F Pathway. (B) We applied GATHER to annotate a signature of the Rb/E2F pathway compiled from various data sources. We show the top ten most significant annotations found in the analysis for each type of data. (C) This matrix shows the common genes between each pair of E2F signatures (Ishida et al., 2001; Muller et al., 2001; Ren et al., 2002; Polager et al., 2002). Each value indicates the number of genes shared between a pair of signatures. The values on the diagonal represent the number of genes in a signature. (D) This shows the significance (Bayes factor) of selected annotations for each signature. Missing values indicate that the evidence from the genes in the signature did not support an association with an annotation.

Given the role of E2Fs as transcription factors and thus expecting that the signature should include direct targets of E2F, we began by analyzing the signature with the TRANSFAC component of GATHER to assess the significance of the presence of potential transcription factor-binding sites within the promoters of genes. As shown in Figure 1B, we found strong evidence that the proximal promoter regions contained E2F binding motifs. The most significant transcription factor annotations were all variants of the E2F binding site with significant Bayes factors. This agreed with previous genome-wide chromatin immunoprecipitation experiments that showed that E2F directly regulated a large number of genes (Ren et al., 2002).

Next, to develop an understanding of the functions reflected in the activation of E2F, we applied each of the functional annotation tools provided by GATHER to the signature. As shown in Figure 1B, the annotations revealed strong associations with DNA replication, cell cycle, cyclins, cyclin-dependent kinase, cell proliferation, DNA metabolism and mitosis. This was seen in the GO analysis and supported by annotations from MEDLINE keywords and MeSH. These annotations coincided with a wealth of studies that have shown a role for E2F activity in the control of the G1/S transition, interaction with and regulation of cyclins and cyclin-dependent kinases, and the induction of S phase (DeGregori et al., 1995; Helin, 1998). Further validation was provided by KEGG pathway annotations that identified cell cycle and purine and pyrimidine metabolism as the most significant occurrences, again, emphasizing cell cycle and DNA replication. Taken together, these annotations portrayed a very clear picture of regulation involving the cell cycle with an emphasis on the control of DNA replication.

Furthermore, we saw evidence relating E2F activity with mitosis, consistent with recent data extending the role of E2F activity beyond S phase (Neufeld et al., 1998). Annotations related to mitosis and M phase were evident with high statistical significance. The GO annotations included mitotic activities, such as nuclear division and cytokinesis, the MeSH annotations pointed to M phase regulatory genes CDC2 and cyclin B, and TRANSFAC detected widespread presence of NFY binding sites, which have previously been found to control many genes regulated at G2/M (Linhart et al., 2005). Further establishing connection to M phase, recent biochemical experiments have shown that E2F transcriptionally activates Myb, a regulator of M phase genes (Zhu et al., 2004). From these results, we predicted that, in addition to direct targets of E2F, the targets of Myb, a downstream effector of E2F, would also appear in the signature. Indeed, we found the binding site of Myb in the promoters of 66 genes in the signature. Of those 66 genes, the GO term for M phase was present with high statistical significance. Thus, GATHER produced annotations from the E2F signature that reflected an established mechanism connecting E2F to the control of mitosis, namely that E2F linked G1/S to mitosis through regulation of Myb.

To investigate further potential functional relationships of E2F, we activated the functional inference component of GATHER by selecting the Infer from Network option on the website. This increased the extent of annotations based on an analysis of a network of genes in the literature to reveal a wider scope of functions not immediately evident in the signature. This component revealed significant associations with more pathways in KEGG, including the TGF-beta and MAPK signaling pathways and apoptosis. Overlap in genes between the E2F and TGF-beta pathway was previously observed in data from microarray assays (Muller et al., 2001; Young et al., 2003). Similarly, we also saw annotations for MAPK signaling, which in KEGG included the AKT pathway. This predicted association was consistent with the experimental observation that E2F induced expression of Gab2, an activator of the AKT pathway (Chaussepied and Ginsberg, 2004). Moreover, activation of the AKT pathway suppressed E2F1-induced apoptosis during the cell cycle (Hallstrom and Nevins, 2003). Apoptosis was also linked to the E2F pathway in GATHER via evidence in MEDLINE words, MeSH terms, GO terms and the apoptosis KEGG pathway. In addition, the Literature Network revealed that apoptosis related genes MDM2 and NOL3 appeared significantly associated with the signature. Finally, GATHER confirmed that E2F1 could bind TOPBP1, whose interaction mediated repression of E2F1-induced apoptosis (Liu et al., 2004). In this example, the functional inference algorithm clearly was able to deepen the analysis and reveal associations with pathways that were well supported by experiments in the published literature.

Next, we used GATHER to explore the molecular mechanism by which the various individual E2F proteins could regulate specific distinct groups of target genes, despite the fact that each of the E2F proteins recognizes identical consensus sites. Previous work has pointed to a combinatorial control model whereby different co-activator proteins physically interact with individual E2F proteins to facilitate the binding to the promoters of specific targets (Attwooll et al., 2004). Establishing evidence for this model required the combination of multiple sources of information in GATHER, specifically to establish that a potential transcriptional regulatory partner must both bind to E2F and its binding site must occur in the proximal promoter of a portion of the genes in the E2F signature. Looking for this pattern using the Protein Interaction and TRANSFAC search modules in GATHER (see Supplementary material), we found that YY1 could bind E2F-2 and -3 and had regulatory motifs on 23 gene promoters; and that the E-box binding factor TFE3 could bind E2F3, and E-boxes were found in the promoters of 138 genes. Indeed, these interactions were confirmed in the literature. E2F-2 or -3 formed complexes with RYBP/YY1 to activate the CDC6 replication protein (Schlisio et al., 2002) and TFE3 formed a complex with E2F3 (and not the others) to regulate the 68kDa subunit of DNA polymerase delta (Giangrande et al., 2003). Another example of combinatorial control established in the literature, that Sp1 and E2F1, -2 or -3 bound cooperatively to the TK promoter, was not evident in GATHER because TRANSFAC lacked high quality matrices to detect SP1 binding sites (Karlseder et al., 1996).

Finally, to examine the robustness with which GATHER could identify relevant biological features underlying the signatures, we analyzed four independent E2F expression signatures and compared the annotations, focusing on GO, KEGG and TRANSFAC. The signatures were collected under different experimental conditions in cells from human and mice, and the stringency used to distinguish genes in the signatures varied. Thus, the number of genes shared between signatures was low, where the most similar signatures, Ishida and Polager, shared only 10 genes out of 77 (Fig. 1C). Nevertheless, cell cycle, DNA replication and related activities were recurring themes across signatures, as well as evidence for transcriptional regulation by E2F and NFY (Fig. 1D). Other annotations also were evident, potentially reflecting the distinctions in how the signatures were generated. Interestingly, the Muller signature exhibited only weak evidence for cell cycle regulation, despite strong evidence for E2F regulation. However, this signature contained an exceptionally large number of genes and that breadth may have diluted the coherence of the principal functions in the signature. In any case, this analysis showed that a comprehensive approach to analyzing genomic signatures could identify distinguishing functions, regardless of the presence of particular single genes of interest.

3.2 Biological processes in breast cancer survival

To further explore the value of GATHER in elucidating biological function, we analyzed gene expression signatures that have been developed to predict breast cancer survival. Initial work identified a profile of 70 genes that could predict the survival of a breast cancer patient (van't Veer et al., 2002). Subsequent work has shown that multiple 70 gene signatures can be identified from the same dataset that equally well predict survival (Ein-Dor et al., 2005). Interestingly, there was only 17% overlap of genes between signatures, although each of them was highly correlated with survival.

To examine the functional composition of the signatures, we analyzed the 10 signatures from Ein-Dor as well as the original signature reported in van't Veer. We annotated each gene list for significant associations with GO processes, KEGG pathways and TRANSFAC binding sites. The most significant annotations are shown in Table 1.

Table 1

Biological processes in survival. This table shows the 30 most significant annotations (two redundant ones omitted) associated with breast cancer survival signatures. Each row contains a type of annotation, and each column contains a signature from (van't Veer et al., 2002; Ein-Dor et al., 2005). The table contains the Bayes factors quantifying the significance of the association between an annotation and a signature. Empty values indicate no evidence of association

Source	Annotation	van't Veer	Sig 1	Sig 2	Sig 3	Sig 4	Sig 5
Cell cycle
KEGG pathway	Cell cycle	0.5	1.0	4.0	9.0	1.2
Gene ontology	Cell cycle	4.4	1.6	4.8	2.9
Gene ontology	Regulation of cell cycle	0.7	4.5
E2F
TRANSFAC	E2F (E2F1_Q4)	3.0	2.1	2.6	5.2	3.4	1.3
TRANSFAC	E2F (E2F_Q3_01)	3.0	1.0	0.5	2.1	4.3	3.2
TRANSFAC	E2F (E2F1DP1RB_01)	1.5	4.9	3.8
TRANSFAC	E2F (E2F_Q3_01)	0.0	2.0	4.3	1.8	0.3
Mitosis
Gene ontology	Mitosis	11.9	1.4	5.1
Gene ontology	Mitotic cell cycle	11.9	1.0	4.5	5.4
Gene ontology	M phase	9.9	2.7	3.7
Gene ontology	M phase of mitotic cell cycle	11.8	1.3	5.0
Gene ontology	Nuclear division	10.2	2.9	3.9
Gene ontology	Regulation of mitosis	4.9
Metabolism
Gene ontology	Amine metabolism	6.2	3.4	1.3
Gene ontology	Amine catabolism	1.2	0.6	4.4
Gene ontology	Amino acid and derivative metabolism	0.1	4.9	2.0	1.9
Gene ontology	Amino acid metabolism	0.4	5.8	2.5	2.6
Gene ontology	Amino acid catabolism	1.3	0.7	4.5
Gene ontology	Organic acid metabolism	2.0	4.4	2.2	3.2
Gene ontology	Carboxylic acid metabolism	2.0	4.5	2.2	3.2
Gene ontology	Regulation of lipid metabolism	0.2	4.9
Gene ontology	L-serine biosynthesis	4.5	0.4
Gene ontology	Glutamine metabolism	2.7	2.7	6.0
Gene ontology	Glutamine family amino acid metabolism	4.5	1.3	1.9	3.8
KEGG pathway	Glutamate metabolism	7.0	3.0
Other
Gene ontology	Regulation of cellular physiological process	4.4	0.2	0.7
Gene ontology	Cell migration	4.3
TRANSFAC	Pax-3	4.9

Source	Annotation	van't Veer	Sig 1	Sig 2	Sig 3	Sig 4	Sig 5
Cell cycle
KEGG pathway	Cell cycle	0.5	1.0	4.0	9.0	1.2
Gene ontology	Cell cycle	4.4	1.6	4.8	2.9
Gene ontology	Regulation of cell cycle	0.7	4.5
E2F
TRANSFAC	E2F (E2F1_Q4)	3.0	2.1	2.6	5.2	3.4	1.3
TRANSFAC	E2F (E2F_Q3_01)	3.0	1.0	0.5	2.1	4.3	3.2
TRANSFAC	E2F (E2F1DP1RB_01)	1.5	4.9	3.8
TRANSFAC	E2F (E2F_Q3_01)	0.0	2.0	4.3	1.8	0.3
Mitosis
Gene ontology	Mitosis	11.9	1.4	5.1
Gene ontology	Mitotic cell cycle	11.9	1.0	4.5	5.4
Gene ontology	M phase	9.9	2.7	3.7
Gene ontology	M phase of mitotic cell cycle	11.8	1.3	5.0
Gene ontology	Nuclear division	10.2	2.9	3.9
Gene ontology	Regulation of mitosis	4.9
Metabolism
Gene ontology	Amine metabolism	6.2	3.4	1.3
Gene ontology	Amine catabolism	1.2	0.6	4.4
Gene ontology	Amino acid and derivative metabolism	0.1	4.9	2.0	1.9
Gene ontology	Amino acid metabolism	0.4	5.8	2.5	2.6
Gene ontology	Amino acid catabolism	1.3	0.7	4.5
Gene ontology	Organic acid metabolism	2.0	4.4	2.2	3.2
Gene ontology	Carboxylic acid metabolism	2.0	4.5	2.2	3.2
Gene ontology	Regulation of lipid metabolism	0.2	4.9
Gene ontology	L-serine biosynthesis	4.5	0.4
Gene ontology	Glutamine metabolism	2.7	2.7	6.0
Gene ontology	Glutamine family amino acid metabolism	4.5	1.3	1.9	3.8
KEGG pathway	Glutamate metabolism	7.0	3.0
Other
Gene ontology	Regulation of cellular physiological process	4.4	0.2	0.7
Gene ontology	Cell migration	4.3
TRANSFAC	Pax-3	4.9

Table 1

Source	Annotation	van't Veer	Sig 1	Sig 2	Sig 3	Sig 4	Sig 5
Cell cycle
KEGG pathway	Cell cycle	0.5	1.0	4.0	9.0	1.2
Gene ontology	Cell cycle	4.4	1.6	4.8	2.9
Gene ontology	Regulation of cell cycle	0.7	4.5
E2F
TRANSFAC	E2F (E2F1_Q4)	3.0	2.1	2.6	5.2	3.4	1.3
TRANSFAC	E2F (E2F_Q3_01)	3.0	1.0	0.5	2.1	4.3	3.2
TRANSFAC	E2F (E2F1DP1RB_01)	1.5	4.9	3.8
TRANSFAC	E2F (E2F_Q3_01)	0.0	2.0	4.3	1.8	0.3
Mitosis
Gene ontology	Mitosis	11.9	1.4	5.1
Gene ontology	Mitotic cell cycle	11.9	1.0	4.5	5.4
Gene ontology	M phase	9.9	2.7	3.7
Gene ontology	M phase of mitotic cell cycle	11.8	1.3	5.0
Gene ontology	Nuclear division	10.2	2.9	3.9
Gene ontology	Regulation of mitosis	4.9
Metabolism
Gene ontology	Amine metabolism	6.2	3.4	1.3
Gene ontology	Amine catabolism	1.2	0.6	4.4
Gene ontology	Amino acid and derivative metabolism	0.1	4.9	2.0	1.9
Gene ontology	Amino acid metabolism	0.4	5.8	2.5	2.6
Gene ontology	Amino acid catabolism	1.3	0.7	4.5
Gene ontology	Organic acid metabolism	2.0	4.4	2.2	3.2
Gene ontology	Carboxylic acid metabolism	2.0	4.5	2.2	3.2
Gene ontology	Regulation of lipid metabolism	0.2	4.9
Gene ontology	L-serine biosynthesis	4.5	0.4
Gene ontology	Glutamine metabolism	2.7	2.7	6.0
Gene ontology	Glutamine family amino acid metabolism	4.5	1.3	1.9	3.8
KEGG pathway	Glutamate metabolism	7.0	3.0
Other
Gene ontology	Regulation of cellular physiological process	4.4	0.2	0.7
Gene ontology	Cell migration	4.3
TRANSFAC	Pax-3	4.9

Source	Annotation	van't Veer	Sig 1	Sig 2	Sig 3	Sig 4	Sig 5
Cell cycle
KEGG pathway	Cell cycle	0.5	1.0	4.0	9.0	1.2
Gene ontology	Cell cycle	4.4	1.6	4.8	2.9
Gene ontology	Regulation of cell cycle	0.7	4.5
E2F
TRANSFAC	E2F (E2F1_Q4)	3.0	2.1	2.6	5.2	3.4	1.3
TRANSFAC	E2F (E2F_Q3_01)	3.0	1.0	0.5	2.1	4.3	3.2
TRANSFAC	E2F (E2F1DP1RB_01)	1.5	4.9	3.8
TRANSFAC	E2F (E2F_Q3_01)	0.0	2.0	4.3	1.8	0.3
Mitosis
Gene ontology	Mitosis	11.9	1.4	5.1
Gene ontology	Mitotic cell cycle	11.9	1.0	4.5	5.4
Gene ontology	M phase	9.9	2.7	3.7
Gene ontology	M phase of mitotic cell cycle	11.8	1.3	5.0
Gene ontology	Nuclear division	10.2	2.9	3.9
Gene ontology	Regulation of mitosis	4.9
Metabolism
Gene ontology	Amine metabolism	6.2	3.4	1.3
Gene ontology	Amine catabolism	1.2	0.6	4.4
Gene ontology	Amino acid and derivative metabolism	0.1	4.9	2.0	1.9
Gene ontology	Amino acid metabolism	0.4	5.8	2.5	2.6
Gene ontology	Amino acid catabolism	1.3	0.7	4.5
Gene ontology	Organic acid metabolism	2.0	4.4	2.2	3.2
Gene ontology	Carboxylic acid metabolism	2.0	4.5	2.2	3.2
Gene ontology	Regulation of lipid metabolism	0.2	4.9
Gene ontology	L-serine biosynthesis	4.5	0.4
Gene ontology	Glutamine metabolism	2.7	2.7	6.0
Gene ontology	Glutamine family amino acid metabolism	4.5	1.3	1.9	3.8
KEGG pathway	Glutamate metabolism	7.0	3.0
Other
Gene ontology	Regulation of cellular physiological process	4.4	0.2	0.7
Gene ontology	Cell migration	4.3
TRANSFAC	Pax-3	4.9

Although there was little overlap in the individual genes comprising the 11–70 gene signatures, it was evident that there was common function represented in these signatures. In particular, with the exception of signature 8, each 70 gene profile was represented as cell cycle, E2F or mitosis. Given the critical role of E2F proteins in the control of genes encoding DNA replication and mitotic activities, these three annotations can be viewed as representing the same biological process. Additionally, the term metabolism scored high with signatures 1, 5 and 9, and glutamine metabolism specifically with signature 4. These observations suggest that, despite the lack of common genes across the signatures, there is in fact common biology, represented in the form of cell cycle gene control, which characterizes breast cancer survival.

3.3 Accuracy of function inference

To quantify the accuracy with which GATHER could discover novel annotations, we created a dataset of 407 functionally homologous groups of human genes. Although, GATHER contained many types of annotations, we chose to evaluate annotations from GO because this data source was actively curated and was of considerable general interest. In addition, its curators documented the evidence (based on computational or experimental experiments) substantiating each annotation, allowing distinct analyses of different types of annotations.

To annotate the gold standard, we applied GATHER to each of the 407 gene groups and retained the GO annotations associated with each group with BF ≥ 0. Then, we repeated the analysis with two modifications: (1) we removed the annotations associated with each gene in a gene group, (2) we applied GATHER to the unannotated gene groups using only the annotations inferred from homologs, a protein interaction network or a literature network. Comparing the results of the second analyses against the first revealed the ability of GATHER to discover annotations not previously related to the genes.

First, we quantified the ability of the algorithm to recover the correct annotations with different parameters. For each analysis, we ranked the resulting annotations by decreasing significance, either Bayes factor or _P_-value, and calculated the precision and recall at each rank in the list (Fig. 2A). At a specific rank, the recall quantified the portion of the gold standard recovered and the precision quantified the errors found in the list. The calculation of recall and precision are described in the Supplementary materials. When including annotations from both homologs and the literature network, GATHER recovered 90% of the original functions. 83% had BF at least 0, and 54% were highly significant with BF at least 6. The literature network recovered more annotations than the protein networks (72 and 59%, respectively) and at higher levels of significance. Many of the missed annotations were not highly prevalent in the original gene sets. When we filtered out the annotations associated with only a single gene in a set, GATHER recovered 99% of the original annotations and 78% of them with a BF at least 6. GATHER was better able to recover annotations with more evidence.

Fig. 2

Accuracy of functional predictions. (A) We assembled 407 groups of functionally related genes with their GO annotations removed. GATHER then annotated each group, predicting the missing annotations (All Annotations) using information from homologs, a network of protein binding and a network extracted from the literature, and combined homologs and literature network. The table indicates the portion correctly recovered annotations. For Frequent Annotations, we retained only the annotations that were associated with more than one gene in each gene group. (B) We compared the precision and recall for annotations predicted using different methods. Here, Network refers to literature network. As a baseline, we plotted a curve showing the significance of annotations when calculated as _P_-values from the hypergeometric distribution.

Next, we examined the rate at which GATHER produced incorrect annotations (Fig. 2B). At a BF threshold of six, at which GATHER recalled 54% of all known annotations, the precision, the portion of the predicted annotations present in the gold standard, was 32%. By comparing the precision for the same level of recall, it was clear that inferring annotations from homologs was most precise. However, this strategy recovered 23% fewer correct annotations than when combined with the literature network. We also compared the Bayes factor statistic against the _P_-value calculated by the popular Fisher's Exact test/hypergeometric distribution and found that ranking annotations by Bayes factor recovered higher precision at all levels of recall.

Examining the estimate of the precision more closely, 68% of the annotations identified with a BF cutoff of six did not appear to be associated with the gene group. Possible explanations were that the gene did not perform the function (annotation was incorrect) or because the gene performed the function (annotation was correct) but that had not yet been discovered or annotated. To distinguish between these two cases, we collected seven successive versions of the GO annotation database from the earliest available release in August 2002 to the latest in October 2005. We repeated the annotations as above on the earliest database and checked future databases to determine if the annotations of the gene groups would have appeared in future analyses (Fig. 3A). In the first version, there were 3975 unverified annotations; 269 would be verified after three years. The rate of verification was steady, strongly suggesting that more functions would be discovered. We examined the method by which functions were discovered by dividing the annotations into Experimental and Computational groups based on the type of evidence supporting the annotation. The annotations predicted by GATHER that appeared in later releases of the GO databases were verified with computational methods at a similar rate as experimental ones (Fig. 3B). Ten percent of the computationally verified annotations were later reconfirmed experimentally.

Fig. 3

(A) We predicted the functions for gene groups using the annotations available in the GO annotation database on August 2002. We then analyzed the gene groups using subsequent releases of the database. For annotations that appeared in analyses of future releases, we assigned them as either Experimental or Computational according to the type of evidence used to justify the annotation in the database. (B) This plot shows the rate at which either Experimental or Computational annotations were verified. Day 0 refers to the August 2002 release of the database. An annotation was verified computationally at an average rate of one every 7 days, while an annotation was verified experimentally every 9 days.

4 DISCUSSION

Biological investigations in the post-genomic era often leverage genome-wide assays to uncover relationships between processes and groups of genes. Extracting the full measure of biological understanding from this scale of data requires the development of a novel class of tool that can interpret comprehensively the biology represented in gene signatures within the context of multi-level biological systems. We have developed GATHER as a tool to address this challenge.

We used GATHER to analyze the Rb/E2F pathway. Its annotations corroborated known functions of E2F, including DNA synthesis, mitosis, nucleotide metabolism and apoptosis. The synthesis of multiple types of evidence was pivotal in establishing strong evidence of phenomena, such as the mechanism for combinatorial control of E2F regulation. Furthermore, the functional discovery component of GATHER revealed deeper relationships between Rb/E2F and other processes, such as the TGF-beta pathway and apoptosis. Although these pathways were regulated predominantly through post-translational activities, E2F could regulate them through transcriptional mechanisms (E2F could regulate AKT through Gab2 and apoptosis through TOPBP1), which may have manifested as signals in the expression profile that could be identified using the functional inference component. These results demonstrated the utility of the inferencing algorithm and its ability to dig deeper into the biology of the signature. The analysis of the Rb/E2F pathway showed that a tool, such as GATHER could support and reinforce discovery through the synthesis of information across biological systems.

Continuing, we evaluated the accuracy of GATHER across different groups of genes. We discovered that it could recover nearly all currently known annotations and that the extent that predicted annotations might be discovered in the future remained undetermined. Furthermore, it was unclear whether novel annotations not currently covered by our evaluation might be more difficult to predict. Nevertheless, it was apparent that annotation databases were slowly accumulating ever-increasing amounts of knowledge about genes.

From the same analysis, the rate of growth of database annotations revealed two striking observations. First, it was intriguing that the number of computational annotations had not increased at a rate significantly greater than that from experimental methodologies. Second, it was also interesting that few annotations predicted from computational models had been verified with experiments. These findings implied that computational and experimental based annotations were complementary, and that both would be necessary for continued development of a comprehensive database of gene functions.

Perhaps the most revealing indication of the value of GATHER in interpreting genome-scale measures of gene expression was illustrated by the breast cancer analysis. The fact that multiple gene expression signatures can predict outcome, with little evidence of overlap in the genes constituting the signatures, can be worrisome, possibly reflecting the dangers of large-scale analyses, such as this and the opportunity for false discovery. In reality, such a result should not be surprising given the exceedingly high complexity of the cancer phenotype. Thus, one view of this result is simply the ability of complex expression signatures to match and reflect the heterogeneity and complexity of the disease, emphasizing the power of whole genome expression analysis. But, in the absence of an ability to interpret this complexity, one is left with a predictive tool but no insight into the biology. The use of GATHER to find common aspects of biology in these otherwise dissimilar expression profiles highlights both the power of expression profiling to tap into the complexity but also the power of GATHER to reveal the common structure in this complexity.

Finally, based on the results presented, we propose the following recommendations for the use of GATHER. First, although it is often sufficient to analyze gene groups without functional inference, inferred annotations reveal a more complete assessment of potential functions of the genes. When inferring annotations, those using only functions from homologs have higher precision than those from other networks. For more complete (but lower quality) annotations, we recommend including annotations from both homologs and the literature network. The accuracy of predicted annotations is related to the Bayes factor, which should be interpreted as the strength of the evidence supporting a function. If an explicit cutoff is desired, we recommend six, as it maximizes the harmonic mean of the recall and precision.

The authors would like to thank Rob Wagner, Carlos Carvalho and Mark DeLong for their technical help, Andrea Bild, Holly Dressman, Eran Andrechek, Seiichi Mori, Steven Angus and Guang Yao for helpful discussions and critical evaluations of GATHER, Liat Ein-Dor for providing gene signatures and anonymous reviewers for their insightful comments. J.T.C. is supported by postdoctoral fellowship #PF-05-047-01-GMC from the American Cancer Society; J.R.N. is supported by NIH 5-U54-CA112952-03. Finally, the authors thank Kaye Culler for her assistance in the preparation of the manuscript.

Conflict of Interest: none declared.

REFERENCES

et al. ,

FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes

Bioinformatics

2004

, vol.

(pg.

578

580

)

et al. ,

Microarray data analysis: from disarray to consolidation, consensus

Nat. Rev. Genet.

2006

, vol.

(pg.

)

et al. ,

Basic local alignment search tool

J. Mol. Biol.

1990

, vol.

215

(pg.

403

410

)

et al. ,

The E2F family: specific functions and overlapping interests

EMBO J.

2004

, vol.

(pg.

4709

4716

)

The SWISS-PROT protein sequence data bank

Nucleic Acids Res.

1991

, vol.

(pg.

2247

2249

)

et al. ,

NCBI GEO: mining millions of expression profiles—database and tools

Nucleic Acids Res.

2005

, vol.

(pg.

D562

D566

)

et al. ,

Oncogenic pathway signatures in human cancers as a guide to targeted therapies

Nature

2006

, vol.

439

(pg.

353

357

)

et al. ,

Distinct gene expression phenotypes of cells lacking Rb, Rb family members

Cancer Res.

2003

, vol.

(pg.

3716

3723

)

Transcriptional regulation of AKT activation by E2F

Mol. Cell.

2004

, vol.

(pg.

831

837

)

et al. ,

Cellular targets for activation by the E2F1 transcription factor include DNA synthesis–and G1/S regulatory genes

Mol. Cell. Biol.

1995

, vol.

(pg.

4215

4224

)

The E2F transcriptional network: old acquaintances with new faces

Oncogene

2005

, vol.

(pg.

2810

2826

)

et al. ,

Outcome signature genes in breast cancer: is there a unique set?

Bioinformatics

2005

, vol.

(pg.

171

178

)

et al. ,

Bayesian Data Analysis

2003

FL, USA

CRC Press LLC, Boca Raton

et al. ,

Identification of E-box factor TFE3 as a functional partner for the E2F3 transcription factor

Mol. Cell. Biol.

2003

, vol.

(pg.

3707

3720

)

et al. ,

Molecular classification of cancer: class discovery, class prediction by gene expression monitoring

Science

1999

, vol.

286

(pg.

531

537

)

Specificity in the activation and control of transcription factor E2F-dependent apoptosis

Proc. Natl Acad. Sci. USA.

2003

, vol.

100

(pg.

10848

10853

)

Regulation of cell proliferation by the E2F transcription factors

Curr. Opin. Genet. Dev.

1998

, vol.

(pg.

)

A gene network for navigating the literature

Nat. Genet.

2004

, vol.

pg.

664

et al. ,

Role for E2F in control of both DNA replication and mitotic functions as revealed from DNA microarray analysis

Mol. Cell. Biol.

2001

, vol.

(pg.

4684

4699

)

et al. ,

A literature network of human genes for high-throughput analysis of gene expression

Nat. Genet.

2001

, vol.

(pg.

)

et al. ,

Interaction of Sp1 with the growth- and cell cycle-regulated transcription factor E2F

Mol. Cell. Biol.

1996

, vol.

(pg.

1659

1667

)

The meaning of systems biology

Cell

2005

, vol.

121

(pg.

503

504

)

et al. ,

E2F3 activity is regulated during the cell cycle, is required for the induction of S phase

Genes Dev.

1998

, vol.

(pg.

2120

2130

)

et al. ,

Deciphering transcriptional regulatory elements that encode specific cell cycle phasing by comparative genomics analysis

Cell Cycle

2005

, vol.

(pg.

1788

1797

)

et al. ,

Regulation of E2F1 by BRCT domain-containing protein TopBP1

Mol. Cell. Biol.

2003

, vol.

(pg.

3287

3304

)

et al. ,

PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes

Nat. Genet.

2003

, vol.

(pg.

267

273

)

et al. ,

Apaf-1 is a transcriptional target for E2F, p53

Nat. Cell. Bio.

2001

, vol.

(pg.

552

558

)

et al. ,

E2Fs regulate the expression of genes involved in differentiation, development, proliferation, and apoptosis

Genes Dev.

2001

, vol.

(pg.

267

285

)

et al. ,

SCOP: a structural classification of proteins database for the investigation of sequences, structures

J. Mol. Biol.

1995

, vol.

247

(pg.

536

540

)

et al. ,

Coordination of growth and cell division in the Drosophila wing

Cell

1998

, vol.

(pg.

1183

1193

)

Toward an understanding of the functional complexity of the E2F and retinoblastoma families

Cell Growth Differ.

1998

, vol.

(pg.

585

593

)

et al. ,

Expression of the HsOrc1 gene, a human ORC1 homolog, is regulated by cell proliferation via the E2F transcription factor

Mol. Cell. Biol.

1996

, vol.

(pg.

6977

6984

)

et al. ,

Use of proteomic patterns in serum to identify ovarian cancer

Lancet

2002

, vol.

359

(pg.

572

577

)

et al. ,

E2Fs up-regulate expression of genes involved in DNA replication, DNA repair and mitosis

Oncogene

2002

, vol.

(pg.

437

446

)

et al. ,

E2F integrates cell cycle progression with DNA repair, replication, G2/M checkpoints

Genes Dev.

2002

, vol.

(pg.

245

256

)

et al. ,

Mining for regulatory programs in the cancer transcriptome

Nat. Genet.

2005

, vol.

(pg.

579

583

)

et al. ,

Towards a proteome-scale map of the human protein–protein interaction network

Nature

2005

, vol.

437

(pg.

1173

1178

)

et al. ,

Interaction of YY1 with E2Fs, mediated by RYBP, provides a mechanism for specificity of E2F function

EMBO J.

2002

, vol.

(pg.

5775

5786

)

Cancer cell cycles

Science

1996

, vol.

274

(pg.

1672

1677

)

et al. ,

A human protein–protein interaction network: a resource for annotating the proteome

Cell

2005

, vol.

122

(pg.

957

968

)

et al. ,

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

Proc. Natl Acad. Sci. USA

2005

, vol.

102

(pg.

15545

15550

)

et al. ,

Gene expression profiling predicts clinical outcome of breast cancer

Nature

2002

, vol.

415

(pg.

530

536

)

et al. ,

Database resources of the National Center for Biotechnology Information

Nucleic Acids Res.

2001

, vol.

(pg.

)

et al. ,

Mechanisms of transcriptional regulation by Rb-E2F segregate by biological pathway

Oncogene

2003

, vol.

(pg.

7209

7217

)

et al. ,

GoMiner: a resource for biological interpretation of genomic and proteomic data

Genome Biol.

2003

, vol.

pg.

R28

et al. ,

WebGestalt: an integrated system for exploring gene sets in various biological contexts

Nucleic Acids Res.

2005

, vol.

(pg.

W741

W748

)

et al. ,

E2Fs link the control of G1/S, G2/M transcription

EMBO J.

2004

, vol.

(pg.

4615

4626

)

Author notes

Associate Editor: Jonathan Wren