GATHER: a systems approach to interpreting genomic signatures (original) (raw)

Journal Article

,

Department of Molecular Genetics and Microbiology, Duke Institute for Genome Sciences and Policy, Duke University Medical Center

Durham, NC 27710, USA

Search for other works by this author on:

Department of Molecular Genetics and Microbiology, Duke Institute for Genome Sciences and Policy, Duke University Medical Center

Durham, NC 27710, USA

*To whom correspondence should be addressed.

Search for other works by this author on:

Revision received:

23 August 2006

Accepted:

11 September 2006

Navbar Search Filter Mobile Enter search term Search

Abstract

Motivation: Understanding the full meaning of the biology captured in molecular profiles, within the context of the entire biological system, cannot be achieved with a simple examination of the individual genes in the signature. To facilitate such an understanding, we have developed GATHER, a tool that integrates various forms of available data to elucidate biological context within molecular signatures produced from high-throughput post-genomic assays.

Results: Analyzing the Rb/E2F tumor suppressor pathway, we show that GATHER identifies critical features of the pathway. We further show that GATHER identifies common biology in a series of otherwise unrelated gene expression signatures that each predict breast cancer outcome. We quantify the performance of GATHER and find that it successfully predicts 90% of the functions over a broad range of gene groups. We believe that GATHER provides an essential tool for extracting the full value from molecular signatures generated from genome-scale analyses.

Availability: GATHER is available at Author Webpage

Contact: j.nevins@duke.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Classical genetic experiments establish relationships between single or small groups of genes and observable traits. This approach, when coupled with the use of molecular biology techniques, represents the paradigm for much of 20th century biology—an ability to link gene function with biological phenotype. This paradigm has been extended with the use of various computational tools that aid analyses of gene structure and function. One widely used tool, BLAST, infers the function of a single gene or protein based on the functions of their homologs (Altschul et al., 1990). Nevertheless, these approaches still represent the paradigm of single gene analyses, although now in a high-throughput fashion.

With the development of complete genome sequences, the study of biology has been transformed in two ways. First and perhaps most critical, has been the development of technologies to enable high-throughput assays of gene activity. When focused on a biological process of interest, such assays identify groups of molecules that cooperatively effect or signify a process. In fact, many biological processes are evident only when analyzing the general patterns manifested by coordinate gene activity (Mootha et al., 2003). Such signatures describe a phenotype as a snapshot of gene activity in a cell or tissue sample at a given instant of time and have been developed to distinguish classes of leukemia using gene expression (Golub et al., 1999), to predict ovarian cancer using serum proteins (Petricoin et al., 2002) and to recognize deregulation of oncogenes in tumors (Bild et al., 2006). Such assays are becoming commonplace; the GEO repository of gene expression data are growing at an average rate of >20 million data points a month (Barrett et al., 2005). High-throughput genomic signatures are transforming biology from an observational molecular science to a data-intensive quantitative genomic science.

Second, databases of genomic data, such as gene or protein sequences, transcription factor-binding localizations, microRNA regulation, protein interaction networks or pathways, provide opportunities to study a signature in the context of various biological systems. By linking multiple systems together and understanding the signature with respect to transcriptional or post-transcriptional regulation, protein complexes or its relationship to other pathways, there is an opportunity to develop novel insight and a deeper understanding of biological processes (Kirschner, 2005).

Typically, analyses of gene expression profiles focus on the function of single genes within the profile, searching for aspects of function that might be relevant in the context of the profiling experiment. But, such a strategy often focuses on a few genes within the profile that might be logical while ignoring the vast majority of genes that might provide the most relevant context. As such, single gene annotations may well miss important associations with pathways or other biological context that is most relevant. Critically important is the recognition that biological processes most likely involve the concerted action of multiple genes—the identification of profiles or signatures through microarray experiments provide an opportunity to reveal these events but require methods that can evaluate groups of genes in a biologically relevant manner.

To address this changing paradigm for functional discovery and provide a mechanism to extract the function from a comprehensive view of a signature, we have developed GATHER, a Gene Annotation Tool to Help Explain Relationships. It is accessible using a web-based interface that allows entry of a list of genes from a high-throughput genomic experiment and can analyze the signature against a series of data sources.

To illustrate the use of GATHER, we have analyzed gene expression signatures that reflect the activity of the Rb/E2F pathway. In addition to the Rb/E2F analysis, we provide a quantitative evaluation of the accuracy with which GATHER can describe gene signatures. At the core of the GATHER system is a statistical foundation that quantifies the significance of functional associations. Methodologically, it builds upon previous methods that annotate genes with functional descriptors from Gene Ontology (GO) (Zeeberg et al., 2003; Al-Shahrour et al., 2004) or other sources (Zhang et al., 2005). These methods quantified the association of an annotation with a group of genes using the _P_-value from the hypergeometric distribution or chi-square test (Allison et al., 2006). Their results are essentially regarded as definitive and coherence in such functional annotations has been used to verify experimentally derived networks (Rual et al., 2005). The application and goals motivating GATHER, however, is most closely related to Gene Set Enrichment Analysis, in that both algorithms find subtle functional patterns within the context of a molecular signature, although GATHER is not limited to signatures from gene expression (Subramanian et al., 2005).

GATHER extends previous work in three ways paramount for extracting maximal information from molecular signatures. First, GATHER has the capacity to discover novel functions of gene groups by integrating annotations from evolutionary homologs and other genes related through protein interaction or literature networks (Rual et al., 2005; Jenssen et al., 2001; Hoffmann and Valencia, 2004). Second, GATHER annotates the characteristics of the genes with respect to datasets from multiple systems, helping synthesize evidence to develop or reinforce hypotheses. A third advancement is the development of a Bayesian statistical model, which we show increases the accuracy with which GATHER can infer novel functions of signatures.

2 METHODS

2.1 Representions of annotations

We defined an annotation as any discrete or discretized attribute that could be associated with a gene. The distribution of an annotation across two gene groups was represented in a 2 × 2 contingency table X where _X_1,• and _X_2,• contained the annotations associated with gene group 1 and 2, respectively; and _X_•,1 and _X_•,2 indicated the number of genes that either are or are not associated with the annotation.

In such an annotation table, gene group 1 contained the genes specified by the user and gene group 2 contained, conceptually, the remaining genes in the genome. In practice, however, many genes were not annotated (55% of the human genes in Entrez Gene had no GO codes), so we discarded those from the groups.

2.2 Statistical significance: Bayes factor

We quantified the evidence supporting the association between a gene group and an annotation using a Bayes factor (Gelman et al., 2003). This assessed the hypothesis that the distribution of annotations varied across gene groups against the hypothesis that the distribution was identical. A positive Bayes factor indicated that the evidence supported the association, while a negative one indicated no association. Its magnitude corresponded to the strength of the evidence for the association, where higher values were stronger. Because the strength of evidence was quantified on a continuous scale and required no significance cutoff, when testing multiple hypotheses, there was no cutoff to adjust. The Bayes factor, the strength of evidence for an association, did not depend on what other associations were also tested. We modeled the distribution of an annotation across groups of genes as a binomial process:

Bayes factor=P(H1∣data)P(H2)P(H2∣data)P(H1)

P(H1∣data)∝Bin(Θ1∣X1,1,X1,2)Bin(Θ1∣α,β) ×Bin(Θ2∣X2,1,X2,2)Bin(Θ2∣α,β)

∝Beta(X1,1+α,X1,2+β)Beta(X2,1+α,X2,2+β)

P(H2∣data)∝Bin(Θ3∣X1,1+X2,1,X1,2+X2,2)Bin(Θ3∣α,β)

∝Beta(X1,1+X2,1+α,X1,2+X2,2+β)Beta(α,β)

We assumed that Θ had a uniform prior distribution. We used non-informative priors, setting the hyperparameters α and β to one, and P(H1) and P(H2) to 50%. We imposed non-informative priors and distributions because, in the general case, there was no information about the strength of an association of an unknown annotation with an unknown gene list. However, for contexts where the expected degree and confidence of such an association may be known, they could be quantified as priors in this Equation.

2.3 Sources of annotations

GATHER supports multiple types of annotations representing different biological systems. For each type of annotation, we obtained or created an index that mapped an annotation to an Entrez Gene ID. From the index, we could associate each annotation to a gene group using a contingency table representation described previously and calculate its significance. We included annotations from GO, MEDLINE abstracts, MeSH terms; a gene network derived from the literature, KEGG pathways, a protein interaction network, microRNA regulation and transcription factor-binding sites. A detailed description of the methods to obtain these annotations is provided in the Supplementary materials.

2.4 Inferring annotations

We created three networks where the nodes represented genes (or proteins) and the edges connected genes that were related according to different criteria. In the first network, the protein interaction network, we connected the genes that were shown to bind in at least one of two large-scale screens of protein binding in humans (Rual et al., 2005; Stelzl et al., 2005). In the second network, the literature network, we connected the genes that co-occurred together, in at least 10 MEDLINE records. The final network, the homology network, connected genes from different organisms if they were documented as orthologs in the Homologene database (Wheeler et al., 2001).

Using one of the three networks, GATHER inferred annotations for a group of genes by adding to the group, all genes that were connected to at least one gene from the original group in the network. Then, GATHER performed its analysis based on the expanded group of genes.

2.5 Structural-based gene groups with functional significance

To evaluate the accuracy of functional inference, we merged 1277 genes into 407 groups based on structural superfamilies in the 1.69 (July 2005) release of the Structural Classification of Proteins (SCOP) database (Murzin et al., 1995). We linked the proteins to Entrez Gene identifiers through the SWISS-PROT database (Bairoch and Boeckmann, 1991). For genes that were assigned to multiple groups, we kept only the assignment to the one with which, it shared highest average sequence identity.

3 RESULTS

3.1 The Rb/E2F pathway

To illustrate the use of and evaluate GATHER, we analyzed the Rb/E2F pathway (Fig. 1A; genes and analysis at Author Webpage). This pathway is central to the control of cell proliferation, providing the signal transduction events that link stimulation of growth with entry to the cell cycle. Deregulation of the Rb/E2F pathway is common to the development of an oncogenic state and in cancer cells, E2Fs are overexpressed (Sherr, 1996; Rhodes et al., 2005). The retinoblastoma (Rb) protein represses the activity of the E2Fs, a family of eight transcription factors that bind a single consensus site. E2Fs are transcriptional regulators necessary for proper cell cycle control and DNA synthesis. Recent experiments have also established a role in a broad range of activities, such as mitosis, DNA repair and apoptosis. Thus, the roles of E2F in normal function and in tumorigenesis are growing, and its functions and regulatory mechanisms are areas of intense investigation (Nevins, 1998; Dimova and Dyson, 2005). We assembled a gene expression signature of 231 genes representing E2F activation from standard biochemical experiments (DeGregori et al., 1995; Ohtani et al., 1996; Leone et al., 1998; Moroni et al., 2001) and gene expression microarrays (Ishida et al., 2001; Muller et al., 2001; Black et al., 2003). The datasets reflected perturbed E2F activity under various cell cycle states (quiescence, synchronized cycles or asynchronous cycles) in cultured cells. Because the assays captured the entire transcriptional response, the compiled signature included genes both directly regulated by E2F (i.e. E2F transactivated or repressed the gene through interactions with its promoter) as well as those regulated indirectly (i.e. E2F activity resulted in altered expression of the gene) in a variety of conditions.

(A) The Rb/E2F Pathway. (B) We applied GATHER to annotate a signature of the Rb/E2F pathway compiled from various data sources. We show the top ten most significant annotations found in the analysis for each type of data. (C) This matrix shows the common genes between each pair of E2F signatures (Ishida et al., 2001; Muller et al., 2001; Ren et al., 2002; Polager et al., 2002). Each value indicates the number of genes shared between a pair of signatures. The values on the diagonal represent the number of genes in a signature. (D) This shows the significance (Bayes factor) of selected annotations for each signature. Missing values indicate that the evidence from the genes in the signature did not support an association with an annotation.

Fig. 1

(A) The Rb/E2F Pathway. (B) We applied GATHER to annotate a signature of the Rb/E2F pathway compiled from various data sources. We show the top ten most significant annotations found in the analysis for each type of data. (C) This matrix shows the common genes between each pair of E2F signatures (Ishida et al., 2001; Muller et al., 2001; Ren et al., 2002; Polager et al., 2002). Each value indicates the number of genes shared between a pair of signatures. The values on the diagonal represent the number of genes in a signature. (D) This shows the significance (Bayes factor) of selected annotations for each signature. Missing values indicate that the evidence from the genes in the signature did not support an association with an annotation.

Given the role of E2Fs as transcription factors and thus expecting that the signature should include direct targets of E2F, we began by analyzing the signature with the TRANSFAC component of GATHER to assess the significance of the presence of potential transcription factor-binding sites within the promoters of genes. As shown in Figure 1B, we found strong evidence that the proximal promoter regions contained E2F binding motifs. The most significant transcription factor annotations were all variants of the E2F binding site with significant Bayes factors. This agreed with previous genome-wide chromatin immunoprecipitation experiments that showed that E2F directly regulated a large number of genes (Ren et al., 2002).

Next, to develop an understanding of the functions reflected in the activation of E2F, we applied each of the functional annotation tools provided by GATHER to the signature. As shown in Figure 1B, the annotations revealed strong associations with DNA replication, cell cycle, cyclins, cyclin-dependent kinase, cell proliferation, DNA metabolism and mitosis. This was seen in the GO analysis and supported by annotations from MEDLINE keywords and MeSH. These annotations coincided with a wealth of studies that have shown a role for E2F activity in the control of the G1/S transition, interaction with and regulation of cyclins and cyclin-dependent kinases, and the induction of S phase (DeGregori et al., 1995; Helin, 1998). Further validation was provided by KEGG pathway annotations that identified cell cycle and purine and pyrimidine metabolism as the most significant occurrences, again, emphasizing cell cycle and DNA replication. Taken together, these annotations portrayed a very clear picture of regulation involving the cell cycle with an emphasis on the control of DNA replication.

Furthermore, we saw evidence relating E2F activity with mitosis, consistent with recent data extending the role of E2F activity beyond S phase (Neufeld et al., 1998). Annotations related to mitosis and M phase were evident with high statistical significance. The GO annotations included mitotic activities, such as nuclear division and cytokinesis, the MeSH annotations pointed to M phase regulatory genes CDC2 and cyclin B, and TRANSFAC detected widespread presence of NFY binding sites, which have previously been found to control many genes regulated at G2/M (Linhart et al., 2005). Further establishing connection to M phase, recent biochemical experiments have shown that E2F transcriptionally activates Myb, a regulator of M phase genes (Zhu et al., 2004). From these results, we predicted that, in addition to direct targets of E2F, the targets of Myb, a downstream effector of E2F, would also appear in the signature. Indeed, we found the binding site of Myb in the promoters of 66 genes in the signature. Of those 66 genes, the GO term for M phase was present with high statistical significance. Thus, GATHER produced annotations from the E2F signature that reflected an established mechanism connecting E2F to the control of mitosis, namely that E2F linked G1/S to mitosis through regulation of Myb.

To investigate further potential functional relationships of E2F, we activated the functional inference component of GATHER by selecting the Infer from Network option on the website. This increased the extent of annotations based on an analysis of a network of genes in the literature to reveal a wider scope of functions not immediately evident in the signature. This component revealed significant associations with more pathways in KEGG, including the TGF-beta and MAPK signaling pathways and apoptosis. Overlap in genes between the E2F and TGF-beta pathway was previously observed in data from microarray assays (Muller et al., 2001; Young et al., 2003). Similarly, we also saw annotations for MAPK signaling, which in KEGG included the AKT pathway. This predicted association was consistent with the experimental observation that E2F induced expression of Gab2, an activator of the AKT pathway (Chaussepied and Ginsberg, 2004). Moreover, activation of the AKT pathway suppressed E2F1-induced apoptosis during the cell cycle (Hallstrom and Nevins, 2003). Apoptosis was also linked to the E2F pathway in GATHER via evidence in MEDLINE words, MeSH terms, GO terms and the apoptosis KEGG pathway. In addition, the Literature Network revealed that apoptosis related genes MDM2 and NOL3 appeared significantly associated with the signature. Finally, GATHER confirmed that E2F1 could bind TOPBP1, whose interaction mediated repression of E2F1-induced apoptosis (Liu et al., 2004). In this example, the functional inference algorithm clearly was able to deepen the analysis and reveal associations with pathways that were well supported by experiments in the published literature.

Next, we used GATHER to explore the molecular mechanism by which the various individual E2F proteins could regulate specific distinct groups of target genes, despite the fact that each of the E2F proteins recognizes identical consensus sites. Previous work has pointed to a combinatorial control model whereby different co-activator proteins physically interact with individual E2F proteins to facilitate the binding to the promoters of specific targets (Attwooll et al., 2004). Establishing evidence for this model required the combination of multiple sources of information in GATHER, specifically to establish that a potential transcriptional regulatory partner must both bind to E2F and its binding site must occur in the proximal promoter of a portion of the genes in the E2F signature. Looking for this pattern using the Protein Interaction and TRANSFAC search modules in GATHER (see Supplementary material), we found that YY1 could bind E2F-2 and -3 and had regulatory motifs on 23 gene promoters; and that the E-box binding factor TFE3 could bind E2F3, and E-boxes were found in the promoters of 138 genes. Indeed, these interactions were confirmed in the literature. E2F-2 or -3 formed complexes with RYBP/YY1 to activate the CDC6 replication protein (Schlisio et al., 2002) and TFE3 formed a complex with E2F3 (and not the others) to regulate the 68kDa subunit of DNA polymerase delta (Giangrande et al., 2003). Another example of combinatorial control established in the literature, that Sp1 and E2F1, -2 or -3 bound cooperatively to the TK promoter, was not evident in GATHER because TRANSFAC lacked high quality matrices to detect SP1 binding sites (Karlseder et al., 1996).

Finally, to examine the robustness with which GATHER could identify relevant biological features underlying the signatures, we analyzed four independent E2F expression signatures and compared the annotations, focusing on GO, KEGG and TRANSFAC. The signatures were collected under different experimental conditions in cells from human and mice, and the stringency used to distinguish genes in the signatures varied. Thus, the number of genes shared between signatures was low, where the most similar signatures, Ishida and Polager, shared only 10 genes out of 77 (Fig. 1C). Nevertheless, cell cycle, DNA replication and related activities were recurring themes across signatures, as well as evidence for transcriptional regulation by E2F and NFY (Fig. 1D). Other annotations also were evident, potentially reflecting the distinctions in how the signatures were generated. Interestingly, the Muller signature exhibited only weak evidence for cell cycle regulation, despite strong evidence for E2F regulation. However, this signature contained an exceptionally large number of genes and that breadth may have diluted the coherence of the principal functions in the signature. In any case, this analysis showed that a comprehensive approach to analyzing genomic signatures could identify distinguishing functions, regardless of the presence of particular single genes of interest.

3.2 Biological processes in breast cancer survival

To further explore the value of GATHER in elucidating biological function, we analyzed gene expression signatures that have been developed to predict breast cancer survival. Initial work identified a profile of 70 genes that could predict the survival of a breast cancer patient (van't Veer et al., 2002). Subsequent work has shown that multiple 70 gene signatures can be identified from the same dataset that equally well predict survival (Ein-Dor et al., 2005). Interestingly, there was only 17% overlap of genes between signatures, although each of them was highly correlated with survival.

To examine the functional composition of the signatures, we analyzed the 10 signatures from Ein-Dor as well as the original signature reported in van't Veer. We annotated each gene list for significant associations with GO processes, KEGG pathways and TRANSFAC binding sites. The most significant annotations are shown in Table 1.

Table 1

Biological processes in survival. This table shows the 30 most significant annotations (two redundant ones omitted) associated with breast cancer survival signatures. Each row contains a type of annotation, and each column contains a signature from (van't Veer et al., 2002; Ein-Dor et al., 2005). The table contains the Bayes factors quantifying the significance of the association between an annotation and a signature. Empty values indicate no evidence of association

Source Annotation van't Veer Sig 1 Sig 2 Sig 3 Sig 4 Sig 5 Sig 6 Sig 7 Sig 8 Sig 9 Sig 10
Cell cycle
KEGG pathway Cell cycle 0.5 1.0 4.0 9.0 1.2
Gene ontology Cell cycle 4.4 1.6 4.8 2.9
Gene ontology Regulation of cell cycle 0.7 4.5
E2F
TRANSFAC E2F (E2F1_Q4) 3.0 2.1 2.6 5.2 3.4 1.3
TRANSFAC E2F (E2F_Q3_01) 3.0 1.0 0.5 2.1 4.3 3.2
TRANSFAC E2F (E2F1DP1RB_01) 1.5 4.9 3.8
TRANSFAC E2F (E2F_Q3_01) 0.0 2.0 4.3 1.8 0.3
Mitosis
Gene ontology Mitosis 11.9 1.4 5.1
Gene ontology Mitotic cell cycle 11.9 1.0 4.5 5.4
Gene ontology M phase 9.9 2.7 3.7
Gene ontology M phase of mitotic cell cycle 11.8 1.3 5.0
Gene ontology Nuclear division 10.2 2.9 3.9
Gene ontology Regulation of mitosis 4.9
Metabolism
Gene ontology Amine metabolism 6.2 3.4 1.3
Gene ontology Amine catabolism 1.2 0.6 4.4
Gene ontology Amino acid and derivative metabolism 0.1 4.9 2.0 1.9
Gene ontology Amino acid metabolism 0.4 5.8 2.5 2.6
Gene ontology Amino acid catabolism 1.3 0.7 4.5
Gene ontology Organic acid metabolism 2.0 4.4 2.2 3.2
Gene ontology Carboxylic acid metabolism 2.0 4.5 2.2 3.2
Gene ontology Regulation of lipid metabolism 0.2 4.9
Gene ontology L-serine biosynthesis 4.5 0.4
Gene ontology Glutamine metabolism 2.7 2.7 6.0
Gene ontology Glutamine family amino acid metabolism 4.5 1.3 1.9 3.8
KEGG pathway Glutamate metabolism 7.0 3.0
Other
Gene ontology Regulation of cellular physiological process 4.4 0.2 0.7
Gene ontology Cell migration 4.3
TRANSFAC Pax-3 4.9
Source Annotation van't Veer Sig 1 Sig 2 Sig 3 Sig 4 Sig 5 Sig 6 Sig 7 Sig 8 Sig 9 Sig 10
Cell cycle
KEGG pathway Cell cycle 0.5 1.0 4.0 9.0 1.2
Gene ontology Cell cycle 4.4 1.6 4.8 2.9
Gene ontology Regulation of cell cycle 0.7 4.5
E2F
TRANSFAC E2F (E2F1_Q4) 3.0 2.1 2.6 5.2 3.4 1.3
TRANSFAC E2F (E2F_Q3_01) 3.0 1.0 0.5 2.1 4.3 3.2
TRANSFAC E2F (E2F1DP1RB_01) 1.5 4.9 3.8
TRANSFAC E2F (E2F_Q3_01) 0.0 2.0 4.3 1.8 0.3
Mitosis
Gene ontology Mitosis 11.9 1.4 5.1
Gene ontology Mitotic cell cycle 11.9 1.0 4.5 5.4
Gene ontology M phase 9.9 2.7 3.7
Gene ontology M phase of mitotic cell cycle 11.8 1.3 5.0
Gene ontology Nuclear division 10.2 2.9 3.9
Gene ontology Regulation of mitosis 4.9
Metabolism
Gene ontology Amine metabolism 6.2 3.4 1.3
Gene ontology Amine catabolism 1.2 0.6 4.4
Gene ontology Amino acid and derivative metabolism 0.1 4.9 2.0 1.9
Gene ontology Amino acid metabolism 0.4 5.8 2.5 2.6
Gene ontology Amino acid catabolism 1.3 0.7 4.5
Gene ontology Organic acid metabolism 2.0 4.4 2.2 3.2
Gene ontology Carboxylic acid metabolism 2.0 4.5 2.2 3.2
Gene ontology Regulation of lipid metabolism 0.2 4.9
Gene ontology L-serine biosynthesis 4.5 0.4
Gene ontology Glutamine metabolism 2.7 2.7 6.0
Gene ontology Glutamine family amino acid metabolism 4.5 1.3 1.9 3.8
KEGG pathway Glutamate metabolism 7.0 3.0
Other
Gene ontology Regulation of cellular physiological process 4.4 0.2 0.7
Gene ontology Cell migration 4.3
TRANSFAC Pax-3 4.9

Table 1

Biological processes in survival. This table shows the 30 most significant annotations (two redundant ones omitted) associated with breast cancer survival signatures. Each row contains a type of annotation, and each column contains a signature from (van't Veer et al., 2002; Ein-Dor et al., 2005). The table contains the Bayes factors quantifying the significance of the association between an annotation and a signature. Empty values indicate no evidence of association

Source Annotation van't Veer Sig 1 Sig 2 Sig 3 Sig 4 Sig 5 Sig 6 Sig 7 Sig 8 Sig 9 Sig 10
Cell cycle
KEGG pathway Cell cycle 0.5 1.0 4.0 9.0 1.2
Gene ontology Cell cycle 4.4 1.6 4.8 2.9
Gene ontology Regulation of cell cycle 0.7 4.5
E2F
TRANSFAC E2F (E2F1_Q4) 3.0 2.1 2.6 5.2 3.4 1.3
TRANSFAC E2F (E2F_Q3_01) 3.0 1.0 0.5 2.1 4.3 3.2
TRANSFAC E2F (E2F1DP1RB_01) 1.5 4.9 3.8
TRANSFAC E2F (E2F_Q3_01) 0.0 2.0 4.3 1.8 0.3
Mitosis
Gene ontology Mitosis 11.9 1.4 5.1
Gene ontology Mitotic cell cycle 11.9 1.0 4.5 5.4
Gene ontology M phase 9.9 2.7 3.7
Gene ontology M phase of mitotic cell cycle 11.8 1.3 5.0
Gene ontology Nuclear division 10.2 2.9 3.9
Gene ontology Regulation of mitosis 4.9
Metabolism
Gene ontology Amine metabolism 6.2 3.4 1.3
Gene ontology Amine catabolism 1.2 0.6 4.4
Gene ontology Amino acid and derivative metabolism 0.1 4.9 2.0 1.9
Gene ontology Amino acid metabolism 0.4 5.8 2.5 2.6
Gene ontology Amino acid catabolism 1.3 0.7 4.5
Gene ontology Organic acid metabolism 2.0 4.4 2.2 3.2
Gene ontology Carboxylic acid metabolism 2.0 4.5 2.2 3.2
Gene ontology Regulation of lipid metabolism 0.2 4.9
Gene ontology L-serine biosynthesis 4.5 0.4
Gene ontology Glutamine metabolism 2.7 2.7 6.0
Gene ontology Glutamine family amino acid metabolism 4.5 1.3 1.9 3.8
KEGG pathway Glutamate metabolism 7.0 3.0
Other
Gene ontology Regulation of cellular physiological process 4.4 0.2 0.7
Gene ontology Cell migration 4.3
TRANSFAC Pax-3 4.9
Source Annotation van't Veer Sig 1 Sig 2 Sig 3 Sig 4 Sig 5 Sig 6 Sig 7 Sig 8 Sig 9 Sig 10
Cell cycle
KEGG pathway Cell cycle 0.5 1.0 4.0 9.0 1.2
Gene ontology Cell cycle 4.4 1.6 4.8 2.9
Gene ontology Regulation of cell cycle 0.7 4.5
E2F
TRANSFAC E2F (E2F1_Q4) 3.0 2.1 2.6 5.2 3.4 1.3
TRANSFAC E2F (E2F_Q3_01) 3.0 1.0 0.5 2.1 4.3 3.2
TRANSFAC E2F (E2F1DP1RB_01) 1.5 4.9 3.8
TRANSFAC E2F (E2F_Q3_01) 0.0 2.0 4.3 1.8 0.3
Mitosis
Gene ontology Mitosis 11.9 1.4 5.1
Gene ontology Mitotic cell cycle 11.9 1.0 4.5 5.4
Gene ontology M phase 9.9 2.7 3.7
Gene ontology M phase of mitotic cell cycle 11.8 1.3 5.0
Gene ontology Nuclear division 10.2 2.9 3.9
Gene ontology Regulation of mitosis 4.9
Metabolism
Gene ontology Amine metabolism 6.2 3.4 1.3
Gene ontology Amine catabolism 1.2 0.6 4.4
Gene ontology Amino acid and derivative metabolism 0.1 4.9 2.0 1.9
Gene ontology Amino acid metabolism 0.4 5.8 2.5 2.6
Gene ontology Amino acid catabolism 1.3 0.7 4.5
Gene ontology Organic acid metabolism 2.0 4.4 2.2 3.2
Gene ontology Carboxylic acid metabolism 2.0 4.5 2.2 3.2
Gene ontology Regulation of lipid metabolism 0.2 4.9
Gene ontology L-serine biosynthesis 4.5 0.4
Gene ontology Glutamine metabolism 2.7 2.7 6.0
Gene ontology Glutamine family amino acid metabolism 4.5 1.3 1.9 3.8
KEGG pathway Glutamate metabolism 7.0 3.0
Other
Gene ontology Regulation of cellular physiological process 4.4 0.2 0.7
Gene ontology Cell migration 4.3
TRANSFAC Pax-3 4.9

Although there was little overlap in the individual genes comprising the 11–70 gene signatures, it was evident that there was common function represented in these signatures. In particular, with the exception of signature 8, each 70 gene profile was represented as cell cycle, E2F or mitosis. Given the critical role of E2F proteins in the control of genes encoding DNA replication and mitotic activities, these three annotations can be viewed as representing the same biological process. Additionally, the term metabolism scored high with signatures 1, 5 and 9, and glutamine metabolism specifically with signature 4. These observations suggest that, despite the lack of common genes across the signatures, there is in fact common biology, represented in the form of cell cycle gene control, which characterizes breast cancer survival.

3.3 Accuracy of function inference

To quantify the accuracy with which GATHER could discover novel annotations, we created a dataset of 407 functionally homologous groups of human genes. Although, GATHER contained many types of annotations, we chose to evaluate annotations from GO because this data source was actively curated and was of considerable general interest. In addition, its curators documented the evidence (based on computational or experimental experiments) substantiating each annotation, allowing distinct analyses of different types of annotations.

To annotate the gold standard, we applied GATHER to each of the 407 gene groups and retained the GO annotations associated with each group with BF ≥ 0. Then, we repeated the analysis with two modifications: (1) we removed the annotations associated with each gene in a gene group, (2) we applied GATHER to the unannotated gene groups using only the annotations inferred from homologs, a protein interaction network or a literature network. Comparing the results of the second analyses against the first revealed the ability of GATHER to discover annotations not previously related to the genes.

First, we quantified the ability of the algorithm to recover the correct annotations with different parameters. For each analysis, we ranked the resulting annotations by decreasing significance, either Bayes factor or _P_-value, and calculated the precision and recall at each rank in the list (Fig. 2A). At a specific rank, the recall quantified the portion of the gold standard recovered and the precision quantified the errors found in the list. The calculation of recall and precision are described in the Supplementary materials. When including annotations from both homologs and the literature network, GATHER recovered 90% of the original functions. 83% had BF at least 0, and 54% were highly significant with BF at least 6. The literature network recovered more annotations than the protein networks (72 and 59%, respectively) and at higher levels of significance. Many of the missed annotations were not highly prevalent in the original gene sets. When we filtered out the annotations associated with only a single gene in a set, GATHER recovered 99% of the original annotations and 78% of them with a BF at least 6. GATHER was better able to recover annotations with more evidence.

Accuracy of functional predictions. (A) We assembled 407 groups of functionally related genes with their GO annotations removed. GATHER then annotated each group, predicting the missing annotations (All Annotations) using information from homologs, a network of protein binding and a network extracted from the literature, and combined homologs and literature network. The table indicates the portion correctly recovered annotations. For Frequent Annotations, we retained only the annotations that were associated with more than one gene in each gene group. (B) We compared the precision and recall for annotations predicted using different methods. Here, Network refers to literature network. As a baseline, we plotted a curve showing the significance of annotations when calculated as P-values from the hypergeometric distribution.

Fig. 2

Accuracy of functional predictions. (A) We assembled 407 groups of functionally related genes with their GO annotations removed. GATHER then annotated each group, predicting the missing annotations (All Annotations) using information from homologs, a network of protein binding and a network extracted from the literature, and combined homologs and literature network. The table indicates the portion correctly recovered annotations. For Frequent Annotations, we retained only the annotations that were associated with more than one gene in each gene group. (B) We compared the precision and recall for annotations predicted using different methods. Here, Network refers to literature network. As a baseline, we plotted a curve showing the significance of annotations when calculated as _P_-values from the hypergeometric distribution.

Next, we examined the rate at which GATHER produced incorrect annotations (Fig. 2B). At a BF threshold of six, at which GATHER recalled 54% of all known annotations, the precision, the portion of the predicted annotations present in the gold standard, was 32%. By comparing the precision for the same level of recall, it was clear that inferring annotations from homologs was most precise. However, this strategy recovered 23% fewer correct annotations than when combined with the literature network. We also compared the Bayes factor statistic against the _P_-value calculated by the popular Fisher's Exact test/hypergeometric distribution and found that ranking annotations by Bayes factor recovered higher precision at all levels of recall.

Examining the estimate of the precision more closely, 68% of the annotations identified with a BF cutoff of six did not appear to be associated with the gene group. Possible explanations were that the gene did not perform the function (annotation was incorrect) or because the gene performed the function (annotation was correct) but that had not yet been discovered or annotated. To distinguish between these two cases, we collected seven successive versions of the GO annotation database from the earliest available release in August 2002 to the latest in October 2005. We repeated the annotations as above on the earliest database and checked future databases to determine if the annotations of the gene groups would have appeared in future analyses (Fig. 3A). In the first version, there were 3975 unverified annotations; 269 would be verified after three years. The rate of verification was steady, strongly suggesting that more functions would be discovered. We examined the method by which functions were discovered by dividing the annotations into Experimental and Computational groups based on the type of evidence supporting the annotation. The annotations predicted by GATHER that appeared in later releases of the GO databases were verified with computational methods at a similar rate as experimental ones (Fig. 3B). Ten percent of the computationally verified annotations were later reconfirmed experimentally.

(A) We predicted the functions for gene groups using the annotations available in the GO annotation database on August 2002. We then analyzed the gene groups using subsequent releases of the database. For annotations that appeared in analyses of future releases, we assigned them as either Experimental or Computational according to the type of evidence used to justify the annotation in the database. (B) This plot shows the rate at which either Experimental or Computational annotations were verified. Day 0 refers to the August 2002 release of the database. An annotation was verified computationally at an average rate of one every 7 days, while an annotation was verified experimentally every 9 days.

Fig. 3

(A) We predicted the functions for gene groups using the annotations available in the GO annotation database on August 2002. We then analyzed the gene groups using subsequent releases of the database. For annotations that appeared in analyses of future releases, we assigned them as either Experimental or Computational according to the type of evidence used to justify the annotation in the database. (B) This plot shows the rate at which either Experimental or Computational annotations were verified. Day 0 refers to the August 2002 release of the database. An annotation was verified computationally at an average rate of one every 7 days, while an annotation was verified experimentally every 9 days.

4 DISCUSSION

Biological investigations in the post-genomic era often leverage genome-wide assays to uncover relationships between processes and groups of genes. Extracting the full measure of biological understanding from this scale of data requires the development of a novel class of tool that can interpret comprehensively the biology represented in gene signatures within the context of multi-level biological systems. We have developed GATHER as a tool to address this challenge.

We used GATHER to analyze the Rb/E2F pathway. Its annotations corroborated known functions of E2F, including DNA synthesis, mitosis, nucleotide metabolism and apoptosis. The synthesis of multiple types of evidence was pivotal in establishing strong evidence of phenomena, such as the mechanism for combinatorial control of E2F regulation. Furthermore, the functional discovery component of GATHER revealed deeper relationships between Rb/E2F and other processes, such as the TGF-beta pathway and apoptosis. Although these pathways were regulated predominantly through post-translational activities, E2F could regulate them through transcriptional mechanisms (E2F could regulate AKT through Gab2 and apoptosis through TOPBP1), which may have manifested as signals in the expression profile that could be identified using the functional inference component. These results demonstrated the utility of the inferencing algorithm and its ability to dig deeper into the biology of the signature. The analysis of the Rb/E2F pathway showed that a tool, such as GATHER could support and reinforce discovery through the synthesis of information across biological systems.

Continuing, we evaluated the accuracy of GATHER across different groups of genes. We discovered that it could recover nearly all currently known annotations and that the extent that predicted annotations might be discovered in the future remained undetermined. Furthermore, it was unclear whether novel annotations not currently covered by our evaluation might be more difficult to predict. Nevertheless, it was apparent that annotation databases were slowly accumulating ever-increasing amounts of knowledge about genes.

From the same analysis, the rate of growth of database annotations revealed two striking observations. First, it was intriguing that the number of computational annotations had not increased at a rate significantly greater than that from experimental methodologies. Second, it was also interesting that few annotations predicted from computational models had been verified with experiments. These findings implied that computational and experimental based annotations were complementary, and that both would be necessary for continued development of a comprehensive database of gene functions.

Perhaps the most revealing indication of the value of GATHER in interpreting genome-scale measures of gene expression was illustrated by the breast cancer analysis. The fact that multiple gene expression signatures can predict outcome, with little evidence of overlap in the genes constituting the signatures, can be worrisome, possibly reflecting the dangers of large-scale analyses, such as this and the opportunity for false discovery. In reality, such a result should not be surprising given the exceedingly high complexity of the cancer phenotype. Thus, one view of this result is simply the ability of complex expression signatures to match and reflect the heterogeneity and complexity of the disease, emphasizing the power of whole genome expression analysis. But, in the absence of an ability to interpret this complexity, one is left with a predictive tool but no insight into the biology. The use of GATHER to find common aspects of biology in these otherwise dissimilar expression profiles highlights both the power of expression profiling to tap into the complexity but also the power of GATHER to reveal the common structure in this complexity.

Finally, based on the results presented, we propose the following recommendations for the use of GATHER. First, although it is often sufficient to analyze gene groups without functional inference, inferred annotations reveal a more complete assessment of potential functions of the genes. When inferring annotations, those using only functions from homologs have higher precision than those from other networks. For more complete (but lower quality) annotations, we recommend including annotations from both homologs and the literature network. The accuracy of predicted annotations is related to the Bayes factor, which should be interpreted as the strength of the evidence supporting a function. If an explicit cutoff is desired, we recommend six, as it maximizes the harmonic mean of the recall and precision.

The authors would like to thank Rob Wagner, Carlos Carvalho and Mark DeLong for their technical help, Andrea Bild, Holly Dressman, Eran Andrechek, Seiichi Mori, Steven Angus and Guang Yao for helpful discussions and critical evaluations of GATHER, Liat Ein-Dor for providing gene signatures and anonymous reviewers for their insightful comments. J.T.C. is supported by postdoctoral fellowship #PF-05-047-01-GMC from the American Cancer Society; J.R.N. is supported by NIH 5-U54-CA112952-03. Finally, the authors thank Kaye Culler for her assistance in the preparation of the manuscript.

Conflict of Interest: none declared.

REFERENCES

et al.

FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes

,

Bioinformatics

,

2004

, vol.

20

(pg.

578

-

580

)

et al.

Microarray data analysis: from disarray to consolidation, consensus

,

Nat. Rev. Genet.

,

2006

, vol.

7

(pg.

55

-

65

)

et al.

Basic local alignment search tool

,

J. Mol. Biol.

,

1990

, vol.

215

(pg.

403

-

410

)

et al.

The E2F family: specific functions and overlapping interests

,

EMBO J.

,

2004

, vol.

23

(pg.

4709

-

4716

)

The SWISS-PROT protein sequence data bank

,

Nucleic Acids Res.

,

1991

, vol.

19

(pg.

2247

-

2249

)

et al.

NCBI GEO: mining millions of expression profiles—database and tools

,

Nucleic Acids Res.

,

2005

, vol.

33

(pg.

D562

-

D566

)

et al.

Oncogenic pathway signatures in human cancers as a guide to targeted therapies

,

Nature

,

2006

, vol.

439

(pg.

353

-

357

)

et al.

Distinct gene expression phenotypes of cells lacking Rb, Rb family members

,

Cancer Res.

,

2003

, vol.

63

(pg.

3716

-

3723

)

Transcriptional regulation of AKT activation by E2F

,

Mol. Cell.

,

2004

, vol.

16

(pg.

831

-

837

)

et al.

Cellular targets for activation by the E2F1 transcription factor include DNA synthesis–and G1/S regulatory genes

,

Mol. Cell. Biol.

,

1995

, vol.

15

(pg.

4215

-

4224

)

The E2F transcriptional network: old acquaintances with new faces

,

Oncogene

,

2005

, vol.

24

(pg.

2810

-

2826

)

et al.

Outcome signature genes in breast cancer: is there a unique set?

,

Bioinformatics

,

2005

, vol.

21

(pg.

171

-

178

)

et al. ,

Bayesian Data Analysis

,

2003

FL, USA

CRC Press LLC, Boca Raton

et al.

Identification of E-box factor TFE3 as a functional partner for the E2F3 transcription factor

,

Mol. Cell. Biol.

,

2003

, vol.

23

(pg.

3707

-

3720

)

et al.

Molecular classification of cancer: class discovery, class prediction by gene expression monitoring

,

Science

,

1999

, vol.

286

(pg.

531

-

537

)

Specificity in the activation and control of transcription factor E2F-dependent apoptosis

,

Proc. Natl Acad. Sci. USA.

,

2003

, vol.

100

(pg.

10848

-

10853

)

Regulation of cell proliferation by the E2F transcription factors

,

Curr. Opin. Genet. Dev.

,

1998

, vol.

8

(pg.

28

-

35

)

A gene network for navigating the literature

,

Nat. Genet.

,

2004

, vol.

36

pg.

664

et al.

Role for E2F in control of both DNA replication and mitotic functions as revealed from DNA microarray analysis

,

Mol. Cell. Biol.

,

2001

, vol.

21

(pg.

4684

-

4699

)

et al.

A literature network of human genes for high-throughput analysis of gene expression

,

Nat. Genet.

,

2001

, vol.

28

(pg.

21

-

28

)

et al.

Interaction of Sp1 with the growth- and cell cycle-regulated transcription factor E2F

,

Mol. Cell. Biol.

,

1996

, vol.

16

(pg.

1659

-

1667

)

The meaning of systems biology

,

Cell

,

2005

, vol.

121

(pg.

503

-

504

)

et al.

E2F3 activity is regulated during the cell cycle, is required for the induction of S phase

,

Genes Dev.

,

1998

, vol.

12

(pg.

2120

-

2130

)

et al.

Deciphering transcriptional regulatory elements that encode specific cell cycle phasing by comparative genomics analysis

,

Cell Cycle

,

2005

, vol.

4

(pg.

1788

-

1797

)

et al.

Regulation of E2F1 by BRCT domain-containing protein TopBP1

,

Mol. Cell. Biol.

,

2003

, vol.

23

(pg.

3287

-

3304

)

et al.

PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes

,

Nat. Genet.

,

2003

, vol.

34

(pg.

267

-

273

)

et al.

Apaf-1 is a transcriptional target for E2F, p53

,

Nat. Cell. Bio.

,

2001

, vol.

3

(pg.

552

-

558

)

et al.

E2Fs regulate the expression of genes involved in differentiation, development, proliferation, and apoptosis

,

Genes Dev.

,

2001

, vol.

15

(pg.

267

-

285

)

et al.

SCOP: a structural classification of proteins database for the investigation of sequences, structures

,

J. Mol. Biol.

,

1995

, vol.

247

(pg.

536

-

540

)

et al.

Coordination of growth and cell division in the Drosophila wing

,

Cell

,

1998

, vol.

93

(pg.

1183

-

1193

)

Toward an understanding of the functional complexity of the E2F and retinoblastoma families

,

Cell Growth Differ.

,

1998

, vol.

9

(pg.

585

-

593

)

et al.

Expression of the HsOrc1 gene, a human ORC1 homolog, is regulated by cell proliferation via the E2F transcription factor

,

Mol. Cell. Biol.

,

1996

, vol.

16

(pg.

6977

-

6984

)

et al.

Use of proteomic patterns in serum to identify ovarian cancer

,

Lancet

,

2002

, vol.

359

(pg.

572

-

577

)

et al.

E2Fs up-regulate expression of genes involved in DNA replication, DNA repair and mitosis

,

Oncogene

,

2002

, vol.

21

(pg.

437

-

446

)

et al.

E2F integrates cell cycle progression with DNA repair, replication, G2/M checkpoints

,

Genes Dev.

,

2002

, vol.

16

(pg.

245

-

256

)

et al.

Mining for regulatory programs in the cancer transcriptome

,

Nat. Genet.

,

2005

, vol.

37

(pg.

579

-

583

)

et al.

Towards a proteome-scale map of the human protein–protein interaction network

,

Nature

,

2005

, vol.

437

(pg.

1173

-

1178

)

et al.

Interaction of YY1 with E2Fs, mediated by RYBP, provides a mechanism for specificity of E2F function

,

EMBO J.

,

2002

, vol.

21

(pg.

5775

-

5786

)

Cancer cell cycles

,

Science

,

1996

, vol.

274

(pg.

1672

-

1677

)

et al.

A human protein–protein interaction network: a resource for annotating the proteome

,

Cell

,

2005

, vol.

122

(pg.

957

-

968

)

et al.

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

,

Proc. Natl Acad. Sci. USA

,

2005

, vol.

102

(pg.

15545

-

15550

)

et al.

Gene expression profiling predicts clinical outcome of breast cancer

,

Nature

,

2002

, vol.

415

(pg.

530

-

536

)

et al.

Database resources of the National Center for Biotechnology Information

,

Nucleic Acids Res.

,

2001

, vol.

29

(pg.

11

-

16

)

et al.

Mechanisms of transcriptional regulation by Rb-E2F segregate by biological pathway

,

Oncogene

,

2003

, vol.

22

(pg.

7209

-

7217

)

et al.

GoMiner: a resource for biological interpretation of genomic and proteomic data

,

Genome Biol.

,

2003

, vol.

4

pg.

R28

et al.

WebGestalt: an integrated system for exploring gene sets in various biological contexts

,

Nucleic Acids Res.

,

2005

, vol.

33

(pg.

W741

-

W748

)

et al.

E2Fs link the control of G1/S, G2/M transcription

,

EMBO J.

,

2004

, vol.

23

(pg.

4615

-

4626

)

Author notes

Associate Editor: Jonathan Wren

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 3,610

3,097 Pageviews

513 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 5
December 2016 2
January 2017 24
February 2017 20
March 2017 14
April 2017 19
May 2017 17
June 2017 30
July 2017 22
August 2017 18
September 2017 13
October 2017 63
November 2017 27
December 2017 61
January 2018 68
February 2018 61
March 2018 52
April 2018 27
May 2018 43
June 2018 53
July 2018 57
August 2018 88
September 2018 70
October 2018 38
November 2018 39
December 2018 35
January 2019 23
February 2019 31
March 2019 51
April 2019 42
May 2019 41
June 2019 31
July 2019 35
August 2019 38
September 2019 36
October 2019 43
November 2019 78
December 2019 61
January 2020 70
February 2020 58
March 2020 37
April 2020 56
May 2020 43
June 2020 62
July 2020 53
August 2020 40
September 2020 67
October 2020 49
November 2020 59
December 2020 61
January 2021 52
February 2021 30
March 2021 21
April 2021 17
May 2021 59
June 2021 48
July 2021 79
August 2021 35
September 2021 32
October 2021 109
November 2021 35
December 2021 19
January 2022 26
February 2022 20
March 2022 45
April 2022 40
May 2022 40
June 2022 25
July 2022 55
August 2022 39
September 2022 40
October 2022 47
November 2022 26
December 2022 20
January 2023 26
February 2023 24
March 2023 29
April 2023 12
May 2023 26
June 2023 18
July 2023 7
August 2023 23
September 2023 15
October 2023 19
November 2023 28
December 2023 39
January 2024 22
February 2024 32
March 2024 27
April 2024 30
May 2024 41
June 2024 23
July 2024 30
August 2024 23
September 2024 18
October 2024 8

Citations

271 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic