Exploring the human genome with functional maps - PubMed (original) (raw)

Exploring the human genome with functional maps

Curtis Huttenhower et al. Genome Res. 2009 Jun.

Abstract

Human genomic data of many types are readily available, but the complexity and scale of human molecular biology make it difficult to integrate this body of data, understand it from a systems level, and apply it to the study of specific pathways or genetic disorders. An investigator could best explore a particular protein, pathway, or disease if given a functional map summarizing the data and interactions most relevant to his or her area of interest. Using a regularized Bayesian integration system, we provide maps of functional activity and interaction networks in over 200 areas of human cellular biology, each including information from approximately 30,000 genome-scale experiments pertaining to approximately 25,000 human genes. Key to these analyses is the ability to efficiently summarize this large data collection from a variety of biologically informative perspectives: prediction of protein function and functional modules, cross-talk among biological processes, and association of novel genes and pathways with known genetic disorders. In addition to providing maps of each of these areas, we also identify biological processes active in each data set. Experimental investigation of five specific genes, AP3B1, ATP6AP1, BLOC1S1, LAMP2, and RAB11A, has confirmed novel roles for these proteins in the proper initiation of macroautophagy in amino acid-starved human fibroblasts. Our functional maps can be explored using HEFalMp (Human Experimental/Functional Mapper), a web interface allowing interactive visualization and investigation of this large body of information.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Overview and performance of genomic data integration for functional mapping. (A) Data from ∼30,000 genome-scale experiments (∼15,000 microarray conditions and ∼15,000 interaction and sequence-based assays) were organized into 656 related data sets (Supplemental Table 1). These data sets were used as inputs for 229 process-specific naïve Bayesian classifiers each trained to predict functional relationships specific to a particular biological area and one process-independent global classifier. Mutual information was calculated between each pair of data sets and used to regularize these classifiers and prevent overconfident predictions. Each classifier was used to infer a predicted functional relationship network for a particular biological process. These networks were then analyzed to find statistically significant sets of functional relationships spanning gene groups of interest. This results in functional maps focusing on individual genes, groups of genes, biological processes, or genetic disorders. Each map provides an informative summarization of the genomic data collection focused on the current biological entity of interest. (B) Performance of predicted functional relationship networks in recapitulating known biology. To confirm that the predicted functional relationships underlying our functional maps were accurate, we scored their ability to recover information from a held-out portion (25% of genes) of our gold standard. This evaluation includes the global process-independent network tested on all genes and the holdout set, a process-aware global mean of the process-specific networks tested on all genes and the hold-out set, and an unregularized global process-independent network tested on all genes. Ranking of functionally related gene pairs is performed by comparing predicted probabilities based on data integration with the known relationships in the held-out test set. Results for individual process-specific networks appear in Supplemental Figure 1 and Supplemental Table 3. Precision is well above baseline, and since naïve classifiers are generally robust to overfitting, performance of the hold-out set is only slightly below that of the entire genome. Bayesian regularization provides a large performance increase at low recall by preventing overconfident predictions.

Figure 2.

Figure 2.

Results of functional mapping from data integration. Functional maps derived from experimental data integration provide information on groups of genes, including cross-talk between pathways, processes, and genes associated with genetic disorders. In all figure parts, thicker edges indicate stronger associations. (A) The process-specific functional relationship networks underlying functional maps can themselves provide information on individual genes' and modules' behavior in the underlying genomic data. Focusing on ALOX5AP, a membrane protein participating in leukotriene synthesis highlights a predicted association with the process of chemotaxis in leukocytes, driven by multiple predicted relationships with known chemotaxis proteins. This represents an instance of functional under-annotation; while ALOX5AP has not been formally cataloged as participating in chemotaxis, its immediate biosynthetic product LTB4 is a known activator of chemotaxis (Peters-Golden and Brock 2003). (B) Associations between genetic disorders and biological processes. To validate functional mapping's ability to discover disease/process associations from data, a focus on ovarian cancer—known to be influenced by at least seven genes (Online Mendelian Inheritance in Man 2008)—we predict associations with the cell cycle, cell proliferation, and hormone stimulus, as well as with several other cancers. These associations are each based on relationships among individual genes predicted from integrated genomic data; directed arrows point to the gene group in which the background connectivity was calculated. As above, additional novel predictions can be explored online using HEFalMp. (C) Visualization of a functional map generated by querying a custom gene set. We chose to focus on the known autophagy proteins ATG7, BECN1, and MAP1LC3B, in addition to genes of interest LAMP2, RAB11A, and VAMP7, in the context of autophagy. This extracts two clear clusters of predicted autophagy-specific functional relationships, one consisting mainly of known autophagy proteins and one enriched for ER/Golgi and vesicular trafficking proteins (including the three test genes). This led us to experimentally test and confirm the hypothesis that LAMP2 and RAB11A (as well as AP3B1, ATP6AP1, and BLOC1S1) are involved in macroautophagy in amino acid-starved human fibroblasts.

Figure 3.

Figure 3.

Impaired autophagosome formation confirms the predicted involvement of AP3B1, ATP6AP1, BLOC1S1, LAMP2, and RAB11A in human macroautophagy. Our functional maps predict AP3B1, ATP6AP1, and BLOC1S1 to be involved in autophagy; an early version also predicted the involvement of LAMP2, RAB11A, and VAMP7 in the process, which recycles cellular biomass in order to survive under conditions of starvation or stress. While VAMP7 knockdowns showed no effect (see Discussion), siRNA knockdowns of the other five proteins inhibited normal autophagy. (A) Measurement of the MAP1LC3-I and autophagosome-bound MAP1LC3-II isoforms by immunoblotting. Under a control condition (luciferase siRNA), starvation (+) induces autophagy in human fibroblasts and up-regulates the autophagy marker MAP1LC3-II; this up-regulation is generally inhibited by knockdown of proteins required for autophagy, e.g., ATG5. (B) Quantification of MAP1LC3-II band intensities. Intensities for each condition are calculated relative to GAPDH using the ImageJ software. Replicates (e.g., controls run on multiple gels) have been averaged when available. (C) Quantification of punctate autophagosome formation. The numbers of fluorescent puncta (MAP1LC3-II-labeled autophagosomes) per cell were averaged over counts from three independent investigators in 10 images per normal (−) or starvation (+) condition, unlabeled and randomized (80 images total; see Supplemental Fig. 5 for standard errors). The resulting distribution of puncta frequencies is low under all nonstarved conditions and significantly increased under a negative control (luciferase) condition. It is only slightly increased for the ATG5 positive control and for the AP3B1, ATP6AP1, BLOC1S1, LAMP2, and RAB11A predictions. (D) Punctate localization of fluorescent GFP-LC3 to the autophagosome during autophagy. Under normal conditions (−), MAP1LC3-I is localized diffusely through the cytoplasm; starvation (+) induces autophagy and localization to the autophagosome membrane. Knockdowns of ATG5 (positive control) or the five validated genes abrogate this localization, indicating that these proteins are required for successful macroautophagy.

Figure 4.

Figure 4.

The HEFalMp tool for functional mapping. We have provided a web interface, the Human Experimental/Functional Map (HEFalMp), at

http://function.princeton.edu/hefalmp

for interactively exploring our predicted functional maps. A user can focus on a gene, gene set, biological process, or genetic disorder of interest and investigate its predicted associations with other genes, processes, or diseases. These predictions are presented using a variety of visualizations, and all data is downloadable for further analysis. (A) Associating a gene with biological processes. An investigator wishes to study which biological processes the TROAP protein is predicted to participate in. (B) Associating a gene with genetic disorders. In the context of one of TROAP's most likely biological processes, chromosome segregation, it is predicted to be particularly associated with genes causing melanomas and breast cancer. (C) Visualizing a predicted functional relationship network for specific genes. Focusing on a gene set consisting of TROAP, two of its most likely relationship partners (UBE2C and TPX2), and two of its most likely partners in chromosome segregation (TOP2A and NCAPH) retrieves a predicted functional relationship network specific to the area of chromosome segregation. (D) Viewing genomic data contributing to a prediction. Clicking on a predicted functional relationship or specifically focusing on TROAP's relationship with CDC25C displays the genomic data used to generate the prediction. Here, TROAP is predicted to relate to CDC25C, a highly conserved mitotic regulator, due to very high correlation between the genes' expression in a variety of microarray conditions. Taken together, this evidence suggests that TROAP is strongly cell cycle regulated and may play an as-yet-uncharacterized role in mitosis.

Figure 5.

Figure 5.

Overview of hierarchically clustered mutual information (MI) between genomic data sets. We used MI among 656 genomic data sets to perform regularization of the parameters of our 230 process-specific Bayesian classifiers. Data sets with a greater proportion of shared information were more heavily mixed with a uniform prior, resulting in the overall up-weighting of particularly unique and informative data. Additionally, a global view of the mutual information scores reveals structure in the data. Primarily platform-based effects can be observed among the expression data sets we obtained from GEO (Barrett et al. 2005), most of which use Affymetrix arrays; tissue type, cell type, and array normalization algorithms can all cause small amounts of information to be shared between many data sets. For example, Robust MultiArray (RMA) normalization causes a noticeable shift in the information shared among HG-U133A arrays. While the amount of MI between any two data sets is generally low (this figure saturates at one bit of shared information), an accumulation of many small overlaps can result in overconfidence during Bayesian data integration, accounting for the success of parameter regularization.

References

    1. Advani R.J., Yang B., Prekeris R., Lee K.C., Klumperman J., Scheller R.H. VAMP-7 mediates vesicular transport from endosomes to lysosomes. J. Cell Biol. 1999;146:765–776. - PMC - PubMed
    1. Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., et al. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. - PMC - PubMed
    1. Barrett T., Suzek T.O., Troup D.B., Wilhite S.E., Ngau W.C., Ledoux P., Rudnev D., Lash A.E., Fujibuchi W., Edgar R. NCBI GEO: Mining millions of expression profiles–database and tools. Nucleic Acids Res. 2005;33:D562–D566. - PMC - PubMed
    1. Carpenter A.E., Jones T.R., Lamprecht M.R., Clarke C., Kang I.H., Friman O., Guertin D.A., Chang J.H., Lindquist R.A., Moffat J., et al. CellProfiler: Image analysis software for identifying and quantifying cell phenotypes. Genome Biol. 2006;7:R100. doi: 10.1186/gb-2006-7-10-r100. - DOI - PMC - PubMed
    1. Chapuy B., Tikkanen R., Muhlhausen C., Wenzel D., von Figura K., Honing S. AP-1 and AP-3 mediate sorting of melanosomal and lysosomal membrane proteins into distinct post-Golgi trafficking pathways. Traffic. 2008;9:1157–1172. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources