Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists (original) (raw)

Journal Article

,

Laboratory of Immunopathogenesis and Bioinformatics, Clinical Services Program, SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, MD 21702, USA

Search for other works by this author on:

,

Laboratory of Immunopathogenesis and Bioinformatics, Clinical Services Program, SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, MD 21702, USA

Search for other works by this author on:

Laboratory of Immunopathogenesis and Bioinformatics, Clinical Services Program, SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, MD 21702, USA

*To whom correspondence should be addressed. Tel: +1 301 846 5093; Fax:

+1 301 846 6762

; Email: rlempicki@mail.nih.gov

Search for other works by this author on:

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

Author Notes

Received:

10 September 2008

Revision received:

24 October 2008

Accepted:

03 November 2008

Published:

25 November 2008

Cite

Da Wei Huang, Brad T. Sherman, Richard A. Lempicki, Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists, Nucleic Acids Research, Volume 37, Issue 1, 1 January 2009, Pages 1–13, https://doi.org/10.1093/nar/gkn923
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Functional analysis of large gene lists, derived in most cases from emerging high-throughput genomic, proteomic and bioinformatics scanning approaches, is still a challenging and daunting task. The gene-annotation enrichment analysis is a promising high-throughput strategy that increases the likelihood for investigators to identify biological processes most pertinent to their study. Approximately 68 bioinformatics enrichment tools that are currently available in the community are collected in this survey. Tools are uniquely categorized into three major classes, according to their underlying enrichment algorithms. The comprehensive collections, unique tool classifications and associated questions/issues will provide a more comprehensive and up-to-date view regarding the advantages, pitfalls and recent trends in a simpler tool-class level rather than by a tool-by-tool approach. Thus, the survey will help tool designers/developers and experienced end users understand the underlying algorithms and pertinent details of particular tool categories/tools, enabling them to make the best choices for their particular research interests.

INTRODUCTION

The traditional biological research approaches typically study one gene or a few genes at a time. In contrast, high-throughput genomic, proteomic and bioinformatics scanning approaches (such as expression microarray, promoter microarray, proteomics, ChIP-on-CHIPs, etc.) are emerging as alternative technologies that allow investigators to simultaneously measure the changes and regulation of genome-wide genes under certain biological conditions. Those high-throughput technologies usually generate large ‘interesting’ gene lists as their final outputs. However, the biological interpretation of large, ‘interesting’ gene lists (ranging in size from hundreds to thousands of genes) is still a challenging and daunting task. Over the last few decades, bioinformatics methods, using the biological knowledge accumulated in public databases [e.g. Gene Ontology (1)], make it possible to systematically dissect large gene lists in an attempt to assemble a summary of the most enriched and pertinent biology. A number of high-throughput enrichment tools, including, but not limited to Onto-Express, MAPPFinder, GoMiner, DAVID, EASE, GeneMerge and FuncAssociate, etc. (2–10), were independently developed during 2002 and 2003 as initial studies to address the challenge of functionally analyzing large gene lists. Since then, the enrichment analysis field has been very productive, resulting in more, similar tools becoming publicly available. In 2005, approximately 14 such tools were collected and reviewed by Khatri et al. (11) and by Curtis et al. (12), respectively. The activity in the field has continually grown stronger as the number of new enrichment tools (with distinct new ideas and features) has significantly increased. Approximately 68 such tools have been collected in this survey (2–10,13–73) (Table 1 and Supplementary Data 1).

Table 1.

List of 68 enrichment tools

Enrichment tool name Year of release Key statistical method Category
FunSpec 2002 Hypergeometric Class I
Onto-express 2002 Fisher's exact; hypergeometic; binomial; chi-square Class I
EASE 2003 Fisher's exact (modified as EASE score) Class I
FatiGO/FatiWise/FatiGO+ 2003 Fisher's exact Class I
FuncAssociate 2003 Fisher's exact Class I
GARBAN 2003 Hypergeometric Class I
GeneMerge 2003 Hypergeometric Class I
GoMiner 2003 Fisher's exact Class I
MAPPFinder 2003 _Z_-score; hypergeometric Class I
CLENCH 2004 Hypergeometric; chi-square; binomial Class I
GO::TermFinder 2004 hypergeometric Class I
GOAL 2004 Permutation Class I
GOArray 2004 Hypergeometric; _Z_-score; permutation Class I
GOStat 2004 Fisher's exact; chi-squre Class I
GoSurfer 2004 Chi-square Class I
OntologyTraverser 2004 Hypergeometric; Fisher's exact Class I
THEA 2004 Hypergeometric Class I
BiNGO 2005 Hypergeometric; binomial Class I
FACT 2005 Adopt GeneMerge and GO::TermFinder statistical modules Class I
gfinder 2005 Fisher's exact Class I
Gobar 2005 Hypergeometric Class I
GOCluster 2005 Hypergeometric Class I
GOSSIP 2005 Fisher's exact Class I
L2L 2005 Binomial; hypergeometric Class I
WebGestalt 2005 Hypergeometric Class I
BayGO 2006 Bayesian; Goodman and Kruskal's gamma factor Class I
eGOn/GeneTools 2006 Fisher's exact Class I
Gene Class Expression 2006 _Z_-statistics Class I
GOALIE 2006 Hidden Kripke model Class I
GOFFA 2006 Fisher's inverse chi-square Class I
GOLEM 2006 Hyerpgeometric Class I
JProGO 2006 Fisher's exact; Kolmogorov–Smirnov test; student's _t_-test; Wilcoxon's test; hypergeometric Class I
PageMan 2006 Fisher's exact; chi-square; Wilcoxon Class I
STEM 2006 Hypergeometric Class I
WEGO 2006 Chi-square Class I
EasyGO 2007 Hypergeometric; chi-square; binomial Class I
g:Profiler 2007 Hypergeometric Class I
ProbCD 2007 Yule's Q; Goodman-Kruskal's gamma; Cramer's T Class I
GOEAST 2008 Hypergeometric Class I
GOHyperGAll 2008 Hypergeometric Class I
CatMap 2004 Permutations Class II
Godist 2004 Kolmogorov–Smirnov test Class II
GO-Mapper 2004 Gaussian distribution; EQ-score Class II
iGA 2004 Permutations; hypergeometric; _t_-test; _Z_-score Class II
GSEA 2005 Kolmogorov–Smirnov-like statistic Class II
MEGO 2005 _Z_-score Class II
PAGE 2005 _Z_-score Class II
T-profiler 2005 _t_-Test Class II
FuncCluster 2006 Fisher's exact Class II
FatiScan 2007 Fisher's Exact Class II
FINA 2007 Fisher's exact Class II
GAzer 2007 _Z_-statistics; permutation Class II
GeneTrail 2007 Hypergeometric; Kolmogorov–Smirnov Class II
MetaGP 2007 _Z_-score Class II
Ontologizer 2004 Fisher's exact Class III
POSOC 2004 POSET (a discrete math: finite partially ordered set) Class III
topGO 2006 Fisher's exact Class III
GO-2D 2007 Hypergeometric; binomial Class III
GENECODIS 2007 Hypergeometric; chi-square Class III
GOSim 2007 Resnik's similarity Class III
PalS 2008 Percent Class III
ProfCom 2008 Greedy heuristics Class III
GOTM 2004 Hypergeometric Class I,II
ermineJ 2005 Permutations; Wilcoxon rank-sum test Class I,II
DAVID 2003 Fisher's Exact (modified as EASE score) Class I,III
GOToolBox 2004 Hypergeometric; Fisher's exact; Binomial Class I,III
ADGO 2006 _Z_-statistic Class II,III
FunNet 2008 Unclear Unclear
Enrichment tool name Year of release Key statistical method Category
FunSpec 2002 Hypergeometric Class I
Onto-express 2002 Fisher's exact; hypergeometic; binomial; chi-square Class I
EASE 2003 Fisher's exact (modified as EASE score) Class I
FatiGO/FatiWise/FatiGO+ 2003 Fisher's exact Class I
FuncAssociate 2003 Fisher's exact Class I
GARBAN 2003 Hypergeometric Class I
GeneMerge 2003 Hypergeometric Class I
GoMiner 2003 Fisher's exact Class I
MAPPFinder 2003 _Z_-score; hypergeometric Class I
CLENCH 2004 Hypergeometric; chi-square; binomial Class I
GO::TermFinder 2004 hypergeometric Class I
GOAL 2004 Permutation Class I
GOArray 2004 Hypergeometric; _Z_-score; permutation Class I
GOStat 2004 Fisher's exact; chi-squre Class I
GoSurfer 2004 Chi-square Class I
OntologyTraverser 2004 Hypergeometric; Fisher's exact Class I
THEA 2004 Hypergeometric Class I
BiNGO 2005 Hypergeometric; binomial Class I
FACT 2005 Adopt GeneMerge and GO::TermFinder statistical modules Class I
gfinder 2005 Fisher's exact Class I
Gobar 2005 Hypergeometric Class I
GOCluster 2005 Hypergeometric Class I
GOSSIP 2005 Fisher's exact Class I
L2L 2005 Binomial; hypergeometric Class I
WebGestalt 2005 Hypergeometric Class I
BayGO 2006 Bayesian; Goodman and Kruskal's gamma factor Class I
eGOn/GeneTools 2006 Fisher's exact Class I
Gene Class Expression 2006 _Z_-statistics Class I
GOALIE 2006 Hidden Kripke model Class I
GOFFA 2006 Fisher's inverse chi-square Class I
GOLEM 2006 Hyerpgeometric Class I
JProGO 2006 Fisher's exact; Kolmogorov–Smirnov test; student's _t_-test; Wilcoxon's test; hypergeometric Class I
PageMan 2006 Fisher's exact; chi-square; Wilcoxon Class I
STEM 2006 Hypergeometric Class I
WEGO 2006 Chi-square Class I
EasyGO 2007 Hypergeometric; chi-square; binomial Class I
g:Profiler 2007 Hypergeometric Class I
ProbCD 2007 Yule's Q; Goodman-Kruskal's gamma; Cramer's T Class I
GOEAST 2008 Hypergeometric Class I
GOHyperGAll 2008 Hypergeometric Class I
CatMap 2004 Permutations Class II
Godist 2004 Kolmogorov–Smirnov test Class II
GO-Mapper 2004 Gaussian distribution; EQ-score Class II
iGA 2004 Permutations; hypergeometric; _t_-test; _Z_-score Class II
GSEA 2005 Kolmogorov–Smirnov-like statistic Class II
MEGO 2005 _Z_-score Class II
PAGE 2005 _Z_-score Class II
T-profiler 2005 _t_-Test Class II
FuncCluster 2006 Fisher's exact Class II
FatiScan 2007 Fisher's Exact Class II
FINA 2007 Fisher's exact Class II
GAzer 2007 _Z_-statistics; permutation Class II
GeneTrail 2007 Hypergeometric; Kolmogorov–Smirnov Class II
MetaGP 2007 _Z_-score Class II
Ontologizer 2004 Fisher's exact Class III
POSOC 2004 POSET (a discrete math: finite partially ordered set) Class III
topGO 2006 Fisher's exact Class III
GO-2D 2007 Hypergeometric; binomial Class III
GENECODIS 2007 Hypergeometric; chi-square Class III
GOSim 2007 Resnik's similarity Class III
PalS 2008 Percent Class III
ProfCom 2008 Greedy heuristics Class III
GOTM 2004 Hypergeometric Class I,II
ermineJ 2005 Permutations; Wilcoxon rank-sum test Class I,II
DAVID 2003 Fisher's Exact (modified as EASE score) Class I,III
GOToolBox 2004 Hypergeometric; Fisher's exact; Binomial Class I,III
ADGO 2006 _Z_-statistic Class II,III
FunNet 2008 Unclear Unclear

Table 1.

List of 68 enrichment tools

Enrichment tool name Year of release Key statistical method Category
FunSpec 2002 Hypergeometric Class I
Onto-express 2002 Fisher's exact; hypergeometic; binomial; chi-square Class I
EASE 2003 Fisher's exact (modified as EASE score) Class I
FatiGO/FatiWise/FatiGO+ 2003 Fisher's exact Class I
FuncAssociate 2003 Fisher's exact Class I
GARBAN 2003 Hypergeometric Class I
GeneMerge 2003 Hypergeometric Class I
GoMiner 2003 Fisher's exact Class I
MAPPFinder 2003 _Z_-score; hypergeometric Class I
CLENCH 2004 Hypergeometric; chi-square; binomial Class I
GO::TermFinder 2004 hypergeometric Class I
GOAL 2004 Permutation Class I
GOArray 2004 Hypergeometric; _Z_-score; permutation Class I
GOStat 2004 Fisher's exact; chi-squre Class I
GoSurfer 2004 Chi-square Class I
OntologyTraverser 2004 Hypergeometric; Fisher's exact Class I
THEA 2004 Hypergeometric Class I
BiNGO 2005 Hypergeometric; binomial Class I
FACT 2005 Adopt GeneMerge and GO::TermFinder statistical modules Class I
gfinder 2005 Fisher's exact Class I
Gobar 2005 Hypergeometric Class I
GOCluster 2005 Hypergeometric Class I
GOSSIP 2005 Fisher's exact Class I
L2L 2005 Binomial; hypergeometric Class I
WebGestalt 2005 Hypergeometric Class I
BayGO 2006 Bayesian; Goodman and Kruskal's gamma factor Class I
eGOn/GeneTools 2006 Fisher's exact Class I
Gene Class Expression 2006 _Z_-statistics Class I
GOALIE 2006 Hidden Kripke model Class I
GOFFA 2006 Fisher's inverse chi-square Class I
GOLEM 2006 Hyerpgeometric Class I
JProGO 2006 Fisher's exact; Kolmogorov–Smirnov test; student's _t_-test; Wilcoxon's test; hypergeometric Class I
PageMan 2006 Fisher's exact; chi-square; Wilcoxon Class I
STEM 2006 Hypergeometric Class I
WEGO 2006 Chi-square Class I
EasyGO 2007 Hypergeometric; chi-square; binomial Class I
g:Profiler 2007 Hypergeometric Class I
ProbCD 2007 Yule's Q; Goodman-Kruskal's gamma; Cramer's T Class I
GOEAST 2008 Hypergeometric Class I
GOHyperGAll 2008 Hypergeometric Class I
CatMap 2004 Permutations Class II
Godist 2004 Kolmogorov–Smirnov test Class II
GO-Mapper 2004 Gaussian distribution; EQ-score Class II
iGA 2004 Permutations; hypergeometric; _t_-test; _Z_-score Class II
GSEA 2005 Kolmogorov–Smirnov-like statistic Class II
MEGO 2005 _Z_-score Class II
PAGE 2005 _Z_-score Class II
T-profiler 2005 _t_-Test Class II
FuncCluster 2006 Fisher's exact Class II
FatiScan 2007 Fisher's Exact Class II
FINA 2007 Fisher's exact Class II
GAzer 2007 _Z_-statistics; permutation Class II
GeneTrail 2007 Hypergeometric; Kolmogorov–Smirnov Class II
MetaGP 2007 _Z_-score Class II
Ontologizer 2004 Fisher's exact Class III
POSOC 2004 POSET (a discrete math: finite partially ordered set) Class III
topGO 2006 Fisher's exact Class III
GO-2D 2007 Hypergeometric; binomial Class III
GENECODIS 2007 Hypergeometric; chi-square Class III
GOSim 2007 Resnik's similarity Class III
PalS 2008 Percent Class III
ProfCom 2008 Greedy heuristics Class III
GOTM 2004 Hypergeometric Class I,II
ermineJ 2005 Permutations; Wilcoxon rank-sum test Class I,II
DAVID 2003 Fisher's Exact (modified as EASE score) Class I,III
GOToolBox 2004 Hypergeometric; Fisher's exact; Binomial Class I,III
ADGO 2006 _Z_-statistic Class II,III
FunNet 2008 Unclear Unclear
Enrichment tool name Year of release Key statistical method Category
FunSpec 2002 Hypergeometric Class I
Onto-express 2002 Fisher's exact; hypergeometic; binomial; chi-square Class I
EASE 2003 Fisher's exact (modified as EASE score) Class I
FatiGO/FatiWise/FatiGO+ 2003 Fisher's exact Class I
FuncAssociate 2003 Fisher's exact Class I
GARBAN 2003 Hypergeometric Class I
GeneMerge 2003 Hypergeometric Class I
GoMiner 2003 Fisher's exact Class I
MAPPFinder 2003 _Z_-score; hypergeometric Class I
CLENCH 2004 Hypergeometric; chi-square; binomial Class I
GO::TermFinder 2004 hypergeometric Class I
GOAL 2004 Permutation Class I
GOArray 2004 Hypergeometric; _Z_-score; permutation Class I
GOStat 2004 Fisher's exact; chi-squre Class I
GoSurfer 2004 Chi-square Class I
OntologyTraverser 2004 Hypergeometric; Fisher's exact Class I
THEA 2004 Hypergeometric Class I
BiNGO 2005 Hypergeometric; binomial Class I
FACT 2005 Adopt GeneMerge and GO::TermFinder statistical modules Class I
gfinder 2005 Fisher's exact Class I
Gobar 2005 Hypergeometric Class I
GOCluster 2005 Hypergeometric Class I
GOSSIP 2005 Fisher's exact Class I
L2L 2005 Binomial; hypergeometric Class I
WebGestalt 2005 Hypergeometric Class I
BayGO 2006 Bayesian; Goodman and Kruskal's gamma factor Class I
eGOn/GeneTools 2006 Fisher's exact Class I
Gene Class Expression 2006 _Z_-statistics Class I
GOALIE 2006 Hidden Kripke model Class I
GOFFA 2006 Fisher's inverse chi-square Class I
GOLEM 2006 Hyerpgeometric Class I
JProGO 2006 Fisher's exact; Kolmogorov–Smirnov test; student's _t_-test; Wilcoxon's test; hypergeometric Class I
PageMan 2006 Fisher's exact; chi-square; Wilcoxon Class I
STEM 2006 Hypergeometric Class I
WEGO 2006 Chi-square Class I
EasyGO 2007 Hypergeometric; chi-square; binomial Class I
g:Profiler 2007 Hypergeometric Class I
ProbCD 2007 Yule's Q; Goodman-Kruskal's gamma; Cramer's T Class I
GOEAST 2008 Hypergeometric Class I
GOHyperGAll 2008 Hypergeometric Class I
CatMap 2004 Permutations Class II
Godist 2004 Kolmogorov–Smirnov test Class II
GO-Mapper 2004 Gaussian distribution; EQ-score Class II
iGA 2004 Permutations; hypergeometric; _t_-test; _Z_-score Class II
GSEA 2005 Kolmogorov–Smirnov-like statistic Class II
MEGO 2005 _Z_-score Class II
PAGE 2005 _Z_-score Class II
T-profiler 2005 _t_-Test Class II
FuncCluster 2006 Fisher's exact Class II
FatiScan 2007 Fisher's Exact Class II
FINA 2007 Fisher's exact Class II
GAzer 2007 _Z_-statistics; permutation Class II
GeneTrail 2007 Hypergeometric; Kolmogorov–Smirnov Class II
MetaGP 2007 _Z_-score Class II
Ontologizer 2004 Fisher's exact Class III
POSOC 2004 POSET (a discrete math: finite partially ordered set) Class III
topGO 2006 Fisher's exact Class III
GO-2D 2007 Hypergeometric; binomial Class III
GENECODIS 2007 Hypergeometric; chi-square Class III
GOSim 2007 Resnik's similarity Class III
PalS 2008 Percent Class III
ProfCom 2008 Greedy heuristics Class III
GOTM 2004 Hypergeometric Class I,II
ermineJ 2005 Permutations; Wilcoxon rank-sum test Class I,II
DAVID 2003 Fisher's Exact (modified as EASE score) Class I,III
GOToolBox 2004 Hypergeometric; Fisher's exact; Binomial Class I,III
ADGO 2006 _Z_-statistic Class II,III
FunNet 2008 Unclear Unclear

During the past several years, bioinformatics enrichment tools have played a very important and successful role contributing to the gene functional analysis of large gene lists for various high-throughput biological studies, which is clearly evidenced by thousands of publications citing these tools (based on Google Scholar as of September 2008). However, these bioinformatics enrichment tools are still in an actively growing and improving stage, without unified methods or one ‘gold’ standard. As more enrichment tools emerge in the scientific community, the individual tool-developing group or end user finds it more and more difficult to comprehensively track the usefulness of all of the existing works to his or her research. This confusing plethora of tools has resulted in several issues: (i) difficulty in comprehensively comparing and remembering the algorithms/features in a tool-by-tool manner among the overwhelmingly large number of tools available (approximately 68 current tools); (ii) a chance that some good work may be overlooked; (iii) redundant efforts in developing ideas that already exist, because of developers’ difficulties in grasping the breadth of the field; (iv) out-of-date ideas being used in newly released tools because of the developers’ lack of awareness of the latest methods; and (v) difficulties for end users in deciding, among so many overwhelming choices, which enrichment tools are most suitable to their analytic needs.

This survey includes four sections to address the situations listed earlier: First, it will identify 68 enrichment tools that are currently available, and further describe the rationales behind them. That way, the tool designers, developers and end users will be made aware of most, if not all, of the existing tools. Secondly, tools will be uniquely classified, according to their underlying algorithms, into three major categories. Thus, readers can more easily and quickly grasp the key spirit of the 68 tools by following the categorical logic instead of trying to search through a tool-by-tool layout. Thirdly, the paper will focus on several important, but largely unanswered, questions and issues associated with the field. We hope that the questions/issues to be discussed will drive more attention, independent thinking, and discussion in the field, thereafter leading to better solutions in the near future. Finally, the paper will conclude with the current status and trends in the field.

GENERAL PRINCIPLE OF ENRICHMENT ANALYSIS AND 68 AVAILABLE TOOLS

A biological process is typically made up of a group of genes, as opposed to an individual gene alone. The principal foundation of enrichment analysis is that if a biological process is abnormal in a given study, the co-functioning genes should have a higher (enriched) potential to be selected as a relevant group by the high-throughput screening technologies. Such a rationale can make the analysis of large gene lists move from an individual gene-oriented view to a relevant gene group-based analysis. Because the analytic conclusion is based on a group of relevant genes instead of on an individual gene, it increases the likelihood for investigators to identify the correct biological processes most pertinent to the biological phenomena under study. For example, 10% of the user's genes selected by a microarray experiment are kinases, as opposed to 1% of the genes in the human genome (this is the gene population background) that are kinases. The enrichment can therefore be quantitatively measured by some common and well-known statistical methods, including Chi-square, Fisher's exact test, Binomial probability and Hypergeometric distribution (more discussion of enrichment _P_-value in a later section of this paper). Thus, a conclusion may be obtained for the particular example, that is, kinases are enriched in the user's study and therefore play an important role in the study. Fortunately, annotation databases, such as Gene Ontology (GO) (1), collecting biological knowledge in a format of gene-to-annotation, are very suitable for high-throughput bioinformatics scanning for the enrichment analysis. The tools systematically map a large number of interesting genes in a list to the associated biological annotation terms (e.g. GO Terms or Pathways), and then statistically examine the enrichment of gene members for each of the annotation terms by comparing the outcome to the control (or reference) background. Thereafter, the annotation terms with enriched gene members can be identified from tens of thousands of other annotation terms in a high-throughput fashion (11,12). The enriched annotation terms associated with the large gene list will give important insights that allow investigators to understand the biological themes behind the large gene list.

Approximately 68 bioinformatics tools (Table 1 and Supplementary Data 1) (2–10,13–73), aligned with the above analytic scenarios and purposes, are collected in this study. Regardless of their distinct features, the general procedure of the tools can be described as having three major layers: data support (backend annotation database); data mining (algorithm and statistics); and result presentation (interface and exploration) (Figure 1). Each of the layers may greatly impact the comprehensiveness of analytic results, as discussed in later sections of this paper. The general features associated with each tool, such as tool home page, publication link, general database scope [see SerbGO (74), which searches detailed annotation coverage across tools], pathway presentation, etc., can be found in Supplementary Data 1, in order to help end users/developers look up tools for their research interests. Moreover, the capability, sensitivity and backend databases can be very different from tool to tool. It is not uncommon for users to try multiple tools with similar analytic capability for the same dataset in order to obtain maximum satisfactory analytic results (75).

The infrastructure of typical enrichment tools. Even though the enrichment analysis tools have distinct features, they can be generally described as three major layers: backend annotation database; data mining; and result presentation. Each of the layers, rather than statistical methods alone, greatly influences the analytic results.

Figure 1.

The infrastructure of typical enrichment tools. Even though the enrichment analysis tools have distinct features, they can be generally described as three major layers: backend annotation database; data mining; and result presentation. Each of the layers, rather than statistical methods alone, greatly influences the analytic results.

CLASSIFICATION OF ENRICHMENT TOOLS

When the tool developer or end user is searching for particular features among the many tools available, it is not an easy task to digest the features for all 68 tools without appropriate classification. Based on the difference of algorithms, this survey classifies the 68 current enrichment tools into three classes: singular enrichment analysis (SEA); gene set enrichment analysis (GSEA); and modular enrichment analysis (MEA). A complete list of tools and their defining classes can be found in Table 1 and Supplementary Data 1. Notably, some tools with diverse capabilities belong to more than one class. The general features and limitations associated with each class are discussed in the following sections and are compared in Table 2.

Table 2.

Categorization of enrichment analysis tools

Tool category Description Indication and limitation Sub-type of algorithms Methods Example tool
Class I: singular enrichment analysis (SEA) Enrichment _P_-value is calculated on each term from the pre-selected interesting gene list. Then, enriched terms are listed in a simple linear text format. This strategy is the most traditional algorithm. It is still dominantly used by most of the enrichment analysis tools. Capable of analyzing any gene list, which could be selected from any high-throughput biological studies/technologies (e.g. Microarray, ChIP-on-CHIP, ChIP-on-sequence, SNP array, EXON array, large scale sequence, etc.). However, the deeper inter-relationships among the terms may not be fully captured in linear format report. Global reference background Local reference background Neural network Fisher's exact hypergeometric chi-square binomial Fisher's Exact hypergeometric chi-square binomial Bayesian GoStat, GoMiner, GOTM, BinGO, GOtoolBox, GFinder, etc. DAVID, Onto-Express, GARBAN, FatiGO, etc. BayGO
Class II: gene set enrichment analysis (GSEA) Entire genes (without pre-selection) and associated experimental values are considered in the enrichment analysis. The unique features of this strategy are: (i) No need to pre-select interesting genes, as opposed to Classes I and II; (ii) Experimental values integrated into _P_-value calculation. Suitable for pair-wide biological studies (e.g. disease versus control). Currently, may be difficult to be applied to the diverse data structures derived by a complex experimental design and some of the new technologies (e.g. SNP, EXON, Promoter arrays). Based on ranked gene list Based on continuous gene values Kolmogorov–Smirnov-like _t_-Test permutation _Z_-score GSEA, CapMap, etc. FatiScan, ADGO, ermineJ, PAGE, iGA, GO-Mapper, GOdist, FINA, T-profiler, MetaGP, etc.
Class III: modular enrichment analysis (MEA) This strategy inherits key spirit of SEA. However, the term–term/gene–gene relationships are considered into enrichment _P_-value calculation. The advantage of this strategy is that term–term/gene–gene relationship might contain unique biological meaning that is not held by a single term or gene. Such network/modular analysis is closer to the nature of biological data structure. Capable of analyzing any gene lists, which could be selected from any high-throughput biological studies/technologies, like Class I. Emphasis on network relationships during analysis. ‘Orphan’ gene/term (with little relationships to other genes/terms), that sometimes could be very interesting, too, may be left out from the analysis. Composite annotations DAG Structure Global annotation relationship Measure enrichment on joint terms Measure enrichment by considering parents-child relationships Measure term–term global similarity with Kappa Statistics Czekanowski-Dice Pearson's correlation ADGO, GeneCodis, ProfCom, etc. topGO, Ontologizer, POSOC, etc. DAVID, GoToolBox, etc.
Tool category Description Indication and limitation Sub-type of algorithms Methods Example tool
Class I: singular enrichment analysis (SEA) Enrichment _P_-value is calculated on each term from the pre-selected interesting gene list. Then, enriched terms are listed in a simple linear text format. This strategy is the most traditional algorithm. It is still dominantly used by most of the enrichment analysis tools. Capable of analyzing any gene list, which could be selected from any high-throughput biological studies/technologies (e.g. Microarray, ChIP-on-CHIP, ChIP-on-sequence, SNP array, EXON array, large scale sequence, etc.). However, the deeper inter-relationships among the terms may not be fully captured in linear format report. Global reference background Local reference background Neural network Fisher's exact hypergeometric chi-square binomial Fisher's Exact hypergeometric chi-square binomial Bayesian GoStat, GoMiner, GOTM, BinGO, GOtoolBox, GFinder, etc. DAVID, Onto-Express, GARBAN, FatiGO, etc. BayGO
Class II: gene set enrichment analysis (GSEA) Entire genes (without pre-selection) and associated experimental values are considered in the enrichment analysis. The unique features of this strategy are: (i) No need to pre-select interesting genes, as opposed to Classes I and II; (ii) Experimental values integrated into _P_-value calculation. Suitable for pair-wide biological studies (e.g. disease versus control). Currently, may be difficult to be applied to the diverse data structures derived by a complex experimental design and some of the new technologies (e.g. SNP, EXON, Promoter arrays). Based on ranked gene list Based on continuous gene values Kolmogorov–Smirnov-like _t_-Test permutation _Z_-score GSEA, CapMap, etc. FatiScan, ADGO, ermineJ, PAGE, iGA, GO-Mapper, GOdist, FINA, T-profiler, MetaGP, etc.
Class III: modular enrichment analysis (MEA) This strategy inherits key spirit of SEA. However, the term–term/gene–gene relationships are considered into enrichment _P_-value calculation. The advantage of this strategy is that term–term/gene–gene relationship might contain unique biological meaning that is not held by a single term or gene. Such network/modular analysis is closer to the nature of biological data structure. Capable of analyzing any gene lists, which could be selected from any high-throughput biological studies/technologies, like Class I. Emphasis on network relationships during analysis. ‘Orphan’ gene/term (with little relationships to other genes/terms), that sometimes could be very interesting, too, may be left out from the analysis. Composite annotations DAG Structure Global annotation relationship Measure enrichment on joint terms Measure enrichment by considering parents-child relationships Measure term–term global similarity with Kappa Statistics Czekanowski-Dice Pearson's correlation ADGO, GeneCodis, ProfCom, etc. topGO, Ontologizer, POSOC, etc. DAVID, GoToolBox, etc.

Table 2.

Categorization of enrichment analysis tools

Tool category Description Indication and limitation Sub-type of algorithms Methods Example tool
Class I: singular enrichment analysis (SEA) Enrichment _P_-value is calculated on each term from the pre-selected interesting gene list. Then, enriched terms are listed in a simple linear text format. This strategy is the most traditional algorithm. It is still dominantly used by most of the enrichment analysis tools. Capable of analyzing any gene list, which could be selected from any high-throughput biological studies/technologies (e.g. Microarray, ChIP-on-CHIP, ChIP-on-sequence, SNP array, EXON array, large scale sequence, etc.). However, the deeper inter-relationships among the terms may not be fully captured in linear format report. Global reference background Local reference background Neural network Fisher's exact hypergeometric chi-square binomial Fisher's Exact hypergeometric chi-square binomial Bayesian GoStat, GoMiner, GOTM, BinGO, GOtoolBox, GFinder, etc. DAVID, Onto-Express, GARBAN, FatiGO, etc. BayGO
Class II: gene set enrichment analysis (GSEA) Entire genes (without pre-selection) and associated experimental values are considered in the enrichment analysis. The unique features of this strategy are: (i) No need to pre-select interesting genes, as opposed to Classes I and II; (ii) Experimental values integrated into _P_-value calculation. Suitable for pair-wide biological studies (e.g. disease versus control). Currently, may be difficult to be applied to the diverse data structures derived by a complex experimental design and some of the new technologies (e.g. SNP, EXON, Promoter arrays). Based on ranked gene list Based on continuous gene values Kolmogorov–Smirnov-like _t_-Test permutation _Z_-score GSEA, CapMap, etc. FatiScan, ADGO, ermineJ, PAGE, iGA, GO-Mapper, GOdist, FINA, T-profiler, MetaGP, etc.
Class III: modular enrichment analysis (MEA) This strategy inherits key spirit of SEA. However, the term–term/gene–gene relationships are considered into enrichment _P_-value calculation. The advantage of this strategy is that term–term/gene–gene relationship might contain unique biological meaning that is not held by a single term or gene. Such network/modular analysis is closer to the nature of biological data structure. Capable of analyzing any gene lists, which could be selected from any high-throughput biological studies/technologies, like Class I. Emphasis on network relationships during analysis. ‘Orphan’ gene/term (with little relationships to other genes/terms), that sometimes could be very interesting, too, may be left out from the analysis. Composite annotations DAG Structure Global annotation relationship Measure enrichment on joint terms Measure enrichment by considering parents-child relationships Measure term–term global similarity with Kappa Statistics Czekanowski-Dice Pearson's correlation ADGO, GeneCodis, ProfCom, etc. topGO, Ontologizer, POSOC, etc. DAVID, GoToolBox, etc.
Tool category Description Indication and limitation Sub-type of algorithms Methods Example tool
Class I: singular enrichment analysis (SEA) Enrichment _P_-value is calculated on each term from the pre-selected interesting gene list. Then, enriched terms are listed in a simple linear text format. This strategy is the most traditional algorithm. It is still dominantly used by most of the enrichment analysis tools. Capable of analyzing any gene list, which could be selected from any high-throughput biological studies/technologies (e.g. Microarray, ChIP-on-CHIP, ChIP-on-sequence, SNP array, EXON array, large scale sequence, etc.). However, the deeper inter-relationships among the terms may not be fully captured in linear format report. Global reference background Local reference background Neural network Fisher's exact hypergeometric chi-square binomial Fisher's Exact hypergeometric chi-square binomial Bayesian GoStat, GoMiner, GOTM, BinGO, GOtoolBox, GFinder, etc. DAVID, Onto-Express, GARBAN, FatiGO, etc. BayGO
Class II: gene set enrichment analysis (GSEA) Entire genes (without pre-selection) and associated experimental values are considered in the enrichment analysis. The unique features of this strategy are: (i) No need to pre-select interesting genes, as opposed to Classes I and II; (ii) Experimental values integrated into _P_-value calculation. Suitable for pair-wide biological studies (e.g. disease versus control). Currently, may be difficult to be applied to the diverse data structures derived by a complex experimental design and some of the new technologies (e.g. SNP, EXON, Promoter arrays). Based on ranked gene list Based on continuous gene values Kolmogorov–Smirnov-like _t_-Test permutation _Z_-score GSEA, CapMap, etc. FatiScan, ADGO, ermineJ, PAGE, iGA, GO-Mapper, GOdist, FINA, T-profiler, MetaGP, etc.
Class III: modular enrichment analysis (MEA) This strategy inherits key spirit of SEA. However, the term–term/gene–gene relationships are considered into enrichment _P_-value calculation. The advantage of this strategy is that term–term/gene–gene relationship might contain unique biological meaning that is not held by a single term or gene. Such network/modular analysis is closer to the nature of biological data structure. Capable of analyzing any gene lists, which could be selected from any high-throughput biological studies/technologies, like Class I. Emphasis on network relationships during analysis. ‘Orphan’ gene/term (with little relationships to other genes/terms), that sometimes could be very interesting, too, may be left out from the analysis. Composite annotations DAG Structure Global annotation relationship Measure enrichment on joint terms Measure enrichment by considering parents-child relationships Measure term–term global similarity with Kappa Statistics Czekanowski-Dice Pearson's correlation ADGO, GeneCodis, ProfCom, etc. topGO, Ontologizer, POSOC, etc. DAVID, GoToolBox, etc.

Class 1: Singular enrichment analysis (SEA)

The most traditional strategy for enrichment analysis is to take the user's preselected (e.g. differentially expressed genes selected between experimental versus control samples by _t_-test with a _P_-value ≤0.05 and fold change ≥1.5) ‘interesting’ genes, and then iteratively test the enrichment of each annotation term one-by-one in a linear mode. Thereafter, the individual, enriched annotation terms passing the enrichment _P_-value threshold are reported in a tabular format ordered by the enrichment probability (enrichment _P_-value). The enrichment _P_-value calculation, i.e. number of genes in the list that hit a given biology class as compared to pure random chance, can be performed with the aid of some common and well-known statistical methods (11,12,76), including Chi-square, Fisher's exact test, Binomial probability and Hypergeometric distribution, etc. (Table 1). More discussion regarding the enrichment _P_-value can be found in a later section of this paper.

Even though the strategy and output format of SEA are simple, SEA is indeed a very efficient way to extract the major biological meaning behind large gene lists, which may be generated from any type of high-throughput genomic studies or bioinformatics software packages. Most of the earlier tools (such as GoMiner, Onto-Express, DAVID and EASE) and a lot of the recently released tools (such as GOEAST and GFinder), adopted this strategy and demonstrated significant success in many genomic studies. However, the common weakness of tools in this class is that the linear output of terms can be very large and overwhelming (from hundreds to thousands). Therefore, the data analyst's focus and interrelationships of relevant terms can be diluted. For example, relevant GO terms like apoptosis, programmed cell death, induction of apoptosis, anti-apoptosis, regulation of apoptosis, etc., are spread out at different positions in a large linear output. It is difficult to focus on interrelationships of relevant biology terms among hundreds or thousands of other terms. In addition, the quality of pre-selected gene lists could largely impact the enrichment analysis, which makes SEA analysis unstable to a certain degree when using different statistical methods or cutoff thresholds.

Class 2: Gene set enrichment analysis (GSEA)

GSEA carries the core spirit of SEA, but with a distinct algorithm to calculate enrichment _P_-values as compared to SEA (35). People in the field give great attention and expectation to the GSEA strategy. The unique idea of GSEA is its ‘no-cutoff’ strategy that takes all genes from a microarray experiment without selecting significant genes (e.g. genes with _P_-value ≤0.05 and fold change ≥1.5). This strategy benefits the enrichment analysis in two aspects: 1) it reduces the arbitrary factors in the typical gene selection step that could impact the traditional enrichment analysis; and 2) it uses all information obtained from microarray experiments by allowing the minimally changing genes, which cannot pass the selection threshold, to contribute to the enrichment analysis in differing degrees. The maximum enrichment score (MES) is calculated from the rank order of all gene members in the annotation category. Thereafter, enrichment _P_-values can be obtained by matching the MES to randomly shuffled MES distributions (a Kolmogorov–Smirnov-like statistic) (35). Other enrichment tools in the GSEA class using the ‘no-cutoff’ strategy, such as ErmineJ (31), FatiScan (55), MEGO (36), PAGE (29), MetaGF, Go-Mapper (22) and ADGO (45), etc., employ parametric statistical approaches such as _z_-score, _t_-test, permutation analysis, etc. These approaches directly take experimental values (e.g. fold change) of all genes into the calculation for each annotation term. Collectively, recent GSEA tools which integrate the total experimental values into the functional data mining are an interesting trend with a lot of potential as a complement to traditional SEA (47,77–79).

However, tools in the GSEA class are also associated with some common limitations. First, the ‘no-cutoff’ strategy is the key advantage of GSEA, but is also becoming its major limitation in many biological studies. The GSEA method requires a summarized biological value (e.g. fold change) for each of the genome-wide genes as input. Sometimes, it is a difficult task to summarize many biological aspects of a gene into one meaningful value when the biological study and genomic platform are complex. For example, each gene derived from a SNP microarray could associate with a set of SNPs, which vary in size, P-values, physical distances, disease regions, LD (Linkage Disequilibrium) strength and SNP-gene locations (e.g. in exon, or in intron) from gene to gene. It is still a very experimental procedure to summarize such diverse aspects of biology into one comprehensive value. Similar challenges may be found in many of the emerging genomic platforms (e.g. SNP, Exon, Promoter microarray). The situations in the examples fully or partially fail in the GSEA-required input data structure requirement. For another example, many clinical microarray studies involve multiple factors/variants simultaneously, such as disease/normal, ages, sex, drug treatment/control, reagent batch effects, animal batch effect, etc. In such complex situations, sophisticated statistical methods, like ANOVA, time series analysis, survival analysis, etc., will be more powerful to handle multi-variances, multiple time points and batch effects, etc. simultaneously for data-mining interesting gene lists. In many similar cases, the upstream data processing and comprehensive gene selection statistics cannot be simply avoided or replaced by GSEA. Moreover, the genes ranked in higher positions (usually with higher differences, e.g. fold change) are the major force driving (highly weighted) the enrichment _P_-values in GSEA. Thus, the underlying assumption is that the genes with large regulations (e.g. fold changes) are contributing more to the biology. Obviously, this is not always true in real biology. Biologists know that small changes of some signal transduction genes can result in larger downstream biological consequences. In contrast, some big changes in metabolic genes may be just a consequence of other small, but important, signal regulation events. Depending on the questions that the researcher is asking, the mildly changed signal transduction genes may be more interesting/important than those largely regulated genes.

The GSEA and SEA methods have been available in the community for many years. Surprisingly, no comprehensive and systematic side-by-side comparisons are available yet. A recent study ran the same datasets with DAVID methods (a SEA/MEA method) versus ErmineJ (a GSEA method) (60). As expected, the results from both methods were highly consistent with each other. The consistency makes sense because the major driving force of the enrichment calculation in GSEA is the largely changing genes. In addition, those genes most likely have better chances to be selected in the traditional gene selection procedures, thus resulting in very similar results between the SEA and GSEA methods.

Class 3: Modular enrichment analysis (MEA)

MEA inherits the basic enrichment calculation found in SEA and incorporates extra network discovery algorithms by considering the term-to-term relationships. Recent tools, such as Ontologizer (69), topGO (41), GENECODIS (59), ADGO (45) and ProfCom (68), claimed to improve discovery sensitivity and specificity by considering inter-relationships of GO terms in the enrichment calculations, i.e. using genes of composite (joint) annotation terms as a reference background. The key advantage of this approach is that the researcher can take advantage of term–term relationships, in which joint terms may contain unique biological meaning for a given study, not held by individual terms. Moreover, when using heterogeneous annotation content, the annotation terms are highly redundant, and also have strong interrelationships regarding different aspects for the same biological process. Building such relationships is one step closer to the true nature of biology during data mining. GoToolBox (18) developed functions to cluster related GO terms or genes, which provides the gene functional annotation in a network context. However, the functions only work for a small scope and only for GO terms. DAVID (60,61) recently provided a new tool that is able to organize and condense a wide range of heterogeneous annotation content, such as GO terms, protein domains, pathways and so on, into term or gene classes. This organization is accomplished by using Kappa statistics to mine the complex biological co-occurrences found in multiple heterogeneous annotation content. Combined with traditional enrichment _P_-value calculations, the new approach allows the enrichment analysis to progress from term-centric or gene-centric to biological module-centric analysis. These methods take into account the redundant and networked nature of biological annotation content in order to concentrate on building the larger biological picture rather than focusing on an individual term or gene. Such data-mining logic seems closer to the nature of biology in that a biological process works in a network manner. However, the obvious limitation of MEA is that ‘orphan’ terms or genes (without strong relationships to neighbor terms/genes) could be left out from the analysis. Thus, it is important to examine those terms or genes that are left out during analysis when using MEA (60). In addition, the quality of the pre-selected gene list impacts the analytic results, just as it does in SEA analysis.

REMAINING QUESTIONS AND CHALLENGES IN THE FIELD

1. Realistically positioning the role of enrichment _P_-values in the current data-mining environment

The high-throughput enrichment data-mining environment is extremely complicated. Variations of the user gene list size, the deviation of the number of genes associated with each annotation, the gene overlap between annotations, the incompleteness of annotation content, the strong connectivity/dependency among genes, unbalanced distributions of annotation content, and high/low frequency of annotation content are examples of sources leading to this complexity and variation. None of the statistical methods mentioned in Table 1 is perfectly suitable for all situations. The complex situations found in the biological data-mining environment determine the discovery sensitivity and specificity (1—false-positive rate) of those statistical methods that are not yet in an optimal state, as discussed by Goeman et al. (73,80,81). Therefore, in real-life practice, many data analysts may treat the resulting enrichment _P_-values as a scoring system that plays a advisory role: i.e. rank and suggest possible relevant annotation terms, as opposed to an absolute, decision-making role (82). The analysts themselves are still playing critical roles in making the final decisions in terms of the most relevant, enriched annotation terms that are highlighted by the enrichment analysis tool. Even though annotation terms may be associated with very significant enrichment _P_-values, it is not uncommon that analysts discard/ignore some of the enriched annotation terms (such as terms with enrichment _P_-values <0.001) because they are not ‘making sense’ to a given study, based on a priori biological knowledge. The analogous example of this type of situation is like that of a Google search, which returns some results that are not relevant to the user's original query. It is up to the user, based on his or her knowledge of the situation, to make the final judgment about the results. Collectively, current enrichment analysis is more of an exploratory procedure, with the aid of enrichment _P_-value, rather than a pure statistical solution. The notion that the enriched terms should make sense based on a priori biological knowledge of the study is the most important guideline to help users in adjusting analytic thresholds and thereby answering questions such as, ‘Should my enrichment _P_-value cutoff be 0.05 or 0.01?’ or ‘Should I always consider the term with a significant enrichment _P_-value like 0.001?’ or ‘Which enrichment tool(s) could be more sensitive to my dataset?’

The most popular and traditional statistical methods used in the enrichment calculation are Fisher exact, Chi-square, Hypergeometric distribution and Binomial distribution, as collected in Table 1 and Supplementary Data 1. It is believed on a principal level that Binomial probability is good for analysis with a large population background. The Fisher exact test, Chi-square test and the Hypergeometric distribution are better for analysis with a smaller population background (12) (see subsection #4 for more discussion about population background). Given the weakness of the typical statistical methods, some alternative mathematical approaches were recently proposed in an attempt to improve the enrichment _P_-value calculations. These approaches include (but are not limited to) mid-_P_-value by Rivals et al. (76), finite partially ordered set approach (POSET) by POSOC (83,84), hidden Kripke model (HKM) by GOLie, greedy heuristics by ProfCom (68), Fisher's inverse chi-squared by GOFAA (50), master-target test/mutually exclusive target–target/intersecting target–target tests by GeneTools (42), EASE Score by EASE (8), Yule's Q by ProbCD (73), Fold Change by GoMiner (39) and Bayesian by BayGO (52). However, it is still too early to state definitively whether some of the improved alternative statistical methods really stand out over the traditional statistical approaches. Given the very complex data-mining environments discussed throughout the manuscript, all current statistical methods are working largely at the edge of their intended capability. Indeed, the specificity of enrichment analysis is more impacted by non-statistical layers than it is by statistical methods alone. In this sense, it is not realistic to guide users to choose enrichment tools simply according to statistical methods that are based purely on statistical advantages/disadvantages. Thus, we do not extensively discuss the differences between statistical methods, since such a discussion could potentially mislead a user's judgment. It is in the user's best interests to try many statistical methods on the same dataset and to compare the results whenever possible. Obviously, the need for new, more robust statistical methods to overcome the limitations of the current methods is still in high demand by the field.

2. Understanding the limitation of multiple testing correction on enrichment _P_-values

According to standard statistical principles, the more annotations that are tested, the greater the chance of an increase in the family-wide false-positive rate (85,86). To control the family-wide false-positive rate in the result list, the review article by Khatri et al. (11,12) indicates that the multiple test correction of enrichment _P_-values must be performed on the functional annotation categories being tested at the same time. Indeed, the majority of the tools performed such corrections with methods such as Bonferroni, Benjamini–Hochberg, Holm, Q-value, Permutation, etc. (Supplementary Data 1). Given the extremely complicated gene functional data-mining environment as discussed in the previous section, a critical question is how much of an improvement in discovery sensitivity and specificity (1—false-positive rate) is achieved by applying such corrections in real-life practice?

Even though many enrichment tools implement such corrections, only a few tools systematically provide evidence regarding the improvements of discovery results with and without such corrections in real-life analytic environments, rather than believing the benefits based on the statistical principle alone. Recently, GOSSIP (27) comprehensively compared the discovery sensitivity and specificity across various correction techniques provided by various tools with real-life datasets. It was concluded that the common multiple testing correction techniques, known to be overly conservative approaches if there are thousands or even more annotation terms involved in the analysis, may not improve specificity as much as people had believed those techniques would. In fact, the sensitivity may actually be negatively affected because of the conservative nature of these corrections (27).

Given the complexity of biological data-mining environments, the enrichment _P_-values derived from the common statistical methods can be very fragile, and are influenced not only by the statistical methods themselves, but also greatly by the algorithms, data sources, the individual biological process itself and so on. The specificity of the discovery is indeed greatly impacted by the non-statistical layers, which cannot be simply fixed by multiple test corrections. Great efforts regarding sensitivity and specificity issues involved in the enrichment analysis may require that improvements are made on the fundamental, non-statistical layers first (Figure 1). Then, the power of various statistical approaches including the multiple test correction can be utilized fully in the enrichment analysis. More than a dozen of the enrichment tools, including recent ones such as EasyGO (66) and g:Profiler (64), as well as the earlier ones such as GoMiner (10), have not implemented multiple test corrections (Supplementary Data 1), but are still widely used by the community in real-life data-mining projects. In summary, the multiple test correction is only a partial solution, not a resolution of the specificity problem in current enrichment analysis platforms.

3. Cross-comparing enrichment analysis results derived from multiple gene lists

A larger gene list can have higher statistical power, resulting in a higher sensitivity (more significant _P_-values) to slightly enriched terms, as well as to more specific terms. On the other hand, the sensitivity is decreased toward largely enriched terms and broader terms. Thus, the size of the gene list impacts the absolute enrichment _P_-values, making it difficult to directly compare the absolute enrichment _P_-values across gene lists. Regardless of the challenges, cross-comparisons sometimes are necessary and important when studying the changes/trends among multiple time course datasets. Tools, such as GOBar (32), Go-Mapper (22), GOAlie, PageMan (51), high-throughput GoMiner (39), and the most recent, GOEAST (70), are intended to provide some of these capabilities to display multiple time course datasets simultaneously. However, users should keep the _P_-value comparison issue in mind when using these tools. The issue is even more critical, particularly when the sizes of gene lists are dramatically different from each other. More comprehensive and appropriate algorithms regarding the comparisons are still in high demand in the field.

4. Setting up the ‘right’ gene reference background

As noted in our previous example, 10% of the user's genes selected by a microarray experiment are kinases, as opposed to 1% of the genes in the human genome (this is the gene population background) that are kinases. The enrichment can therefore be quantitatively measured. A conclusion may be obtained for the particular example, that is, kinases are enriched in the user's study, and therefore play important roles in the study. However, 10% alone cannot lead to such a conclusion without comparison to the gene reference background (i.e. 1%). Thus, the different gene reference background settings may greatly impact the enrichment _P_-values, even when using the same statistical method and annotation content (12). For example, tools such as GOToolBox (18), GOstat (14), GoMiner (10), FatiGO (13) and GOTM (24), use the total genes in the genome as a global reference background. They tend to give more significant _P_-values, as compared to the tools (e.g. Onto-Express) using a narrowed-down set of genes (e.g. genes only existing on a microarray) as a gene reference background. In addition, DAVID (61) tends to be more conservative by using genes existing on the array and found to be associated with terms in the corresponding annotation categories, as the gene reference background. Many tools further allow users to upload a customized gene list as a gene reference background (Supplementary Data 1). Even though there is no ‘gold’ standard for the reference background, a general guideline is to set up the reference background as the pool of genes that could be selected for the studied annotation category (12). For example, the total genes found on a microarray chip seem to be the ‘right’ reference background, if the analysis gene list is derived from a microarray study conducted with the given chip. However, it is not perfect, since some genes on the chip could have little or no chance to be selected during the study, due to a low expression level that falls below the microarray detection range, and/or ‘bad’ probe design, etc. Even though the gene reference background directly impacts enrichment _P_-value, it will impact the _P_-values of all terms in a relatively similar manner within the same analysis. For the same dataset, analyzed with different gene reference backgrounds, the output rank/order of the enrichment terms will remain relatively the same, even though the terms may be associated with different _P_-values. Such stable order/rank of enrichment terms in the output is more important than their absolute _P_-values so that the annotation exploration and conclusion on the same dataset will be similar and comparable when using different gene reference backgrounds. In this sense, another important principle of setting a gene reference background is to use a consistent gene reference background within the same analysis.

5. Extending backend annotation databases

Due to its enriched content and suitable data structure for high-throughput data mining, GO (1) is the only backend data source used in most, if not all, of the earlier enrichment tools, as well as in some of the more recent tools (Supplementary Data 1). However, many different biological aspects are being maintained and annotated by different independent resources; these aspects have not only a significant amount of overlapping information, but also a significant amount of unique data, due to the differing focus of the specialized groups. No one, single source is able to maintain all of the biological aspects, such as GO for the biological process, molecular functions or cellular components; Pfam for protein domains; BIND for protein–protein interactions; KEGG for pathways; TRANSFAC for gene regulations; GNF for gene–tissue expressions; OMIM for gene–disease associations; and so on (65,87,88). In this sense, a comprehensive backend database integrated with diverse and heterogeneous data sources will allow the enrichment tools to more comprehensively mine the large gene lists on broad-based annotation content covering different biological aspects, rather than on GO content alone. Obviously, the improvement of the annotation database alone can significantly improve the comprehensiveness of the data mining. Otherwise, the power of advanced data-mining algorithms and statistics cannot be fully utilized in the enrichment analysis.

Many tools are still using GO as the only backend database in the enrichment analysis (Supplementary Data 1). However, some recent tools or new releases of early-generation tools, such as Onto-Express (62), DAVID (61), WebGestalt (40), Fatigo+ (56), FACT (30), g:Profiler (64), GAzer (63) and GeneTrail (57), etc., extended their backend bio-databases by integrating wide-range heterogeneous data content (e.g. GO, KEGG pathways, protein domains, disease association, tissue expression, etc.) in order to increase the comprehensiveness of the enrichment analytic results. The WebGestalt, DAVID and Onto-Express groups independently reported their efforts in detail, with the resulting collections including GeneKeyDB, the DAVID Knowledgebase and OT, respectively (65,87,88). Each group described the steps involved in integrating and constructing such large bio-databases, particularly for the purposes of high-throughput gene functional analysis. Moreover, the databases of L2L (34) and DAVID (61) include gene expression data from publicly available SAGE, EST and microarray studies. Thus, the user's dataset may be aligned with this data with similar conditions during functional analysis. Regarding species coverage, although the backend databases of several of the enrichment tools may cover a wide range of species, the support for a less popular species (i.e. rice) may not be as robust as that of more popular species (i.e. human, mouse, rat, yeast, fly). Given this situation, several enrichment tools were specifically designed for these less popular species, such as WEGO for rice (54); easyGO for crops (66); FINA for prokaryotes (58); CLENCH for Arabidopsis (21); JProGo for prokaryotes (48); BayGo for Xylella fastidiosa (52). Collectively, the quality, integration, and coverage of databases designed for high-throughput gene functional analysis have recently made notable progress, compared to that in earlier works. While the database improvement is an endless task, the current improvements have already significantly benefited individual groups and tools, as well as provided better backend bio-sources to the field for future tool development (65,87,88). The tools that still use GO as their only backend database should consider the integration of a wider collection of bio-databases in order to reflect the need and progress of the field.

6. Efficiently mapping users’ input gene identifiers to the available annotation

If the gene identifier (ID) cannot be efficiently mapped to its corresponding annotation content, the subsequent data mining will be largely impaired. Thus, the comprehensiveness of mapping ID-to-ID and ID-to-annotation content in the database is essential as the first step to maximally translate gene lists into possible annotation content for further high-throughput enrichment analysis algorithms (12). However, this is not a simple and trivial issue when the identifiers representing gene/proteins are highly redundant, and are maintained by independent bioinformatics organizations. Even though the identifier cross-mapping issues were effectively addressed within each major bioinformatics organization, such as NCBI Entrez Gene (89), UniProt UniRef (90) and PIR-NREF (91), respectively, the weaker referencing capability across organizations still exists. For example, UniProt does not cover RefSeq IDs and NCBI Entrez Gene does not reference PIR ID at all. When different annotation databases use one system as their major gene identifier systems, e.g. GeneRif adopts NCBI IDs as major associated identifiers, and InterPro uses UniProt/SwissProt as major associated identifiers (65), some annotation content does not favor certain types of user input IDs. Thus, for a given type of ID, without special attention to this issue, important annotation content could be easily left out of the high-throughput analysis without the user's awareness, resulting in an incomplete or even failed enrichment analysis. Unfortunately, the enrichment tools, in general, have poorly documented how they handle the ID-to-ID and ID-to-annotation mapping issues. Most of the tools have likely adopted the existing work of another major group such as the NCBI Entrez Gene database (89). In such a case, although a tool may claim to support many ID systems, it does not mean that all types of IDs are fully integrated into the backend annotation database, due to the cross-organization issues discussed earlier. Some recent efforts, such as Onto-Translate (62), MatchMiner (92), IDConverter (93) and DAVID ID Converter (61), have made large improvements in an effort to help the ID-to-ID and ID-to-annotation mapping issue. With these aforementioned works, users may easily translate one type of ID to another. Moreover, they not only provide the improved cross-referencing capability but also enrich annotation content. For example, after gene IDs were re-agglomerated by a procedure called the DAVID Gene Concept, 10–20% more GO terms were able to be assigned to corresponding genes in the DAVID Knowledgebase, as compared to annotations in each individual source (65).

7. Enhancing the exploratory capability and graphical presentation

Due to the limitations of current enrichment analysis, the analysis of large gene lists, in the authors’ opinion, is still more of an exploratory procedure rather than a single statistical solution at this time. Data analysts still play the most important role in interpreting the analytic results and collecting information from different views to make the final decision of which enriched annotation categories/biology are most relevant for the study in question. Such decisions are usually made with the aid of the enrichment _P_-values derived from the enrichment analysis, the previously known knowledge of expected biology relevant to experiments, and more importantly, the various data collected through exploration of the genes and annotation categories.

Flexibility in allowing users to define the analytic scope, e.g. GO levels, can make the analysis more focused in terms of a user's interests. Many tools, such as GOMiner (10), Onto-Express (62), DAVID (61) and FatiGO (56), support this type of flexibility. In addition, many tools, providing comprehensive links to primary annotation resources regarding annotation categories or gene reports, allow users to quickly and efficiently gather relevant information concerning items of interest. A Directed Acyclic Graph (DAG) maintains the structure of GO annotation terms (1). Even though all tools adopt GO in their enrichment analysis, most tools break down the structured nodes into flat terms during the calculation of enrichment _P_-values, and thereafter list the results in an easily readable tabular format. This simplified linear format and efficient organization of data for easy interpretation is widely used by most of the enrichment tools. Moreover, a number of tools, such as Onto-Express (62), easyGO (66), GoMiner (10), eGOn (42), GoSurfer (25), GOFFA (50) and GeneTrail (57), are able to display the enrichment analysis results on the DAG or a tree structure so that users may easily explore the enrichment results in neighboring nodes. Onto-Express further provides recalculation functions for ‘drill down’ analysis of a particular branch of the DAG. In contrast, POSOC (83) made an important note, that is, that DAG, as a structure, holds GO orientations, but lacks the power for biological inference, since a lot of functionally related terms may be maintained in different DAG branches (83). Thus, more and more recent tools, such as Onto-Express (62), DAVID (61), POSOC (83), BayGO (52), FatiGO+ (56), MAPPFinder (7), FuncCluster (43) and FunNet, have started to integrate BioCarta, KEGG, or other pathway visualizations in order to more efficiently examine the user's genes in a network context. In addition, some high-throughput pathway visualization tools, such as PathMAPA, Pathway Miner, Pathway Processor, ArrayXPath, Pathway Express, PathwayExplorer, KOBAS and VAMPIRE, are very useful, but are not included in this review because of their focuses on pathway analysis alone. Interestingly, biological module/classes of annotation terms, provided by PalS (67), DAVID (61) and GoToolBox (18), present heterogeneous annotation terms or genes in a group scope. This focuses the analysis on the larger biological picture and reduces the efforts involved in mining too many individual and redundant terms or genes. In addition, DAVID provides a simple 2D view visualization (61) that is able to efficiently display the related and heterogeneous many-genes-to-many-terms relationships, identified by the DAVID classification functions (60), on one well-organized page. Using such visualizations, users can efficiently examine the inter-relationships of highly related heterogeneous annotations and genes to pinpoint important commonalities and differences.

8. Evaluating the analytic capability of new enrichment tools

Sixty-eight enrichment tools, and potentially more that are missing from this collection, have already made the field very crowded. Many of the tool publications present minimal cross-comparisons to other tools. An appropriate standard evaluation procedure would make the analytic capability more comparable among tools, particularly for new tools. In addition, a good standard could make some new tools really stand out, as well as prevent redundant work from appearing in publications. Such standards should include, but not be limited to: a set of common datasets (gene lists) with expected and known biology in different, difficult levels for analysis; important aspects (e.g. backend database, enrichment _P_-values, speed, exploratory capability, graphic presentation, etc.) for cross-comparisons; emphasis on differences and advantages over other competing methods; etc. There is no detailed proposal as of yet, but obviously a standard is needed in the field.

9. Choosing the most appropriate enrichment tools from the various choices

Choosing the most suitable enrichment tool or tools largely depends on the users’ research needs, IT experiences and the questions being asked. A precise guideline is most likely not possible since the research goals are very diverse from project to project. Before choosing a tool, a user may ask questions such as, ‘Is the GO data source enough or are more (such as pathway, protein domain, protein–protein interactions, etc.) needed?’; ‘Is the SEA linear enrichment report enough or do I really need MEA to look into inter-relationships?’; ‘Is my experimental design simple enough to fit into the GSEA input requirement or is a comprehensive statistical method necessary for gene selection?’; ‘What is my IT capability to handle R, standalone tools, or web tools?’; etc. Thereafter, tools that maximally meet the user's requirements can be logically selected. Table 2 compares the strength and limitation of each tool class. Instead of looking up individual tools among the overwhelming choices, it is recommended that the researchers locate the desired tool class (i.e. SEA, GSEA and MEA) first, then further narrow down to individual tools within that class. Supplementary Data 1 lists some of the aspects that users may be interested in, for every tool. In addition, a protocol paper regarding enrichment analysis by Huang et al. (82) could be useful for beginning users. SerbGO is a good site to search and compare detailed features and annotation coverage among tools. It is not recommended that the researchers choose tools simply according to the underlying enrichment statistical methods. As discussed in previous sections, the behavior of most statistical methods in current enrichment tools is working with large uncertainties.

Moreover, successful analytic works in higher-quality publications could serve as important examples to guide end users in the choice of ‘well-used’ tools and to follow analytic procedures for similar situations. Importantly, it is not unusual that different tools have similar capabilities and functions, but output very different results due to the variations in the implementations of the various important aspects. Thus, it is recommended that the user test multiple tools, which even offer similar analytic capability, in order to obtain the most satisfactory results (75).

CONCLUSIONS AND PERSPECTIVES

Due to the complexity of biological data-mining situations, in its current state, the analysis of large gene lists with the current enrichment tools is still more of an exploratory data-mining procedure rather than a pure statistical solution. The best analytic conclusions are made with the aid of the investigator's bio-knowledge, integrated annotation databases, computing algorithms and the enrichment _P_-values derived from statistical methods.

A large, linear list of enriched annotation terms in output reports may not satisfy researchers as much as it did years ago. The next generation of enrichment tools will strive for an integrative and comprehensive data-mining environment that will not only provide a more efficient means to identify the individual enriched annotations with improved databases, algorithms and statistical methods, but also comprehensively address the internal relationships of many enriched heterogeneous annotations. Tools with such capabilities could make the analysis more focused and understandable in a network context. Many of the most recently reported tools fall into the class II and III categories, which suggests such a trend in the field (Table 1 and Supplementary Data 1).

Finally, it can be expected that the activities and passions of developing new enrichment tools will continue, due to the unmet needs and limitations of current enrichment analytic methods. A standard for evaluating new tools will facilitate the growth of the field.

FUNDING

National Institute of Allergy and Infectious Diseases; National Institutes of Health (NO1-CO-56000). Funding for open access charge: same source as above.

Conflict of interest statement. The annotation of this tool and publication do not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the United States Government.

ACKNOWLEDGEMENTS

Thanks go to Dr Xin Zheng and Ms Jun Yang in the Laboratory of Immunopathogenesis and Bioinformatics (LIB) group for biological and bioinformatics discussion. We also thank Bill Wilton and Mike Tartakovsky for information technology and network support.

REFERENCES

1

et al.

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

,

Nat. Genet.

,

2000

, vol.

25

(pg.

25

-

29

)

2

Profiling gene expression using onto-express

,

Genomics

,

2002

, vol.

79

(pg.

266

-

270

)

3

FunSpec: a web-based cluster interpreter for yeast

,

BMC Bioinformatics

,

2002

, vol.

3

pg.

35

4

Characterizing gene sets with FuncAssociate

,

Bioinformatics

,

2003

, vol.

19

(pg.

2502

-

2504

)

5

GeneMerge—post-genomic analysis, data mining, and hypothesis testing

,

Bioinformatics

,

2003

, vol.

19

(pg.

891

-

892

)

6

DAVID: Database for Annotation, Visualization, and Integrated Discovery

,

Genome Biol.

,

2003

, vol.

4

pg.

P3

7

MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data

,

Genome Biol.

,

2003

, vol.

4

pg.

R7

8

Identifying biological themes within lists of genes with EASE

,

Genome Biol.

,

2003

, vol.

4

pg.

R70

9

GARBAN: genomic analysis and rapid biological annotation of cDNA microarray and proteomic data

,

Bioinformatics

,

2003

, vol.

19

(pg.

2158

-

2160

)

10

et al.

GoMiner: a resource for biological interpretation of genomic and proteomic data

,

Genome Biol.

,

2003

, vol.

4

pg.

R28

11

Pathways to the analysis of microarray data

,

Trends Biotechnol.

,

2005

, vol.

23

(pg.

429

-

435

)

12

Ontological analysis of gene expression data: current tools, limitations, and open problems

,

Bioinformatics

,

2005

, vol.

21

(pg.

3587

-

3595

)

13

FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes

,

Bioinformatics

,

2004

, vol.

20

(pg.

578

-

580

)

14

GOstat: find statistically overrepresented Gene Ontologies within a group of genes

,

Bioinformatics

,

2004

, vol.

20

(pg.

1464

-

1465

)

15

GO::TermFinder–open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes

,

Bioinformatics

,

2004

, vol.

20

(pg.

3710

-

3715

)

16

Iterative Group Analysis (iGA): a simple tool to enhance sensitivity and facilitate interpretation of microarray experiments

,

BMC Bioinformatics

,

2004

, vol.

5

pg.

34

17

Comparing functional annotation analyses with Catmap

,

BMC Bioinformatics

,

2004

, vol.

5

pg.

193

18

GOToolBox: functional analysis of gene datasets based on Gene Ontology

,

Genome Biol.

,

2004

, vol.

5

pg.

R101

19

GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining

,

Nucleic Acids Res.

,

2004

, vol.

32

(pg.

W293

-

300

)

20

THEA: ontology-driven analysis of microarray data

,

Bioinformatics

,

2004

, vol.

20

(pg.

2636

-

2643

)

21

CLENCH: a program for calculating Cluster ENriCHment using the Gene Ontology

,

Bioinformatics

,

2004

, vol.

20

(pg.

1196

-

1197

)

22

GO-Mapper: functional analysis of gene expression data using the expression level as a score to evaluate Gene Ontology terms

,

Bioinformatics

,

2004

, vol.

20

(pg.

2618

-

2625

)

23

GOAL: automated Gene Ontology analysis of expression profiles

,

Nucleic Acids Res.

,

2004

, vol.

32

(pg.

W492

-

499

)

24

GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies

,

BMC Bioinformatics

,

2004

, vol.

5

pg.

16

25

GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in Gene Ontology space

,

Appl. Bioinformatics

,

2004

, vol.

3

(pg.

261

-

264

)

26

BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments

,

Nucleic Acids Res.

,

2005

, vol.

33

(pg.

W460

-

464

)

27

Biological profiling of gene groups utilizing Gene Ontology

,

Genome Inform.

,

2005

, vol.

16

(pg.

106

-

115

)

28

T-profiler: scoring the activity of predefined groups of genes using gene expression data

,

Nucleic Acids Res.

,

2005

, vol.

33

(pg.

W592

-

595

)

29

PAGE: parametric analysis of gene set enrichment

,

BMC Bioinformatics

,

2005

, vol.

6

pg.

144

30

FACT–a framework for the functional interpretation of high-throughput experiments

,

BMC Bioinformatics

,

2005

, vol.

6

pg.

161

31

ErmineJ: tool for functional analysis of gene expression data sets

,

BMC Bioinformatics

,

2005

, vol.

6

pg.

269

32

GObar: a gene ontology based analysis and visualization tool for gene sets

,

BMC Bioinformatics

,

2005

, vol.

6

pg.

189

33

BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks

,

Bioinformatics

,

2005

, vol.

21

(pg.

3448

-

3449

)

34

L2L: a simple tool for discovering the hidden significance in microarray expression data

,

Genome Biol.

,

2005

, vol.

6

pg.

R81

35

et al.

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

,

Proc. Natl Acad. Sci. USA

,

2005

, vol.

102

(pg.

15545

-

15550

)

36

MEGO: gene functional module expression based on gene ontology

,

Biotechniques

,

2005

, vol.

38

(pg.

277

-

283

)

37

goCluster integrates statistical analysis and functional interpretation of microarray expression data

,

Bioinformatics

,

2005

, vol.

21

(pg.

3575

-

3577

)

38

OntologyTraverser: an R package for GO analysis

,

Bioinformatics

,

2005

, vol.

21

(pg.

275

-

276

)

39

et al.

High-throughput GoMiner, an ‘industrial-strength’ integrative gene ontology tool for interpretation of multiple-microarray experiments, with application to studies of Common Variable Immune Deficiency (CVID)

,

BMC Bioinformatics

,

2005

, vol.

6

pg.

168

40

WebGestalt: an integrated system for exploring gene sets in various biological contexts

,

Nucleic Acids Res.

,

2005

, vol.

33

(pg.

W741

-

748

)

41

Improved scoring of functional groups from gene expression data by decorrelating GO graph structure

,

Bioinformatics

,

2006

, vol.

22

(pg.

1600

-

1607

)

42

GeneTools—application for functional annotation and statistical hypothesis testing

,

BMC Bioinformatics

,

2006

, vol.

7

pg.

470

43

Clustering biological annotations and gene expression data to identify putatively co-regulated biological processes

,

J. Bioinform. Comput. Biol.

,

2006

, vol.

4

(pg.

833

-

852

)

44

Grouping Gene Ontology terms to improve the assessment of gene set enrichment in microarray data

,

BMC Bioinformatics

,

2006

, vol.

7

pg.

426

45

ADGO: analysis of differentially expressed gene sets using composite GO annotation

,

Bioinformatics

,

2006

, vol.

22

(pg.

2249

-

2253

)

46

Gene class expression: analysis tool of Gene Ontology terms with gene expression data

,

Genet. Mol. Res.

,

2006

, vol.

5

(pg.

108

-

114

)

47

Circumventing the cut-off for enrichment analysis

,

Brief Bioinform.

,

2006

, vol.

7

(pg.

202

-

203

)

48

et al.

JProGO: a novel tool for the functional interpretation of prokaryotic microarray data using Gene Ontology information

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

W510

-

515

)

49

GOLEM: an interactive graph-based gene-ontology navigation and analysis tool

,

BMC Bioinformatics

,

2006

, vol.

7

pg.

443

50

GOFFA: Gene Ontology For Functional Analysis – A FDA Gene Ontology tool for analysis of genomic and proteomic data

,

BMC Bioinformatics

,

2006

, vol.

7

pg.

S23

51

et al.

PageMan: an interactive ontology tool to generate, display, and annotate overview graphs for profiling experiments

,

BMC Bioinformatics

,

2006

, vol.

7

pg.

535

52

BayGO: Bayesian analysis of ontology term enrichment in microarray data

,

BMC Bioinformatics

,

2006

, vol.

7

pg.

86

53

A categorization approach to automated ontological function annotation

,

Protein Sci.

,

2006

, vol.

15

(pg.

1544

-

1549

)

54

et al.

WEGO: a web tool for plotting GO annotations

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

W293

-

297

)

55

From genes to functional classes in the study of biological systems

,

BMC Bioinformatics

,

2007

, vol.

8

pg.

114

56

FatiGO + : a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

W91

-

96

)

57

GeneTrail—advanced gene set enrichment analysis

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

W186

-

192

)

58

FIVA: Functional Information Viewer and Analyzer extracting biological knowledge from transcriptome data of prokaryotes

,

Bioinformatics

,

2007

, vol.

23

(pg.

1161

-

1163

)

59

GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists

,

Genome Biol.

,

2007

, vol.

8

pg.

R3

60

The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists

,

Genome Biol.

,

2007

, vol.

8

pg.

R183

61

et al.

DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

W169

-

W175

)

62

Onto-Tools: new additions and improvements in 2006

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

W206

-

W211

)

63

GAzer: gene set analyzer

,

Bioinformatics

,

2007

, vol.

23

(pg.

1697

-

1699

)

64

g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

W193

-

200

)

65

DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis

,

BMC Bioinformatics

,

2007

, vol.

8

pg.

426

66

EasyGO: Gene Ontology-based annotation and functional enrichment analysis tool for agronomical species

,

BMC Genomics

,

2007

, vol.

8

pg.

246

67

PaLS: filtering common literature, biological terms and pathway information

,

Nucleic Acids Res.

,

2008

, vol.

36

(pg.

W364

-

W367

)

68

ProfCom: a web tool for profiling the complex functionality of gene groups identified from high-throughput data

,

Nucleic Acids Res.

,

2008

, vol.

36

(pg.

W347

-

W351

)

69

Ontologizer 2.0 - A multifunctional tool for GO term enrichment analysis and data exploration

,

Bioinformatics.

,

2008

, vol.

24

(pg.

1650

-

1651

)

70

GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis

,

Nucleic Acids Res.

,

2008

, vol.

36

(pg.

W358

-

W363

)

71

GOSim—an R-package for computation of information theoretic GO similarities between terms and gene products

,

BMC Bioinformatics

,

2007

, vol.

8

pg.

166

72

GO-2D: identifying 2-dimensional cellular-localized functional modules in Gene Ontology

,

BMC Genomics

,

2007

, vol.

8

pg.

30

73

ProbCD: enrichment analysis accounting for categorization uncertainty

,

BMC Bioinformatics

,

2007

, vol.

8

pg.

383

74

SerbGO: searching for the best GO tool

,

Nucleic Acids Res.

,

2008

, vol.

36

(pg.

W368

-

371

)

75

Use and misuse of the gene ontology annotations

,

Nat. Rev. Genet.

,

2008

, vol.

9

(pg.

509

-

515

)

76

Enrichment or depletion of a GO category within a class of genes: which test?

,

Bioinformatics

,

2007

, vol.

23

(pg.

401

-

407

)

77

Threshold-free high-power methods for the ontological analysis of genome-wide gene-expression studies

,

Genome Biol.

,

2007

, vol.

8

pg.

R74

78

et al.

Gaining confidence in biological interpretation of the microarray data: the functional consistence of the significant GO categories

,

Bioinformatics

,

2008

, vol.

24

(pg.

265

-

271

)

79

Extensions to gene set enrichment

,

Bioinformatics

,

2007

, vol.

23

(pg.

306

-

313

)

80

Analyzing gene expression data in terms of gene sets: methodological issues

,

Bioinformatics

,

2007

, vol.

23

(pg.

980

-

987

)

81

Enrichment analysis in high-throughput genomics - accounting for dependency in the NULL

,

Brief Bioinform.

,

2007

, vol.

8

(pg.

71

-

77

)

82

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources

,

Nat. Protoc.

,

2008

doi: 10.1038/nprot.2008.211

83

The gene ontology categorizer

,

Bioinformatics

,

2004

, vol.

20

(pg.

i169

-

177

)

84

How to decide which are the most pertinent overly-represented features during gene set enrichment analysis

,

BMC Bioinformatics

,

2007

, vol.

8

pg.

332

85

Controlling the false discovery rate: a practical and powerful approach to multiple testing

,

J. R. Stat. Soc. B

,

1995

, vol.

57

(pg.

289

-

300

)

86

Multiple hypothesis testing in microarray experiments

,

Stat. Sci.

,

2003

, vol.

18

(pg.

71

-

103

)

87

Babel's tower revisited: a universal resource for cross-referencing across annotation databases

,

Bioinformatics

,

2006

, vol.

22

(pg.

2934

-

2939

)

88

GeneKeyDB: a lightweight, gene-centric, relational database to support data mining environments

,

BMC Bioinformatics

,

2005

, vol.

6

pg.

72

89

Entrez Gene: gene-centered information at NCBI

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

D26

-

D31

)

90

The UniProt Consortium

The universal protein resource (UniProt)

,

Nucleic Acids Res.

,

2008

, vol.

36

(pg.

D190

-

D195

)

91

et al.

The protein information resource

,

Nucleic Acids Res.

,

2003

, vol.

31

(pg.

345

-

347

)

92

MatchMiner: a tool for batch navigation among gene and gene product identifiers

,

Genome Biol.

,

2003

, vol.

4

pg.

R27

93

IDconverter and IDClight: conversion and annotation of gene and protein IDs

,

BMC Bioinformatics

,

2007

, vol.

8

pg.

9

Author notes

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

© 2008 The Author(s)

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 46,794

33,849 Pageviews

12,945 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 20
December 2016 20
January 2017 148
February 2017 393
March 2017 464
April 2017 333
May 2017 351
June 2017 389
July 2017 301
August 2017 270
September 2017 289
October 2017 250
November 2017 272
December 2017 457
January 2018 741
February 2018 495
March 2018 669
April 2018 664
May 2018 706
June 2018 526
July 2018 534
August 2018 672
September 2018 491
October 2018 855
November 2018 624
December 2018 486
January 2019 594
February 2019 540
March 2019 770
April 2019 662
May 2019 616
June 2019 563
July 2019 766
August 2019 812
September 2019 649
October 2019 515
November 2019 485
December 2019 524
January 2020 473
February 2020 412
March 2020 422
April 2020 312
May 2020 349
June 2020 522
July 2020 440
August 2020 420
September 2020 455
October 2020 458
November 2020 559
December 2020 479
January 2021 507
February 2021 466
March 2021 671
April 2021 543
May 2021 578
June 2021 477
July 2021 420
August 2021 411
September 2021 449
October 2021 597
November 2021 497
December 2021 419
January 2022 513
February 2022 481
March 2022 686
April 2022 583
May 2022 583
June 2022 392
July 2022 407
August 2022 427
September 2022 437
October 2022 431
November 2022 416
December 2022 324
January 2023 472
February 2023 525
March 2023 562
April 2023 455
May 2023 425
June 2023 407
July 2023 406
August 2023 396
September 2023 466
October 2023 656
November 2023 436
December 2023 422
January 2024 548
February 2024 635
March 2024 763
April 2024 402
May 2024 393
June 2024 332
July 2024 416
August 2024 370
September 2024 487
October 2024 678
November 2024 210

×

Email alerts

Citing articles via

More from Oxford Academic