Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists (original) (raw)

Journal Article

Laboratory of Immunopathogenesis and Bioinformatics, Clinical Services Program, SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, MD 21702, USA

Search for other works by this author on:

Laboratory of Immunopathogenesis and Bioinformatics, Clinical Services Program, SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, MD 21702, USA

Search for other works by this author on:

Laboratory of Immunopathogenesis and Bioinformatics, Clinical Services Program, SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, MD 21702, USA

*To whom correspondence should be addressed. Tel: +1 301 846 5093; Fax:

+1 301 846 6762

; Email: rlempicki@mail.nih.gov

Search for other works by this author on:

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

Author Notes

Received:

10 September 2008

Revision received:

24 October 2008

Accepted:

03 November 2008

Published:

25 November 2008

Cite

Da Wei Huang, Brad T. Sherman, Richard A. Lempicki, Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists, Nucleic Acids Research, Volume 37, Issue 1, 1 January 2009, Pages 1–13, https://doi.org/10.1093/nar/gkn923
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Functional analysis of large gene lists, derived in most cases from emerging high-throughput genomic, proteomic and bioinformatics scanning approaches, is still a challenging and daunting task. The gene-annotation enrichment analysis is a promising high-throughput strategy that increases the likelihood for investigators to identify biological processes most pertinent to their study. Approximately 68 bioinformatics enrichment tools that are currently available in the community are collected in this survey. Tools are uniquely categorized into three major classes, according to their underlying enrichment algorithms. The comprehensive collections, unique tool classifications and associated questions/issues will provide a more comprehensive and up-to-date view regarding the advantages, pitfalls and recent trends in a simpler tool-class level rather than by a tool-by-tool approach. Thus, the survey will help tool designers/developers and experienced end users understand the underlying algorithms and pertinent details of particular tool categories/tools, enabling them to make the best choices for their particular research interests.

INTRODUCTION

The traditional biological research approaches typically study one gene or a few genes at a time. In contrast, high-throughput genomic, proteomic and bioinformatics scanning approaches (such as expression microarray, promoter microarray, proteomics, ChIP-on-CHIPs, etc.) are emerging as alternative technologies that allow investigators to simultaneously measure the changes and regulation of genome-wide genes under certain biological conditions. Those high-throughput technologies usually generate large ‘interesting’ gene lists as their final outputs. However, the biological interpretation of large, ‘interesting’ gene lists (ranging in size from hundreds to thousands of genes) is still a challenging and daunting task. Over the last few decades, bioinformatics methods, using the biological knowledge accumulated in public databases [e.g. Gene Ontology (1)], make it possible to systematically dissect large gene lists in an attempt to assemble a summary of the most enriched and pertinent biology. A number of high-throughput enrichment tools, including, but not limited to Onto-Express, MAPPFinder, GoMiner, DAVID, EASE, GeneMerge and FuncAssociate, etc. (2–10), were independently developed during 2002 and 2003 as initial studies to address the challenge of functionally analyzing large gene lists. Since then, the enrichment analysis field has been very productive, resulting in more, similar tools becoming publicly available. In 2005, approximately 14 such tools were collected and reviewed by Khatri et al. (11) and by Curtis et al. (12), respectively. The activity in the field has continually grown stronger as the number of new enrichment tools (with distinct new ideas and features) has significantly increased. Approximately 68 such tools have been collected in this survey (2–10,13–73) (Table 1 and Supplementary Data 1).

Table 1.

List of 68 enrichment tools

Enrichment tool name	Year of release	Key statistical method	Category
FunSpec	2002	Hypergeometric	Class I
Onto-express	2002	Fisher's exact; hypergeometic; binomial; chi-square	Class I
EASE	2003	Fisher's exact (modified as EASE score)	Class I
FatiGO/FatiWise/FatiGO+	2003	Fisher's exact	Class I
FuncAssociate	2003	Fisher's exact	Class I
GARBAN	2003	Hypergeometric	Class I
GeneMerge	2003	Hypergeometric	Class I
GoMiner	2003	Fisher's exact	Class I
MAPPFinder	2003	_Z_-score; hypergeometric	Class I
CLENCH	2004	Hypergeometric; chi-square; binomial	Class I
GO::TermFinder	2004	hypergeometric	Class I
GOAL	2004	Permutation	Class I
GOArray	2004	Hypergeometric; _Z_-score; permutation	Class I
GOStat	2004	Fisher's exact; chi-squre	Class I
GoSurfer	2004	Chi-square	Class I
OntologyTraverser	2004	Hypergeometric; Fisher's exact	Class I
THEA	2004	Hypergeometric	Class I
BiNGO	2005	Hypergeometric; binomial	Class I
FACT	2005	Adopt GeneMerge and GO::TermFinder statistical modules	Class I
gfinder	2005	Fisher's exact	Class I
Gobar	2005	Hypergeometric	Class I
GOCluster	2005	Hypergeometric	Class I
GOSSIP	2005	Fisher's exact	Class I
L2L	2005	Binomial; hypergeometric	Class I
WebGestalt	2005	Hypergeometric	Class I
BayGO	2006	Bayesian; Goodman and Kruskal's gamma factor	Class I
eGOn/GeneTools	2006	Fisher's exact	Class I
Gene Class Expression	2006	_Z_-statistics	Class I
GOALIE	2006	Hidden Kripke model	Class I
GOFFA	2006	Fisher's inverse chi-square	Class I
GOLEM	2006	Hyerpgeometric	Class I
JProGO	2006	Fisher's exact; Kolmogorov–Smirnov test; student's _t_-test; Wilcoxon's test; hypergeometric	Class I
PageMan	2006	Fisher's exact; chi-square; Wilcoxon	Class I
STEM	2006	Hypergeometric	Class I
WEGO	2006	Chi-square	Class I
EasyGO	2007	Hypergeometric; chi-square; binomial	Class I
g:Profiler	2007	Hypergeometric	Class I
ProbCD	2007	Yule's Q; Goodman-Kruskal's gamma; Cramer's T	Class I
GOEAST	2008	Hypergeometric	Class I
GOHyperGAll	2008	Hypergeometric	Class I
CatMap	2004	Permutations	Class II
Godist	2004	Kolmogorov–Smirnov test	Class II
GO-Mapper	2004	Gaussian distribution; EQ-score	Class II
iGA	2004	Permutations; hypergeometric; _t_-test; _Z_-score	Class II
GSEA	2005	Kolmogorov–Smirnov-like statistic	Class II
MEGO	2005	_Z_-score	Class II
PAGE	2005	_Z_-score	Class II
T-profiler	2005	_t_-Test	Class II
FuncCluster	2006	Fisher's exact	Class II
FatiScan	2007	Fisher's Exact	Class II
FINA	2007	Fisher's exact	Class II
GAzer	2007	_Z_-statistics; permutation	Class II
GeneTrail	2007	Hypergeometric; Kolmogorov–Smirnov	Class II
MetaGP	2007	_Z_-score	Class II
Ontologizer	2004	Fisher's exact	Class III
POSOC	2004	POSET (a discrete math: finite partially ordered set)	Class III
topGO	2006	Fisher's exact	Class III
GO-2D	2007	Hypergeometric; binomial	Class III
GENECODIS	2007	Hypergeometric; chi-square	Class III
GOSim	2007	Resnik's similarity	Class III
PalS	2008	Percent	Class III
ProfCom	2008	Greedy heuristics	Class III
GOTM	2004	Hypergeometric	Class I,II
ermineJ	2005	Permutations; Wilcoxon rank-sum test	Class I,II
DAVID	2003	Fisher's Exact (modified as EASE score)	Class I,III
GOToolBox	2004	Hypergeometric; Fisher's exact; Binomial	Class I,III
ADGO	2006	_Z_-statistic	Class II,III
FunNet	2008	Unclear	Unclear

Enrichment tool name	Year of release	Key statistical method	Category
FunSpec	2002	Hypergeometric	Class I
Onto-express	2002	Fisher's exact; hypergeometic; binomial; chi-square	Class I
EASE	2003	Fisher's exact (modified as EASE score)	Class I
FatiGO/FatiWise/FatiGO+	2003	Fisher's exact	Class I
FuncAssociate	2003	Fisher's exact	Class I
GARBAN	2003	Hypergeometric	Class I
GeneMerge	2003	Hypergeometric	Class I
GoMiner	2003	Fisher's exact	Class I
MAPPFinder	2003	_Z_-score; hypergeometric	Class I
CLENCH	2004	Hypergeometric; chi-square; binomial	Class I
GO::TermFinder	2004	hypergeometric	Class I
GOAL	2004	Permutation	Class I
GOArray	2004	Hypergeometric; _Z_-score; permutation	Class I
GOStat	2004	Fisher's exact; chi-squre	Class I
GoSurfer	2004	Chi-square	Class I
OntologyTraverser	2004	Hypergeometric; Fisher's exact	Class I
THEA	2004	Hypergeometric	Class I
BiNGO	2005	Hypergeometric; binomial	Class I
FACT	2005	Adopt GeneMerge and GO::TermFinder statistical modules	Class I
gfinder	2005	Fisher's exact	Class I
Gobar	2005	Hypergeometric	Class I
GOCluster	2005	Hypergeometric	Class I
GOSSIP	2005	Fisher's exact	Class I
L2L	2005	Binomial; hypergeometric	Class I
WebGestalt	2005	Hypergeometric	Class I
BayGO	2006	Bayesian; Goodman and Kruskal's gamma factor	Class I
eGOn/GeneTools	2006	Fisher's exact	Class I
Gene Class Expression	2006	_Z_-statistics	Class I
GOALIE	2006	Hidden Kripke model	Class I
GOFFA	2006	Fisher's inverse chi-square	Class I
GOLEM	2006	Hyerpgeometric	Class I
JProGO	2006	Fisher's exact; Kolmogorov–Smirnov test; student's _t_-test; Wilcoxon's test; hypergeometric	Class I
PageMan	2006	Fisher's exact; chi-square; Wilcoxon	Class I
STEM	2006	Hypergeometric	Class I
WEGO	2006	Chi-square	Class I
EasyGO	2007	Hypergeometric; chi-square; binomial	Class I
g:Profiler	2007	Hypergeometric	Class I
ProbCD	2007	Yule's Q; Goodman-Kruskal's gamma; Cramer's T	Class I
GOEAST	2008	Hypergeometric	Class I
GOHyperGAll	2008	Hypergeometric	Class I
CatMap	2004	Permutations	Class II
Godist	2004	Kolmogorov–Smirnov test	Class II
GO-Mapper	2004	Gaussian distribution; EQ-score	Class II
iGA	2004	Permutations; hypergeometric; _t_-test; _Z_-score	Class II
GSEA	2005	Kolmogorov–Smirnov-like statistic	Class II
MEGO	2005	_Z_-score	Class II
PAGE	2005	_Z_-score	Class II
T-profiler	2005	_t_-Test	Class II
FuncCluster	2006	Fisher's exact	Class II
FatiScan	2007	Fisher's Exact	Class II
FINA	2007	Fisher's exact	Class II
GAzer	2007	_Z_-statistics; permutation	Class II
GeneTrail	2007	Hypergeometric; Kolmogorov–Smirnov	Class II
MetaGP	2007	_Z_-score	Class II
Ontologizer	2004	Fisher's exact	Class III
POSOC	2004	POSET (a discrete math: finite partially ordered set)	Class III
topGO	2006	Fisher's exact	Class III
GO-2D	2007	Hypergeometric; binomial	Class III
GENECODIS	2007	Hypergeometric; chi-square	Class III
GOSim	2007	Resnik's similarity	Class III
PalS	2008	Percent	Class III
ProfCom	2008	Greedy heuristics	Class III
GOTM	2004	Hypergeometric	Class I,II
ermineJ	2005	Permutations; Wilcoxon rank-sum test	Class I,II
DAVID	2003	Fisher's Exact (modified as EASE score)	Class I,III
GOToolBox	2004	Hypergeometric; Fisher's exact; Binomial	Class I,III
ADGO	2006	_Z_-statistic	Class II,III
FunNet	2008	Unclear	Unclear

Table 1.

List of 68 enrichment tools

Enrichment tool name	Year of release	Key statistical method	Category
FunSpec	2002	Hypergeometric	Class I
Onto-express	2002	Fisher's exact; hypergeometic; binomial; chi-square	Class I
EASE	2003	Fisher's exact (modified as EASE score)	Class I
FatiGO/FatiWise/FatiGO+	2003	Fisher's exact	Class I
FuncAssociate	2003	Fisher's exact	Class I
GARBAN	2003	Hypergeometric	Class I
GeneMerge	2003	Hypergeometric	Class I
GoMiner	2003	Fisher's exact	Class I
MAPPFinder	2003	_Z_-score; hypergeometric	Class I
CLENCH	2004	Hypergeometric; chi-square; binomial	Class I
GO::TermFinder	2004	hypergeometric	Class I
GOAL	2004	Permutation	Class I
GOArray	2004	Hypergeometric; _Z_-score; permutation	Class I
GOStat	2004	Fisher's exact; chi-squre	Class I
GoSurfer	2004	Chi-square	Class I
OntologyTraverser	2004	Hypergeometric; Fisher's exact	Class I
THEA	2004	Hypergeometric	Class I
BiNGO	2005	Hypergeometric; binomial	Class I
FACT	2005	Adopt GeneMerge and GO::TermFinder statistical modules	Class I
gfinder	2005	Fisher's exact	Class I
Gobar	2005	Hypergeometric	Class I
GOCluster	2005	Hypergeometric	Class I
GOSSIP	2005	Fisher's exact	Class I
L2L	2005	Binomial; hypergeometric	Class I
WebGestalt	2005	Hypergeometric	Class I
BayGO	2006	Bayesian; Goodman and Kruskal's gamma factor	Class I
eGOn/GeneTools	2006	Fisher's exact	Class I
Gene Class Expression	2006	_Z_-statistics	Class I
GOALIE	2006	Hidden Kripke model	Class I
GOFFA	2006	Fisher's inverse chi-square	Class I
GOLEM	2006	Hyerpgeometric	Class I
JProGO	2006	Fisher's exact; Kolmogorov–Smirnov test; student's _t_-test; Wilcoxon's test; hypergeometric	Class I
PageMan	2006	Fisher's exact; chi-square; Wilcoxon	Class I
STEM	2006	Hypergeometric	Class I
WEGO	2006	Chi-square	Class I
EasyGO	2007	Hypergeometric; chi-square; binomial	Class I
g:Profiler	2007	Hypergeometric	Class I
ProbCD	2007	Yule's Q; Goodman-Kruskal's gamma; Cramer's T	Class I
GOEAST	2008	Hypergeometric	Class I
GOHyperGAll	2008	Hypergeometric	Class I
CatMap	2004	Permutations	Class II
Godist	2004	Kolmogorov–Smirnov test	Class II
GO-Mapper	2004	Gaussian distribution; EQ-score	Class II
iGA	2004	Permutations; hypergeometric; _t_-test; _Z_-score	Class II
GSEA	2005	Kolmogorov–Smirnov-like statistic	Class II
MEGO	2005	_Z_-score	Class II
PAGE	2005	_Z_-score	Class II
T-profiler	2005	_t_-Test	Class II
FuncCluster	2006	Fisher's exact	Class II
FatiScan	2007	Fisher's Exact	Class II
FINA	2007	Fisher's exact	Class II
GAzer	2007	_Z_-statistics; permutation	Class II
GeneTrail	2007	Hypergeometric; Kolmogorov–Smirnov	Class II
MetaGP	2007	_Z_-score	Class II
Ontologizer	2004	Fisher's exact	Class III
POSOC	2004	POSET (a discrete math: finite partially ordered set)	Class III
topGO	2006	Fisher's exact	Class III
GO-2D	2007	Hypergeometric; binomial	Class III
GENECODIS	2007	Hypergeometric; chi-square	Class III
GOSim	2007	Resnik's similarity	Class III
PalS	2008	Percent	Class III
ProfCom	2008	Greedy heuristics	Class III
GOTM	2004	Hypergeometric	Class I,II
ermineJ	2005	Permutations; Wilcoxon rank-sum test	Class I,II
DAVID	2003	Fisher's Exact (modified as EASE score)	Class I,III
GOToolBox	2004	Hypergeometric; Fisher's exact; Binomial	Class I,III
ADGO	2006	_Z_-statistic	Class II,III
FunNet	2008	Unclear	Unclear

Enrichment tool name	Year of release	Key statistical method	Category
FunSpec	2002	Hypergeometric	Class I
Onto-express	2002	Fisher's exact; hypergeometic; binomial; chi-square	Class I
EASE	2003	Fisher's exact (modified as EASE score)	Class I
FatiGO/FatiWise/FatiGO+	2003	Fisher's exact	Class I
FuncAssociate	2003	Fisher's exact	Class I
GARBAN	2003	Hypergeometric	Class I
GeneMerge	2003	Hypergeometric	Class I
GoMiner	2003	Fisher's exact	Class I
MAPPFinder	2003	_Z_-score; hypergeometric	Class I
CLENCH	2004	Hypergeometric; chi-square; binomial	Class I
GO::TermFinder	2004	hypergeometric	Class I
GOAL	2004	Permutation	Class I
GOArray	2004	Hypergeometric; _Z_-score; permutation	Class I
GOStat	2004	Fisher's exact; chi-squre	Class I
GoSurfer	2004	Chi-square	Class I
OntologyTraverser	2004	Hypergeometric; Fisher's exact	Class I
THEA	2004	Hypergeometric	Class I
BiNGO	2005	Hypergeometric; binomial	Class I
FACT	2005	Adopt GeneMerge and GO::TermFinder statistical modules	Class I
gfinder	2005	Fisher's exact	Class I
Gobar	2005	Hypergeometric	Class I
GOCluster	2005	Hypergeometric	Class I
GOSSIP	2005	Fisher's exact	Class I
L2L	2005	Binomial; hypergeometric	Class I
WebGestalt	2005	Hypergeometric	Class I
BayGO	2006	Bayesian; Goodman and Kruskal's gamma factor	Class I
eGOn/GeneTools	2006	Fisher's exact	Class I
Gene Class Expression	2006	_Z_-statistics	Class I
GOALIE	2006	Hidden Kripke model	Class I
GOFFA	2006	Fisher's inverse chi-square	Class I
GOLEM	2006	Hyerpgeometric	Class I
JProGO	2006	Fisher's exact; Kolmogorov–Smirnov test; student's _t_-test; Wilcoxon's test; hypergeometric	Class I
PageMan	2006	Fisher's exact; chi-square; Wilcoxon	Class I
STEM	2006	Hypergeometric	Class I
WEGO	2006	Chi-square	Class I
EasyGO	2007	Hypergeometric; chi-square; binomial	Class I
g:Profiler	2007	Hypergeometric	Class I
ProbCD	2007	Yule's Q; Goodman-Kruskal's gamma; Cramer's T	Class I
GOEAST	2008	Hypergeometric	Class I
GOHyperGAll	2008	Hypergeometric	Class I
CatMap	2004	Permutations	Class II
Godist	2004	Kolmogorov–Smirnov test	Class II
GO-Mapper	2004	Gaussian distribution; EQ-score	Class II
iGA	2004	Permutations; hypergeometric; _t_-test; _Z_-score	Class II
GSEA	2005	Kolmogorov–Smirnov-like statistic	Class II
MEGO	2005	_Z_-score	Class II
PAGE	2005	_Z_-score	Class II
T-profiler	2005	_t_-Test	Class II
FuncCluster	2006	Fisher's exact	Class II
FatiScan	2007	Fisher's Exact	Class II
FINA	2007	Fisher's exact	Class II
GAzer	2007	_Z_-statistics; permutation	Class II
GeneTrail	2007	Hypergeometric; Kolmogorov–Smirnov	Class II
MetaGP	2007	_Z_-score	Class II
Ontologizer	2004	Fisher's exact	Class III
POSOC	2004	POSET (a discrete math: finite partially ordered set)	Class III
topGO	2006	Fisher's exact	Class III
GO-2D	2007	Hypergeometric; binomial	Class III
GENECODIS	2007	Hypergeometric; chi-square	Class III
GOSim	2007	Resnik's similarity	Class III
PalS	2008	Percent	Class III
ProfCom	2008	Greedy heuristics	Class III
GOTM	2004	Hypergeometric	Class I,II
ermineJ	2005	Permutations; Wilcoxon rank-sum test	Class I,II
DAVID	2003	Fisher's Exact (modified as EASE score)	Class I,III
GOToolBox	2004	Hypergeometric; Fisher's exact; Binomial	Class I,III
ADGO	2006	_Z_-statistic	Class II,III
FunNet	2008	Unclear	Unclear

During the past several years, bioinformatics enrichment tools have played a very important and successful role contributing to the gene functional analysis of large gene lists for various high-throughput biological studies, which is clearly evidenced by thousands of publications citing these tools (based on Google Scholar as of September 2008). However, these bioinformatics enrichment tools are still in an actively growing and improving stage, without unified methods or one ‘gold’ standard. As more enrichment tools emerge in the scientific community, the individual tool-developing group or end user finds it more and more difficult to comprehensively track the usefulness of all of the existing works to his or her research. This confusing plethora of tools has resulted in several issues: (i) difficulty in comprehensively comparing and remembering the algorithms/features in a tool-by-tool manner among the overwhelmingly large number of tools available (approximately 68 current tools); (ii) a chance that some good work may be overlooked; (iii) redundant efforts in developing ideas that already exist, because of developers’ difficulties in grasping the breadth of the field; (iv) out-of-date ideas being used in newly released tools because of the developers’ lack of awareness of the latest methods; and (v) difficulties for end users in deciding, among so many overwhelming choices, which enrichment tools are most suitable to their analytic needs.

This survey includes four sections to address the situations listed earlier: First, it will identify 68 enrichment tools that are currently available, and further describe the rationales behind them. That way, the tool designers, developers and end users will be made aware of most, if not all, of the existing tools. Secondly, tools will be uniquely classified, according to their underlying algorithms, into three major categories. Thus, readers can more easily and quickly grasp the key spirit of the 68 tools by following the categorical logic instead of trying to search through a tool-by-tool layout. Thirdly, the paper will focus on several important, but largely unanswered, questions and issues associated with the field. We hope that the questions/issues to be discussed will drive more attention, independent thinking, and discussion in the field, thereafter leading to better solutions in the near future. Finally, the paper will conclude with the current status and trends in the field.

GENERAL PRINCIPLE OF ENRICHMENT ANALYSIS AND 68 AVAILABLE TOOLS

A biological process is typically made up of a group of genes, as opposed to an individual gene alone. The principal foundation of enrichment analysis is that if a biological process is abnormal in a given study, the co-functioning genes should have a higher (enriched) potential to be selected as a relevant group by the high-throughput screening technologies. Such a rationale can make the analysis of large gene lists move from an individual gene-oriented view to a relevant gene group-based analysis. Because the analytic conclusion is based on a group of relevant genes instead of on an individual gene, it increases the likelihood for investigators to identify the correct biological processes most pertinent to the biological phenomena under study. For example, 10% of the user's genes selected by a microarray experiment are kinases, as opposed to 1% of the genes in the human genome (this is the gene population background) that are kinases. The enrichment can therefore be quantitatively measured by some common and well-known statistical methods, including Chi-square, Fisher's exact test, Binomial probability and Hypergeometric distribution (more discussion of enrichment _P_-value in a later section of this paper). Thus, a conclusion may be obtained for the particular example, that is, kinases are enriched in the user's study and therefore play an important role in the study. Fortunately, annotation databases, such as Gene Ontology (GO) (1), collecting biological knowledge in a format of gene-to-annotation, are very suitable for high-throughput bioinformatics scanning for the enrichment analysis. The tools systematically map a large number of interesting genes in a list to the associated biological annotation terms (e.g. GO Terms or Pathways), and then statistically examine the enrichment of gene members for each of the annotation terms by comparing the outcome to the control (or reference) background. Thereafter, the annotation terms with enriched gene members can be identified from tens of thousands of other annotation terms in a high-throughput fashion (11,12). The enriched annotation terms associated with the large gene list will give important insights that allow investigators to understand the biological themes behind the large gene list.

Approximately 68 bioinformatics tools (Table 1 and Supplementary Data 1) (2–10,13–73), aligned with the above analytic scenarios and purposes, are collected in this study. Regardless of their distinct features, the general procedure of the tools can be described as having three major layers: data support (backend annotation database); data mining (algorithm and statistics); and result presentation (interface and exploration) (Figure 1). Each of the layers may greatly impact the comprehensiveness of analytic results, as discussed in later sections of this paper. The general features associated with each tool, such as tool home page, publication link, general database scope [see SerbGO (74), which searches detailed annotation coverage across tools], pathway presentation, etc., can be found in Supplementary Data 1, in order to help end users/developers look up tools for their research interests. Moreover, the capability, sensitivity and backend databases can be very different from tool to tool. It is not uncommon for users to try multiple tools with similar analytic capability for the same dataset in order to obtain maximum satisfactory analytic results (75).

The infrastructure of typical enrichment tools. Even though the enrichment analysis tools have distinct features, they can be generally described as three major layers: backend annotation database; data mining; and result presentation. Each of the layers, rather than statistical methods alone, greatly influences the analytic results.

Figure 1.

CLASSIFICATION OF ENRICHMENT TOOLS

When the tool developer or end user is searching for particular features among the many tools available, it is not an easy task to digest the features for all 68 tools without appropriate classification. Based on the difference of algorithms, this survey classifies the 68 current enrichment tools into three classes: singular enrichment analysis (SEA); gene set enrichment analysis (GSEA); and modular enrichment analysis (MEA). A complete list of tools and their defining classes can be found in Table 1 and Supplementary Data 1. Notably, some tools with diverse capabilities belong to more than one class. The general features and limitations associated with each class are discussed in the following sections and are compared in Table 2.

Table 2.

Categorization of enrichment analysis tools

Tool category	Description	Indication and limitation	Sub-type of algorithms	Methods	Example tool
Class I: singular enrichment analysis (SEA)	Enrichment _P_-value is calculated on each term from the pre-selected interesting gene list. Then, enriched terms are listed in a simple linear text format. This strategy is the most traditional algorithm. It is still dominantly used by most of the enrichment analysis tools.	Capable of analyzing any gene list, which could be selected from any high-throughput biological studies/technologies (e.g. Microarray, ChIP-on-CHIP, ChIP-on-sequence, SNP array, EXON array, large scale sequence, etc.). However, the deeper inter-relationships among the terms may not be fully captured in linear format report.	Global reference background Local reference background Neural network	Fisher's exact hypergeometric chi-square binomial Fisher's Exact hypergeometric chi-square binomial Bayesian	GoStat, GoMiner, GOTM, BinGO, GOtoolBox, GFinder, etc. DAVID, Onto-Express, GARBAN, FatiGO, etc. BayGO
Class II: gene set enrichment analysis (GSEA)	Entire genes (without pre-selection) and associated experimental values are considered in the enrichment analysis. The unique features of this strategy are: (i) No need to pre-select interesting genes, as opposed to Classes I and II; (ii) Experimental values integrated into _P_-value calculation.	Suitable for pair-wide biological studies (e.g. disease versus control). Currently, may be difficult to be applied to the diverse data structures derived by a complex experimental design and some of the new technologies (e.g. SNP, EXON, Promoter arrays).	Based on ranked gene list Based on continuous gene values	Kolmogorov–Smirnov-like _t_-Test permutation _Z_-score	GSEA, CapMap, etc. FatiScan, ADGO, ermineJ, PAGE, iGA, GO-Mapper, GOdist, FINA, T-profiler, MetaGP, etc.
Class III: modular enrichment analysis (MEA)	This strategy inherits key spirit of SEA. However, the term–term/gene–gene relationships are considered into enrichment _P_-value calculation. The advantage of this strategy is that term–term/gene–gene relationship might contain unique biological meaning that is not held by a single term or gene. Such network/modular analysis is closer to the nature of biological data structure.	Capable of analyzing any gene lists, which could be selected from any high-throughput biological studies/technologies, like Class I. Emphasis on network relationships during analysis. ‘Orphan’ gene/term (with little relationships to other genes/terms), that sometimes could be very interesting, too, may be left out from the analysis.	Composite annotations DAG Structure Global annotation relationship	Measure enrichment on joint terms Measure enrichment by considering parents-child relationships Measure term–term global similarity with Kappa Statistics Czekanowski-Dice Pearson's correlation	ADGO, GeneCodis, ProfCom, etc. topGO, Ontologizer, POSOC, etc. DAVID, GoToolBox, etc.

Tool category	Description	Indication and limitation	Sub-type of algorithms	Methods	Example tool
Class I: singular enrichment analysis (SEA)	Enrichment _P_-value is calculated on each term from the pre-selected interesting gene list. Then, enriched terms are listed in a simple linear text format. This strategy is the most traditional algorithm. It is still dominantly used by most of the enrichment analysis tools.	Capable of analyzing any gene list, which could be selected from any high-throughput biological studies/technologies (e.g. Microarray, ChIP-on-CHIP, ChIP-on-sequence, SNP array, EXON array, large scale sequence, etc.). However, the deeper inter-relationships among the terms may not be fully captured in linear format report.	Global reference background Local reference background Neural network	Fisher's exact hypergeometric chi-square binomial Fisher's Exact hypergeometric chi-square binomial Bayesian	GoStat, GoMiner, GOTM, BinGO, GOtoolBox, GFinder, etc. DAVID, Onto-Express, GARBAN, FatiGO, etc. BayGO
Class II: gene set enrichment analysis (GSEA)	Entire genes (without pre-selection) and associated experimental values are considered in the enrichment analysis. The unique features of this strategy are: (i) No need to pre-select interesting genes, as opposed to Classes I and II; (ii) Experimental values integrated into _P_-value calculation.	Suitable for pair-wide biological studies (e.g. disease versus control). Currently, may be difficult to be applied to the diverse data structures derived by a complex experimental design and some of the new technologies (e.g. SNP, EXON, Promoter arrays).	Based on ranked gene list Based on continuous gene values	Kolmogorov–Smirnov-like _t_-Test permutation _Z_-score	GSEA, CapMap, etc. FatiScan, ADGO, ermineJ, PAGE, iGA, GO-Mapper, GOdist, FINA, T-profiler, MetaGP, etc.
Class III: modular enrichment analysis (MEA)	This strategy inherits key spirit of SEA. However, the term–term/gene–gene relationships are considered into enrichment _P_-value calculation. The advantage of this strategy is that term–term/gene–gene relationship might contain unique biological meaning that is not held by a single term or gene. Such network/modular analysis is closer to the nature of biological data structure.	Capable of analyzing any gene lists, which could be selected from any high-throughput biological studies/technologies, like Class I. Emphasis on network relationships during analysis. ‘Orphan’ gene/term (with little relationships to other genes/terms), that sometimes could be very interesting, too, may be left out from the analysis.	Composite annotations DAG Structure Global annotation relationship	Measure enrichment on joint terms Measure enrichment by considering parents-child relationships Measure term–term global similarity with Kappa Statistics Czekanowski-Dice Pearson's correlation	ADGO, GeneCodis, ProfCom, etc. topGO, Ontologizer, POSOC, etc. DAVID, GoToolBox, etc.

Table 2.

Categorization of enrichment analysis tools

Tool category	Description	Indication and limitation	Sub-type of algorithms	Methods	Example tool
Class I: singular enrichment analysis (SEA)	Enrichment _P_-value is calculated on each term from the pre-selected interesting gene list. Then, enriched terms are listed in a simple linear text format. This strategy is the most traditional algorithm. It is still dominantly used by most of the enrichment analysis tools.	Capable of analyzing any gene list, which could be selected from any high-throughput biological studies/technologies (e.g. Microarray, ChIP-on-CHIP, ChIP-on-sequence, SNP array, EXON array, large scale sequence, etc.). However, the deeper inter-relationships among the terms may not be fully captured in linear format report.	Global reference background Local reference background Neural network	Fisher's exact hypergeometric chi-square binomial Fisher's Exact hypergeometric chi-square binomial Bayesian	GoStat, GoMiner, GOTM, BinGO, GOtoolBox, GFinder, etc. DAVID, Onto-Express, GARBAN, FatiGO, etc. BayGO
Class II: gene set enrichment analysis (GSEA)	Entire genes (without pre-selection) and associated experimental values are considered in the enrichment analysis. The unique features of this strategy are: (i) No need to pre-select interesting genes, as opposed to Classes I and II; (ii) Experimental values integrated into _P_-value calculation.	Suitable for pair-wide biological studies (e.g. disease versus control). Currently, may be difficult to be applied to the diverse data structures derived by a complex experimental design and some of the new technologies (e.g. SNP, EXON, Promoter arrays).	Based on ranked gene list Based on continuous gene values	Kolmogorov–Smirnov-like _t_-Test permutation _Z_-score	GSEA, CapMap, etc. FatiScan, ADGO, ermineJ, PAGE, iGA, GO-Mapper, GOdist, FINA, T-profiler, MetaGP, etc.
Class III: modular enrichment analysis (MEA)	This strategy inherits key spirit of SEA. However, the term–term/gene–gene relationships are considered into enrichment _P_-value calculation. The advantage of this strategy is that term–term/gene–gene relationship might contain unique biological meaning that is not held by a single term or gene. Such network/modular analysis is closer to the nature of biological data structure.	Capable of analyzing any gene lists, which could be selected from any high-throughput biological studies/technologies, like Class I. Emphasis on network relationships during analysis. ‘Orphan’ gene/term (with little relationships to other genes/terms), that sometimes could be very interesting, too, may be left out from the analysis.	Composite annotations DAG Structure Global annotation relationship	Measure enrichment on joint terms Measure enrichment by considering parents-child relationships Measure term–term global similarity with Kappa Statistics Czekanowski-Dice Pearson's correlation	ADGO, GeneCodis, ProfCom, etc. topGO, Ontologizer, POSOC, etc. DAVID, GoToolBox, etc.

Tool category	Description	Indication and limitation	Sub-type of algorithms	Methods	Example tool
Class I: singular enrichment analysis (SEA)	Enrichment _P_-value is calculated on each term from the pre-selected interesting gene list. Then, enriched terms are listed in a simple linear text format. This strategy is the most traditional algorithm. It is still dominantly used by most of the enrichment analysis tools.	Capable of analyzing any gene list, which could be selected from any high-throughput biological studies/technologies (e.g. Microarray, ChIP-on-CHIP, ChIP-on-sequence, SNP array, EXON array, large scale sequence, etc.). However, the deeper inter-relationships among the terms may not be fully captured in linear format report.	Global reference background Local reference background Neural network	Fisher's exact hypergeometric chi-square binomial Fisher's Exact hypergeometric chi-square binomial Bayesian	GoStat, GoMiner, GOTM, BinGO, GOtoolBox, GFinder, etc. DAVID, Onto-Express, GARBAN, FatiGO, etc. BayGO
Class II: gene set enrichment analysis (GSEA)	Entire genes (without pre-selection) and associated experimental values are considered in the enrichment analysis. The unique features of this strategy are: (i) No need to pre-select interesting genes, as opposed to Classes I and II; (ii) Experimental values integrated into _P_-value calculation.	Suitable for pair-wide biological studies (e.g. disease versus control). Currently, may be difficult to be applied to the diverse data structures derived by a complex experimental design and some of the new technologies (e.g. SNP, EXON, Promoter arrays).	Based on ranked gene list Based on continuous gene values	Kolmogorov–Smirnov-like _t_-Test permutation _Z_-score	GSEA, CapMap, etc. FatiScan, ADGO, ermineJ, PAGE, iGA, GO-Mapper, GOdist, FINA, T-profiler, MetaGP, etc.
Class III: modular enrichment analysis (MEA)	This strategy inherits key spirit of SEA. However, the term–term/gene–gene relationships are considered into enrichment _P_-value calculation. The advantage of this strategy is that term–term/gene–gene relationship might contain unique biological meaning that is not held by a single term or gene. Such network/modular analysis is closer to the nature of biological data structure.	Capable of analyzing any gene lists, which could be selected from any high-throughput biological studies/technologies, like Class I. Emphasis on network relationships during analysis. ‘Orphan’ gene/term (with little relationships to other genes/terms), that sometimes could be very interesting, too, may be left out from the analysis.	Composite annotations DAG Structure Global annotation relationship	Measure enrichment on joint terms Measure enrichment by considering parents-child relationships Measure term–term global similarity with Kappa Statistics Czekanowski-Dice Pearson's correlation	ADGO, GeneCodis, ProfCom, etc. topGO, Ontologizer, POSOC, etc. DAVID, GoToolBox, etc.

Class 1: Singular enrichment analysis (SEA)

The most traditional strategy for enrichment analysis is to take the user's preselected (e.g. differentially expressed genes selected between experimental versus control samples by _t_-test with a _P_-value ≤0.05 and fold change ≥1.5) ‘interesting’ genes, and then iteratively test the enrichment of each annotation term one-by-one in a linear mode. Thereafter, the individual, enriched annotation terms passing the enrichment _P_-value threshold are reported in a tabular format ordered by the enrichment probability (enrichment _P_-value). The enrichment _P_-value calculation, i.e. number of genes in the list that hit a given biology class as compared to pure random chance, can be performed with the aid of some common and well-known statistical methods (11,12,76), including Chi-square, Fisher's exact test, Binomial probability and Hypergeometric distribution, etc. (Table 1). More discussion regarding the enrichment _P_-value can be found in a later section of this paper.

Even though the strategy and output format of SEA are simple, SEA is indeed a very efficient way to extract the major biological meaning behind large gene lists, which may be generated from any type of high-throughput genomic studies or bioinformatics software packages. Most of the earlier tools (such as GoMiner, Onto-Express, DAVID and EASE) and a lot of the recently released tools (such as GOEAST and GFinder), adopted this strategy and demonstrated significant success in many genomic studies. However, the common weakness of tools in this class is that the linear output of terms can be very large and overwhelming (from hundreds to thousands). Therefore, the data analyst's focus and interrelationships of relevant terms can be diluted. For example, relevant GO terms like apoptosis, programmed cell death, induction of apoptosis, anti-apoptosis, regulation of apoptosis, etc., are spread out at different positions in a large linear output. It is difficult to focus on interrelationships of relevant biology terms among hundreds or thousands of other terms. In addition, the quality of pre-selected gene lists could largely impact the enrichment analysis, which makes SEA analysis unstable to a certain degree when using different statistical methods or cutoff thresholds.

Class 2: Gene set enrichment analysis (GSEA)

GSEA carries the core spirit of SEA, but with a distinct algorithm to calculate enrichment _P_-values as compared to SEA (35). People in the field give great attention and expectation to the GSEA strategy. The unique idea of GSEA is its ‘no-cutoff’ strategy that takes all genes from a microarray experiment without selecting significant genes (e.g. genes with _P_-value ≤0.05 and fold change ≥1.5). This strategy benefits the enrichment analysis in two aspects: 1) it reduces the arbitrary factors in the typical gene selection step that could impact the traditional enrichment analysis; and 2) it uses all information obtained from microarray experiments by allowing the minimally changing genes, which cannot pass the selection threshold, to contribute to the enrichment analysis in differing degrees. The maximum enrichment score (MES) is calculated from the rank order of all gene members in the annotation category. Thereafter, enrichment _P_-values can be obtained by matching the MES to randomly shuffled MES distributions (a Kolmogorov–Smirnov-like statistic) (35). Other enrichment tools in the GSEA class using the ‘no-cutoff’ strategy, such as ErmineJ (31), FatiScan (55), MEGO (36), PAGE (29), MetaGF, Go-Mapper (22) and ADGO (45), etc., employ parametric statistical approaches such as _z_-score, _t_-test, permutation analysis, etc. These approaches directly take experimental values (e.g. fold change) of all genes into the calculation for each annotation term. Collectively, recent GSEA tools which integrate the total experimental values into the functional data mining are an interesting trend with a lot of potential as a complement to traditional SEA (47,77–79).

However, tools in the GSEA class are also associated with some common limitations. First, the ‘no-cutoff’ strategy is the key advantage of GSEA, but is also becoming its major limitation in many biological studies. The GSEA method requires a summarized biological value (e.g. fold change) for each of the genome-wide genes as input. Sometimes, it is a difficult task to summarize many biological aspects of a gene into one meaningful value when the biological study and genomic platform are complex. For example, each gene derived from a SNP microarray could associate with a set of SNPs, which vary in size, P-values, physical distances, disease regions, LD (Linkage Disequilibrium) strength and SNP-gene locations (e.g. in exon, or in intron) from gene to gene. It is still a very experimental procedure to summarize such diverse aspects of biology into one comprehensive value. Similar challenges may be found in many of the emerging genomic platforms (e.g. SNP, Exon, Promoter microarray). The situations in the examples fully or partially fail in the GSEA-required input data structure requirement. For another example, many clinical microarray studies involve multiple factors/variants simultaneously, such as disease/normal, ages, sex, drug treatment/control, reagent batch effects, animal batch effect, etc. In such complex situations, sophisticated statistical methods, like ANOVA, time series analysis, survival analysis, etc., will be more powerful to handle multi-variances, multiple time points and batch effects, etc. simultaneously for data-mining interesting gene lists. In many similar cases, the upstream data processing and comprehensive gene selection statistics cannot be simply avoided or replaced by GSEA. Moreover, the genes ranked in higher positions (usually with higher differences, e.g. fold change) are the major force driving (highly weighted) the enrichment _P_-values in GSEA. Thus, the underlying assumption is that the genes with large regulations (e.g. fold changes) are contributing more to the biology. Obviously, this is not always true in real biology. Biologists know that small changes of some signal transduction genes can result in larger downstream biological consequences. In contrast, some big changes in metabolic genes may be just a consequence of other small, but important, signal regulation events. Depending on the questions that the researcher is asking, the mildly changed signal transduction genes may be more interesting/important than those largely regulated genes.

The GSEA and SEA methods have been available in the community for many years. Surprisingly, no comprehensive and systematic side-by-side comparisons are available yet. A recent study ran the same datasets with DAVID methods (a SEA/MEA method) versus ErmineJ (a GSEA method) (60). As expected, the results from both methods were highly consistent with each other. The consistency makes sense because the major driving force of the enrichment calculation in GSEA is the largely changing genes. In addition, those genes most likely have better chances to be selected in the traditional gene selection procedures, thus resulting in very similar results between the SEA and GSEA methods.

Class 3: Modular enrichment analysis (MEA)

MEA inherits the basic enrichment calculation found in SEA and incorporates extra network discovery algorithms by considering the term-to-term relationships. Recent tools, such as Ontologizer (69), topGO (41), GENECODIS (59), ADGO (45) and ProfCom (68), claimed to improve discovery sensitivity and specificity by considering inter-relationships of GO terms in the enrichment calculations, i.e. using genes of composite (joint) annotation terms as a reference background. The key advantage of this approach is that the researcher can take advantage of term–term relationships, in which joint terms may contain unique biological meaning for a given study, not held by individual terms. Moreover, when using heterogeneous annotation content, the annotation terms are highly redundant, and also have strong interrelationships regarding different aspects for the same biological process. Building such relationships is one step closer to the true nature of biology during data mining. GoToolBox (18) developed functions to cluster related GO terms or genes, which provides the gene functional annotation in a network context. However, the functions only work for a small scope and only for GO terms. DAVID (60,61) recently provided a new tool that is able to organize and condense a wide range of heterogeneous annotation content, such as GO terms, protein domains, pathways and so on, into term or gene classes. This organization is accomplished by using Kappa statistics to mine the complex biological co-occurrences found in multiple heterogeneous annotation content. Combined with traditional enrichment _P_-value calculations, the new approach allows the enrichment analysis to progress from term-centric or gene-centric to biological module-centric analysis. These methods take into account the redundant and networked nature of biological annotation content in order to concentrate on building the larger biological picture rather than focusing on an individual term or gene. Such data-mining logic seems closer to the nature of biology in that a biological process works in a network manner. However, the obvious limitation of MEA is that ‘orphan’ terms or genes (without strong relationships to neighbor terms/genes) could be left out from the analysis. Thus, it is important to examine those terms or genes that are left out during analysis when using MEA (60). In addition, the quality of the pre-selected gene list impacts the analytic results, just as it does in SEA analysis.

REMAINING QUESTIONS AND CHALLENGES IN THE FIELD

1. Realistically positioning the role of enrichment _P_-values in the current data-mining environment

The high-throughput enrichment data-mining environment is extremely complicated. Variations of the user gene list size, the deviation of the number of genes associated with each annotation, the gene overlap between annotations, the incompleteness of annotation content, the strong connectivity/dependency among genes, unbalanced distributions of annotation content, and high/low frequency of annotation content are examples of sources leading to this complexity and variation. None of the statistical methods mentioned in Table 1 is perfectly suitable for all situations. The complex situations found in the biological data-mining environment determine the discovery sensitivity and specificity (1—false-positive rate) of those statistical methods that are not yet in an optimal state, as discussed by Goeman et al. (73,80,81). Therefore, in real-life practice, many data analysts may treat the resulting enrichment _P_-values as a scoring system that plays a advisory role: i.e. rank and suggest possible relevant annotation terms, as opposed to an absolute, decision-making role (82). The analysts themselves are still playing critical roles in making the final decisions in terms of the most relevant, enriched annotation terms that are highlighted by the enrichment analysis tool. Even though annotation terms may be associated with very significant enrichment _P_-values, it is not uncommon that analysts discard/ignore some of the enriched annotation terms (such as terms with enrichment _P_-values <0.001) because they are not ‘making sense’ to a given study, based on a priori biological knowledge. The analogous example of this type of situation is like that of a Google search, which returns some results that are not relevant to the user's original query. It is up to the user, based on his or her knowledge of the situation, to make the final judgment about the results. Collectively, current enrichment analysis is more of an exploratory procedure, with the aid of enrichment _P_-value, rather than a pure statistical solution. The notion that the enriched terms should make sense based on a priori biological knowledge of the study is the most important guideline to help users in adjusting analytic thresholds and thereby answering questions such as, ‘Should my enrichment _P_-value cutoff be 0.05 or 0.01?’ or ‘Should I always consider the term with a significant enrichment _P_-value like 0.001?’ or ‘Which enrichment tool(s) could be more sensitive to my dataset?’

The most popular and traditional statistical methods used in the enrichment calculation are Fisher exact, Chi-square, Hypergeometric distribution and Binomial distribution, as collected in Table 1 and Supplementary Data 1. It is believed on a principal level that Binomial probability is good for analysis with a large population background. The Fisher exact test, Chi-square test and the Hypergeometric distribution are better for analysis with a smaller population background (12) (see subsection #4 for more discussion about population background). Given the weakness of the typical statistical methods, some alternative mathematical approaches were recently proposed in an attempt to improve the enrichment _P_-value calculations. These approaches include (but are not limited to) mid-_P_-value by Rivals et al. (76), finite partially ordered set approach (POSET) by POSOC (83,84), hidden Kripke model (HKM) by GOLie, greedy heuristics by ProfCom (68), Fisher's inverse chi-squared by GOFAA (50), master-target test/mutually exclusive target–target/intersecting target–target tests by GeneTools (42), EASE Score by EASE (8), Yule's Q by ProbCD (73), Fold Change by GoMiner (39) and Bayesian by BayGO (52). However, it is still too early to state definitively whether some of the improved alternative statistical methods really stand out over the traditional statistical approaches. Given the very complex data-mining environments discussed throughout the manuscript, all current statistical methods are working largely at the edge of their intended capability. Indeed, the specificity of enrichment analysis is more impacted by non-statistical layers than it is by statistical methods alone. In this sense, it is not realistic to guide users to choose enrichment tools simply according to statistical methods that are based purely on statistical advantages/disadvantages. Thus, we do not extensively discuss the differences between statistical methods, since such a discussion could potentially mislead a user's judgment. It is in the user's best interests to try many statistical methods on the same dataset and to compare the results whenever possible. Obviously, the need for new, more robust statistical methods to overcome the limitations of the current methods is still in high demand by the field.

2. Understanding the limitation of multiple testing correction on enrichment _P_-values

According to standard statistical principles, the more annotations that are tested, the greater the chance of an increase in the family-wide false-positive rate (85,86). To control the family-wide false-positive rate in the result list, the review article by Khatri et al. (11,12) indicates that the multiple test correction of enrichment _P_-values must be performed on the functional annotation categories being tested at the same time. Indeed, the majority of the tools performed such corrections with methods such as Bonferroni, Benjamini–Hochberg, Holm, Q-value, Permutation, etc. (Supplementary Data 1). Given the extremely complicated gene functional data-mining environment as discussed in the previous section, a critical question is how much of an improvement in discovery sensitivity and specificity (1—false-positive rate) is achieved by applying such corrections in real-life practice?

Even though many enrichment tools implement such corrections, only a few tools systematically provide evidence regarding the improvements of discovery results with and without such corrections in real-life analytic environments, rather than believing the benefits based on the statistical principle alone. Recently, GOSSIP (27) comprehensively compared the discovery sensitivity and specificity across various correction techniques provided by various tools with real-life datasets. It was concluded that the common multiple testing correction techniques, known to be overly conservative approaches if there are thousands or even more annotation terms involved in the analysis, may not improve specificity as much as people had believed those techniques would. In fact, the sensitivity may actually be negatively affected because of the conservative nature of these corrections (27).

Given the complexity of biological data-mining environments, the enrichment _P_-values derived from the common statistical methods can be very fragile, and are influenced not only by the statistical methods themselves, but also greatly by the algorithms, data sources, the individual biological process itself and so on. The specificity of the discovery is indeed greatly impacted by the non-statistical layers, which cannot be simply fixed by multiple test corrections. Great efforts regarding sensitivity and specificity issues involved in the enrichment analysis may require that improvements are made on the fundamental, non-statistical layers first (Figure 1). Then, the power of various statistical approaches including the multiple test correction can be utilized fully in the enrichment analysis. More than a dozen of the enrichment tools, including recent ones such as EasyGO (66) and g:Profiler (64), as well as the earlier ones such as GoMiner (10), have not implemented multiple test corrections (Supplementary Data 1), but are still widely used by the community in real-life data-mining projects. In summary, the multiple test correction is only a partial solution, not a resolution of the specificity problem in current enrichment analysis platforms.

3. Cross-comparing enrichment analysis results derived from multiple gene lists

A larger gene list can have higher statistical power, resulting in a higher sensitivity (more significant _P_-values) to slightly enriched terms, as well as to more specific terms. On the other hand, the sensitivity is decreased toward largely enriched terms and broader terms. Thus, the size of the gene list impacts the absolute enrichment _P_-values, making it difficult to directly compare the absolute enrichment _P_-values across gene lists. Regardless of the challenges, cross-comparisons sometimes are necessary and important when studying the changes/trends among multiple time course datasets. Tools, such as GOBar (32), Go-Mapper (22), GOAlie, PageMan (51), high-throughput GoMiner (39), and the most recent, GOEAST (70), are intended to provide some of these capabilities to display multiple time course datasets simultaneously. However, users should keep the _P_-value comparison issue in mind when using these tools. The issue is even more critical, particularly when the sizes of gene lists are dramatically different from each other. More comprehensive and appropriate algorithms regarding the comparisons are still in high demand in the field.

4. Setting up the ‘right’ gene reference background

As noted in our previous example, 10% of the user's genes selected by a microarray experiment are kinases, as opposed to 1% of the genes in the human genome (this is the gene population background) that are kinases. The enrichment can therefore be quantitatively measured. A conclusion may be obtained for the particular example, that is, kinases are enriched in the user's study, and therefore play important roles in the study. However, 10% alone cannot lead to such a conclusion without comparison to the gene reference background (i.e. 1%). Thus, the different gene reference background settings may greatly impact the enrichment _P_-values, even when using the same statistical method and annotation content (12). For example, tools such as GOToolBox (18), GOstat (14), GoMiner (10), FatiGO (13) and GOTM (24), use the total genes in the genome as a global reference background. They tend to give more significant _P_-values, as compared to the tools (e.g. Onto-Express) using a narrowed-down set of genes (e.g. genes only existing on a microarray) as a gene reference background. In addition, DAVID (61) tends to be more conservative by using genes existing on the array and found to be associated with terms in the corresponding annotation categories, as the gene reference background. Many tools further allow users to upload a customized gene list as a gene reference background (Supplementary Data 1). Even though there is no ‘gold’ standard for the reference background, a general guideline is to set up the reference background as the pool of genes that could be selected for the studied annotation category (12). For example, the total genes found on a microarray chip seem to be the ‘right’ reference background, if the analysis gene list is derived from a microarray study conducted with the given chip. However, it is not perfect, since some genes on the chip could have little or no chance to be selected during the study, due to a low expression level that falls below the microarray detection range, and/or ‘bad’ probe design, etc. Even though the gene reference background directly impacts enrichment _P_-value, it will impact the _P_-values of all terms in a relatively similar manner within the same analysis. For the same dataset, analyzed with different gene reference backgrounds, the output rank/order of the enrichment terms will remain relatively the same, even though the terms may be associated with different _P_-values. Such stable order/rank of enrichment terms in the output is more important than their absolute _P_-values so that the annotation exploration and conclusion on the same dataset will be similar and comparable when using different gene reference backgrounds. In this sense, another important principle of setting a gene reference background is to use a consistent gene reference background within the same analysis.

5. Extending backend annotation databases

Due to its enriched content and suitable data structure for high-throughput data mining, GO (1) is the only backend data source used in most, if not all, of the earlier enrichment tools, as well as in some of the more recent tools (Supplementary Data 1). However, many different biological aspects are being maintained and annotated by different independent resources; these aspects have not only a significant amount of overlapping information, but also a significant amount of unique data, due to the differing focus of the specialized groups. No one, single source is able to maintain all of the biological aspects, such as GO for the biological process, molecular functions or cellular components; Pfam for protein domains; BIND for protein–protein interactions; KEGG for pathways; TRANSFAC for gene regulations; GNF for gene–tissue expressions; OMIM for gene–disease associations; and so on (65,87,88). In this sense, a comprehensive backend database integrated with diverse and heterogeneous data sources will allow the enrichment tools to more comprehensively mine the large gene lists on broad-based annotation content covering different biological aspects, rather than on GO content alone. Obviously, the improvement of the annotation database alone can significantly improve the comprehensiveness of the data mining. Otherwise, the power of advanced data-mining algorithms and statistics cannot be fully utilized in the enrichment analysis.

Many tools are still using GO as the only backend database in the enrichment analysis (Supplementary Data 1). However, some recent tools or new releases of early-generation tools, such as Onto-Express (62), DAVID (61), WebGestalt (40), Fatigo+ (56), FACT (30), g:Profiler (64), GAzer (63) and GeneTrail (57), etc., extended their backend bio-databases by integrating wide-range heterogeneous data content (e.g. GO, KEGG pathways, protein domains, disease association, tissue expression, etc.) in order to increase the comprehensiveness of the enrichment analytic results. The WebGestalt, DAVID and Onto-Express groups independently reported their efforts in detail, with the resulting collections including GeneKeyDB, the DAVID Knowledgebase and OT, respectively (65,87,88). Each group described the steps involved in integrating and constructing such large bio-databases, particularly for the purposes of high-throughput gene functional analysis. Moreover, the databases of L2L (34) and DAVID (61) include gene expression data from publicly available SAGE, EST and microarray studies. Thus, the user's dataset may be aligned with this data with similar conditions during functional analysis. Regarding species coverage, although the backend databases of several of the enrichment tools may cover a wide range of species, the support for a less popular species (i.e. rice) may not be as robust as that of more popular species (i.e. human, mouse, rat, yeast, fly). Given this situation, several enrichment tools were specifically designed for these less popular species, such as WEGO for rice (54); easyGO for crops (66); FINA for prokaryotes (58); CLENCH for Arabidopsis (21); JProGo for prokaryotes (48); BayGo for Xylella fastidiosa (52). Collectively, the quality, integration, and coverage of databases designed for high-throughput gene functional analysis have recently made notable progress, compared to that in earlier works. While the database improvement is an endless task, the current improvements have already significantly benefited individual groups and tools, as well as provided better backend bio-sources to the field for future tool development (65,87,88). The tools that still use GO as their only backend database should consider the integration of a wider collection of bio-databases in order to reflect the need and progress of the field.

6. Efficiently mapping users’ input gene identifiers to the available annotation

If the gene identifier (ID) cannot be efficiently mapped to its corresponding annotation content, the subsequent data mining will be largely impaired. Thus, the comprehensiveness of mapping ID-to-ID and ID-to-annotation content in the database is essential as the first step to maximally translate gene lists into possible annotation content for further high-throughput enrichment analysis algorithms (12). However, this is not a simple and trivial issue when the identifiers representing gene/proteins are highly redundant, and are maintained by independent bioinformatics organizations. Even though the identifier cross-mapping issues were effectively addressed within each major bioinformatics organization, such as NCBI Entrez Gene (89), UniProt UniRef (90) and PIR-NREF (91), respectively, the weaker referencing capability across organizations still exists. For example, UniProt does not cover RefSeq IDs and NCBI Entrez Gene does not reference PIR ID at all. When different annotation databases use one system as their major gene identifier systems, e.g. GeneRif adopts NCBI IDs as major associated identifiers, and InterPro uses UniProt/SwissProt as major associated identifiers (65), some annotation content does not favor certain types of user input IDs. Thus, for a given type of ID, without special attention to this issue, important annotation content could be easily left out of the high-throughput analysis without the user's awareness, resulting in an incomplete or even failed enrichment analysis. Unfortunately, the enrichment tools, in general, have poorly documented how they handle the ID-to-ID and ID-to-annotation mapping issues. Most of the tools have likely adopted the existing work of another major group such as the NCBI Entrez Gene database (89). In such a case, although a tool may claim to support many ID systems, it does not mean that all types of IDs are fully integrated into the backend annotation database, due to the cross-organization issues discussed earlier. Some recent efforts, such as Onto-Translate (62), MatchMiner (92), IDConverter (93) and DAVID ID Converter (61), have made large improvements in an effort to help the ID-to-ID and ID-to-annotation mapping issue. With these aforementioned works, users may easily translate one type of ID to another. Moreover, they not only provide the improved cross-referencing capability but also enrich annotation content. For example, after gene IDs were re-agglomerated by a procedure called the DAVID Gene Concept, 10–20% more GO terms were able to be assigned to corresponding genes in the DAVID Knowledgebase, as compared to annotations in each individual source (65).

7. Enhancing the exploratory capability and graphical presentation

Due to the limitations of current enrichment analysis, the analysis of large gene lists, in the authors’ opinion, is still more of an exploratory procedure rather than a single statistical solution at this time. Data analysts still play the most important role in interpreting the analytic results and collecting information from different views to make the final decision of which enriched annotation categories/biology are most relevant for the study in question. Such decisions are usually made with the aid of the enrichment _P_-values derived from the enrichment analysis, the previously known knowledge of expected biology relevant to experiments, and more importantly, the various data collected through exploration of the genes and annotation categories.

Flexibility in allowing users to define the analytic scope, e.g. GO levels, can make the analysis more focused in terms of a user's interests. Many tools, such as GOMiner (10), Onto-Express (62), DAVID (61) and FatiGO (56), support this type of flexibility. In addition, many tools, providing comprehensive links to primary annotation resources regarding annotation categories or gene reports, allow users to quickly and efficiently gather relevant information concerning items of interest. A Directed Acyclic Graph (DAG) maintains the structure of GO annotation terms (1). Even though all tools adopt GO in their enrichment analysis, most tools break down the structured nodes into flat terms during the calculation of enrichment _P_-values, and thereafter list the results in an easily readable tabular format. This simplified linear format and efficient organization of data for easy interpretation is widely used by most of the enrichment tools. Moreover, a number of tools, such as Onto-Express (62), easyGO (66), GoMiner (10), eGOn (42), GoSurfer (25), GOFFA (50) and GeneTrail (57), are able to display the enrichment analysis results on the DAG or a tree structure so that users may easily explore the enrichment results in neighboring nodes. Onto-Express further provides recalculation functions for ‘drill down’ analysis of a particular branch of the DAG. In contrast, POSOC (83) made an important note, that is, that DAG, as a structure, holds GO orientations, but lacks the power for biological inference, since a lot of functionally related terms may be maintained in different DAG branches (83). Thus, more and more recent tools, such as Onto-Express (62), DAVID (61), POSOC (83), BayGO (52), FatiGO+ (56), MAPPFinder (7), FuncCluster (43) and FunNet, have started to integrate BioCarta, KEGG, or other pathway visualizations in order to more efficiently examine the user's genes in a network context. In addition, some high-throughput pathway visualization tools, such as PathMAPA, Pathway Miner, Pathway Processor, ArrayXPath, Pathway Express, PathwayExplorer, KOBAS and VAMPIRE, are very useful, but are not included in this review because of their focuses on pathway analysis alone. Interestingly, biological module/classes of annotation terms, provided by PalS (67), DAVID (61) and GoToolBox (18), present heterogeneous annotation terms or genes in a group scope. This focuses the analysis on the larger biological picture and reduces the efforts involved in mining too many individual and redundant terms or genes. In addition, DAVID provides a simple 2D view visualization (61) that is able to efficiently display the related and heterogeneous many-genes-to-many-terms relationships, identified by the DAVID classification functions (60), on one well-organized page. Using such visualizations, users can efficiently examine the inter-relationships of highly related heterogeneous annotations and genes to pinpoint important commonalities and differences.

8. Evaluating the analytic capability of new enrichment tools

Sixty-eight enrichment tools, and potentially more that are missing from this collection, have already made the field very crowded. Many of the tool publications present minimal cross-comparisons to other tools. An appropriate standard evaluation procedure would make the analytic capability more comparable among tools, particularly for new tools. In addition, a good standard could make some new tools really stand out, as well as prevent redundant work from appearing in publications. Such standards should include, but not be limited to: a set of common datasets (gene lists) with expected and known biology in different, difficult levels for analysis; important aspects (e.g. backend database, enrichment _P_-values, speed, exploratory capability, graphic presentation, etc.) for cross-comparisons; emphasis on differences and advantages over other competing methods; etc. There is no detailed proposal as of yet, but obviously a standard is needed in the field.

9. Choosing the most appropriate enrichment tools from the various choices

Choosing the most suitable enrichment tool or tools largely depends on the users’ research needs, IT experiences and the questions being asked. A precise guideline is most likely not possible since the research goals are very diverse from project to project. Before choosing a tool, a user may ask questions such as, ‘Is the GO data source enough or are more (such as pathway, protein domain, protein–protein interactions, etc.) needed?’; ‘Is the SEA linear enrichment report enough or do I really need MEA to look into inter-relationships?’; ‘Is my experimental design simple enough to fit into the GSEA input requirement or is a comprehensive statistical method necessary for gene selection?’; ‘What is my IT capability to handle R, standalone tools, or web tools?’; etc. Thereafter, tools that maximally meet the user's requirements can be logically selected. Table 2 compares the strength and limitation of each tool class. Instead of looking up individual tools among the overwhelming choices, it is recommended that the researchers locate the desired tool class (i.e. SEA, GSEA and MEA) first, then further narrow down to individual tools within that class. Supplementary Data 1 lists some of the aspects that users may be interested in, for every tool. In addition, a protocol paper regarding enrichment analysis by Huang et al. (82) could be useful for beginning users. SerbGO is a good site to search and compare detailed features and annotation coverage among tools. It is not recommended that the researchers choose tools simply according to the underlying enrichment statistical methods. As discussed in previous sections, the behavior of most statistical methods in current enrichment tools is working with large uncertainties.

Moreover, successful analytic works in higher-quality publications could serve as important examples to guide end users in the choice of ‘well-used’ tools and to follow analytic procedures for similar situations. Importantly, it is not unusual that different tools have similar capabilities and functions, but output very different results due to the variations in the implementations of the various important aspects. Thus, it is recommended that the user test multiple tools, which even offer similar analytic capability, in order to obtain the most satisfactory results (75).

CONCLUSIONS AND PERSPECTIVES

Due to the complexity of biological data-mining situations, in its current state, the analysis of large gene lists with the current enrichment tools is still more of an exploratory data-mining procedure rather than a pure statistical solution. The best analytic conclusions are made with the aid of the investigator's bio-knowledge, integrated annotation databases, computing algorithms and the enrichment _P_-values derived from statistical methods.

A large, linear list of enriched annotation terms in output reports may not satisfy researchers as much as it did years ago. The next generation of enrichment tools will strive for an integrative and comprehensive data-mining environment that will not only provide a more efficient means to identify the individual enriched annotations with improved databases, algorithms and statistical methods, but also comprehensively address the internal relationships of many enriched heterogeneous annotations. Tools with such capabilities could make the analysis more focused and understandable in a network context. Many of the most recently reported tools fall into the class II and III categories, which suggests such a trend in the field (Table 1 and Supplementary Data 1).

Finally, it can be expected that the activities and passions of developing new enrichment tools will continue, due to the unmet needs and limitations of current enrichment analytic methods. A standard for evaluating new tools will facilitate the growth of the field.

FUNDING

National Institute of Allergy and Infectious Diseases; National Institutes of Health (NO1-CO-56000). Funding for open access charge: same source as above.

Conflict of interest statement. The annotation of this tool and publication do not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the United States Government.

ACKNOWLEDGEMENTS

Thanks go to Dr Xin Zheng and Ms Jun Yang in the Laboratory of Immunopathogenesis and Bioinformatics (LIB) group for biological and bioinformatics discussion. We also thank Bill Wilton and Mike Tartakovsky for information technology and network support.

REFERENCES

et al.

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

Nat. Genet.

2000

, vol.

(pg.

)

Profiling gene expression using onto-express

Genomics

2002

, vol.

(pg.

266

270

)

FunSpec: a web-based cluster interpreter for yeast

BMC Bioinformatics

2002

, vol.

pg.

Characterizing gene sets with FuncAssociate

Bioinformatics

2003

, vol.

(pg.

2502

2504

)

GeneMerge—post-genomic analysis, data mining, and hypothesis testing

Bioinformatics

2003

, vol.

(pg.

891

892

)

DAVID: Database for Annotation, Visualization, and Integrated Discovery

Genome Biol.

2003

, vol.

pg.

MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data

Genome Biol.

2003

, vol.

pg.

Identifying biological themes within lists of genes with EASE

Genome Biol.

2003

, vol.

pg.

R70

GARBAN: genomic analysis and rapid biological annotation of cDNA microarray and proteomic data

Bioinformatics

2003

, vol.

(pg.

2158

2160

)

et al.

GoMiner: a resource for biological interpretation of genomic and proteomic data

Genome Biol.

2003

, vol.

pg.

R28

Pathways to the analysis of microarray data

Trends Biotechnol.

2005

, vol.

(pg.

429

435

)

Ontological analysis of gene expression data: current tools, limitations, and open problems

Bioinformatics

2005

, vol.

(pg.

3587

3595

)

FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes

Bioinformatics

2004

, vol.

(pg.

578

580

)

GOstat: find statistically overrepresented Gene Ontologies within a group of genes

Bioinformatics

2004

, vol.

(pg.

1464

1465

)

GO::TermFinder–open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes

Bioinformatics

2004

, vol.

(pg.

3710

3715

)

Iterative Group Analysis (iGA): a simple tool to enhance sensitivity and facilitate interpretation of microarray experiments

BMC Bioinformatics

2004

, vol.

pg.

Comparing functional annotation analyses with Catmap

BMC Bioinformatics

2004

, vol.

pg.

193

GOToolBox: functional analysis of gene datasets based on Gene Ontology

Genome Biol.

2004

, vol.

pg.

R101

GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining

Nucleic Acids Res.

2004

, vol.

(pg.

W293

300

)

THEA: ontology-driven analysis of microarray data

Bioinformatics

2004

, vol.

(pg.

2636

2643

)

CLENCH: a program for calculating Cluster ENriCHment using the Gene Ontology

Bioinformatics

2004

, vol.

(pg.

1196

1197

)

GO-Mapper: functional analysis of gene expression data using the expression level as a score to evaluate Gene Ontology terms

Bioinformatics

2004

, vol.

(pg.

2618

2625

)

GOAL: automated Gene Ontology analysis of expression profiles

Nucleic Acids Res.

2004

, vol.

(pg.

W492

499

)

GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies

BMC Bioinformatics

2004

, vol.

pg.

GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in Gene Ontology space

Appl. Bioinformatics

2004

, vol.

(pg.

261

264

)

BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments

Nucleic Acids Res.

2005

, vol.

(pg.

W460

464

)

Biological profiling of gene groups utilizing Gene Ontology

Genome Inform.

2005

, vol.

(pg.

106

115

)

T-profiler: scoring the activity of predefined groups of genes using gene expression data

Nucleic Acids Res.

2005

, vol.

(pg.

W592

595

)

PAGE: parametric analysis of gene set enrichment

BMC Bioinformatics

2005

, vol.

pg.

144

FACT–a framework for the functional interpretation of high-throughput experiments

BMC Bioinformatics

2005

, vol.

pg.

161

ErmineJ: tool for functional analysis of gene expression data sets

BMC Bioinformatics

2005

, vol.

pg.

269

GObar: a gene ontology based analysis and visualization tool for gene sets

BMC Bioinformatics

2005

, vol.

pg.

189

BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks

Bioinformatics

2005

, vol.

(pg.

3448

3449

)

L2L: a simple tool for discovering the hidden significance in microarray expression data

Genome Biol.

2005

, vol.

pg.

R81

et al.

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

Proc. Natl Acad. Sci. USA

2005

, vol.

102

(pg.

15545

15550

)

MEGO: gene functional module expression based on gene ontology

Biotechniques

2005

, vol.

(pg.

277

283

)

goCluster integrates statistical analysis and functional interpretation of microarray expression data

Bioinformatics

2005

, vol.

(pg.

3575

3577

)

OntologyTraverser: an R package for GO analysis

Bioinformatics

2005

, vol.

(pg.

275

276

)

et al.

High-throughput GoMiner, an ‘industrial-strength’ integrative gene ontology tool for interpretation of multiple-microarray experiments, with application to studies of Common Variable Immune Deficiency (CVID)

BMC Bioinformatics

2005

, vol.

pg.

168

WebGestalt: an integrated system for exploring gene sets in various biological contexts

Nucleic Acids Res.

2005

, vol.

(pg.

W741

748

)

Improved scoring of functional groups from gene expression data by decorrelating GO graph structure

Bioinformatics

2006

, vol.

(pg.

1600

1607

)

GeneTools—application for functional annotation and statistical hypothesis testing

BMC Bioinformatics

2006

, vol.

pg.

470

Clustering biological annotations and gene expression data to identify putatively co-regulated biological processes

J. Bioinform. Comput. Biol.

2006

, vol.

(pg.

833

852

)

Grouping Gene Ontology terms to improve the assessment of gene set enrichment in microarray data

BMC Bioinformatics

2006

, vol.

pg.

426

ADGO: analysis of differentially expressed gene sets using composite GO annotation

Bioinformatics

2006

, vol.

(pg.

2249

2253

)

Gene class expression: analysis tool of Gene Ontology terms with gene expression data

Genet. Mol. Res.

2006

, vol.

(pg.

108

114

)

Circumventing the cut-off for enrichment analysis

Brief Bioinform.

2006

, vol.

(pg.

202

203

)

et al.

JProGO: a novel tool for the functional interpretation of prokaryotic microarray data using Gene Ontology information

Nucleic Acids Res.

2006

, vol.

(pg.

W510

515

)

GOLEM: an interactive graph-based gene-ontology navigation and analysis tool

BMC Bioinformatics

2006

, vol.

pg.

443

GOFFA: Gene Ontology For Functional Analysis – A FDA Gene Ontology tool for analysis of genomic and proteomic data

BMC Bioinformatics

2006

, vol.

pg.

S23

et al.

PageMan: an interactive ontology tool to generate, display, and annotate overview graphs for profiling experiments

BMC Bioinformatics

2006

, vol.

pg.

535

BayGO: Bayesian analysis of ontology term enrichment in microarray data

BMC Bioinformatics

2006

, vol.

pg.

A categorization approach to automated ontological function annotation

Protein Sci.

2006

, vol.

(pg.

1544

1549

)

et al.

WEGO: a web tool for plotting GO annotations

Nucleic Acids Res.

2006

, vol.

(pg.

W293

297

)

From genes to functional classes in the study of biological systems

BMC Bioinformatics

2007

, vol.

pg.

114

FatiGO + : a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments

Nucleic Acids Res.

2007

, vol.

(pg.

W91

)

GeneTrail—advanced gene set enrichment analysis

Nucleic Acids Res.

2007

, vol.

(pg.

W186

192

)

FIVA: Functional Information Viewer and Analyzer extracting biological knowledge from transcriptome data of prokaryotes

Bioinformatics

2007

, vol.

(pg.

1161

1163

)

GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists

Genome Biol.

2007

, vol.

pg.

The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists

Genome Biol.

2007

, vol.

pg.

R183

et al.

DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists

Nucleic Acids Res.

2007

, vol.

(pg.

W169

W175

)

Onto-Tools: new additions and improvements in 2006

Nucleic Acids Res.

2007

, vol.

(pg.

W206

W211

)

GAzer: gene set analyzer

Bioinformatics

2007

, vol.

(pg.

1697

1699

)

g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments

Nucleic Acids Res.

2007

, vol.

(pg.

W193

200

)

DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis

BMC Bioinformatics

2007

, vol.

pg.

426

EasyGO: Gene Ontology-based annotation and functional enrichment analysis tool for agronomical species

BMC Genomics

2007

, vol.

pg.

246

PaLS: filtering common literature, biological terms and pathway information

Nucleic Acids Res.

2008

, vol.

(pg.

W364

W367

)

ProfCom: a web tool for profiling the complex functionality of gene groups identified from high-throughput data

Nucleic Acids Res.

2008

, vol.

(pg.

W347

W351

)

Ontologizer 2.0 - A multifunctional tool for GO term enrichment analysis and data exploration

Bioinformatics.

2008

, vol.

(pg.

1650

1651

)

GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis

Nucleic Acids Res.

2008

, vol.

(pg.

W358

W363

)

GOSim—an R-package for computation of information theoretic GO similarities between terms and gene products

BMC Bioinformatics

2007

, vol.

pg.

166

GO-2D: identifying 2-dimensional cellular-localized functional modules in Gene Ontology

BMC Genomics

2007

, vol.

pg.

ProbCD: enrichment analysis accounting for categorization uncertainty

BMC Bioinformatics

2007

, vol.

pg.

383

SerbGO: searching for the best GO tool

Nucleic Acids Res.

2008

, vol.

(pg.

W368

371

)

Use and misuse of the gene ontology annotations

Nat. Rev. Genet.

2008

, vol.

(pg.

509

515

)

Enrichment or depletion of a GO category within a class of genes: which test?

Bioinformatics

2007

, vol.

(pg.

401

407

)

Threshold-free high-power methods for the ontological analysis of genome-wide gene-expression studies

Genome Biol.

2007

, vol.

pg.

R74

et al.

Gaining confidence in biological interpretation of the microarray data: the functional consistence of the significant GO categories

Bioinformatics

2008

, vol.

(pg.

265

271

)

Extensions to gene set enrichment

Bioinformatics

2007

, vol.

(pg.

306

313

)

Analyzing gene expression data in terms of gene sets: methodological issues

Bioinformatics

2007

, vol.

(pg.

980

987

)

Enrichment analysis in high-throughput genomics - accounting for dependency in the NULL

Brief Bioinform.

2007

, vol.

(pg.

)

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources

Nat. Protoc.

2008

doi: 10.1038/nprot.2008.211

The gene ontology categorizer

Bioinformatics

2004

, vol.

(pg.

i169

177

)

How to decide which are the most pertinent overly-represented features during gene set enrichment analysis

BMC Bioinformatics

2007

, vol.

pg.

332

Controlling the false discovery rate: a practical and powerful approach to multiple testing

J. R. Stat. Soc. B

1995

, vol.

(pg.

289

300

)

Multiple hypothesis testing in microarray experiments

Stat. Sci.

2003

, vol.

(pg.

103

)

Babel's tower revisited: a universal resource for cross-referencing across annotation databases

Bioinformatics

2006

, vol.

(pg.

2934

2939

)

GeneKeyDB: a lightweight, gene-centric, relational database to support data mining environments

BMC Bioinformatics

2005

, vol.

pg.

Entrez Gene: gene-centered information at NCBI

Nucleic Acids Res.

2007

, vol.

(pg.

D26

D31

)

The UniProt Consortium

The universal protein resource (UniProt)

Nucleic Acids Res.

2008

, vol.

(pg.

D190

D195

)

et al.

The protein information resource

Nucleic Acids Res.

2003

, vol.

(pg.

345

347

)

MatchMiner: a tool for batch navigation among gene and gene product identifiers

Genome Biol.

2003

, vol.

pg.

R27

IDconverter and IDClight: conversion and annotation of gene and protein IDs

BMC Bioinformatics

2007

, vol.

pg.

Author notes

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 46,794

33,849 Pageviews

12,945 PDF Downloads

Since 11/1/2016

Month:	Total Views:
November 2016	20
December 2016	20
January 2017	148
February 2017	393
March 2017	464
April 2017	333
May 2017	351
June 2017	389
July 2017	301
August 2017	270
September 2017	289
October 2017	250
November 2017	272
December 2017	457
January 2018	741
February 2018	495
March 2018	669
April 2018	664
May 2018	706
June 2018	526
July 2018	534
August 2018	672
September 2018	491
October 2018	855
November 2018	624
December 2018	486
January 2019	594
February 2019	540
March 2019	770
April 2019	662
May 2019	616
June 2019	563
July 2019	766
August 2019	812
September 2019	649
October 2019	515
November 2019	485
December 2019	524
January 2020	473
February 2020	412
March 2020	422
April 2020	312
May 2020	349
June 2020	522
July 2020	440
August 2020	420
September 2020	455
October 2020	458
November 2020	559
December 2020	479
January 2021	507
February 2021	466
March 2021	671
April 2021	543
May 2021	578
June 2021	477
July 2021	420
August 2021	411
September 2021	449
October 2021	597
November 2021	497
December 2021	419
January 2022	513
February 2022	481
March 2022	686
April 2022	583
May 2022	583
June 2022	392
July 2022	407
August 2022	427
September 2022	437
October 2022	431
November 2022	416
December 2022	324
January 2023	472
February 2023	525
March 2023	562
April 2023	455
May 2023	425
June 2023	407
July 2023	406
August 2023	396
September 2023	466
October 2023	656
November 2023	436
December 2023	422
January 2024	548
February 2024	635
March 2024	763
April 2024	402
May 2024	393
June 2024	332
July 2024	416
August 2024	370
September 2024	487
October 2024	678
November 2024	210

Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists (original) (raw)

Cite

Abstract

INTRODUCTION

GENERAL PRINCIPLE OF ENRICHMENT ANALYSIS AND 68 AVAILABLE TOOLS

CLASSIFICATION OF ENRICHMENT TOOLS

Class 1: Singular enrichment analysis (SEA)

Class 2: Gene set enrichment analysis (GSEA)

Class 3: Modular enrichment analysis (MEA)

REMAINING QUESTIONS AND CHALLENGES IN THE FIELD

1. Realistically positioning the role of enrichment _P_-values in the current data-mining environment

2. Understanding the limitation of multiple testing correction on enrichment _P_-values

3. Cross-comparing enrichment analysis results derived from multiple gene lists

4. Setting up the ‘right’ gene reference background

5. Extending backend annotation databases

6. Efficiently mapping users’ input gene identifiers to the available annotation

7. Enhancing the exploratory capability and graphical presentation

8. Evaluating the analytic capability of new enrichment tools

9. Choosing the most appropriate enrichment tools from the various choices

CONCLUSIONS AND PERSPECTIVES

FUNDING

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Cited

Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists (original) (raw)

Cite

Abstract

INTRODUCTION

GENERAL PRINCIPLE OF ENRICHMENT ANALYSIS AND 68 AVAILABLE TOOLS

CLASSIFICATION OF ENRICHMENT TOOLS

Class 1: Singular enrichment analysis (SEA)

Class 2: Gene set enrichment analysis (GSEA)

Class 3: Modular enrichment analysis (MEA)

REMAINING QUESTIONS AND CHALLENGES IN THE FIELD

1. Realistically positioning the role of enrichment _P_-values in the current data-mining environment

2. Understanding the limitation of multiple testing correction on enrichment _P_-values

3. Cross-comparing enrichment analysis results derived from multiple gene lists

4. Setting up the ‘right’ gene reference background

5. Extending backend annotation databases

6. Efficiently mapping users’ input gene identifiers to the available annotation

7. Enhancing the exploratory capability and graphical presentation

8. Evaluating the analytic capability of new enrichment tools

9. Choosing the most appropriate enrichment tools from the various choices

CONCLUSIONS AND PERSPECTIVES

FUNDING

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited