Enrichment or depletion of a GO category within a class of genes: which test? (original) (raw)
Journal Article
,
*To whom correspondence should be addressed.
Search for other works by this author on:
,
Search for other works by this author on:
,
1 1
Laboratoire de Neurobiologie et Diversité Cellulaire, École Supérieure de Physique et de Chimie Industrielles (ESPCI)
10 rue Vauquelin, 75005 Paris, France
Search for other works by this author on:
1 1
Laboratoire de Neurobiologie et Diversité Cellulaire, École Supérieure de Physique et de Chimie Industrielles (ESPCI)
10 rue Vauquelin, 75005 Paris, France
Search for other works by this author on:
Revision received:
11 December 2006
Accepted:
11 December 2006
Published:
20 December 2006
Cite
Isabelle Rivals, Léon Personnaz, Lieng Taing, Marie-Claude Potier, Enrichment or depletion of a GO category within a class of genes: which test?, Bioinformatics, Volume 23, Issue 4, February 2007, Pages 401–407, https://doi.org/10.1093/bioinformatics/btl633
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
Motivation: A number of available program packages determine the significant enrichments and/or depletions of GO categories among a class of genes of interest. Whereas a correct formulation of the problem leads to a single exact null distribution, these GO tools use a large variety of statistical tests whose denominations often do not clarify the underlying _P_-value computations.
Summary: We review the different formulations of the problem and the tests they lead to: the binomial, χ2, equality of two probabilities, Fisher's exact and hypergeometric tests. We clarify the relationships existing between these tests, in particular the equivalence between the hypergeometric test and Fisher's exact test. We recall that the other tests are valid only for large samples, the test of equality of two probabilities and the χ2-test being equivalent. We discuss the appropriateness of one- and two-sided _P_-values, as well as some discreteness and conservatism issues.
Contact: isabelle.rivals@espci.fr
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
A common problem in functional genomic studies is to detect significant enrichments and/or depletions of Gene Ontology (GO) categories within a class of genes of interest, typically the class of significantly differentially expressed (DE) genes. Many GO processing tools perform this task using various statistical tests refered to as: the binomial test, the χ2-test, the equality of two probabilities test, Fisher's exact test and the hypergeometric test (see Table 1). The authors of some packages claim the advantages of the test(s) they propose, often seemingly contradicting each other. For example, Zeeberg et al. (2003) favor Fisher's exact test: ‘Unlike the _Z_-statistics with the hypergeometric distribution, and tests based on it, Fisher's exact test is appropriate even for categories containing a small number of genes’, whereas for Martin et al. (2004) the hypergeometric test is most appropriate: ‘On the average, the hypergeometric distribution seems to be both the most adapted model and the most powerful test’. Moreover, even though the most recent review papers use a number of criteria to exhaustively compare the different existing tools (Khatri and Draghici, 2005), they do not discuss in detail the identity and approximation relationships existing between the different tests. This is precisely the aim of the present paper.
Table 1
Reviewed GO processing tools
GO tool | Statistical tests | Reference |
---|---|---|
BINGO | Hypergeometric | Maere et al., 2005 |
CLENCH | Hypergeometric, binomial, χ2 | Shah and Fedorov, 2004 |
DAVID | Fisher | Dennis et al., 2003 |
EASEonline | Fisher | Hosack et al., 2003 |
eGOn | Fisher | http://www.genetools.microarray.ntnu.no/help/help\_egon.php?egon=1#intro |
FatiGO | Fisher | Al-Sharour et al., 2004 |
FuncAssociate | Fisher | Author Webpage |
FunSpec | Hypergeometric | Robinson et al., 2002 |
GeneMerge | Hypergeometric | Castillo-Davis and Hartl, 2003 |
GFINDer | Hypergeometric, binomial, Fishera | Masseroli et al., 2004 |
GoMiner | Fisher | Zeeberg et al., 2003 |
GOstat | χ2, Fisher | Beißbarth and Speed, 2004 |
GoSurfer | χ2 | Zhong et al., 2004 |
GO TermFinder (CPAN) | Hypergeometric | Boyle et al., 2004 |
GO TermFinder (SGD) | Binomial | Author Webpage.html |
GOTM | Hypergeometric | Zhang et al., 2004 |
GOToolBox | Hypergeometric, binomial, Fisher | Martin et al., 2004 |
L2L | Binomial | Newman and Weiner, 2005 |
NetAffx GO Mining Tool | χ2 | Cheng et al., 2004 |
Onto-Express | Binomial, χ2, Fisher | Khatri et al. 2002; Draghici et al., 2003 |
Ontology Traverser | Hypergeometric | Young et al., 2005 |
STEM | Hypergeometric | Ernst et al., 2005 |
THEA | Hypergeometric, binomial | Pasquier et al., 2004 |
GO tool | Statistical tests | Reference |
---|---|---|
BINGO | Hypergeometric | Maere et al., 2005 |
CLENCH | Hypergeometric, binomial, χ2 | Shah and Fedorov, 2004 |
DAVID | Fisher | Dennis et al., 2003 |
EASEonline | Fisher | Hosack et al., 2003 |
eGOn | Fisher | http://www.genetools.microarray.ntnu.no/help/help\_egon.php?egon=1#intro |
FatiGO | Fisher | Al-Sharour et al., 2004 |
FuncAssociate | Fisher | Author Webpage |
FunSpec | Hypergeometric | Robinson et al., 2002 |
GeneMerge | Hypergeometric | Castillo-Davis and Hartl, 2003 |
GFINDer | Hypergeometric, binomial, Fishera | Masseroli et al., 2004 |
GoMiner | Fisher | Zeeberg et al., 2003 |
GOstat | χ2, Fisher | Beißbarth and Speed, 2004 |
GoSurfer | χ2 | Zhong et al., 2004 |
GO TermFinder (CPAN) | Hypergeometric | Boyle et al., 2004 |
GO TermFinder (SGD) | Binomial | Author Webpage.html |
GOTM | Hypergeometric | Zhang et al., 2004 |
GOToolBox | Hypergeometric, binomial, Fisher | Martin et al., 2004 |
L2L | Binomial | Newman and Weiner, 2005 |
NetAffx GO Mining Tool | χ2 | Cheng et al., 2004 |
Onto-Express | Binomial, χ2, Fisher | Khatri et al. 2002; Draghici et al., 2003 |
Ontology Traverser | Hypergeometric | Young et al., 2005 |
STEM | Hypergeometric | Ernst et al., 2005 |
THEA | Hypergeometric, binomial | Pasquier et al., 2004 |
aThe website now proposes 3 additional tests, but they are not documented.
Table 1
Reviewed GO processing tools
GO tool | Statistical tests | Reference |
---|---|---|
BINGO | Hypergeometric | Maere et al., 2005 |
CLENCH | Hypergeometric, binomial, χ2 | Shah and Fedorov, 2004 |
DAVID | Fisher | Dennis et al., 2003 |
EASEonline | Fisher | Hosack et al., 2003 |
eGOn | Fisher | http://www.genetools.microarray.ntnu.no/help/help\_egon.php?egon=1#intro |
FatiGO | Fisher | Al-Sharour et al., 2004 |
FuncAssociate | Fisher | Author Webpage |
FunSpec | Hypergeometric | Robinson et al., 2002 |
GeneMerge | Hypergeometric | Castillo-Davis and Hartl, 2003 |
GFINDer | Hypergeometric, binomial, Fishera | Masseroli et al., 2004 |
GoMiner | Fisher | Zeeberg et al., 2003 |
GOstat | χ2, Fisher | Beißbarth and Speed, 2004 |
GoSurfer | χ2 | Zhong et al., 2004 |
GO TermFinder (CPAN) | Hypergeometric | Boyle et al., 2004 |
GO TermFinder (SGD) | Binomial | Author Webpage.html |
GOTM | Hypergeometric | Zhang et al., 2004 |
GOToolBox | Hypergeometric, binomial, Fisher | Martin et al., 2004 |
L2L | Binomial | Newman and Weiner, 2005 |
NetAffx GO Mining Tool | χ2 | Cheng et al., 2004 |
Onto-Express | Binomial, χ2, Fisher | Khatri et al. 2002; Draghici et al., 2003 |
Ontology Traverser | Hypergeometric | Young et al., 2005 |
STEM | Hypergeometric | Ernst et al., 2005 |
THEA | Hypergeometric, binomial | Pasquier et al., 2004 |
GO tool | Statistical tests | Reference |
---|---|---|
BINGO | Hypergeometric | Maere et al., 2005 |
CLENCH | Hypergeometric, binomial, χ2 | Shah and Fedorov, 2004 |
DAVID | Fisher | Dennis et al., 2003 |
EASEonline | Fisher | Hosack et al., 2003 |
eGOn | Fisher | http://www.genetools.microarray.ntnu.no/help/help\_egon.php?egon=1#intro |
FatiGO | Fisher | Al-Sharour et al., 2004 |
FuncAssociate | Fisher | Author Webpage |
FunSpec | Hypergeometric | Robinson et al., 2002 |
GeneMerge | Hypergeometric | Castillo-Davis and Hartl, 2003 |
GFINDer | Hypergeometric, binomial, Fishera | Masseroli et al., 2004 |
GoMiner | Fisher | Zeeberg et al., 2003 |
GOstat | χ2, Fisher | Beißbarth and Speed, 2004 |
GoSurfer | χ2 | Zhong et al., 2004 |
GO TermFinder (CPAN) | Hypergeometric | Boyle et al., 2004 |
GO TermFinder (SGD) | Binomial | Author Webpage.html |
GOTM | Hypergeometric | Zhang et al., 2004 |
GOToolBox | Hypergeometric, binomial, Fisher | Martin et al., 2004 |
L2L | Binomial | Newman and Weiner, 2005 |
NetAffx GO Mining Tool | χ2 | Cheng et al., 2004 |
Onto-Express | Binomial, χ2, Fisher | Khatri et al. 2002; Draghici et al., 2003 |
Ontology Traverser | Hypergeometric | Young et al., 2005 |
STEM | Hypergeometric | Ernst et al., 2005 |
THEA | Hypergeometric, binomial | Pasquier et al., 2004 |
aThe website now proposes 3 additional tests, but they are not documented.
2 PROBLEM STATEMENT
We consider a total population of genes, e.g. the genes expressed in a microarray experiment, and we are interested in the property of a gene to belong to a specific GO category. The aim is to establish whether the class of the DE genes presents an enrichment and/or a depletion of the GO category of interest with respect to the total gene population.
3 CANDIDATE FORMULATIONS
Let H0 denote the null hypothesis that the property for a gene to belong to the GO category of interest and that to be DE are independent, or equivalently that the DE genes are picked at random from the total gene population. We consider successively the hypergeometric, the comparison of two probabilities, and the 2 × 2 contingency table formulations of the above problem, and introduce the exact or approximate null distributions they lead to.
Notations (see Table 2): the total number of genes is denoted by n, the total number of genes belonging to the GO category of interest by n+1, the number of DE genes by n1+: n, n+1 and n1+ are hence fixed by the experiment. The number of DE genes belonging to the GO category is denoted by n11.
Table 2
Classification of the genes expressed in a microarray experiment
Category 1 (∈GO category) | Category 2 (∉GO category) | Total | |
---|---|---|---|
Class 1 (DE) | n11 | n12 | n1+ |
Class 2 (not DE) | n21 | n22 | n2+ |
Total | n+1 | n+2 | n |
Category 1 (∈GO category) | Category 2 (∉GO category) | Total | |
---|---|---|---|
Class 1 (DE) | n11 | n12 | n1+ |
Class 2 (not DE) | n21 | n22 | n2+ |
Total | n+1 | n+2 | n |
Table 2
Classification of the genes expressed in a microarray experiment
Category 1 (∈GO category) | Category 2 (∉GO category) | Total | |
---|---|---|---|
Class 1 (DE) | n11 | n12 | n1+ |
Class 2 (not DE) | n21 | n22 | n2+ |
Total | n+1 | n+2 | n |
Category 1 (∈GO category) | Category 2 (∉GO category) | Total | |
---|---|---|---|
Class 1 (DE) | n11 | n12 | n1+ |
Class 2 (not DE) | n21 | n22 | n2+ |
Total | n+1 | n+2 | n |
3.1 Hypergeometric formulation
The hypergeometric formulation is directly derived from the problem statement.
3.1.1 Exact null distribution
If H0 is true, the random variable N11 whose realization1 is the observed value n11, has a hypergeometric distribution with parameters n, n1+, and n+1, which we denote by N11 ∼ Hyper(n, n1+, n+1), with:
(1)
Note that Hyper(n, n1+, n+1) ≡ Hyper(n, n+1, n1+).
3.1.2 Approximate null distribution
For a large sample, N11 has approximately a binomial distribution with parameters n1+ and n+1/n: N11 ∼ Bi(n1+, n+1/n). Note that if n1+ n+1/n is also large, the binomial approximation can further be approximated by a Gaussian distribution.
3.2 Comparison of two probabilities formulation
In a second formulation, we consider two samples, that of the DE genes of size n1+, among which n11 genes belonging to the GO category of interest, and that of the not DE genes of size n2+, among which n21 genes belonging to the GO category. The proportions of genes belonging to the GO category in the two samples are thus f1 = n11/n1+ (DE genes) and f2 = n21/n2+ (not DE genes). Let p1 and p2 denote the probabilities to belong to the GO category in the two samples; then N11 ∼ Bi(n1+, p1) and N21 ∼ Bi(n2+, p2). In this formulation, the null hypothesis H0 is the equality of the two probabilities p1 = p2 = p, i.e. there is neither enrichment nor depletion in the sample of DE genes with respect to that of the not DE genes.
3.2.1 Approximate null distribution
The case of large samples arises frequently. Then, the binomial distributions can be approximated with Gaussian distributions. Under H0, n1+ and n2+ being large, the probability p can be correctly estimated with f = (n11 + n21)/(n1+ + n2+) = n+1/n, leading to the approximately normally distributed variable:
(2)
This distribution is approximate for two reasons: (1) the replacement of the binomial distributions by Gaussian distributions holds only for large samples (both n1+ and n2+ must be large), and (2) it has not been taken into account that, according to our problem statement, the sum N11 + N21, the total number of genes belonging to the GO category, is fixed and equal to n+1.
3.2.2 Exact null distribution
Without approximating the binomial distribution, and taking into account that N11 + N21 = n+1, we naturally obtain N11 ∼ Hyper(n, n1+, n+1) (see (Fisher, 1935; Lehman, 1986) for the complete computation with the binomial distribution conditionally on N11 + N21 = n+1). Hence, the exact distribution of N11 under H0 is as before the hypergeometric distribution.
3.3 Contingency table formulation
A third formulation is based on Table 2 seen as a 2 × 2 contingency table. Let again H0 denote the hypothesis that the property to belong to the GO category of interest and that to be DE are independent.
3.3.1 Approximate null distribution
The case of a large sample is frequently considered where, if H0 is true, the following variable is asymptotically χ2 distributed with one degree of freedom (Mood et al., 1974):
(3)
Note that d2 is the square of the realization z of the normal variable Z given by Equation (2):
(4)
3.3.2 Exact null distribution
Whatever the sample size, Fisher's formula gives the probability of the observed configuration of the contingency table under H0 (Fisher, 1935; Mood et al., 1974; Agresti, 2002):
(5)
It is easy to show that N11 ∼ Hyper(n, n1+, n+1):
(6)
As expected, the exact distribution of N11 under H0 is again the hypergeometric distribution, see Equation (1).
3.4 Summary
Under H0, i.e. assuming the independence of the property to belong to the GO category of interest and of the property to be DE, or equivalently assuming p1 = p2 where p1 is the probability of the DE genes to belong to the GO category, and p2 the probability of the not DE genes to belong to the GO category, the exact distribution of N11 is the hypergeometric distribution N11 ∼ Hyper(n, n1+, n+1) which, if n is large, can be approximated with the binomial distribution Bi(n1+, n+1/n). If the two samples are large, it is also possible to exhibit an approximately normal variable Z or its square D2 = Z2, the latter being hence approximately χ2 distributed with one degree of freedom.
4 TESTS AND _P_-VALUES
Generally, when performing the test of a null hypothesis H0 against some alternative hypothesis Ha, one disposes of a realization x of a random variable X with known distribution under H0, the null distribution. One chooses a priori a probability α of type I error (the error to reject H0 whereas it is true) that must not be exceeded, also called significance level, the decision to reject H0 being taken when x falls in the critical region. In this context, the _P_-value is the minimum significance level for which H0 would be rejected, or equivalently, it is the probability, under H0, of the minimal critical region containing x.
The choice of a critical region in order to maximize the power of the test, and hence the choice of the corresponding _P_-value, depends on the alternative hypothesis Ha, which may be ‘enrichment’ (_p_1 > _p_2, one-sided test, critical region right), ‘depletion’ (_p_1 < _p_2 ‘one-sided’ test, critical region left), or ‘enrichment or depletion’ (_p_1 ≠ _p_2, two-sided test, critical region left and right). Enrichment, depletion and enrichment or depletion are later denoted by E, D, and E/D, respectively.
4.1 One-sided tests
The one-sided _P_-value is defined as:
(7)
If the case of a discrete distribution, like the exact hypergeometric distribution or the approximate binomial distribution, it is not possible to guaranty any value of the significance level with the rule ‘reject H0 if pone(n11) ≤ α’. Due to the discreteness, the actual significance level (or size of the test) is generally smaller than the nominal (desired) significance level α, which results in a loss of power.
To minimize this loss, a good remedy is the use of mid-_P_-values (Agresti and Min, 2001; Agresti, 2002). The one-sided mid-_P_-value, which we denote by πone, is defined as:
(8)
It must be noted that the actual significance level, i.e. the actual probability of type I error, is no longer guaranteed to be smaller than the nominal significance level. However, it is rarely much greater (Agresti, 2002).
Another remedy is randomization, with which any desired significance level can be achieved. However in practice, randomization having nothing to do with the data does not make much sense (Lehmann, 1986; Agresti, 2002).
If the approximately normal variable Z is considered, we have:
(9)
If the approximately χ2 distributed variable D2 is used, a one-sided test cannot be performed, since both enrichment (large observed n11) and depletion (small observed n11) lead to a large value of D2, i.e. there is a single critical region.
4.2 Two-sided tests
In the case of a two-sided test i.e. Ha = E/D, and of a discrete null distribution, there are several popular definitions of the _P_-value, see (Agresti, 1992, 2002). A first approach defines the two-sided _P_-value as twice the one-sided _P_-value:
(10)
Yates and Fisher himself were in favor of this ‘doubling’ approach (Yates, 1984). A second approach, which after Gibbons and Pratt (1975) we name the ‘minimum-likelihood’ approach, defines the _P_-value as the sum of the probabilities of the values of N11 that are smaller or equal to that of the observed value n11, as recommended for example in (Mehta and Patel, 1998):
(11)
The minimum-likelihood approach is the only one we have encountered in the GO tools of Table 1. A third approach defines the _P_-value as the sum of the probabilities of the values of N11 that are at least as or more extreme (with respect to the mathematical expectation of N11) than the observed one (Gibbons and Pratt, 1975; Yates, 1984; Agresti, 2002). A fourth approach defines the two-sided _P_-value as min[P(N11 ≥ n11), P(N11 ≤ n11)] plus an attainable probability in the other tail that is as close as possible to, but not greater than, that one-tailed probability (Agresti, 2002).
These definitions lead to equal _P_-values in the case of symmetric distributions, i.e. when n1+ = n2+; else, they possibly lead to different _P_-values and corresponding test results, each of them having advantages and disadvantages, due to the discreteness and skewness of the hypergeometric distribution. The problem is also that these _P_-values do not correspond to any well-defined two-sided test. This issue is discussed for example in (Dunne et al., 1996), where a two-sided _P_-value based on an uniformly most powerful unbiased test is proposed. However, this _P_-value is obtained with an iterative procedure, which makes this approach inadequate for the screening of hundreds of different GO categories.
Thus, if a single simple and computationally light (see subsection 6.3) procedure were to be recommended, we would advice the doubling approach, against which there is no strong argument, and using the mid-_P_-value, in order to reduce the discreteness and conservatism effects:
(12)
A mid-_P_-value can also be defined for the minimum-likelihood approach, as the sum of the probabilities that are smaller than the probability of the observed value n11, plus half the sum of the probabilities equal to it:
(13)
However, we must again emphasize that the actual probability of type I error may exceed the nominal significance level.
If the approximately normal variable Z is considered (a continuous and symmetrically distributed variable), we have:
(14)
If the approximately χ2 distributed variable D2 is considered, the _P_-value is computed as:
(15)
4.3 One versus two-sided tests?
Consider a dataset consisting of tissues in a pathological condition and of normal tissues, and a GO category whose genes are directly affected by the condition, i.e. the genes belonging to this GO category are DE (either over- or under-expressed). Such a GO category is likely to be over-represented among the DE genes, i.e. an enrichment is expected. Thus, detecting an enrichment is desirable. On the other hand, consider a GO category such that the normal expression of the corresponding genes is necessary for the condition to develop, i.e. the genes belonging to this GO category are not DE. Such a GO category is likely to be under-represented among the DE genes, i.e. a depletion is expected. Thus, detecting a depletion is also desirable, even if there is a risk to detect the depletion of a GO category corresponding to genes whose normal expression is necessary to the mere survival of the specie.
Thus, both enrichments and depletions of GO categories are potentially of interest. Hence, unless there is a specific reason not to consider enrichment or depletion, the adequate alternative hypothesis is Ha = E/D, i.e. two-sided tests are appropriate.
5 SUMMARY AND DISCUSSION
To summarize, there is a single exact null distribution of N11, the hypergeometric distribution, but different exact tests (exact in the sense that they are based on the exact null distribution), one or two-sided, and with several definitions of the _P_-value in the latter case. These tests can equally be called hypergeometric or Fisher's exact tests2. Thus, it is not justified to claim, as Masseroli et al. (2004) do, that ‘the χ2 and Fisher's exact tests have less power than the hypergeometric and binomial distribution tests’. GFINDER and GOToolBox propose the hypergeometric test and Fisher's exact test as two alternative options: GFINDER indeed provides the same results for the two options (one-sided tests), but strangely enough, GOToolBox gives different results, whereas they should be identical for the same choice of a _P_-value (incorrect results given by some GO tools are detailed in the Appendix, which is available as supplementary data).
The available GO tools often do not explicitly state which _P_-value is computed. For example, BINGO calls the test it performs ‘hypergeometric test’ (Maere et al. 2005), without saying that it is two-sided with the minimum-likelihood approach. According to both references and websites, we could establish that FuncAssociate, GFINDER and THEA provide only one-sided tests in both directions, while FuncSpec, EASEonline, GO Term Finder (CPAN), Term Finder (SGD), GOTM, L2L, Ontology Traverser and STEM only one-sided enrichment tests and that BINGO, DAVID, eGOn, 2004 FatiGo, GeneMerge, GoMiner, GOstat, GoSurfer, NetAffx and Onto-Express provide two-sided tests, the _P_-values being computed according to the minimum-likelihood approach when a discrete distribution is used.
As discussed in section 4.3, two-sided tests are usually most appropriate. Be it with the doubling or the minimum-likelihood approach to the _P_-value, the discreteness and conservatism effects can be efficiently dealt with using mid-_P_-values, a possibility that is not offered by any of the GO tools of Table 1.
6 NUMERICAL ILLUSTRATIONS
6.1 Small sample
We consider a small sample with n = 20, n1+ = 7, n+1 = 6 and n11 = 4, i.e. f1 = 0.57 and f2 = 0.15. The null distribution of N11 ∼ Hyper(n, n1+, n+1) is shown in Figure 1. The sample being very small, we consider only the tests based on this exact distribution.
Fig. 1
Hypergeometric null distribution Hyper(20, 7, 6) (crosses). The observed value is n11 = 4 (circle).
6.1.1 One-sided test
For illustration purposes, let us first consider a one-sided test (suppose one is interested in enrichments only). The corresponding one-sided _P_-value right equals:
The one-sided mid-_P_-value is:
There is a substantial difference between the _P_-value and the mid-_P_-value. With a significance level α = 5%, the mid-_P_-value leads to reject H0, whereas the _P_-value does not: the use of a mid-_P_-value corresponds to a less conservative test. However, the actual significance level is no longer guaranteed to be smaller than the nominal significance level 5%.
6.1.2 Two-sided tests
The two-sided doubling _P_-value equals:
The two-sided doubling mid-_P_-value equals:
As for the one-sided test, there is a substantial difference between the two values. Also, with a significance level α = 5%, a two-sided test does not reject H0.
The two-sided minimum-likelihood _P_-value equals:
The hypergeometric distribution being here asymmetric, the doubling and minimum-likelihood _P_-values are quite different.
The two-sided minimum-likelihood mid-_P_-value equals:
It is always smaller than the _P_-value, and hence corresponds to a less conservative test.
6.2 Large sample
We now consider a sample whose size is analogous to that of samples encountered when testing enrichment of GO categories among DE genes on dedicated microarrays. We have n = 800, n1+ = 40, n+1 = 100, and observe n11 = 10, i.e. f1 = 0.25 and f2 = 0.12. The alternative hypothesis is Ha = E/D (two-sided test).
- The exact two-sided doubling _P_-value obtained with the hypergeometric distribution is = 3.95 × 10−2, and the two-sided mid-_P_-value is = 2.66 × 10−2. With the minimum-likelihood approach, = 2.39 × 10−2, and the two-sided mid-_P_-value is = 1.74 × 10−2. Note that, the null distribution being asymmetric, there is a noticeable difference between the two approaches, and, though the sample is quite large, between the _P_-values and the corresponding mid-_P_-values.
- The approximate binomial test leads to a doubling _P_-value of 4.54 × 10−2, and to a doubling mid-_P_-value of 3.11 × 10−2, to a minimum-likelihood _P_-value of 2.75 × 10−2, and to a minimum-likelihood mid-_P_-value of 2.03 × 10−2. Note that though the sample is not small, there is quite a difference with the exact distribution.
- The approximate test of equality of two probabilities leads to the value of an approximately normal statistic z = 2.45, and to a two-sided _P_-value of ptwo(z) = 1.42 × 10−2. This value is even less accurate than that obtained with the binomial approximation, because the DE sample is too small (n1+ = 40).
- The χ2-test indeed leads to a statistic value d2 = 6.015 = z2, and hence to the same two-sided _P_-value.
In the case of larger samples, obtained with mouse or human pangenomic microrrays, typically with n of the order of 25 000:
- The approximate binomial test leads to (mid-) _P_-values that are very close to those of the exact hypergeometric test. However, with todays computing means, there is no decisive advantage in performing this approximation (see next section).
- The approximate test of equality of two probabilities becomes closer to the exact one only if the number of DE genes is large, which is not necessarily the case. There is thus no reason to use this test.
- This is hence also true for the equivalent χ2 test.
6.3 Implementation with R and computational issues
All the exact tests can be implemented ‘by hand’ with the hypergeometric cumulative distribution function ‘phyper’ and the distribution function ‘dhyper’, and the binomial approximations with ‘pbinom’ and ‘dbinom’3.
The default implementation of the exact test with R provides the two-sided minimum-likelihood _P_-value. The corresponding instruction is ‘fisher.test(c)’, where the matrix c is the 2 × 2 contingency table [n11 n12; n21 n22]. The one-sided enrichment test is obtained with ‘fisher.test(c, alternative = “greater”)’, the one-sided depletion test with ‘fisher.test(c, alternative = “less”)’.
In order to evaluate the computation time of the two-sided tests, let us consider the case of a microarray with n = 25 000 genes, n1+ = 1000 DE genes, and 500 different GO categories. We take n+1 uniformly distributed in [0,n], and n11 uniformly distributed in [max(0, n+1+n1+–n), min(n1+, n+1)]. With R 2.1.0 running under Mac OS X on a 2 GHz two processor Macintosh (PowerPC 970 2.2), we obtain the following total elapsed times (mean and standard error on 20 runs) for the doubling approach: Hence, the gain in time obtained by using the binomial approximation to the hypergeometric distribution is negligible.
- hypergeometric doubling _P_-values, computed with the functions ‘dhyper’ and ‘phyper’: 0.17 ± 0.02 s, and 0.20 ± 0.02s for the mid-_P_-values.
- binomial doubling _P_-values, computed with the functions ‘dbinom’ and ‘pbinom’: 0.16 ± 0.02s, and 0.19 ± 0.02s for the mid-_P_-values.
For the minimum-likelihood approach, the R function ‘fisher.test’, (which does not only compute a _P_-value) is much slower than a computation ‘by hand’: The computation time is hence an argument in favor of the doubling approach to the two-sided _P_-value.
- hypergeometric minimum-likelihood _P_-values, computed with the function ‘fisher.test’: 17.15 ± 0.21 s.
- hypergeometric minimum-likelihood _P_-values, computed with the functions ‘dhyper’ and ‘phyper’: 1.83 ± 0.04s and 2.10 ± 0.05s for the mid-_P_-values.
7 CONCLUSION
The correct statement of the enrichment and/or depletion testing problem leads to a unique exact null distribution of the number of DE genes belonging to the GO category of interest, given the total gene number and the total number of genes belonging to the GO category. This distribution is the hypergeometric one, whose values are equivalently given by Fisher's formula for a 2 × 2 contingency table. Since both enrichments and depletions of GO categories are potentially of interest, two-sided tests are generally most appropriate. With the doubling or the popular minimum-likelihood definitions of the _P_-value, a loss of power due to the discreteness of the hypergeometric distribution is efficiently dealt with using mid-_P_-values, the doubling _P_-value involving lighter computations than the minimum-likelihood _P_-value. Finally, since many dedicated microarrays involve small data sets, and given the currently available algorithms and computing means, there is no strong argument in favor of the approximate large sample tests.
Funding to pay the Open Access publication charges for this article was provided by the CNRS and the city of Paris.
Conflict of Interest: none declared.
REFERENCES
A survey of exact inference for contingency tables
,
Stat. Sci.
,
1992
, vol.
7
(pg.
131
-
177
)
On small-sample confidence intervals for parameters in discrete distributions
,
Biometrics
,
2001
, vol.
57
(pg.
963
-
971
)
,
Categorical Data Analysis
,
2002
2nd edn
Hoboken, New Jersey
John Wiley & Sons, Inc.
Reducing conservatism of exact small-sample methods of inference for discrete data
,
2006
Compstat 2006, 17th Symposium of the IASC
28 August—1 September 2006
Rome
et al.
FatiGO: A web tool for finding significant associations of Gene Ontology terms with groups of genes
,
Bioinformatics
,
2004
, vol.
20
(pg.
578
-
580
)
GOstat: find statistically overrepresented Gene Ontologies within & group of genes
,
Bioinformatics
,
2004
, vol.
20
(pg.
1464
-
1465
)
et al.
GO: TermFinder–open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes
,
Bioinformatics
,
2004
, vol.
20
(pg.
3710
-
3715
)
GeneMerge–post-genomics analysis, data mining, and hypothesis testing
,
Bioinformatics
,
2003
, vol.
19
(pg.
891
-
892
)
et al.
NetAffx Gene Ontology Mining Tool: a visual approach for microarray data analysis
,
Bioinformatics
,
2004
, vol.
20
(pg.
1462
-
1463
)
et al.
DAVID: Database for Annotation, Visualization, and Integrated Discovery
,
Genome Biol.
,
2003
, vol.
4
pg.
R60
et al.
Global functional profiling of gene expression
,
Genomics
,
2003
, vol.
81
(pg.
98
-
104
)
et al.
Two-sided _P_-values from discrete asymmetric distributions based on uniformly most powerful unbiased tests
,
The Statistician
,
1996
, vol.
45
(pg.
397
-
405
)
eGOn Reference Manual (2004)
,
2004
et al.
Clustering short time series gene expression data
,
Bioinformatics
,
2005
, vol.
21
Suppl. 1
(pg.
i159
-
i168
)
The logic of inductive inference
,
J. Royal Stat. Soc.
,
1935
, vol.
98
(pg.
39
-
54
)
_P_-values: interpretation and methodology
,
Am. Stat.
,
1975
, vol.
29
(pg.
20
-
25
)
et al.
Identifying biological themes within lists of genes with EASE
,
Genome Biol.
,
2003
, vol.
4
pg.
R70
et al.
Profiling gene expression utilizing onto-express
,
Genomics
,
2002
, vol.
79
(pg.
266
-
270
)
Ontological analysis of gene expression data: current tools, limitations, and open problems
,
Bioinformatics
,
2005
, vol.
21
(pg.
3587
-
3595
)
,
Testing Statistical Hypotheses
,
1986
2nd edn
New York, LLC
Springer-Verlag
et al.
BiNGO: a Cytoscape plugin to assass overrepresentation of Gene Ontology categories in Biological Networks
,
Bioinformatics
,
2005
, vol.
21
(pg.
3448
-
3449
)
et al.
GOToolbox: functional analysis of gene datasets based on Gene Ontology
,
Genome Biol.
,
2004
, vol.
5
pg.
R101
et al.
GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysisn and mining
,
Nucleic Acids Res.
,
2004
, vol.
32
(pg.
W293
-
W300
)
Exact inference for categorical data
,
Encyclopedia of Biostatistics
,
1998
, vol.
Vol. 2
UK
Wiley, Chichester
(pg.
1411
-
1422
)
et al. ,
Introduction to the Theory of Statistics
,
1974
3rd edn
McGraw-Hill. International Edition
L2L: a simple tool for discovering the hidden significance in microarray expression data
,
Genome Biol.
,
2005
, vol.
6
pg.
R8
et al.
THEA: ontology-driven analysis of microarray
,
Bioinformatics
,
2004
, vol.
20
(pg.
2636
-
2643
)
et al.
FunSpec: a web-based cluster interpreter for yeast
,
BMC Bioinformatics
,
2002
, vol.
3
pg.
35
CLENCH: a program for calculating Cluster ENriCHment using the Gene Ontology
,
Bioinformatics
,
2004
, vol.
20
(pg.
1196
-
1197
)
Test of significance for 2x2 contingency tables
,
J. Royal Stat. Soc. Series A
,
1984
, vol.
147
(pg.
426
-
463
)
et al.
Ontology Traverser: an R package for GO analysis
,
Bioinformatics
,
2005
, vol.
21
(pg.
275
-
276
)
et al.
GoMiner: a resource for biological interpretation of genomic and proteomic data
,
Genome Biol.
,
2003
, vol.
4
pg.
R28
et al.
GOTree Machine (GOTM): a web-based platform for interpreting sets of iinteresting genes using Gene Ontology hierarchies
,
BMC Bioinformatics
,
2004
, vol.
5
pg.
16
et al.
GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in gene ontology space
,
Appl. Bioinformatics
,
2004
, vol.
3
(pg.
261
-
264
)
1Random variables and their realizations are denoted respectively by uppercase and lowercase letters.
2As a matter of fact, (Fisher, 1935) describes a one-sided test in the direction of the observed departure of the null hypothesis.
3The code of the R functions can be found at the R project site https://svn.r-project.org/R/trunk/src/nmath/. The best known and most complete software for contingency table methods in general is StatXact (Agresti, 2006).
Author notes
Associate Editor: Jonathan Wren
© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Citations
Views
Altmetric
Metrics
Total Views 15,311
11,071 Pageviews
4,240 PDF Downloads
Since 11/1/2016
Month: | Total Views: |
---|---|
November 2016 | 15 |
December 2016 | 11 |
January 2017 | 176 |
February 2017 | 257 |
March 2017 | 252 |
April 2017 | 193 |
May 2017 | 151 |
June 2017 | 152 |
July 2017 | 79 |
August 2017 | 43 |
September 2017 | 61 |
October 2017 | 54 |
November 2017 | 58 |
December 2017 | 107 |
January 2018 | 96 |
February 2018 | 89 |
March 2018 | 142 |
April 2018 | 95 |
May 2018 | 197 |
June 2018 | 132 |
July 2018 | 114 |
August 2018 | 130 |
September 2018 | 120 |
October 2018 | 180 |
November 2018 | 210 |
December 2018 | 175 |
January 2019 | 124 |
February 2019 | 146 |
March 2019 | 194 |
April 2019 | 158 |
May 2019 | 150 |
June 2019 | 126 |
July 2019 | 138 |
August 2019 | 107 |
September 2019 | 97 |
October 2019 | 155 |
November 2019 | 131 |
December 2019 | 149 |
January 2020 | 162 |
February 2020 | 148 |
March 2020 | 118 |
April 2020 | 139 |
May 2020 | 94 |
June 2020 | 200 |
July 2020 | 159 |
August 2020 | 169 |
September 2020 | 155 |
October 2020 | 236 |
November 2020 | 280 |
December 2020 | 221 |
January 2021 | 187 |
February 2021 | 233 |
March 2021 | 291 |
April 2021 | 186 |
May 2021 | 244 |
June 2021 | 188 |
July 2021 | 260 |
August 2021 | 250 |
September 2021 | 257 |
October 2021 | 307 |
November 2021 | 256 |
December 2021 | 186 |
January 2022 | 235 |
February 2022 | 244 |
March 2022 | 265 |
April 2022 | 245 |
May 2022 | 260 |
June 2022 | 188 |
July 2022 | 221 |
August 2022 | 215 |
September 2022 | 212 |
October 2022 | 155 |
November 2022 | 171 |
December 2022 | 150 |
January 2023 | 196 |
February 2023 | 126 |
March 2023 | 126 |
April 2023 | 149 |
May 2023 | 134 |
June 2023 | 111 |
July 2023 | 133 |
August 2023 | 106 |
September 2023 | 138 |
October 2023 | 143 |
November 2023 | 152 |
December 2023 | 93 |
January 2024 | 153 |
February 2024 | 127 |
March 2024 | 128 |
April 2024 | 172 |
May 2024 | 146 |
June 2024 | 121 |
July 2024 | 112 |
August 2024 | 115 |
September 2024 | 116 |
October 2024 | 63 |
Citations
488 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic