PACK: Profile Analysis using Clustering and Kurtosis to find molecular classifiers in cancer (original) (raw)

Abstract

Motivation: Elucidating the molecular taxonomy of cancers and finding biological and clinical markers from microarray experiments is problematic due to the large number of variables being measured. Feature selection methods that can identify relevant classifiers or that can remove likely false positives prior to supervised analysis are therefore desirable.

Results: We present a novel feature selection procedure based on a mixture model and a non-gaussianity measure of a gene's expression profile. The method can be used to find genes that define either small outlier subgroups or major subdivisions, depending on the sign of kurtosis. The method can also be used as a filtering step, prior to supervised analysis, in order to reduce the false discovery rate. We validate our methodology using six independent datasets by rediscovering major classifiers in ER negative and ER positive breast cancer and in prostate cancer. Furthermore, our method finds two novel subtypes within the basal subgroup of ER negative breast tumours, associated with apoptotic and immune response functions respectively, and with statistically different clinical outcome.

Availability: An _R_-function pack that implements the methods used here has been added to vabayelMix, available from (Author Webpage).

Contact: aet21@cam.ac.uk

Supplementary information: Supplementary information is available at Bioinformatics online.

INTRODUCTION

Over the last few years there have been numerous microarray experiments that have profiled considerable number of tumours in an attempt to elucidate the molecular taxonomy of cancers and to derive various biological or clinical markers. A well-known problem associated with deriving such markers is caused by the relatively small sample sizes, which leads to derived sets of candidate markers containing a significant number of false positives. Applying methods to estimate false discovery rates (FDR) may help determine the approximate number of false positives, but they don't help in separating the false positives from the true positives. Another common problem associated with finding relevant biological markers is that, very often candidate markers are derived from heterogeneous groups of tumours.

In view of these difficulties it is of great importance to complete the molecular taxonomy of cancers and to apply pattern recognition methods that can pick out the true biological classifiers from others whose association with biological, histological or clinical factors is spurious or much less certain. In this context, a fundamental assumption, which can be validated a posteriori, is that relevant biological classifiers distinguish two (or more) distinct populations of samples. Mathematically, this is tantamount to saying that the expression profile of a classifier is a mixture of at least two distributions. Conversely, false positives are, in spite of their association with a biological or clinical feature, more likely to be described by a single distribution. Thus, application of a pattern recognition method that can infer the number of clusters in individual gene expression profiles may help to identify the relevant markers among the larger set of positives, effectively and significantly reducing the FDR.

Recently, a simple model-free implementation of such a pattern recognition method called COPA (Cancer Outlier Profile Analysis) was applied to several prostate cancer datasets, which detected relevant prostate cancer markers (Tomlins et al., 2005). COPA wasmotivated by a desire to find small subclasses of tumours with ‘outlier’ gene expression, and it is very likely that such outlier subclasses (or clusters) can be captured by a pattern recognition algorithm that explicitly infers the clusters.

We propose here a more general model-based pattern recognition method for feature selection, which we call PACK (Profile Analysis using Clustering and Kurtosis). Depending on the application it can be used to either find major molecular classifiers, or genes defining small outlier subgroups, or simply as a filtering step prior to supervised analysis to remove likely false positives. The general approach is based on two steps.1 First, we learn, for each gene separately, the number of clusters present in its one dimensional expression profile.2 Second, for those gene profiles that predict two clusters we compute a measure of non-gaussianity (kurtosis) and rank them according to this measure. This measure can be used to pick out genes that are either discriminating approximately sized subgroups or genes that are defining small ‘outlier’ subgroups. Finally, association of the features with a binary biological, histological or clinical phenotype may be determined using Fisher’s exact test, or for multi-class phenotypes using a χ2-test.

Applying this methodology to six cancer microarray datasets we are able to rediscover well-known classifiers in ER negative and ER positive breast cancer and in prostate cancer, thus validating our approach, as well as finding novel classifiers in ER negative breast tumours. Furthermore, by clustering over the selected features we are able to provide more compact subcategorizations of ER negative tumours.

SYSTEMS AND METHODS

Data preparation

For breast cancer we used the microarray sets of 295 tumors (226 ER+ and 69 ER−) from van de Vijver et al. (2002), 285 tumors (208 ER+ and 77 ER−) from Wang et al. (2005) and an internal dataset of 133 primary breast tumors3 (93 ER+ and 40 ER−). For prostate cancer, we used the microarray sets of 103 samples from Laponite et al. (2004), 32 samples from Welsh et al. (2001) and 102 samples from Singh et al. (2002). In all cases we used the normalized data, which are freely available in the public domain (see References). The retrieved datasets were further normalized, if necessary, by transforming them onto a common log2-scale and shifting the median of each array to zero.

For breast cancer we considered the two clinical subgroups, ER+ and ER−, based on the expression or absence of expression of the estrogen receptor gene _ER_α as determined by immunohistochemistry. Since we were interested in finding novel classifiers in ER- breast cancer and sample sizes for this subgroup were relatively small, the breast cancer sets were also integrated together using first the Z-score transformation (Cheadle et al., 2003) for each gene in each dataset and then fusing the Z-score transformed gene data together across all relevant studies. The Z-score transformation for each gene in each study was done using only the samples within the clinical subgroup considered. This resulted in integrated matrices over 1131 common genes and 527 ER+ and 186 ER− samples.

Feature selection strategy

The hypothesis is that genes that are true biological or clinical markers have expression profiles, which are generated by a mixture of two or more underlying distributions, while spurious features are more likely to have profiles generated by a single distribution. Similar ideas were put forward in Dozmorov et al. (2004) using F-statistics but where no explicit mixture modelling was attempted, and in Zhao et al. (2004) where mixture modelling and kurtosis were used but in an explicit supervised context.

Step 1: learning the number of clusters

Given a gene’s expression profile x=(x1,…, xn) we regard this as a random sample of a univariate random variable X whose density function is a mixture of gaussians, i.e.

p(xi∣θ)=∑k=1CMπkG(xi∣μk,σk),

(1)

where πk are the weights of the components, (μk,σk) are the mean and standard deviation of the univariate gaussian k and θ denotes the set of all parameters. In the above, CM denotes the number of clusters, which can be inferred using one of various approaches. One possibility is to use the EM algorithm to learn the parameters for a wide range of different CM and perform model selection (finding the optimal CM⁠, denoted C) using the Bayesian Information Criterion (BIC) score (Schwarz, 1978; Yeung et al., 2001). Alternatively, the optimal number of clusters, C, can be inferred using a lower bound on the model evidence as provided by a variational Bayesian approach (Attias, 1999; MacKay, 1995; Teschendorff et al., 2005). In this approach, C is treated as an internal parameter and is inferred together with the means and variances of the clusters. The results we report here were obtained using the EM+BIC algorithm for model selection. Very similar results, however, were obtained using the variational Bayesian framework.4

Two notes with this clustering are in order: (1) To safeguard us from picking out genes that predicted structure only because of a very small number of ‘outlier’ samples we built in a threshold on the size of the smallest predicted cluster, so that genes for which the minimum cluster size was less than 5 were excluded from further analysis. (2) We didn't consider models with more than two clusters. While this might miss profiles that are best described by two major clusters and one very small outlier cluster of say 1 or 2 samples, we verified that these constituted a very small number of cases.

Step 2: using kurtosis to characterize features and find relevant classifiers

For a gaussian E[(X−X¯)4]=3E[(X−X¯)2]2⁠, so that K(X)=0⁠. Most non-gaussian distributions necessarily have either K>0⁠, in which case they are called supergaussian or leptokurtic, or K<0⁠, in which case they are called subgaussian or platykurtic. It is clear that many of the genes selected in step 1 above must either have significantly positive or significantly negative kurtosis values, since they are described by bi-modal, and hence non-gaussian, distributions. Specifically, a mixture of two approximately equal mass gaussians must have negative kurtosis since the two modes on either side of the center of mass effectively flatten out the distribution (Fig. 1A). Thus, genes that define major subdivisions are described by distributions with negative kurtosis. Similarly, a mixture of two gaussians with highly unequal masses (⁠π1≪π2 say) must have positive kurtosis since the small mass gaussian effectively extends or lengthens the tail of the more dominant gaussian (Fig. 1B). It follows that genes that define small ‘outlier’ subgroups are described by distributions with positive kurtosis.

(A) A bi-modal distribution where the modes have approximately equal mass has negative kurtosis. (B) A bi-modal distribution where one of the modes has small relative mass and hence constitutes an ‘outlier’ subgroup has positive kurtosis. (C) Kurtosis profile as a function of effect size (e=μ/σ, μ= difference between cluster means, σ = standard deviation of clusters) and relative cluster mass (expressed in percentage i.e πmin∗100, where πmin=min{π1,π2}).

Fig. 1

(A) A bi-modal distribution where the modes have approximately equal mass has negative kurtosis. (B) A bi-modal distribution where one of the modes has small relative mass and hence constitutes an ‘outlier’ subgroup has positive kurtosis. (C) Kurtosis profile as a function of effect size (⁠e=μ/σ⁠, μ= difference between cluster means, σ = standard deviation of clusters) and relative cluster mass (expressed in percentage i.e πmin∗100⁠, where πmin=min{π1,π2}⁠).

For a gene described by a mixture of two gaussians the kurtosis, K, is a function of two parameters5: the effect size of the gene, as defined by the effective separation between the two clusters (i.e. e=μ/σ⁠), and the ratio of cluster weights (i.e. R=π1/(1−π1)⁠). Specifically, a short computation reveals that

K(e,R)=e4R(R−a)(R−b)(1+R)4(1+R(1+R)2e2)2,

(3)

where a and b are the quadratic roots 2±3⁠. The variation of kurtosis as a function of these two parameters is shown in Figure 1C. Clearly, as the effect size is increased the kurtosis decreases monotonically, asymptotically approaching the lower bound −2, for ‘platykurtic’ genes (i.e. genes with πmin≥0.3⁠), while the kurtosis increases monotonically without bound for ‘leptokurtic’ genes (i.e genes with πmin≤0.15⁠). Thus, kurtosis provides a useful measure for ranking genes based on either how leptokurtic or platykurtic their profiles are.

Given a gene’s expression profile x=(x1, …, xn) an unbiased estimate for the kurtosis (Snedecor and Cochran, 1967) is

K(x)=^n(n+1)∑i=1n (xi−x¯)4(n−1)(n−2)(n−3)σ4−3(n−1)2(n−2)(n−3)

(4)

where x¯ and σ are the mean and standard deviation estimates of the distribution. Only for very large sample sizes (⁠n>1000⁠) can the above estimator be approximated by a normal distribution of approximate standard error 24/n (Snedecor and Cochran, 1967).

RESULTS AND DISCUSSION

Unsupervised context: finding major classifiers and subclasses

(A) Meta-analysis

We first applied our methodology individually to each study and performed a meta-analysis to see whether we would be able to rediscover the major classifiers within ER negative breast tumours and prostate cancer. Due to the smaller samples sizes within individual studies we applied the version without prior clustering (PAK) as well as PACK.

ER negative breast tumours. The top 20 platykurtic genes as predicted by PAK and ranked according to their weighted average kurtosis6 over the three cohorts are shown in Table 1 (see Supplementary Table S1 for the full list). It is clear that many of the genes in Table 1 play an already well-established and important role in the classification of breast tumours. For example, ERBB2 (=HER2)(ranked 13-th) is a well-known classifier with overexpression defining a subgroup of ER− tumours with significantly worse prognosis than ER+ samples (Sorlie et al., 2003; Sotiriou et al., 2003). Similarly, CDKN2A (=p16) (ranked 3rd) is a well-known marker of the basal subtype, which constitutes the other main subclass of ER− tumours (Sorlie et al., 2003; Sotiriou et al., 2003). Interestingly, the androgen receptor gene AR, which has recently been suggested as defining a larger class of ER− tumours (the apocrine subgroup) that includes ER−/HER2+ tumours (Farmer et al., 2005), was also highly ranked (15-th). Furthermore, we noted that kurtosis values were fairly stable across the independent cohorts. In contrast, the clustering step was less stable across studies (Table 1). Thus, even though PACK ranked ERBB2 higher (4-th), it did not predict structure for AR and CDKN2A in Naderi’s study (Table 1), presumably as a result of its smaller cohort size.

Table 1

Top 20 platykurtic genes out of a total of 1131 overlapping genes in ER negative subgroup as predicted by PAK.

Gene WAK Vijver Wang Naderi Vijver:PACK Wang:PACK Naderi:PACK Merged Cluster size (%)
FOXA1 −1.37 −1.46 −1.48 −0.99 Y Y Y −1.35 48
GABRP −1.1 −1.32 −1.17 −0.6 Y Y Y −1.08 50
CDKN2A −1.08 −1.18 −1.18 −0.71 Y Y N −1.06 45
TFAP2B −1.03 −0.68 −1.36 −0.98 Y Y N −1.01 42
CXCL9 −1 −0.73 −1.11 −1.22 N Y Y −0.97 50
CRYAB −0.97 −1.1 −0.98 −0.71 Y Y N −0.95 34
BANP −0.94 −0.88 −1.06 −0.81 Y Y N −0.92 36
S100A8 −0.88 −0.81 −0.91 −0.96 Y Y Y −0.86 43
PER2 −0.87 −0.69 −0.96 −1 N Y N −0.84 39
UBD −0.86 −1.13 −0.73 −0.63 Y N N −0.83 18
MIA −0.83 −1.42 −0.79 0.12 Y N Y −0.81 39
SOX11 −0.81 −0.76 −0.59 −1.32 N N Y −0.78 39
ERBB2 −0.81 −1.22 −0.12 −1.43 Y Y Y −0.78 28
G1P2 −0.79 −0.96 −0.9 −0.29 Y N N −0.77 38
HLA−F −0.78 −1 −0.71 −0.57 N N N −0.76 N
AR −0.76 −0.82 −0.65 −0.87 Y Y N −0.74 39
DUSP6 −0.75 −0.73 −1.01 −0.29 N Y N −0.73 37
MX1 −0.75 −0.43 −0.94 −0.93 N N Y −0.72 37
EGR1 −0.73 −0.36 −0.97 −0.9 N N Y −0.7 N
KIAA1324 −0.72 −0.68 −0.75 −0.74 N N N −0.7 N
Gene WAK Vijver Wang Naderi Vijver:PACK Wang:PACK Naderi:PACK Merged Cluster size (%)
FOXA1 −1.37 −1.46 −1.48 −0.99 Y Y Y −1.35 48
GABRP −1.1 −1.32 −1.17 −0.6 Y Y Y −1.08 50
CDKN2A −1.08 −1.18 −1.18 −0.71 Y Y N −1.06 45
TFAP2B −1.03 −0.68 −1.36 −0.98 Y Y N −1.01 42
CXCL9 −1 −0.73 −1.11 −1.22 N Y Y −0.97 50
CRYAB −0.97 −1.1 −0.98 −0.71 Y Y N −0.95 34
BANP −0.94 −0.88 −1.06 −0.81 Y Y N −0.92 36
S100A8 −0.88 −0.81 −0.91 −0.96 Y Y Y −0.86 43
PER2 −0.87 −0.69 −0.96 −1 N Y N −0.84 39
UBD −0.86 −1.13 −0.73 −0.63 Y N N −0.83 18
MIA −0.83 −1.42 −0.79 0.12 Y N Y −0.81 39
SOX11 −0.81 −0.76 −0.59 −1.32 N N Y −0.78 39
ERBB2 −0.81 −1.22 −0.12 −1.43 Y Y Y −0.78 28
G1P2 −0.79 −0.96 −0.9 −0.29 Y N N −0.77 38
HLA−F −0.78 −1 −0.71 −0.57 N N N −0.76 N
AR −0.76 −0.82 −0.65 −0.87 Y Y N −0.74 39
DUSP6 −0.75 −0.73 −1.01 −0.29 N Y N −0.73 37
MX1 −0.75 −0.43 −0.94 −0.93 N N Y −0.72 37
EGR1 −0.73 −0.36 −0.97 −0.9 N N Y −0.7 N
KIAA1324 −0.72 −0.68 −0.75 −0.74 N N N −0.7 N

‘WAK’ gives the weighted average kurtosis over the three cohorts. ‘Vijver’, ‘Wang’ and ‘Naderi’ columns give the gene kurtosis values in the respective studies. The ‘PACK’ columns state whether the clustering step predicted structure (‘Y’ = Yes, ‘N’ = No) in the gene’s expression profile. ‘Merged’ gives kurtosis value over the integrated cohort while ‘Cluster Size’ gives the cluster size (in percentage of the total cohort size) of the smallest subgroup predicted by PACK. ‘N’ indicates that PACK predicted no structure in that gene’s expression profile.

Table 1

Top 20 platykurtic genes out of a total of 1131 overlapping genes in ER negative subgroup as predicted by PAK.

Gene WAK Vijver Wang Naderi Vijver:PACK Wang:PACK Naderi:PACK Merged Cluster size (%)
FOXA1 −1.37 −1.46 −1.48 −0.99 Y Y Y −1.35 48
GABRP −1.1 −1.32 −1.17 −0.6 Y Y Y −1.08 50
CDKN2A −1.08 −1.18 −1.18 −0.71 Y Y N −1.06 45
TFAP2B −1.03 −0.68 −1.36 −0.98 Y Y N −1.01 42
CXCL9 −1 −0.73 −1.11 −1.22 N Y Y −0.97 50
CRYAB −0.97 −1.1 −0.98 −0.71 Y Y N −0.95 34
BANP −0.94 −0.88 −1.06 −0.81 Y Y N −0.92 36
S100A8 −0.88 −0.81 −0.91 −0.96 Y Y Y −0.86 43
PER2 −0.87 −0.69 −0.96 −1 N Y N −0.84 39
UBD −0.86 −1.13 −0.73 −0.63 Y N N −0.83 18
MIA −0.83 −1.42 −0.79 0.12 Y N Y −0.81 39
SOX11 −0.81 −0.76 −0.59 −1.32 N N Y −0.78 39
ERBB2 −0.81 −1.22 −0.12 −1.43 Y Y Y −0.78 28
G1P2 −0.79 −0.96 −0.9 −0.29 Y N N −0.77 38
HLA−F −0.78 −1 −0.71 −0.57 N N N −0.76 N
AR −0.76 −0.82 −0.65 −0.87 Y Y N −0.74 39
DUSP6 −0.75 −0.73 −1.01 −0.29 N Y N −0.73 37
MX1 −0.75 −0.43 −0.94 −0.93 N N Y −0.72 37
EGR1 −0.73 −0.36 −0.97 −0.9 N N Y −0.7 N
KIAA1324 −0.72 −0.68 −0.75 −0.74 N N N −0.7 N
Gene WAK Vijver Wang Naderi Vijver:PACK Wang:PACK Naderi:PACK Merged Cluster size (%)
FOXA1 −1.37 −1.46 −1.48 −0.99 Y Y Y −1.35 48
GABRP −1.1 −1.32 −1.17 −0.6 Y Y Y −1.08 50
CDKN2A −1.08 −1.18 −1.18 −0.71 Y Y N −1.06 45
TFAP2B −1.03 −0.68 −1.36 −0.98 Y Y N −1.01 42
CXCL9 −1 −0.73 −1.11 −1.22 N Y Y −0.97 50
CRYAB −0.97 −1.1 −0.98 −0.71 Y Y N −0.95 34
BANP −0.94 −0.88 −1.06 −0.81 Y Y N −0.92 36
S100A8 −0.88 −0.81 −0.91 −0.96 Y Y Y −0.86 43
PER2 −0.87 −0.69 −0.96 −1 N Y N −0.84 39
UBD −0.86 −1.13 −0.73 −0.63 Y N N −0.83 18
MIA −0.83 −1.42 −0.79 0.12 Y N Y −0.81 39
SOX11 −0.81 −0.76 −0.59 −1.32 N N Y −0.78 39
ERBB2 −0.81 −1.22 −0.12 −1.43 Y Y Y −0.78 28
G1P2 −0.79 −0.96 −0.9 −0.29 Y N N −0.77 38
HLA−F −0.78 −1 −0.71 −0.57 N N N −0.76 N
AR −0.76 −0.82 −0.65 −0.87 Y Y N −0.74 39
DUSP6 −0.75 −0.73 −1.01 −0.29 N Y N −0.73 37
MX1 −0.75 −0.43 −0.94 −0.93 N N Y −0.72 37
EGR1 −0.73 −0.36 −0.97 −0.9 N N Y −0.7 N
KIAA1324 −0.72 −0.68 −0.75 −0.74 N N N −0.7 N

‘WAK’ gives the weighted average kurtosis over the three cohorts. ‘Vijver’, ‘Wang’ and ‘Naderi’ columns give the gene kurtosis values in the respective studies. The ‘PACK’ columns state whether the clustering step predicted structure (‘Y’ = Yes, ‘N’ = No) in the gene’s expression profile. ‘Merged’ gives kurtosis value over the integrated cohort while ‘Cluster Size’ gives the cluster size (in percentage of the total cohort size) of the smallest subgroup predicted by PACK. ‘N’ indicates that PACK predicted no structure in that gene’s expression profile.

Prostate tumours. We hypothesized again that major prostate classifiers could be found by focusing on those genes with most negative kurtosis values, in particular, we sought to rediscover the ERG transcription factor, which was shown in Tomlins et al. (2005) to define an overexpressed subgroup of prostate tumours at a relatively low centile of ∼75%. Strikingly, applying PAK in a meta-analysis of three prostate cohorts (Lapointe et al., 2004; Welsh et al., 2001; Singh et al., 2002) showed that, overall, ERG was ranked first out of 4360 overlapping genes (Supplementary Table S2). Similarly, PACK ranked ERG first out of a total of 512 overlapping genes with bi-modal profiles (Table 2 and Supplementary Table S2). Other well-known important classifiers in prostate cancer were ranked by PACK in the top 5% of genes (Table 2). For example, the important role that ERBB2 plays in prostate tumours has been highlighted before (Ullen et al., 2005; Agus et al., 2005). Similarly, VEGF and the _VEGF-_receptor FLT1 are important to prostate tumour angiogenesis (Caine et al., 2004) and proapoptopic gene therapy (Kaliberov et al., 2004). The implications of aberrant expression of MUC1 in prostate cancer have also been extensively studied (Cozzi et al., 2005; Singh et al., 2006). Finally, CD44 is an important differentially methylated promoter in prostate cancer cell lines (Wang et al., 2005) with implications in invasion and metastasis (Harrison et al., 2006).

Table 2

PACK in prostate cancer: selected platykurtic genes among the top 5% of a total of 512 probes that predicted structure in all three studies, together with their kurtosis values in the individual cohorts and the WAK over all three cohorts

Gene WAK Lapointe Welsh Singh Rank
ERG −1.44 −1.68 −1.16 −1.28 1
FLT1 −0.74 −0.55 0.02 −1.17 9
CD44 −0.69 −0.83 0.1 −0.81 16
MUC1 −0.68 −0.87 −0.93 −0.4 18
ERBB2 −0.59 −0.75 −1.04 −0.28 24
VEGF −0.59 −0.69 −0.15 −0.62 25
Gene WAK Lapointe Welsh Singh Rank
ERG −1.44 −1.68 −1.16 −1.28 1
FLT1 −0.74 −0.55 0.02 −1.17 9
CD44 −0.69 −0.83 0.1 −0.81 16
MUC1 −0.68 −0.87 −0.93 −0.4 18
ERBB2 −0.59 −0.75 −1.04 −0.28 24
VEGF −0.59 −0.69 −0.15 −0.62 25

The last column gives the PACK rank position of the gene (as given by WAK) out of 512 genes.

Table 2

PACK in prostate cancer: selected platykurtic genes among the top 5% of a total of 512 probes that predicted structure in all three studies, together with their kurtosis values in the individual cohorts and the WAK over all three cohorts

Gene WAK Lapointe Welsh Singh Rank
ERG −1.44 −1.68 −1.16 −1.28 1
FLT1 −0.74 −0.55 0.02 −1.17 9
CD44 −0.69 −0.83 0.1 −0.81 16
MUC1 −0.68 −0.87 −0.93 −0.4 18
ERBB2 −0.59 −0.75 −1.04 −0.28 24
VEGF −0.59 −0.69 −0.15 −0.62 25
Gene WAK Lapointe Welsh Singh Rank
ERG −1.44 −1.68 −1.16 −1.28 1
FLT1 −0.74 −0.55 0.02 −1.17 9
CD44 −0.69 −0.83 0.1 −0.81 16
MUC1 −0.68 −0.87 −0.93 −0.4 18
ERBB2 −0.59 −0.75 −1.04 −0.28 24
VEGF −0.59 −0.69 −0.15 −0.62 25

The last column gives the PACK rank position of the gene (as given by WAK) out of 512 genes.

(B) Integrated analysis

To validate our methodology further and to facilitate the discovery of novel tumour subtypes we next applied the algorithms to the merged datasets in ER− and ER+ breast cancer. We first validated the integration based on the Z-score by clustering the breast cancer ER+ and ER− meta-cohorts across the common genes into 2, 3, 4, 5 and 6 clusters (using k-means), which showed that samples did not cluster according to study (Kruskal–Wallis P>0.1 in all cases).

ER negative breast tumours. Based on past studies, ER− tumours can be subdivided into those that express ERBB2 and those that don't and instead that express genes characteristic of basal cells (Sorile et al., 2003; Sotiriou et al., 2003). The taxonomy of the basal subgroup has however not been completely elucidated. The clustering-step of PACK resulted in a total of 312 genes whose expression profiles significantly divided the meta-cohort into two or more clusters and with the smallest cluster containing at least five samples (Supplementary Table S1). To estimate the FDR under PACK we considered a null-hypothesis model in which the 1131 expression profiles are drawn from a single gaussian and found an average of 7 ± 2 false positives, leading to a small FDR (<0.05). The 312 genes were then ranked according to their platykurtic measure (negative-kurtosis). The top 17 platykurtic genes were among the top 20 obtained by the meta-analysis of PAK. As expected, almost all of these top genes defined major subgroups with cluster sizes in the range 30−50% of the total cohort size. Expression profiles across the meta-cohort with the predicted clusterings and kernel density estimates for some of these genes are shown in Figure 2. In all we found 51 genes, each predicting two major clusters and all with negative kurtosis values. Hierarchical clustering over these 51 genes revealed significant correlations between many of these yielding three main clusters of genes and samples (Supplementary Figure S1). One of the main clusters was dominated by higher expression of HER2 and AR as well as further steroid response genes. The other two clusters were characterized by low expression of steroid response genes (including AR) and high expression of genes characteristic of basal cells (basal group). One of the basal subclasses was dominated by higher expression of genes related to cell proliferation and fatty-acid metabolism CDKN2A, GABRP, PLA2G4A, FABP7, EN1, while the other had additional overexpression of immune response genes IGLC2, PLAC8, CXCL9, C1QA. To obtain higher confidence between the gene clusters and functional associations we clustered all previously derived 312 genes. This reproduced the clustering structure based on only the 51 platykurtic genes yielding well-defined ER−/AR+ and ER−/AR− groups. Moreover, it confirmed that the ER−/AR− class is subdivided further into two subclasses, one overexpressing genes related to apoptosis and cell-proliferation (⁠P<0.01⁠) [after correction for multiple testing using GOTM (Zhang et al., 2004)] functions and another overexpressing genes related to immune response (GOTM, P<0.01)⁠.

Predicted clustering of samples for selected major platykurtic genes within ER− subgroup together with histograms and kernel density estimation. Clusters labelled by symbols, colours label the different studies: van de Vijver (grey), Wang (black) and Naderi (darkgrey).

Fig. 2

Predicted clustering of samples for selected major platykurtic genes within ER− subgroup together with histograms and kernel density estimation. Clusters labelled by symbols, colours label the different studies: van de Vijver (grey), Wang (black) and Naderi (darkgrey).

Association of the platykurtic genes with clinical outcome was tested using Fisher’s exact test (treating clinical outcome as a binary phenotype). Interestingly, for two of the genes in the immune response cluster, C1QA and IGLC2, we found significant association with outcome (P < 0.05) (Fig. 3). Moreover, samples in the immune response cluster overexpressing these genes had better overall outcome relative to the steroid response and cell-proliferation subclasses (Kruskal–Wallis, P < 0.05). Therefore, it appears that within the ER− class there is a subclass of ER−/AR− tumours that overexpress immune response genes and that has better prognosis.

Predicted clustering of samples for two platykurtic genes, C1QA and IGLC2, within ER− breast tumours, defining a good outcome basal subtype. Predicted clusters are labelled by symbols. Grey = good outcome, black = poor outcome).

Fig. 3

Predicted clustering of samples for two platykurtic genes, C1QA and IGLC2, within ER− breast tumours, defining a good outcome basal subtype. Predicted clusters are labelled by symbols. Grey = good outcome, black = poor outcome).

ER positive breast tumours. As shown in (Sorlie et al., 2003; Sotiriou et al., 2003), the ER+ or luminal type can be subdivided into two major luminal subclasses, A and B, with the luminal-A subclass having significantly better outcome. Applying PACK yielded 647 genes with bi-modal profiles (with a FDR < 0.05), which we then ranked according to their negative kurtosis (Supplementary Table S3). Among the top 60 ranked genes were NAT1, SCUBE2, NPY1R, TFF1, TFAP2B and AREG, all of which clustered together with overexpression defining one subtype (luminal-A), and for two of which (TFAP2B and SCUBE2) there was significant association with outcome (Fisher test, P<0.05⁠). Similarly, there were also genes for which overexpression defined a much smaller luminal-B subtype, notably, LAPTM4B, ZWINT and SQLE. We verified our definitions of luminal A and B subtypes by dividing the samples, using k-means clustering over the top 60 genes, into two groups and testing for association with outcome, which showed that luminal-type A had significantly better outcome (⁠P<0.05⁠). Moreover, the genes defining the luminal A and B clusters here are similar to those noted previously (Sorile et al., 2003).

(C) Comparison with F-statistics

Given that the kurtosis measure proved useful in picking out important markers in both breast and prostate cancer, we next asked how it compared with an alternative measure, such as the _F_-statistic [see e.g. Dozmorov et al. (2004)]. We therefore computed the _F_-statistic (ratio of variance between clusters and variance within clusters) for all the genes that predicted structure in the clustering step, and then ranked the genes according to the significance of their _F_-values. While the _F_-statistic also ranked the major known classifiers in ER− breast cancer at the top, a drawback is the presence of top-ranked genes with positive kurtosis profiles (Supplementary Table S4 and Figure S2). The presence of genes defining outlier subgroups was most notable in prostate cancer (Lapointe’s and Singh’s cohorts) where <50% of the top 100 probes had negative kurtosis profiles. Thus, kurtosis provided a more direct and robust measure of bi-modality that is relevant to identifying major subclasses within tumours.

Supervised context (PAC): a clustering criterion for feature selection

Another of our hypotheses is that features that exhibit structure in their gene expression profiles and that correlate with a phenotype are less likely to be false positives than those that only correlate with the phenotype. This is similar but not identical to the hypothesis, proved in Bair and Tibshirani (2004), that selecting the top ranked features of the first supervised principal component improves the FDR over just selecting the top correlated features. Since a supervised principal component captures the features with most variance within the subset of features that correlate with a phenotype, it is likely that selecting variables based on whether they predict structure should also improve the FDR. Moreover, important features with small effect sizes may not be easily captured using variance maximization even if the latter is done on a preselected set of relevant features. To test our hypothesis we used the same procedure and context as used in Bair and Tibshirani (2004). Thus, we first ranked features according to their Cox-scores in univariate Cox regression using a training set.7 Next, we computed the mean Cox score for the top n ranked genes in an independent test set for different values of n. Similarly, we computed the mean Cox score for the top n ranked genes, that also predicted structure in the training set, in the test set and for the same values of n. To avoid any bias associated with a specific choice of training set we repeated the analysis for 50 different choices of training and test sets. In all cases training and test set sizes where chosen as 60 and 40% of the cohort size. We found, consistently across all three studies, that clustering expression profiles to select prognostic features that have at least two clusters identified features that performed better in independent test sets, thus effectively reducing the FDR (Fig. 4).

Mean Cox score of the top n prognostic genes, derived from a training set, in an independent test set, for each breast cancer cohort separately. Data points represent averages over 50 different choices of training and test sets. In all cases, training-test set sizes were 60% and 40% of the cohort size. Data points for top prognostic genes identified with a prior cluster-number learning step are shown in black and without prior cluster-number learning in grey.

Fig. 4

Mean Cox score of the top n prognostic genes, derived from a training set, in an independent test set, for each breast cancer cohort separately. Data points represent averages over 50 different choices of training and test sets. In all cases, training-test set sizes were 60% and 40% of the cohort size. Data points for top prognostic genes identified with a prior cluster-number learning step are shown in black and without prior cluster-number learning in grey.

CONCLUSIONS

We have presented a novel strategy for feature selection in gene expression analysis that picks out genes with bi- or multi-modal profiles. We have validated our methodology on a meta-cohort of three major ER+ and ER− breast cancer datasets as well as three prostate cancer sets, by showing that it can rederive some of the most important classifiers in these tumours. Our results clearly confirm the presence of two main subtypes within ER− and ER+ tumours. Moreover, using PACK we were able to divide the ER− basal subtype further into apoptotic and immune response subtypes, the latter group having significantly better outcome relative to the rest of ER− samples.

In principle, PACK could also be used to find genes defining small subgroups with outlier expression by ranking genes according to their positive kurtosis. For example, by applying PACK with positive kurtosis as the ranking measure to Lapointe et al. (2004) and Welsh et al. (2001), we were able to pick out ETV1 (Tomlins et al., 2005) as ranked among the top 0.5% at specific high centiles (i.e. it was highly ranked when compared with genes defining subgroups of the same size). On the other hand, when considered over all centiles it was only ranked among the top 10% in each study. This suggests that (positive) kurtosis, because it is unbounded from above and its variability increases with magnitude, may be a less useful measure for picking out genes defining small outlier subgroups. It is possible however that applying PACK in a meta-analysis of many prostate cancer sets would identify ETV1 and other novel genes as important classifiers in prostate cancer. We leave this question open for a future investigation.

It is clear that in the case of small cohort sizes the clustering algorithm can often fail to capture true structure leading to an unacceptably large false negative rate. In such cases, PAK may be preferable. On the other hand, for larger sample sizes (≥150) we recommend PACK, since PACK benefits from a much lower FDR (kurtosis estimates have a large variance and negative kurtosis is not uncommon for samples drawn from a single gaussian). Furthermore, clustering is always necessary to learn the precise subgroup sizes and gene effect size, which in turn characterizes a gene’s expression profile, while kurtosis alone does not provide a full characterization.

In summary, we believe that the PACK methodology presented here can prove very valuable for finding new biological and/or clinical classifiers in microarray data, helping to further characterize the molecular taxonomy of cancers. In particular, we have shown that genes with bi-modal negative kurtosis expression profiles play an important role as major classifiers in cancer and that PACK allows easy identification of such genes.

The authors thank Paul Edwards for discussions. Research in the Cancer Genomics Program is funded by grants from Cancer Research UK. A.E.T. is supported by the Isaac Newton Trust. N.L.B.M. is supported by Fellowship PRAXIS XXI SFRH/BD/2914/2000 from Fundação para a Ciência e a Tecnologia (Portugal)/European Social Fund (3rd Framework Programme).

Conflict of Interest: none declared.

1In addition to PACK we also consider two simpler versions, PAK and PAC, where the clustering and kurtosis steps are absent, respectively.

2In the work we report here model selection is carried out over two models, one in which there is only one cluster and the other where there are at least two.

3A gene-expression signature to predict survival across independent data sets. A. Naderi, A.E. Teschendorff et al., Oncogene (In press).

4While the variational Bayesian model is generally more accurate in capturing structure, for the purposes of this work EM+BIC was preferable since it yielded the same solution under repeated runs and is a faster implementation.

5We assume in the following that the clusters are of equal variance σ2 and are separated by a distance μ.

6Weights were chosen proportional to the number of samples in each study.

7This analysis was done for each ER positive breast cancer dataset separately to avoid confounding by differences between the cohorts clinical data.

REFERENCES

et al.

Targeting ligand-activated erbb2 signaling inhibits breast and prostate tumor growth

,

Cancer Cell

,

2002

, vol.

2

(pg.

127

-

137

)

Inferring parameters and structure of latent variable models by variational bayes

,

1999

Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence

San Francisco, CA

Morgan Kaufmann

(pg.

21

-

30

)

Semi-supervised methods to predict patient survival from gene expression data

,

PLoS Biol.

,

2004

, vol.

2

pg.

E108

Balanda and MacGillivray

Kurtosis: a critical review

,

Am. Stat.

,

1988

, vol.

42

(pg.

111

-

119

)

et al.

Platelet-derived vegf, flt-1, angiopoietin-1 and p-selectin in breast and prostate cancer: further evidence for a role of platelets in tumour angiogenesis

,

Ann. Med.

,

2004

, vol.

36

(pg.

273

-

277

)

et al.

Analysis of microarray data using Z score transformation

,

J. Mol. Diagn.

,

2003

, vol.

5

(pg.

73

-

81

)

et al.

Muc1, muc2, muc4, muc5ac and muc6 expression in the progression of prostate cancer

,

Clin. Exp. Metastasis

,

2005

, vol.

22

(pg.

565

-

573

)

et al.

Hypervariable genes–experimental error or hidden dynamics

,

Nucleic Acids Res.

,

2004

, vol.

32

pg.

e147

et al.

Identification of molecular apocrine breast tumours by microarray analysis

,

Oncogene

,

2005

, vol.

24

(pg.

4660

-

4671

)

et al.

The influence of cd44v3-v10 on adhesion, invasion and mmp-14 expression in prostate cancer cells

,

Oncol. Rep.

,

2006

, vol.

15

(pg.

199

-

206

)

et al.

Adenovirus-mediated flt1-targeted proapoptotic gene therapy of human prostate cancer

,

Mol. Ther.

,

2004

, vol.

10

(pg.

1059

-

1070

)

et al.

Gene expression profiling identifies clinically relevant subtypes of prostate cancer

,

Proc. Natl Acad. Sci. USA

,

2004

, vol.

101

(pg.

811

-

816

)

Developments in probabilistic modelling with neural networks-ensemble learning

,

Neural Networks: Artificial Intelligence and Industrial Applications. Proceedings of the 3rd Annual Symposium on Neural Networks. Nijmengen

,

1995

(pg.

191

-

198

)

Springer

Estimating the dimension of a model

,

Annls. Stat.

,

1978

, vol.

6

(pg.

461

-

464

)

et al.

Aberrant expression of transmembrane mucins, muc1 and muc4, in human prostate carcinomas

,

Prostate

,

2006

, vol.

66

(pg.

421

-

429

)

et al.

Gene expression correlates of clinical prostate cancer behavior

,

Cancer Cell

,

2002

, vol.

1

(pg.

203

-

209

)

,

Statistical Methods

,

1967

Ames, Iowa

Iowa State University Press

et al.

Repeated observation of breast tumor subtypes in independent gene expression data sets

,

Proc. Natl Acad. Sci. USA

,

2003

, vol.

100

(pg.

8418

-

8423

)

et al.

Breast cancer classification and prognosis based on gene expression profiles from a population-based study

,

Proc. Natl Acad. Sci. USA

,

2003

, vol.

100

(pg.

10393

-

10398

)

et al.

A variational bayesian mixture modelling framework for cluster analysis of gene-expression data

,

Bioinformatics

,

2005

, vol.

21

(pg.

3025

-

3033

)

et al.

Recurrent fusion of tmprss2 and ets transcription factor genes in prostate cancer

,

Science

,

2005

, vol.

310

(pg.

644

-

648

)

et al.

Prostate cancer cell lines lack amplification: overexpression of her2

,

Acta. Oncol.

,

2005

, vol.

44

(pg.

490

-

495

)

et al.

A gene-expression signature as a predictor of survival in breast cancer

,

N Engl. J. Med.

,

2002

, vol.

347

(pg.

1999

-

2009

)

et al.

Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer

,

Lancet

,

2005

, vol.

365

(pg.

671

-

679

)

et al.

Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer

,

Cancer Res.

,

2001

, vol.

61

(pg.

5974

-

5978

)

et al.

Model-based clustering and data transformations for gene expression data

,

Bioinformatics

,

2001

, vol.

17

(pg.

977

-

987

)

et al.

Gotree machine (gotm): a web-based platform for interpreting sets of interesting genes using gene ontology hierarchies

,

BMC Bioinformatics

,

2004

, vol.

5

(pg.

1

-

8

)

et al.

Identification of differentially expressed genes with multivariate outlier analysis

,

J. Biopharm. Stat.

,

2004

, vol.

14

(pg.

629

-

646

)

Author notes

Associate Editor: Martin Bishop

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org