Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays - PubMed (original) (raw)

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

U Alon et al. Proc Natl Acad Sci U S A. 1999.

Abstract

Oligonucleotide arrays can provide a broad picture of the state of the cell, by monitoring the expression level of thousands of genes at the same time. It is of interest to develop techniques for extracting useful information from the resulting data sets. Here we report the application of a two-way clustering method for analyzing a data set consisting of the expression patterns of different cell types. Gene expression in 40 tumor and 22 normal colon tissue samples was analyzed with an Affymetrix oligonucleotide array complementary to more than 6,500 human genes. An efficient two-way clustering algorithm was applied to both the genes and the tissues, revealing broad coherent patterns that suggest a high degree of organization underlying gene expression in these tissues. Coregulated families of genes clustered together, as demonstrated for the ribosomal proteins. Clustering also separated cancerous from noncancerous tissue and cell lines from in vivo tissues on the basis of subtle distributed patterns of genes even when expression of individual genes varied only slightly between the tissues. Two-way clustering thus may be of use both in classifying genes into functional groups and in classifying tissues based on gene expression.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Correlation between pairs of genes across the 62 tissue types. ○, tumor tissues; □, normal tissue; line, best fit (least-mean squares) with correlation coefficient r. (A) Correlation between 60S ribosomal protein L22 (EST number T47584) and ribosomal protein L3 (T57630). (B) 60S ribosomal protein L22 and her2 (M11730). Intensities are a measure of the mRNA concentration with 100 intensity units equal to roughly 10 messages/cell (8). (C) Probability histogram of correlation coefficients between pairs of genes. All pairs within the 2,000 genes with highest minimal intensity across the tissues were used. Dashed line, correlation coefficient for data where identity of tissues was randomized. Shaded regions, correlation with statistical significance P < 10−3. On average each gene scores such a significant correlation with about 30 other genes, and such an anticorrelation with about 10 other genes.

Figure 2

Figure 2

Data set of intensities of 2,000 genes in 22 normal and 40 tumor colon tissues. The genes chosen are the 2,000 genes with highest minimal intensity across the samples. The vertical axis corresponds to genes, and the horizontal axis to tissues. Each gene was normalized so its average intensity across the tissues is 0, and its SD is 1. The color code used is indicated in the adjoining scale. (A) Unclustered data set. (B) Clustered data. The 62 tissues are arranged on the vertical axis according to the ordered tree of Fig. 3. The 2,000 genes are arranged on the horizontal axis according to their ordered tree. (C) Unclustered randomized data, where the original data set was randomized (the location of each number in the matrix was randomly shifted). (D) Clustered randomized data, subjected to the same clustering algorithm as in B. The data and the clustering program are available at

http://www.molbio.princeton.edu/colondata

.

Figure 3

Figure 3

(A) Expanded view of clustered data set of 2,000 genes in 22 normal and 40 tumor colon tissues. The genes chosen are the 2,000 genes with highest minimal intensity across the samples. Tumor tissues are marked with arrows on the left. Normal tissues are unmarked. Note the separation of normal and tumor tissues. Thin black vertical arrows on the bottom mark ESTs homologous to ribosomal proteins (see Table 1). Note that where these genes cluster the arrows group together and resemble a thick arrow. (B) Same as A but with EB and EB1 colon carcinoma cell lines (17) added to the data set (marked with ∗∗). Note the clustering of cell lines into a separate group with expression patterns markedly different from both tumor and normal in vivo tissues.

Figure 4

Figure 4

Clustering tree for the tissue samples. Tumors (T) and normal tissue (n) numbered such that tumor and normal tissues with the same serial number originate from the same patient. Tissue T18 is a tumor and tissue T19 is a metastasis from the same patient. The muscle index for each tissue is shown. The muscle index was defined as the average intensity of the ESTs on the array that are homologous to the following 17 smooth muscle genes: (D42054) human ORF (smooth muscle myosin-related), complete cds; (U37019) human smooth muscle cell calponin mRNA, complete cds; (T61597, R01216, T78485) caldesmon, smooth muscle (Gallus gallus); (T60155) actin, aortic smooth muscle (human); (M95787) smooth muscle protein 22-alpha (human); (J02854) myosin regulatory light chain 2, smooth muscle isoform (human); (T97948) calponin h2, smooth muscle (Sus scrofa); (R16199, R42761, R50839, H30638, T55741) myosin light chain kinase, smooth muscle (Gallus gallus); (T96548) actin, gamma-enteric smooth muscle (human); (X12369) tropomyosin alpha chain, smooth muscle (human); (H20709) myosin light chain alkali, smooth-muscle isoform (human). The index is normalized to vary between 0.0 and 1.0. The horizontal distance between tree nodes was determined by the relative value of β at which splitting occurred in the clustering algorithm (see Materials and Methods).

Figure 5

Figure 5

Separation of tumor and normal tissues by clustering over a set of 500 genes. Genes were sorted by statistical significance (t test) of the difference in normal and tumors. Tissues were clustered by using a window of 500 genes selected from the sorted genes. The vertical axis denotes the fraction of tumors in the tumor rich cluster (|TN|/(T + N) where N and T are the number of normal, tumor tissues). Dashed line indicates separation in a randomized data set. The horizontal axis denotes the starting point of the 500-gene window, so that at the left-hand side the most significant 500 genes are used, and at the right the least significant 500 genes.

Similar articles

Cited by

References

    1. Lockhart D, Dong H, Byrne M, Follettie M, Gallo M, Chee M, Mittmann M, Wang C, Kobayashi M, Horton H, Brown E. Nat Biotechnol. 1996;14:1675–1680. - PubMed
    1. DeRisi J, Penland L, Brown P, Bittner M, Meltzer P, Ray M, Chen Y, Su Y, Trent J. Nat Genet. 1996;14:457–460. - PubMed
    1. Pietu G, Alibert O, Guichard V, Lamy B, Bois F, Leroy E, Mariage-Sampson R, Houlgatte R, Soularue P, Auffray C. Genome Res. 1996;6:492–503. - PubMed
    1. Wodicka L, Dong H, Mittmann M, Ho M, Lockhart D. Nat Biotechnol. 1997;15:1359–1367. - PubMed
    1. DeRisi J, Iyer V, Brown P. Science. 1997. 680–686. - PubMed

MeSH terms

Substances

LinkOut - more resources