Coexpression of Neighboring Genes in the Genome of Arabidopsis thaliana (original) (raw)

Abstract

Large-scale analyses of expression data of eukaryotic organisms are now becoming increasingly routine. The data sets are revealing interesting and novel patterns of genomic organization, which provide insight both into molecular evolution and how structure and function of a genome interrelate. Our study investigates, for the first time, how genome organization affects expression of a gene in the Arabidopsis genome. The analyses show that neighboring genes are coexpressed. This pattern has been found for all eukaryotic genomes studied so far, but as yet, it remains unclear whether it is due to selective or nonselective influences. We have investigated reasons for coexpression of neighboring genes in Arabidopsis, and our evidence suggests that orientation of gene pairs plays a significant role, with potential sharing of regulatory elements in divergently transcribed genes. Using the data available in the KEGG database, we find evidence that genes in the same pathway are coexpressed, although this is not a major cause for the coexpression of neighboring genes.

Several large-scale analyses of expression data in higher eukaryotes have shown that neighboring genes tend to have similar expression patterns. Regional similarity in expression has been found in humans (Caron et al. 1995; Lercher et al. 2002), Drosophila (Cohen et al. 2000; Boutanaev et al. 2002; Spellman and Rubin 2002;), yeast (Cohen et al. 2000), and Caenorhabditis elegans (Lercher et al. 2003).

There are a number of potential causes for neighboring genes in a genome to have similar expression patterns. First, duplicated genes often remain neighbors for significant periods of evolutionary time, and given their common ancestry, are likely to have similar expression patterns. Second, neighboring genes in prokaryotic genomes, particularly those that are functionally related, are often found in operons. To date, operons have been found in Caenorhabditis elegans (Blumenthal et al. 2002), and there are also several examples of polycistronic genes in the human genome (Reiss et al. 1998; Gray et al. 1999). It is possible that genes involved in a particular metabolic pathway that requires coordinate regulation will be found to be clustered in other higher eukaryotes. For example, recent studies on Arabidopsis thaliana have identified clustered genes in relation to root development (Birnbaum et al. 2003) and mitochondrial function (Elo et al. 2003). Third, even in the absence of coordinate regulation, the close proximity of neighboring genes in eukaryotic genomes could lead to sharing of _cis_-regulatory elements such as enhancers or insulators, leading to a similarity in their expression patterns. Fourth, there may be a selective advantage for coexpressed genes to be in the same chromosomal domain.

The observations on coexpression of neighboring genes have been based on data gained from a variety of experimental techniques. These have included Serial Analysis of Gene Expression (SAGE; Lercher et al. 2002), DNA microarray data (Spellman and Rubin 2002), and data derived from gene annotation, such as Gene Ontology (GO terms; Spellman and Rubin 2002) and pathway assignation (Lee and Sonnhammer 2003).

Increasingly, data sets from DNA microarrays, which enable large numbers of genes to be analyzed simultaneously in a single experiment, are used for bioinformatics analysis. However, there are several different microarray technologies currently in use, including cDNA, oligo, and Affymetrix arrays. It is unclear as yet whether quantitative comparison of data sets from these different technologies is feasible. An example of this difficulty is illustrated in Kuo et. al. (2002), where a comparison of human microarray data sets using cDNA and Affymetrix technologies found no direct correlation.

This study describes the first analysis of the Arabidopsis genome to determine whether neighboring genes are coexpressed. Gene expression in Arabidopsis has been studied in-depth worldwide, and there are publicly available data sets for both cDNA and Affymetrix microarrays. This gives the added opportunity to directly compare the impact of these two technologies on the analysis. Our results from a pairwise comparison, show that coexpression of neighboring genes does exist in the Arabidopsis genome. There is significant disparity in the conclusions that can be drawn from data derived from the two different microarray technologies. The causes of coexpression have been explored, and evidence is provided to suggest that neither gene duplication nor common functionality are the main cause for coexpression of neighboring genes in the Arabidopsis genome.

RESULTS

Neighboring Genes Are Coexpressed

The data sets used for this analysis were derived from cDNA and Affymetrix microarrays. For each data set, as shown in Figure 1, the mean Pearsons correlation coefficient (R) of all pairs of neighboring genes was calculated to give a measure of the similarity in their expression pattern. The significance of this value was confirmed using a Monte-Carlo simulation, which compares the value obtained to a distribution of random mean R-values derived from the same set of data. Surprisingly, the mean R from the random distribution was positive rather than being zero, as would be expected by chance. A possible explanation for this effect may be the influence of housekeeping genes showing common patterns of expression in many different tissues and experimental conditions, thereby shifting the mean value into the positive. There was clear evidence for significant coexpression of neighboring genes across the genome. This was obtained for data sets from both cDNA and Affymetrix microarrays (cDNA arry: P < 0.0001, +4.99 standard deviations from the random mean, Affymetrix array, +23.1 standard deviations; Fig. 1A,C). Tandem duplicates, defined as gene pairs with a BLAST e-value <0.2 and within 10 genes of one another on the chromosome, were found to have a higher degree of coexpression than that of neighboring genes that were not tandem duplicates. This was obtained using a Mann-Whitney U-test (both data sets: _P_ < 0.0001, Table 1). The result suggested that tandem duplicates could be a significant cause of coexpression of neighboring genes. Therefore, to determine the extent of this effect, one member of each pair of tandem duplicates was removed, and the mean coexpression was recalculated and again compared with randomized data sets. The results of these analyses are shown in Figure 1, B and D, and clearly demonstrate that the impact of tandem duplicates on the coexpression of neighboring genes is different between data obtained from the two technologies. The cDNA array data set free of tandem duplicates showed no evidence of coexpression of neighboring genes (_n_ = 2109; _P_ > 0.10, +1.08 standard deviations; Fig. 1B), whereas the Affymetrix data set continued to show a significant pattern (n = 1367; P < 0.0001, +18.6 standard deviations; Fig. 1D).

Figure 1.

Histogram of mean R generated from 10,000 randomized pairs of genes. For each randomized genome, the mean R of pairwise comparisons was calculated. The Mean R calculated from the original set of neighboring gene pairs is marked with an arrow. For both data sets, including tandem duplicates, there was a significant degree of coexpression of neighboring pairs. After the removal of tandem duplicates only, the Affymetrix data set showed evidence that neighboring genes were more likely to be coexpressed. (A) cDNA array including tandem duplicates. Mean R = 0.04397, σ = 0.00365. (B) cDNA array not including tandem duplicates. Mean R = 0.04574, σ = 0.00394. (C) Affymetrix array including tandem duplicates. Mean R = 0.03275, σ = 0.00307. (D) Affymetrix array not including tandem duplicates. Mean R = 0.03510, σ = 0.00328.

Table 1.

Descriptive Statistics for the Pairwise Comparisons of Neighboring Gene Pairs

Number of gene pairs (missing values)	Mean R ± se	Median R
Stanford cDNA array	All genes +td	2497 (95)	0.0622 ± 0.0042	0.0226
All genes -td	2109 (71)	0.050 ± 0.0042	0.01769
Tandem duplicates	140 (6)	0.240 ± 0.026	0.1441
NASC Affymetrix array	All genes +td	18908 (8656)	0.1039 ± 0.0033	0.08070
All genes -td	14959 (6304)	0.096 ± 0.0035	0.07387
Tandem duplicates	1307 (787)	0.271 ± 0.016	0.2638

To investigate whether the correlation continues beyond neighboring gene pairs into clusters of increasing size, nonoverlapping blocks of three to 20 genes were compared, and the results are shown in Figure 2. Previous analyses in Drosophila suggested that blocks of genes up to 20 in size showed significant clustering of coexpressed genes. Data from only the Affymetrix arrays minus tandem duplicates are shown. The difference in degree of coexpression between real and randomized data sets remained significant for all block sizes. For nonoverlapping blocks of three to 10 genes, there is a clear, gradual decrease in coexpression. Beyond this, there is no further decrease in coexpression, and this continued for block sizes of up 20 genes. This implies that in the Arabidopsis genome, there may be clusters of up to 20 genes that are coexpressed, with an overall median cluster size of 100 kb. It was possible that the statistical significance of these results was inflated by genes that are only one, two, or three genes apart. To investigate this possibility, the randomizations were repeated, but rather than randomizing single genes in each block, groups of three genes were used. When these additional analyses were carried out, the mean R for the randomized data sets increased, but as shown in Figure 3, no random data set produced a higher mean R value than the real data set. This confirmed the significance of the finding that blocks of genes are coexpressed in the genome.

Figure 2.

Using the Affymetrix data set lacking tandem duplicates, the mean R for nonoverlapping windows of neighboring genes (three to 20 genes in size) was plotted against cluster size (blue line). The Mean R from 100 random sets of gene clusters (three to 20 genes in size) was also plotted (red line).

Figure 3.

Gene pairs up to 12 kb apart were binned according to their intergenic distance for both the data set containing and lacking tandem duplicates. The Mean R for all pairs within each bin was calculated using the cDNA microarray data and the Affymetrix data. (Red) Results obtained using Affymetrix data set; (blue) results obtained using cDNA microarray data. Data points using triangles are those obtained including tandem duplicates; circles indicate results obtained after removal of tandem duplicates. The regression lines are plotted, full lines those with all genes included, dashed lines those without tandem duplicates. Also plotted is the mean R value for all gene pairs (dashed lines with dots).

It was interesting to determine whether there was a direct correlation between distance and degree of coexpression. Thus, each pair of genes was placed in bins according to their intergenic distance (0–1 kb, 1–2 kb, 2–3 kb, etc.). If there is a relationship between proximity and degree of coexpression, then it could be expected that genes that are closer together would have a greater degree of coexpression than genes that are further away. For the Affymetrix data set, as shown in Figure 3, a significant correlation was observed between coexpression and intergenic distance of gene pairs up to 12 kb apart (with tandem duplicates: R2 = 0.73; P < 0.005, without tandem duplicates: R2 = 0.69; _P_ < 0.005). Interestingly, when gene pairs in intergenic blocks >12 kb were considered, the correlation between coexpression and gene distance was no longer found to be significant. No correlation was observed for the cDNA array data sets, with or without tandem duplicates. Given this lack of correlation, it is unclear whether the quantitative results from cDNA microarrays are useful for bioinformatic analysis, and therefore, further work focused only on the Affymetrix data sets.

Genes Thought to Be Involved in the Same Biological Process Are Coexpressed

The KEGG database defines genes that are thought to function in the same biological process, such as in a metabolic or regulatory pathway. Recently, a study of several genomes, including that of Arabidopsis, has used the KEGG database to demonstrate that genes functioning in the same pathway are often clustered in the genome (Lee and Sonnhammer 2003). It was important, therefore, to determine whether genes in the same pathway are coexpressed and whether this could be the causal reason for the coexpression of neighboring genes. Currently, 1891 genes in the Arabidopsis genome are assigned to pathways listed in the KEGG database. Of these, 912 gene pairs can be defined as near neighbors (within 10 genes of each other). The mean R is three times higher for gene pairs assigned to the same pathway, compared with those that were not (Mann-Whitney; P < 0.001; Table 2). On removing those gene pairs that were in the same pathway from the remainder, there continued to be significant coexpression of neighboring genes (Monte Carlo simulation: P < 0.0001). Thus, using the limited data currently available, coexpression of neighboring gene pairs is not only caused by clustering of genes in the same pathway.

Table 2.

Descriptive Statistics for Gene Pairs in the Same Metabolic Pathway and Those Not in the Same Metabolic Pathway

R (Pearsons correlation coefficient)	Intergenic distance (bp)
Gene pairs	N	Mean R ± se	Median R	Mean bp ± se	Median bp
In same metabolic pathway	72	0.2268 ± 0.0448	0.2566	19115 ± 1441	19180
Not known to be in same metabolic pathway	840	0.0756 ± 0.0111	0.0422	19160 ± 464	17614

The mean R value, that is, degree of coexpression, was calculated for genes in each pathway listed in the KEGG database. The results are shown in Table 3, and illustrate several interesting features. First, the degree of coexpression shows considerable variation between different pathways. Second, the degree of coexpression is extremely high for some pathways, particularly those in which there is a known molecular interaction between gene products, such as components of the proteosome, ribosome, and replicon. Third, genes encoding enzymes of metabolic pathways are not so highly coexpressed, with some exceptions, such as those involved in the TCA cycle and fatty acid biosynthesis.

Table 3.

Degree of Coexpression of Genes Within the Same Pathway as Defined by the KEGG Database

Pathway no.	Pathway id	Pathway description	Total comparisons	R	No. genes
1	*ath03050*	Proteasome	946	0.436	47
2	*ath03010*	Ribosome	24504	0.385	249
3	ath00580	Phospholipid degradation	25	0.378	9
4	*ath03030*	DNA polymerase	89	0.360	16
5	ath00960	Alkaloid biosynthesis II	10	0.349	5
6	*ath03032*	Replication complex	36	0.300	10
7	ath00020	Citrate cycle (TCA cycle)	670	0.264	39
8	ath00860	Porphyrin and chlorophyll metabolism	190	0.254	22
9	*ath00061*	Fatty acid biosynthesis (path 1)	78	0.240	13
10	*ath03020*	RNA polymerase	491	0.214	37
11	ath00720	Reductive carboxylate cycle (CO2 fixation)	170	0.213	20
12	*ath00195*	Photosynthesis	1646	0.198	63
13	ath00510	N-Glycans biosynthesis	262	0.189	25
14	ath00521	Streptomycin biosynthesis	21	0.189	7
15	*ath00193*	ATP synthesis	525	0.184	37
16	ath03022	Basal transcription factors	326	0.175	34
17	ath00970	Aminoacyl-tRNA biosynthesis	629	0.174	38
18	*ath03014*	Other translation factors	15	0.171	7
19	ath00150	Androgen and estrogen metabolism	10	0.167	6
20	*ath03034*	Other replication, recombination and repair factors	66	0.158	14
21	ath00300	Lysine biosynthesis	66	0.143	13
22	ath00360	Phenylalanine metabolism	2502	0.138	78
23	ath00760	Nicotinate and nicotinamide metabolism	2894	0.136	86
24	ath00400	Phenylalanine, tyrosine and tryptophan biosynthesis	629	0.135	40
25	ath00632	Benzoate degradation via CoA ligation	3208	0.132	89
75	ath00750	Vitamin B6 metabolism	10	0.050	5
76	ath00561	Glycerolipid metabolism	1032	0.049	48
77	ath00511	N-Glycan degradation	35	0.048	9
78	ath00252	Alanine and aspartate metabolism	494	0.045	36
79	ath00330	Arginine and proline metabolism	560	0.041	36
80	ath00910	Nitrogen metabolism	378	0.037	29
81	ath00220	Urea cycle and metabolism of amino groups	136	0.036	20
82	ath00410	β-Alanine metabolism	120	0.034	17
83	ath00670	One carbon pool by folate	78	0.034	16
84	ath00362	Benzoate degradation via hydroxylation	36	0.033	10
85	ath00340	Histidine metabolism	91	0.029	15
86	ath00472	D-Arginine and D-ornithine metabolism	21	0.027	8
87	ath00251	Glutamate metabolism	818	0.025	43
88	ath00351	1,1,1-Trichloro-2.2-bis(4-chlorophenyl)ethane (DDT) degradation	21	0.022	8
89	ath00361	γ-Hexachlorocyclohexane degradation	587	0.021	40
90	ath00053	Ascorbate and aldarate metabolism	836	0.020	47
91	ath00100	Sterol biosynthesis	449	0.018	34
92	ath03060	Protein export	377	0.018	31
93	ath00530	Aminosugars metabolism	153	0.018	20
94	ath00628	Fluorene degradation	587	0.017	40
95	ath04710	Circadian rhythm	496	0.014	32
96	ath00120	Bile acid biosynthesis	91	0.006	14
97	ath00900	Terpenoid biosynthesis	95	0.002	17
98	ath00460	Cyanoamino acid metabolism	65	-0.007	15
99	ath02052	Other ion-coupled transporters	45	-0.074	10
100	ath00550	Peptidoglycan biosynthesis	10	-0.172	5

The Effect of Gene Orientation on Coexpression of Neighboring Genes

Genes in a genome can be transcribed in one of two directions and therefore pairs of genes can be orientated in three alternative combinations as follows: divergent transcription (← →), convergent transcription (→ ←), or parallel transcription (→ →/← ←). Using the Affymetrix data set minus tandem duplicates, those pairs of genes with divergent (← →) or parallel (→ →/← ←) orientation were found to have a higher degree of coexpression than those genes with convergent (→ ←) orientation oftranscription (Table 4; Kruskal-Wallis, P < 0.0001). Interestingly, the pairs of genes with convergent orientation were found to have shorter intergenic distance than those with divergent or parallel orientation (Table 4; Kruskal-Wallis, P < 0.0001).

Table 4.

Descriptive Statistics for Pairwise Comparison of Neighboring Genes According to Orientation of Transcription

R (Pearson correlation coefficient)	Intergenic distance (bp)
Orientation	N	Mean R ± se	Median R	Mean bp ± se	Median bp
Complete dataset without tandem duplicates (Affymetrix data)	←→	2212	0.106 ± 0.007	0.08866	2770 ± 65.3	1872
→→/←←	4201	0.104 ± 0.0051	0.07831	2093.7 ± 33.7	1351
→←	2241	0.071 ± 0.0068	0.05515	1147.6 ± 37.4	597
Tandem duplicates only (Affymetrix data)	←→	38	0.391 ± 0.055	0.4734	7758 ± 600	7621
→→/←←	445	0.27 ± 0.018	0.2563	5427 ± 142	4625
→←	36	0.158 ± 0.064	0.2519	5377 ± 491	4879

The above analysis excluded tandem duplicates. The same analysis was performed on a data set of neighboring genes that consisted only of tandem duplicates. As a basis for this analysis, the transcriptional orientation of tandem duplicates was first investigated, and as predicted, most were found to be in the parallel (→→/←←) orientation (χ2 test, P < 0.0001). However, it was the tandem duplicates existing in the divergent (← →) orientation of transcription that showed the greatest degree of coexpression (Table 4; Kruskal-Wallis, P < 0.05).

DISCUSSION

Many technologies are now available to determine the different patterns of gene expression exhibited in cells and tissues of an organism. Often, the entire genomes of these organisms have also been sequenced. This provides the opportunity to analyze gene expression in the context of genome organization. For A. thaliana, the genome sequencing program was completed in 2000 (The Arabidopsis Genome Initiative 2000), and it is fast becoming routine to apply a variety of microarray technologies to the model plant to define global patterns of gene expression. Despite the availability of these data, very few detailed global gene expression analyses have been published on this organism. This study has explored, for the first time, the possibility of coexpression of neighboring genes in Arabidopsis and the reasons that this might occur.

Our results show that neighboring genes in the Arabidopsis genome are indeed coexpressed. We have observed this coexpression from two different sources of data for the statistical analysis, Affymetrix and cDNA microarray technologies. Tandem duplicates were found to have a higher degree of coexpression than other neighboring genes in our analysis, but interestingly, the impact of their removal was found to be different when the data from the two technologies were compared. Only the Affymetrix data set continued to show a significant pattern of coexpression. The loss of significance from the cDNA microarray data sets can readily be understood given the known problem of cross-hybridization arising from highly homologous genes such as tandem duplicates. This leads to a higher overall level of noise and unreliability when using cDNA arrays. In contrast, the Affymetrix technology bypasses this problem by using multiple oligonucleotides unique for each gene.

A further difference shown by the analyses of the data sets from the two technologies relates to the effect of intergenic distance, as one could predict that genes closer together would have a greater degree of coexpression than those that are more distant in the genome. A significant correlation between distance and coexpression was only found for the Affymetrix data set, either with or without the inclusion of tandem duplicates. This finding also questions the general utility of cDNA microarrays for this type of quantitative analysis. Some discrepancies have been found previously between cDNA and Affymetrix data sets, such as, for example, in the study of gene expression patterns in 56 cell lines from the National Cancer Institute (Kuo et al 2002), as well as in a study using human neuroblastoma cells (Li et al. 2002). Given the potential problems associated with data sets from cDNA microarrays and the inherent problems of gene duplication in Arabidopsis, all further analyses in this study used only Affymetrix data sets omitting tandem duplicates.

We have addressed several possible explanations for the observed coexpression of neighboring genes. For example, MARS are thought to influence gene expression through changing chromatin conformation patterns (Mishra and Karch 1999; Gerasimova and Corces 2001). To explore this possibility, we used bioinformatic tools to identify MARS in the Arabidopsis genome (Glazko et al. 2000). However, this approach has considerable limitations, as there is no experimental certainty that the MARS identified are functional or operational under the conditions of plant growth and development used to gain expression data. Within these limitations, we found no positive or negative evidence that the presence of MARS correlates with coexpression of neighboring gene pairs (E.J.B. Williams and D.J. Bowles, unpubl.).

Gene orientation has been examined in a number of studies for its relationship to degree of coexpression. Studies on yeast have shown that divergently transcribed genes have a higher degree of coexpression than genes in convergent orientation (Kruglyak and Tang 2000). It has been suggested that the underlying cause for these observations may be due to sharing of common regulatory elements. Although several bidirectional promoters have been found in mammalian genomes (Adachi and Lieber 2002), few examples have been found in plants. Significantly, recent experimental data from Capsicum annuum have discovered two coexpressed homologous genes that are neighbors and are divergently transcribed (Shin et al. 2003). The authors demonstrated that a single promoter, situated between the genes, is responsible for driving their expression. In our analysis of Arabidopsis, we found clear evidence that gene pairs transcribed in divergent or parallel orientations showed a higher degree of coexpression than those gene pairs in the convergent orientation. Interestingly, tandem duplicates in the divergent orientation have a higher coexpression than those in the parallel orientation, despite most tandem duplicates being in the parallel orientation. These findings may indicate that bidirectional promoters may be more common than expected in plant genomes and may be particularly important in the coexpression of duplicate gene pairs. Our data provide the basis for undertaking experimental studies to investigate how the expression of defined gene pairs is regulated.

Coexpression of neighboring genes could arise through the genes sharing a common function. For example, one could readily predict that genes encoding enzymes in a common metabolic pathway may be coordinately regulated and therefore coexpressed, particularly if the entire pathway is responsive to environmental or developmental cues. To gain an insight into the role of shared function in coexpression, we used the KEGG database to analyze gene expression in the context of gene function (Kanehisa 2002). The database encompasses genes of annotated function, which currently only represents a small subset of genes in the Arabidopsis genome. An additional problem with the KEGG database is subjectivity of the annotation. Within these constraints, high degrees of coexpression were observed between pairs of genes in Arabidopsis thought to share a role in common biological processes (PATHWAYS, as defined by the KEGG database), implying that commonality of function does explain some degree of the coexpression observed. When these pairs of genes were removed and the analysis repeated, neighboring genes in the genome continued to be coexpressed. Thus, on the basis of the limited information currently available, the data suggest that the phenomenon of coexpression of neighboring genes in the Arabidopsis genome does not rely only on genes functioning in a common biological process. However, as the KEGG database is not comprehensive, not all pairs of genes involved in the same pathway can be definitively removed. It would be interesting to repeat these analyses at a later date when more information is available and a far greater proportion of genes in the Arabidopsis genome have been assigned a definite function.

Interestingly, when coexpression of genes across the entire genome was analyzed in the context of the KEGG database, particularly high degrees of correlation were observed for genes encoding proteins that are known to function in multicomponent complexes, such as the proteosome, ribosome, and replicon. Often, these complexes contain a high level of protein–protein interactions and our conclusions from the Arabidopsis data are supported by studies in yeast, in which genes encoding interacting proteins tend to be coexpressed (Ge et al. 2001; Grigoriev 2001; Jansen et al. 2002). In contrast, the degree of coexpression of genes encoding enzymes in metabolic pathways in Arabidopsis was low, with the exception of several key primary metabolic pathways, such as the TCA cycle and fatty acid metabolism. Given these findings, it is an interesting possibility that the number of interactions between proteins may be an important predictor of the degree of coexpression between their corresponding genes. Additionally, as more data emerges from studies of gene function in Arabidopsis, it will be important to determine whether protein–protein interactions play a role in the coexpression of neighboring genes.

METHODS

Data Sources

Microarray Data

Data was collected from two sources. The Stanford data set is a collection of microarray experiments using cDNA microarrays. The data was downloaded from the Stanford Web site (ftp://genome-ftp.stanford.edu/pub/smd/organisms/AT). A total of 233 experiments were used and the total number of genes across all experiments was 7627 genes. Not all genes were present in each array. As an indicator of the expression level, the normalized ratio was used (channel 1/channel 2 ratio normalized). The Affymetrix data was obtained using the Nottingham Arabidopsis Stock Centre (NASC) Affywatch service (http://arabidopsis.info/prototype/; Craigon et al 2004). The data set contained 175 experiments, 28 of which used 8300 chips; the remainder were full-genome chips. Expression level was defined as the normalized signal values where the detection call was 1, indicating that the signal value was statistically significant. If any one gene was represented more than once on a chip, then the mean expression level across the chip for that gene was used. Both sets of data contained experiments using various tissue types and sample sources.

Detecting Local Similarity in Expression

The level of coexpression between two genes was defined as the Pearson's correlation coefficient (R) of the expression level for these genes across all experiments.

To test for pairwise local similarity in expression in the Arabidopsis genome, the mean R (Pearson's correlation coefficient) of the expression profiles for neighboring pairs of genes was calculated for both the affymetrix and cDNA data sets. Neighbors were defined as genes that were immediately adjacent in the Arabidopsis genome according to each gene's AGI name, that is, gene pairs with an AGI name (of the form At[chr]g[xxxxx]), differing by 10 or less (e.g., At1g10020 and At1g10030 are defined as neighbors). The mean R calculated from the real data set was then compared with the mean R calculated from 10,000 data sets, in which the order of genes in the Arabidopsis genome was randomized. To ensure that the R-value calculated was statistically valid for each pairwise comparison, there had to be at least 10 experiments in which both genes had valid values. For the Affymetrix data in particular, this resulted in many comparisons being rejected, due to an insufficient number of experiments in which the transcript was identified. The number of gene pair comparisons was conserved between the randomized and the real data sets (Stanford n = 2498; NASC n = 7388).

When analyzing blocks of genes, the mean of all possible comparisons within the block was used as the level of coexpression for that block. Therefore, for a block of five genes, 10 different correlations were carried out, and the mean R was used as a measure of the level of coexpression for that particular block. The mean R was then compared with means calculated from randomized data sets. One hundred randomizations were carried out for each simulation. Where sub-blocks were used, the number of genes in a randomized block were varied. For example, when there were three genes in a sub-block, the Arabidopsis genome was split into blocks of three ordered neighboring genes. These blocks were then randomized. For each random distribution, the genes were split into blocks of 15 genes, from which the mean Pearson correlation coefficient was calculated using the Affymetrix array data. Tandem duplicates were excluded.

Distance between genes was defined as the distance in base-pairs between the last coding position, on either strand, of the first gene to the first coding position of the second gene.

Removal of Tandem Duplicates

All Arabidopsis protein sequences from the May 2003 build were downloaded from MIPS (http://mips.gsf.de/proj/thal/db/index.html). The protein sequences were compared using an allagainst-all BLAST algorithm. Any pair of genes within 100 genes of each other that showed sequence similarity (e-value cut off = 0.2) was counted as a tandem duplicate. This cut off value removes ∼90% of related genes from a data set, and has a false positive rate of about 10% (Lercher et al. 2002). One member of each pair of tandem duplicates was removed from the analysis. This gave 8890 pairs of tandem duplicates in the entire Arabidopsis genome. This compares favorably with the 17% of the Arabidopsis genome claimed to be in tandem arrays quoted in the Arabidopsis genome paper (The Arabidopsis Genome Initiative 2000).

Identification of Genes in the Same Metabolic Pathway

The KEGG database (http://www.genome.ad.jp/kegg/), downloaded August 2003, was used to assign 1891 genes to 117 PATHWAYS, resulting in 4048 gene-PATHWAY assignations. The KEGG database annotates only a small proportion of the Arabidopsis genome, and the ontology is biased toward mammalian metabolic pathways. For each pair of nonduplicate genes in which there was known pathway information and Affymetrix data associated with both genes (n = 912), where the pair was within 10 genes of each other, a Pearson's correlation coefficient was calculated. Of these, 101 pairs were classified as neighboring genes; eight of these were pairs classified as being in the same metabolic pathway. To increase the number of gene pairs for comparison against neighboring pairs of genes that were not in the same metabolic pathway, all gene pairs for which data had been calculated were used in the analysis (Nmetabolic = 72, Nnot metabolic = 840).

PERL scripts that carry out the methods described in this work are available from the authors on request.

Acknowledgments

We thank Yi Li, Kathryn Madagan, Fabian Vaistij, Eng-Kiat Lim, and Chris Winefield for their helpful comments and discussion. E.J.B.W. is funded by the BBSRC Exploiting Genomics Initiative (grant no. EGA16205).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2131104.

Footnotes

[Supplemental material is available online at www.genome.org. All of the raw microarray data and metabolic pathway data will be made available as additional information. Also, all programs used to analyze data will be made available on request as well as any other data used in the analyses.]

References

Adachi, N. and Lieber, M.R. 2002. Bidirectional gene organization: A common architectural feature of the human genome. Cell 109**:** 807–809. [DOI] [PubMed] [Google Scholar]
The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408**:** 796–815. [DOI] [PubMed] [Google Scholar]
Birnbaum, K., Shasha, D.E., Wang, J.Y., Jung, J.W., Lambert, G.M., Galbraith, D.W., and Benfey, P.N. 2003. A gene expression map of the Arabidopsis root. Science 302**:** 1956–1960. [DOI] [PubMed] [Google Scholar]
Blumenthal, T., Evans, D., Link, C.D., Guffanti, A., Lawson, D., Thierry-Mieg, J., Thierry-Mieg, D., Chiu, W.L., Duke, K., Kiraly, M., et al. 2002. A global analysis of Caenorhabditis elegans operons. Nature 417**:** 851–854. [DOI] [PubMed] [Google Scholar]
Boutanaev, A.M., Kalmykova, A.I., Shevelyov, Y.Y., and Nurminsky, D.I. 2002. Large clusters of co-expressed genes in the Drosophila genome. Nature 420**:** 666–669. [DOI] [PubMed] [Google Scholar]
Caron, H., Peter, M., Vansluis, P., Speleman, F., Dekraker, J., Laureys, G., Michon, J., Brugieres, L., Voute, P.A., Westerveld, A., et al. 1995. Evidence for 2 tumor-suppressor loci on chromosomal bands-1p35–36 involved in neuroblastoma—one probably imprinted, another associated with n-myc amplification. Hum. Mol. Genet. 4**:** 535–539. [DOI] [PubMed] [Google Scholar]
Cohen, B.A., Mitra, R.D., Hughes, J.D., and Church, G.M. 2000. A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression. Nat. Genet. 26**:** 183–186. [DOI] [PubMed] [Google Scholar]
Craigon, D.J., James, N., Okyere, J., Higgins, J., Jotham, J., and May, S. 2004. NASCArrays: A repository for microarray data generated by NASC's transcriptomics service. Nucleic Acids Res. 32**:** D575–D577. [DOI] [PMC free article] [PubMed] [Google Scholar]
Elo, A., Lyznik, A., Gonzalez, D.O., Kachman, S.D., and Mackenzie, S.A. 2003. Nuclear genes that encode mitochondrial proteins for DNA and RNA metabolism are clustered in the Arabidopsis genome. Plant Cell 15**:** 1619–1631. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ge, H., Liu, Z., Church, G.M., and Vidal, M. 2001. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet. 29**:** 482–486. [DOI] [PubMed] [Google Scholar]
Gerasimova, T.I. and Corces, V.G. 2001. Chromatin insulators and boundaries: Effects on transcription and nuclear organization. Annu. Rev. Genet. 35**:** 193–208. [DOI] [PubMed] [Google Scholar]
Glazko, G.V., Rogozin, I.B., and Glazkov, M.V. 2000. Computer prediction of DNA sites of attachment to different nuclear matrix elements. Mol. Biol. 34**:** 1–5. [PubMed] [Google Scholar]
Gray, T.A., Saitoh, S., and Nicholls, R.D. 1999. An imprinted, mammalian bicistronic transcript encodes two independent proteins. Proc. Natl. Acad. Sci. 96**:** 5616–5621. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grigoriev, A. 2001. A relationship between gene expression and protein interactions on the proteome scale: Analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucleic Acids Res. 29**:** 3513–3519. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jansen, R., Greenbaum, D., and Gerstein, M. 2002. Relating whole-genome expression data with protein–protein interactions. Genome Res. 12**:** 37–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kanehisa, M. 2002. The KEGG database. Novartis. Found. Symp. 247**:** 91–101. [PubMed] [Google Scholar]
Kruglyak, S. and Tang, S. 2000. Regulation of adjacent yeast genes. Trends Genet. 16**:** 109–111. [DOI] [PubMed] [Google Scholar]
Kuo, W.P., Jenssen, T.K., Butte, A.J., Ohno-Machado, L., and Kohane, I.S. 2002. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics. 18**:** 405–412. [DOI] [PubMed] [Google Scholar]
Lee, J.M. and Sonnhammer, E.L. 2003. Genomic gene clustering analysis of pathways in eukaryotes. Genome Res. 13**:** 875–882. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lercher, M.J., Urrutia, A.O., and Hurst, L.D. 2002. Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat. Genet. 31**:** 180–183. [DOI] [PubMed] [Google Scholar]
Lercher, M.J., Blumenthal, T., and Hurst, L.D. 2003. Coexpression of neighboring genes in Caenorhabditis elegans is mostly due to operons and duplicate genes. Genome Res. 13**:** 238–243. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li, J., Pankratz, M., and Johnson, J.A. 2002. Differential gene expression patterns revealed by oligonucleotide versus long cDNA arrays. Toxicol. Sci. 69**:** 383–390. [DOI] [PubMed] [Google Scholar]
Mishra, R.K. and Karch, F. 1999. Boundaries that demarcate structural and functional domains of chromatin. J. Biosci. 24**:** 377–399. [Google Scholar]
Reiss, J., Cohen, N., Dorche, C., Mandel, H., Mendel, R.R., Stallmeyer, B., Zabot, M.T., and Dierks, T. 1998. Mutations in a polycistronic nuclear gene associated with molybdenum cofactor deficiency. Nat. Genet. 20**:** 51–53. [DOI] [PubMed] [Google Scholar]
Shin, R., Kim, M.J., and Paek, K.H. 2003. The CaTin1 (Capsicum annuum TMV-induced Clone 1) and CaTin1-2 genes are linked head-to-head and share a bidirectional promoter. Plant Cell Physiol. 44**:** 549–554. [DOI] [PubMed] [Google Scholar]
Spellman, P.T. and Rubin, G.M. 2002. Evidence for large domains of similarly expressed genes in the Drosophila genome. J. Biol. 1**:** 5. [DOI] [PMC free article] [PubMed] [Google Scholar]

WEB SITE REFERENCES

ftp://genome-ftp.stanford.edu/pub/smd/organisms/AT; Stanford database.
http://arabidopsis.info/prototype; NASC Affymetrix database.
http://mips.gsf.de/proj/thal/db/index.html; MIPS Web site.
http://www.genome.ad.jp/kegg/; Kegg database.