Genome analysis: RNA recognition motif (RRM) and K homology (KH) domain RNA-binding proteins from the flowering plant Arabidopsis thaliana (original) (raw)

Abstract

Regulation of gene expression at the post-transcriptional level is mainly achieved by proteins containing well-defined sequence motifs involved in RNA binding. The most widely spread motifs are the RNA recognition motif (RRM) and the K homology (KH) domain. In this article, we survey the complete Arabidopsis thaliana genome for proteins containing RRM and KH RNA-binding domains. The Arabidopsis genome encodes 196 RRM-containing proteins, a more complex set than found in Caenorhabditis elegans and Drosophila melanogaster. In addition, the Arabidopsis genome contains 26 KH domain proteins. Most of the Arabidopsis RRM-containing proteins can be classified into structural and/or functional groups, based on similarity with either known metazoan or Arabidopsis proteins. Approximately 50% of Arabidopsis RRM-containing proteins do not have obvious homologues in metazoa, and for most of those that are predicted to be orthologues of metazoan proteins, no experimental data exist to confirm this. Additionally, the function of most Arabidopsis RRM proteins and of all KH proteins is unknown. Based on the data presented here, it is evident that among all eukaryotes, only those RNA-binding proteins that are involved in the most essential processes of post-transcriptional gene regulation are preserved in structure and, most probably, in function. However, the higher complexity of RNA-binding proteins in Arabidopsis, as evident in groups of SR splicing factors and poly(A)-binding proteins, may account for the observed differences in mRNA maturation between plants and metazoa. This survey provides a first systematic analysis of plant RNA-binding proteins, which may serve as a basis for functional characterisation of this important protein group in plants.

INTRODUCTION

In the last decade, it has become increasingly apparent that post-transcriptional control of gene expression in eukaryotes is very important. In particular, post-transcriptional regulatory events are of crucial importance for development. The levels of regulation include pre-mRNA splicing, polyadenylation, mRNA transport, translation and stability/decay. Regulation is mainly achieved either directly by RNA-binding proteins or indirectly, whereby RNA-binding proteins modulate the function of other regulatory factors. The large variety of possible RNA targets implies the existence of a large number of RNA-binding proteins with different binding specificities.

The most abundant nuclear RNA-binding proteins in human cells are collectively termed heterogeneous nuclear ribonucleoproteins (hnRNPs), according to their association with nascent RNA polymerase II transcripts. Molecular cloning of genes encoding hnRNPs led to the discovery of several motifs involved in RNA binding (1,2). The most widely spread is the consensus sequence RNA-binding domain (CS-RBD), also known as RNA recognition motif (RRM). The RRM contains two short consensus sequences, RNP1 (octamer) and RNP2 (hexamer) embedded in a structurally, but not sequence, conserved region of approximately 80 amino acids. RRMs are present not only in hnRNP proteins, but also in a large variety of other RNA-binding proteins involved in all post-transcriptional processes, whereby the number of RRMs per protein varies from one to four copies (1,2). The K homology (KH) motif, first identified in the human hnRNP K protein (3), is the second most frequently found RNA-binding domain (1). The KH domain is approximately 60 amino acids long with a characteristic pattern of hydrophobic residues and with the most conserved consensus sequence VIGXXGXXI mapping to the middle of the domain (1). One protein can possess up to 15 copies of the KH domain. Both RRM and KH domains seem to be ancient protein structures as they have been found in organisms ranging from bacteria to humans (1).

In this article, we surveyed the complete Arabidopsis thaliana genome for RNA-binding proteins containing RRM and KH domains. Proteins containing zinc fingers, zinc knuckles, RGG boxes or DEAD boxes as the sole possible RNA-binding motifs were not included as none of these motifs can be used to exclusively predict an RNA-binding function.

MATERIALS AND METHODS

Protein sequences of known plant or metazoan RRM-containing proteins were used to query the nucleotide sequence of the Arabidopsis genome using TBLASTN and the annotated set of predicted Arabidopsis proteins using BLASTP (4) on the TAIR server (http://www.arabidopsis.org/Blast/). All identified proteins were used for searching the non-redundant protein database at the National Center for Biotechnology Information (NCBI, Bethesda, MD) using the BLAST and PSI-BLAST programs (http://www.ncbi.nlm.nih.gov:80/BLAST/). The BLOSUM62 matrix (5) was used for scoring, and E values <0.001 were used as a threshold for inclusion into the final protein list. Final alignments presented in Figures 36 were generated using ClustalW (6) (http://www2.ebi.ac.uk/clustalw/) and shaded on the BoxShade server (http://www.ch.embnet.org/software/BOX\_form.html).

Figure 3.

Figure 3

Sequence analysis of Arabidopsis hnRNP A/B-like proteins. Alignment of the first (top) and the second (bottom) RRM of Arabidopsis hnRNP A/B proteins with metazoan members of the hnRNP A/B group of proteins. Sequences were aligned using the Clustal W program and shaded by the BoxShade server. Amino acids identical or similar in 50% of the sequences are shaded by black or grey background, respectively. Conserved secondary structure elements are indicated between alignments of the two RRMs. Asterisks indicate the position of residues located in the conserved hydrophobic core (52). Residues involved in formation of inter-RRM salt bridges are indicated with blue squares. The two acidic amino acids in the second RRM, possibly involved in salt bridges in Arabidopsis proteins are indicated with purple squares. The RNP1 and RNP2 motifs are indicated with red boxes. Consensus sequences at the bottom of each alignment indicate residues that are conserved in 10/13 sequences for the whole alignments or in 5/6 sequences for Arabidopsis proteins. Six groups of similar amino acids are indicated as follows: B = H, K, R; J = I, L, M, V; O = F, W, Y; U = S, T; X = A, G; Z = D, E. Hs, Homo sapiens; Xl, X.laevis; Dm, D.melanogaster; Sa, Schistocerca americana; Ce, C.elegans. Accession numbers of proteins used in alignment are as follows: HsA1, SWISS-PROT P09651; HsB1, SWISS-PROT: M29064; XlA1a, SWISS-PROT M31041; Dmhrp36, SWISS-PROT P48810; Dmhrp48, SWISS-PROT P48809; SaA1, SWISS-PROT P21522; CeA1, SWISS-PROT D10877.

Figure 6.

Figure 6

Alignment of RRMs of eight Arabidopsis 30K-RRM proteins. Details as in Figure 2. Consensus sequence at the bottom of alignment indicates residues that are conserved in 6/8 sequences.

The identification of KH domain-containing proteins in Arabidopsis followed the same procedure as described above for RRM-containing proteins.

A curated webpage containing results shown in Tables 1 and 2 will be available at http://www.at.embnet.org/bch/arabidopsis.htm.

Table 1. Summary of Arabidopsis RRM-containing proteins.

graphic file with name gkf161t01a.jpg

graphic file with name gkf161t01b.jpg

Table 2. Summary of Arabidopsis proteins containing KH domain.

graphic file with name gkf161t02.jpg

RESULTS AND DISCUSSION

RNA recognition motif (RRM) proteins

The Arabidopsis genome encodes 196 different RRM containing proteins, which is more than those found in Caenorhabditis elegans [100 (7)] and Drosophila melanogaster [117 (8)]. Table 1 summarises our analysis, which comprises only RRM proteins containing both RNP1 and RNP2 submotifs. The final decision whether a protein fulfills the criteria of an RRM-containing protein was made by BLASTP and PSI-BLAST searches against the entire gene set in the GenBank. Only those proteins producing several hits with E values <0.001 were included in further analysis. This also allowed the prediction of the closest homologues in organisms other than plants (Table 1). Approximately 50% of Arabidopsis RRM proteins contain more than one RRM domain. In addition to RRM domains, 11 proteins contain C2HC-type zinc knuckles, seven proteins contain C3H-type zinc fingers and four proteins contain C4-type zinc (ring) fingers. Moreover, in nine proteins, RRMs were found in combination with the nuclear transport factor-like (NTF-like) domain (9). The Arabidopsis orthologue of SF1/BBP (BAA97393) possesses one RRM, two C2HC-type zinc knuckles and one KH domain. One protein contains, in addition to an RRM, a C2HC-type zinc knuckle and a cyclophilin domain (AAG51976). This is an unique domain combination not found in any other organism except plants. In another protein (AAF79856), an RRM was found in combination with the homeobox domain (Table 1). Domain compositions of the major types of Arabidopsis RRM-containing proteins are depicted schematically in Figure 1.

Figure 1.

Figure 1

Schematic representation of the modular structure of Arabidopsis RRM-containing proteins. Only major types of domain combinations are shown. Individual modules are identified by different shapes and colours. Different types of domains (RNA-binding, auxiliary domains and other distinctive regions of proteins) are listed at the bottom.

Of the 196 RRM proteins encoded by the Arabidopsis genome, ∼50% have not yet been described. However, it should not be concluded that 50% of Arabidopsis RRM proteins have an assigned function, as most of the described proteins are published or deposited in GenBank without any functional implications. In general, from our analyses it is clear that plants (Arabidopsis) express a complex set of RRM-containing proteins. At least 50% of them do not have obvious orthologues in metazoa, and among them are also some proteins that belong to protein families present in all higher eukaryotes (such as plant-specific SR proteins) (10). The Arabidopsis genome often encodes two or three very related RRM proteins. This has already been noted for other large protein groups in the Arabidopsis (11,12); indeed, ∼44% of such cases result from large intra- and inter-chromosomal duplications found in the Arabidopsis genome (12). We have not made any effort to determine how many Arabidopsis RRM proteins resulted from such events. Significantly, based on comparison with available cDNA sequences, we have noted that ∼33% (22 out of 53 analysed proteins) of Arabidopsis RRM-protein coding genes have wrongly predicted intron–exon boundaries (Table 1). Consequently, the protein sizes indicated in Table 1 should be handled with caution. In particular, this is important for cloning of Arabidopsis cDNAs based on current information in the Arabidopsis genome database.

The function for some groups of RRM proteins can be predicted based on the similarity with their metazoan counterparts. This is true for poly(A)-binding proteins (PABPs) (13,14), at least some Arabidopsis SR proteins (1520), snRNP and spliceosome-associated RRM proteins (2125), CstF-64 (cleavage stimulation factor of 64 kDa, a protein involved in polyadenylation), nucleolin, S19 ribosomal protein, and translation initiation factor 3 (TIF3). FCA (26) and FPA (27) are the only plant RRM proteins for which a function has been implicated based on the Arabidopsis mutant phenotype. We have identified an additional FCA-like protein (Table 1); whether this protein has a similar function in controlling flowering time in Arabidopsis remains to be established.

In the following paragraphs, we describe and discuss in more detail particular groups of Arabidopsis RRM-containing proteins.

Poly(A)-binding proteins

Poly(A) tails of eukaryotic mRNAs are bound by PABPs, and this interaction was shown to be essential for stimulation of polyadenylation, control of the poly(A) tail length, translation initiation, and for mRNA degradation (2830). In yeast, all these functions are carried out by a single protein, Pab1p, which is an essential protein present in both the nucleus and the cytoplasm. Mammalian cells contain two distinct PABPs, a cytoplasmic PABP1 which is an orthologue of the yeast Pab1p, and a nuclear PAB2 (PABP2). Consistent with their cellular localisation, PABP1 is a mammalian protein involved in translation and cytoplasmic mRNA stability, whereas PAB2 is involved in polyadenylation (2830). In contrast to yeast and human, which possess one and two PABPs, respectively, the Arabidopsis genome encodes 12 different PABPs (Table 1). Nine of the 12 Arabidopsis PABPs are homologous to the yeast and mammalian Pab1p, and are likewise composed of four consecutive RRMs (Fig. 1 and Table 1). However, we have to mention here that PABP8 and PABP9 are highly diverged members of this protein group (Fig. 2). The other three Arabidopsis PABPs consist of an acidic N-terminal domain followed by one RRM (Fig. 1), which is reminiscent of the mammalian PAB2 protein. Also, at the primary sequence level these three proteins are highly similar to PAB2 (Fig. 2), therefore we named them AtPAB2a, AtPAB2b and AtPAB2c (Table 1). Despite the quite strong sequence divergence between individual PABPs in Arabidopsis (13,14), PABP2, PABP3 and PABP5 were capable of rescuing a Pab1p-deficient yeast strain (3133). In a complementation assay, PABP2, which is the most diverged member of this protein family in Arabidopsis, was shown to participate in many of the same post-transcriptional processes identified for yeast Pab1p (32). As some Arabidopsis PABPs are differentially expressed, it has been hypothesised that individual PABPs regulate polyadenylation and deadenylation during different stages of plant development (13,31,34).

Figure 2.

Figure 2

Dendrogram of Arabidopsis PABPs, and their orthologues from yeast and human. Dendrogram and sequence alignments were generated with the PileUP program, using default parameters (Genetic Computer Group, Madison, WI).

Based on the sequence similarity, one additional protein, PABP-like (Table 1), can be assigned to this group. In contrast to PABPs which contain four RRMs (Fig. 1), this protein is predicted to have only three RRMs, and experimental data are necessary to define it as a genuine PABP. Furthermore, in Chlamydomonas reinhardtii, a nucleus-encoded PABP (RB47), is required for translational regulation of the chloroplast psbA gene (35). We were not able to unambiguously predict an Arabidopsis orthologue of RB47.

Ser/Arg (SR) and spliceosomal RRM-containing proteins

SR proteins are essential splicing factors identified in all eukaryotes except yeast. They consist of one or two N-terminally positioned RRMs and a C-terminal domain rich in SR dipeptides (Fig. 1), hence the name SR proteins. In metazoa, SR proteins play an important role in constitutive and alternative splicing by promoting interactions across intronic and exonic sequences during early steps of spliceosome assembly, thereby helping in selection of splice sites (36,37). The genes encoding most of Arabidopsis SR proteins have already been cloned (10,15,1720), and evidence exists that at least some of them have similar activity in pre-mRNA splicing as their metazoan counterparts (15,19,20,25). In total, the Arabidopsis genome encodes 18 different SR proteins (Table 1), which is more then found in human cells (10 different human SR proteins have been characterised so far) (37). Of the 18 Arabidopsis SR proteins, clear orthologues of human SF2/ASF (atSRp34), SC35 (atSC35) and 9G8 (atRSZp21, atRSZp22 and atRSZp22a) have been identified. In addition to the previously characterised atSRp34, we have identified two novel Arabidopsis proteins having strong similarity with atSRp34 and human SF2/ASF (Table 1, atSRp34a and atSRp34b). In the TAIR database these two proteins are annotated as SF2-like, but due to the wrong prediction of the last three exons they did not contain an SR domain. The other SR proteins (RSp31, RSp40, RSp41, RSZ32 and RSZ33; Table 1) seem to be plant specific (10,18; S.Lopato and A.Barta, unpublished data). Particularly interesting are RSZ32 and RSZ33 which, in addition to one RRM, contains two consecutive C2HC-type zinc knuckles (Fig. 1 and Table 1), a situation not found in any metazoan SR protein. It is worth noting that Arabidopsis expresses three orthologues of each human SF2/ASF and 9G8 proteins. Except RSp31 and SR45, all other plant SR proteins are represented by pairs of very similar proteins. These close homologues in Arabidopsis may have partially redundant functions; however, evidence exists that pairs of homologous genes are differentially expressed (19; M.Kalyna and A.Barta, unpublished data). This indicates that they may modulate splicing during different stages of plant development, and/or that the target pre-mRNAs which are regulated by such pairs of proteins are different.

Spliceosomal proteins containing an RRM domain are easily found in Arabidopsis (Table 1). This is consistent with the observation that spliceosome composition is in general highly conserved between yeast, plants and metazoa (25,38; Z.J.Lorković and A.Barta, unpublished data). Therefore, the observed differences in intron processing between plants and metazoa must occur at the early steps of intron recognition (25,39). This is supported by the multitude of SR proteins expressed in Arabidopsis, some of which seem to be plant specific. However, additional, not yet experimentally identified, plant-specific RNA-binding proteins could also contribute to plant intron recognition (see also below).

UBP1, RBP45, RBP47, UBA1 and UBA2—oligouridylate-specific RRM proteins

A common feature of this group of nuclear RRM proteins is their specificity for oligouridylates. UBP1, RBP45 and RBP47 are also structurally related; they consist of three RRMs and a glutamine-rich N-terminus (40,41) (Fig. 1). At the primary sequence level as well as at the biological level, RBP45 and RBP47 proteins are clearly different from UBP1 (41). Protoplast transfection experiments have indicated that UBP1 from Nicotiana plumbaginifolia functions in nuclear pre-mRNA maturation by stimulating splicing efficiency of suboptimal introns and increasing the steady-state level of reporter RNAs (40). Neither RBP45 nor RBP47 from N.plumbaginifolia affected splicing and accumulation of reporter RNAs in plant protoplasts (41). The mechanism by which UBP1 increases splicing efficiency is unclear, whereas enhanced accumulation of RNA is apparently due to UBP1 interacting with the 3′-UTR and protecting mRNA from exonucleolytic degradation (40). RBP45/RBP47 and UBP1 are most similar to yeast Nam8p and metazoan TIA-1 proteins, respectively. Nam8p and TIA-1 are components of U1 snRNP (42,43), and stabilise interaction of U1 snRNP with the pre-mRNAs containing introns with suboptimal 5′-splice sites (4244). Although direct evidence for an association of UBP1 with U1 snRNP is missing, it is possible that its effects on splicing occur in a similar way.

UBA1a and UBA2a were identified as proteins interacting with UBP1 in a yeast two-hybrid system (M.H.L.Lambermon, Y.Fu, D.A.Wieczorek Kirk, M.Dupasquier, W.Filipowicz and Z.J.Lorković manuscript submitted for publication). Like UBP1, both UBA1a and UBA2a increased the steady-state levels of reporter RNAs when overexpressed in protoplasts, but unlike UBP1 neither protein stimulated pre-mRNA splicing. It has been suggested that UBP1, UBA1 and UBA2 proteins may act as components of a complex recognising U-rich sequences in plant 3′-UTRs resulting in mRNA stabilisation in the nucleus (M.H.L.Lambermon, Y.Fu, D.A.Wieczorek Kirk, M.Dupasquier, W.Filipowicz and Z.J.Lorković manuscript submitted for publication). Neither UBA1 nor UBA2 seem to have orthologues in metazoan genomes.

Arabidopsis proteins with homology to metazoan hnRNPs

In human cells, 20 different hnRNP proteins or groups of proteins have been characterised. hnRNP proteins have been identified in C.elegans, Xenopus laevis and D.melanogaster; in the latter organism 12 major proteins, termed hrp36 to hrp75, were identified and most of them have a strong sequence similarity to the human hnRNP A/B proteins (4548). Metazoan hnRNP A/B proteins are composed of two adjacent N-terminally positioned RRMs and a glycine-rich C-terminal auxiliary domain (2,49,50). At the biological level, hnRNP A/B proteins are involved in alternative splicing by promoting usage of distal 5′-splice sites, thereby antagonising the alternative splicing activity of splicing factors SF2/ASF and SC35 (4951).

Previous analysis of the Arabidopsis genome revealed six genes whose predicted protein sequences had a domain organisation typical of hnRNP A/B proteins (25). Sequence analysis of predicted Arabidopsis proteins, named AtRNP A/B_1 to 6, revealed that all of them, like metazoan proteins, are composed of two RRMs followed by a C-terminal auxiliary domain. However, the C-terminal domain in only two Arabidopsis proteins is glycine-rich, as in the metazoan hnRNP A/B proteins, whereas in the other four proteins this domain is rather equally enriched in glycine, asparagine and serine residues (Fig. 1). Comparison of the RNA-binding domains with their metazoan counterparts revealed strong sequence conservation that results in an ungapped alignment of RRM2 and one amino acid gap in loop five of the RRM1 (Fig. 3). Pairwise comparison of RRMs of AtRNP A/B proteins with metazoan hnRNP A/B RRMs resulted in identity/similarity scores ranging between 44–50% and 50–60%, which is greater than the scores obtained with RRMs from other metazoan or plant RRM-containing proteins. In addition to conserved positions within the RRMs of most RRM proteins (core consensus in Fig. 3) (52), this alignment reveals positions that are highly conserved in RRMs of both Arabidopsis and metazoan hnRNP A/B proteins, and in Arabidopsis proteins alone (consensus and At consensus lanes in Fig. 3). In terms of molecular weight, isoelectric point and amino acid composition of the C-terminal auxiliary domains, Arabidopsis proteins are most similar to D.melanogaster hrp proteins. As in D.melanogaster hrp proteins (4547), we were unable to unambiguously identify vertebrate orthologues of the Arabidopsis hnRNP A/B proteins.

In spite of the strong similarity between Arabidopsis and metazoan hnRNP proteins, certain differences do exist. The linker region (IRL) between two RRMs, which is conserved in metazoan proteins in length (13 amino acids) and primary sequence (53), is variable in Arabidoposis proteins (11–19 amino acids). The length of the IRL is known to be important for the spatial arrangement of the two RRMs and for alternative splicing activity of the human hnRNP A1 protein (53). Another difference concerns the residues involved in formation of two salt bridges, responsible for holding the two RRMs in close contact (Fig. 3, blue squares) (5355). The two arginines (mostly lysines in Arabidopsis proteins) in RRM1 are conserved in terms of charge, whereas only the second of the pair of acidic residues (aspartic acid 157) seems to be conserved in RRM2 of Arabidopsis proteins (Fig. 3, blue squares). Interestingly, there is another pair of acidic amino acids just one position upstream (Fig. 3, purple squares), which could potentially take over the function of Asp155 and Asp157 in making the salt bridges.

In addition to putative homologues of hnRNP A/B proteins, it seems that the Arabidopsis genome also encodes homologues of hnRNP H/F and hnRNP I proteins (Table 1) (25). hnRNP A, B, H, F and I are all involved in splicing regulation, particularly in alternative splicing (49,50). It is important to note that the splicing factors ASF/SF2 and SC35 that antagonise hnRNP A1 activity in usage of alternative 5′-splice sites are also conserved in plants (15,18,19) (Table 1; see also above). CUG-BP (or CELF) proteins that antagonise the activity of hnRNP I in 3′-splice site selection (56) are likewise conserved in Arabidopsis (Table 1). In the light of increasing evidence that alternative splicing is important in regulating gene expression in plants, it is interesting that these hnRNP proteins are conserved. Consequently, analysis of factors involved in this process, including Arabidopsis hnRNP proteins, is necessary for a better understanding of this important aspect of post-transcriptional regulation of gene expression in plants.

Database searches with the complete set of Arabidopsis RRM proteins presented in Table 1 did not reveal possible homologues of other human RRM-containing hnRNPs.

Chloroplast RRM-containing proteins (cpRNPs)

A group of nucleus-encoded, RRM-containing RNA-binding proteins has been described in the chloroplasts of higher plants (5761). They have a characteristic domain organisation; an N-terminal transit peptide which is necessary for import into chloroplasts is followed by an acidic domain at the N-terminus of the mature protein and two consecutive RRMs at the C-terminus (Fig. 1). The Arabidopsis genome encodes eight cpRNPs and, according to sequence homology to previously described cpRNPs from different plant species (5759), we named them cpRNP28, cpRNP29, cpRNP31 and cpRNP33. Again, as shown for SR proteins, each cpRNP is represented by two closely related proteins (Table 1). In tobacco, cpRNPs are abundant stromal proteins that exist as complexes with ribosome-free mRNAs (62). Evidence also exists that cpRNPs are involved in chloroplast mRNA 3′-end formation, RNA stabilisation (63,64); furthermore, as shown recently, some cpRNPs are involved in chloroplast RNA editing (65). A more detailed description of cpRNPs can be found in other reviews (60,61).

Glycine-rich and small RRM-containing proteins

This group comprises 27 Arabidopsis proteins. A common feature of these proteins is their similar domain organisation; they all contain one RRM at the N-terminus and a C-terminal extension. Based on differences in the C-terminal part, we divided them into two subgroups: (i) glycine-rich RNA-binding proteins (GR-RBPs) and (ii) small RNA-binding proteins (S-RBPs) (Table 1 and Fig. 1). To distinguish between cell wall-localised glycine-rich proteins (GRPs) that do not contain RRMs and RRM-containing GRPs, we propose renaming the RRM-containing GRPs as GR-RBPs (glycine-rich RNA-binding proteins). GR-RBPs are represented by eight members; all have been previously reported from different plant species (61). They have been implicated in responses to various environmental stresses (6668) and rRNA processing (69,70), and some of them seem to be regulated by a circadian clock (61,67,71,72). Alignment of GR-RBPs revealed strong primary sequence conservation in their RRMs (Fig. 4), indicating that this is a homogenous group of proteins with similar RNA-binding specificities and maybe related functions. Furthermore, based on the sequence similarity in their RRMs, three additional proteins can be assigned to this subgroup. In contrast to GR-RBPs, the C-terminus in these three proteins is rather arginine/aspartate/glutamate-rich, with RD (BAB02203; NP_196048) or RD/RE (AAB71977) repeats. Moreover, these proteins have a C2HC-type zinc knuckle inserted between the RRM and the C-terminal domain (Fig. 1), which is not found in GR-RBPs and S-RBPs. They most probably represent orthologues of the N.silvestris RZ-1 protein, which has been found in a nucleoplasmic 60S RNP complex and in association with nuclear poly(A)+ RNA (41,73). We have found one additional protein with the same domain organisation (Table 1; RZ-1_like; AAG51392), but its C-terminal domain contains basic and acidic patches instead of RD or RD/RE repeats. Moreover, phylogenetic analysis revealed that this protein is more similar to some GR-RBPs and S-RBPs than to three AtRZ-1 proteins. Because RZ-1 and RZ-1_like proteins are encoded in genomes of different plant species, but not in metazoan genomes, we conclude that they are plant specific.

Figure 4.

Figure 4

Alignment of RRMs of eight Arabidopsis GR-RBPs. The three Arabidopsis RZ-1 orthologues and RZ-1_like protein are not included. Details as in Figure 2. Consensus sequence at the bottom of alignment indicates residues that are conserved in 7/8 sequences.

The 15 S-RBPs are grouped together based only on their predicted molecular weight. In contrast to GP-RBPs, alignment of their RRMs revealed that they are a rather heterogeneous group of proteins with low sequence homology outside the RNP1 and RNP2 submotifs (Fig. 5). BLASTP searches with individual members of this subgroup resulted in limited homology with various plant and metazoan RRM-containing proteins. The most common hits were plant GR-RBPs, human and X.laevis CIRP protein, and human RBM3 proteins which are induced by cold shock (7477). It remains to be established whether S-RBPs also respond to cold stress or other environmental conditions.

Figure 5.

Figure 5

Alignment of RRMs of 15 Arabidopsis S-RBPs. Details as in Figure 2. Consensus sequence at the bottom of alignment indicates residues that are conserved in 11/15 sequences.

GR-RBPs and S-RBPs have been found in organisms ranging from Cyanobacteria to humans (61,7478). Cyanobacterial and metazoan GR-RBPs are, like some plant GR-RBPs, induced by cold-shock (77,78 and references therein); however, the exact function of these proteins remains largely unknown. The eight GR-RBPs together with 15 S-RBPs with similarity to plant and metazoan GR-RBPs (Table 1) constitute the largest group of RRM-containing proteins in Arabidopsis. It seems that genes encoding this group of proteins became highly amplified during evolution of land plants. This may not be a surprise, because unlike metazoa, plants are sessile organisms which are constantly exposed to changes in their environment. Amplification of this gene family and subsequent acquisition of differential expression could be a way to regulate RNA metabolism under different environmental conditions.

30K-RRM proteins

This is a very homogenous group of eight proteins containing one RRM with strong sequence homology in the entire RRM region (Fig. 6), which may indicate a common ancestor or otherwise very similar functions. Another common feature of these proteins is their similar molecular weight of ∼30 kDa, hence the name 30K-RRM proteins. Like in GR-RBPs, RRMs of 30K-RRMs are located in the N-terminal half of the protein, followed by a C-terminal extension with rather unusual amino acid compositions (rich in proline, glutamine, histidine, glycine, serine and acidic amino acids) (Fig. 1). These domains could possibly play a role in protein–protein interactions. The best metazoan match found with any of these plant sequences in BLASTP searches was the human SEB4D protein whose function is not known. However, in spite of the sequence similarity with SEB4D, which is limited to the RRM regions, 30K-RRM proteins do not seem to have orthologues in metazoa. The function of all 30K-RRM proteins remains to be determined.

Arabidopsis RRM proteins containing an NTF-like domain

The NTF domain was first identified in NTF2 protein which is involved in nuclear protein import (9). Later, a related factor, p15 (or NXT1) involved in nuclear protein export was found to possess an NTF-like domain (79). Meanwhile, NTF-like domains have been found in a large variety of proteins, including nucleocytoplasmic transport factor TAP (Mex67p in yeast), some plant MAP kinases and a protein G3BP implicated in the Ras signal transduction pathway (80).

Of the nine Arabidopsis proteins containing NTF-like domain in combination with an RRM, the domain organisation of three proteins resembles that of human G3BP. An N-terminally positioned NTF domain is followed by one RRM and an RGG box at the very C-terminus (80,81) (Fig. 1). None of the NTF-RRM Arabidopsis proteins has previously been described. By analogy with metazoan proteins it is likely that they are involved in nucleocytoplasmic trafficking of RNA and/or proteins. Alternatively, they could be involved in some signal transduction pathways in plants, as is human G3BP.

Messenger RNAs are exported from the nucleus as large ribonucleoprotein complexes (82,83). A protein complex that associates with mRNA during splicing and distinguishes spliced from unspliced mRNAs has recently been identified (82,83). Among the proteins identified in this complex are two RRM proteins, REF (84) and Y14 (85), which are highly conserved in Arabidopsis (Table 1). As in mouse (84), four different REF proteins are expressed in Arabidopsis (Table 1). Only one REF homologue, Yra1p, exists in yeast (80,86), whereas Y14 does not seem not to be present in yeast at all. Other components of this protein complex (DEK, SRm160 and RNPS1) are less conserved in Arabidopsis. It is interesting to note, however, that the Arabidopsis genome does not encode a protein corresponding to TAP/Mex67p, a component of the mRNA export machinery found to interact directly with REF (84). TAP protein was shown to interact through its NTF-like domain with the p15 (NXT1) as well (80) which is likewise absent from Arabidopsis genome, and this interaction is required for efficient export of mRNA to the cytoplasm (87,88). The multitude of RRM proteins containing NTF-like domain in Arabidopsis makes it possible that at least some of these proteins take over the function of TAP in mRNA export in plants and the function of p15 could be mediated by one of the three NTF2 Arabidopsis homologues.

Other Arabidopsis RRM-containing proteins

This group consists of 69 proteins; ∼25% of these proteins have already been mentioned in the general description of Arabidopsis RRM-containing proteins, or are discussed in connection with proteins from other groups. In addition, three other members from this group, and proteins similar with them, are described below. The other proteins which are listed at the end in Table 1 show limited similarities to RRM-containing proteins from other organisms (best scores are indicated in Table 1), and are therefore not discussed further. To establish their possible functional relationships to metazoan proteins will need experimentation.

We have identified five novel proteins highly related to the previously published Arabidopsis RBP37 (Table 1) which is expressed in dividing cells during development (89). This group of proteins does not have obvious orthologues in metazoa, and their functional targets are still to be determined.

AtRBP1 protein, which consists of two N-terminal RRMs and an extension at the C-terminus, was found to be expressed in rapidly dividing tissues (90). The RRMs of this protein are most similar to those of metazoan Musashi proteins (9193 and references therein). RRM proteins belonging to the Musashi family are specifically expressed in the nervous system, particularly in stem cells and neural progenitor cells; however, their roles are poorly understood (9193 and references therein). We have identified three additional proteins having similarity to AtRBP1 and Musashi. Two of those proteins (BAB08520; NP_173208) contain a glycine-rich C-terminal domain which makes them similar to hnRNP A/B and D proteins (49,50). Indeed, in BLASTP searches the best scores obtained with both proteins were Musashi and hnRNP D proteins, whereas the best Arabidopsis scores were proteins designated as AthnRNP A/B proteins (Table 1). Experimental data are required to establish whether the function of these proteins is similar or equivalent to Musashi, hnRNP D or hnRNP A/B proteins.

In Schizosaccharomyces pombe, Mei2p has been shown to be required for both induction of premeiotic DNA synthesis and promotion of the first meiotic division (94). The Arabidopsis protein AML1, which is highly similar to the S.pombe Mei2p, was cloned by functional complementation of a fission yeast pheromone receptor-deficient strain (95). We have identified four novel Mei2p-like proteins in Arabidopsis. It remains to be determined whether they also participate in the regulation of meiosis.

Arabidopsis proteins containing KH domain

In metazoa, proteins containing KH domains have been implicated in transcription, mRNA stability, translational silencing and mRNA localisation (50,96,97). Mutations in KH domain proteins very often result in developmental defects. For example, the D.melanogaster how gene encodes a single KH-domain protein essential for tendon cell differentiation (98), whereas murine quaking protein is required for maturation of Schwann cells into myelin-forming cells in the peripheral nervous system (99). Quaking and two other D.melanogaster KH domain-containing proteins, FMRP and MCG10, have been shown to induce apoptosis (100102). The FMR gene also encodes a KH domain-containing protein; transcriptional silencing of this gene or a mutation in the C-terminal KH domain leads to the fragile X syndrome (103,104). In addition, some metazoan KH proteins have been shown to be autoantigens associated with certain tumour types (105107).

In Arabidopsis we have found 26 proteins containing KH domains (Table 2). In contrast to the 27 Drosophila KH domain proteins (8), most Arabidopsis KH proteins possess more than one KH domain (Table 2). An alignment of 60 KH domains can be found on http://www.embnet.org/bch/arabidopsis.htm. In addition to KH domains, two Arabidopsis proteins possess C3H-type zinc fingers (gene IDs At5g06770; At3g12130), whereas the Arabidopsis homologue of splicing factor SF1/BBP contains two C2HC-type zinc knuckles (gene ID At5g51300). Large KH domain proteins, such as chicken vigilin, which possesses 15 KH domains were not found in the Arabidopsis genome. Vigilin homologues have been found in human, X.laevis, D.melanogaster, C.elegans, S.pombe, and Saccharomyces cerevisiae, and evidence exists for their involvement in the control of cell ploidy (108), heterochromatin structure (109), and possibly in RNA stabilisation (110). Despite the obviously highly conserved primary structure and function of vigilins in all eukaryotes, an extensive search of the Arabidopsis genome failed to reveal homologous proteins. The only Arabidopsis KH proteins that could unambiguously be predicted as orthologues of yeast or metazoan KH proteins are the splicing factor SF1/BBP (gene ID At5g51300) and the homologue of the yeast KRR1p (AtKRR1p; gene ID At5g08420). KRR1p and its metazoan orthologs (human Rip1 and Drosophila Dribble proteins) are nucleolar proteins implicated in rRNA processing (111,112 and references therein). Strong conservation of SF1/BBP and AtKRR1p is in accordance with the high conservation of pre-mRNA and rRNA processing machineries in all eukaryotes. BLASTP, PSI-BLAST and FASTA searches with Arabidopsis KH domain-containing proteins resulted in limited similarities with diverse metazoan KH proteins, but these similarities are restricted to KH domains only. For example, the KH domains of five Arabidopsis proteins (gene IDs At1g09660, At2g38610, At5g56140, At3g08620 and At2g26480) show limited similarity to respective domains of mammalian quaking proteins (Table 2), whereas the other regions of these proteins do not show significant similarity. Experimental data will be required to establish whether these proteins have similar functions. The same applies to possible Arabidopsis homologues of metazoan hnRNP K and E proteins. It seems that plants have evolved KH proteins with entirely different domain organisations, resulting most probably in different binding specificities and biological functions. Given that many metazoan KH proteins are involved in cell differentiation and development this may not be a surprise. Plant development, despite following some common themes found in metazoa in pattern formation, requires plant-specific protein functions. This is best illustrated by the existence of large number of plant-specific transcription factors (11).

It is noteworthy that none of the Arabidopsis KH-domain proteins has been characterised so far.

CONCLUSION

Plants have evolved a large number of kingdom-specific RNA-binding proteins. Only those proteins required for basic mechanisms in post-transcriptional regulation of gene expression have been preserved in all eukaryotic lineages during evolution. The function of Arabidopsis RRM and particularly KH domain proteins is largely unknown. In the years to come, the great task will be to characterise Arabidopsis RNA-binding proteins using DNA microarray technology and reverse genetics. This, together with biochemical characterisation, will aid in understanding observed differences in post-transcriptional regulation of gene expression between plants and metazoa. Such analyses will help to place individual RRM or KH proteins into a complex network regulating plant development and plant–environment interactions.

Acknowledgments

ACKNOWLEDGEMENTS

We are grateful to Maria Kalyna and Tim Skern for helpful and critical comments on the manuscript. We thank Joachim Seipelt for help with the webpage. This work was supported by a grant (SFB17 Nos 1710 and 1711) from the Österreichischer Fonds zur Förderung der Wissenschaftlichen Forschung to A.B.

REFERENCES