Domain mobility in proteins: functional and evolutionary implications (original) (raw)
Journal Article
,
Malay Kumar Basu presently is a Senior Bioinformatics Engineer at the J. Craig Venter Institute (Rockville MD, USA).
Search for other works by this author on:
,
Eugenia Poliakov is a staff scientist at the Laboratory of Retinal Cell and Molecular Biology, National Eye Institute, National Institutes of Health (Bethesda MD, USA).
Search for other works by this author on:
Igor B. Rogozin is a staff scientist at the National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health (Bethesda MD, USA).
Search for other works by this author on:
Received:
04 September 2008
Revision received:
08 December 2008
Published:
16 January 2009
Cite
Malay Kumar Basu, Eugenia Poliakov, Igor B. Rogozin, Domain mobility in proteins: functional and evolutionary implications, Briefings in Bioinformatics, Volume 10, Issue 3, May 2009, Pages 205–216, https://doi.org/10.1093/bib/bbn057
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
A substantial fraction of eukaryotic proteins contains multiple domains, some of which show a tendency to occur in diverse domain architectures and can be considered mobile (or ‘promiscuous’). These promiscuous domains are typically involved in protein–protein interactions and play crucial roles in interaction networks, particularly those contributing to signal transduction. They also play a major role in creating diversity of protein domain architecture in the proteome. It is now apparent that promiscuity is a volatile and relatively fast-changing feature in evolution, and that only a few domains retain their promiscuity status throughout evolution. Many such domains attained their promiscuity status independently in different lineages. Only recently, we have begun to understand the diversity of protein domain architectures and the role the promiscuous domains play in evolution of this diversity. However, many of the biological mechanisms of protein domain mobility remain shrouded in mystery. In this review, we discuss our present understanding of protein domain promiscuity, its evolution and its role in cellular function.
PROTEIN DOMAINS
Protein domains are the structural and functional units of proteins. It is now well established that proteins carry out their functions primarily through their constituent domains. They can be gained by proteins to acquire new function. Domains are, therefore, considered to be the units through which proteins evolve. In structural biology, domains are defined as independent folding units in a protein. However, domains are generally identified as highly conserved regions of the protein sequence. This apparent contradiction in definition of protein domain disappears upon scrutiny: domains identified by sequence conservation alone have been shown to have distinct structural identity [1, 2]. Numerous sequence- and structure-based domain databases enable protein domain detection with very high accuracy, such as Pfam [3], SMART [4], CDD [5], INTERPRO [6], SCOP [7], ProDom [8], DALI [9] and CATH [10]. These databases either use sequence- or structure-based methods to identify regions in protein sequences that belong to specific domain families.
Despite decades of study, the biological mechanisms shaping the domain architecture in proteins are largely unknown. However, it is now known that domains differ in their propensity to form multidomain proteins. While some domains are present only in specific combinations, others participate in diverse domain architectures. Domains of the latter types are called ‘promiscuous’ or mobile domains, and are very important in creating the observed diversity in protein domain architectures. They play a major role in signaling network in the cell by bringing together domains with different functionalities into one protein sequence, and thus promoting crosstalk in signaling. Their central role in evolution cannot be overemphasized, but only recently we have begun to understand the role of selection in shaping the domain promiscuity. In this review, we will discuss our current understanding of protein domain promiscuity, its evolution and its role in cellular function. After briefly discussing the multidomain architecture in proteins, we will discuss how promiscuous domains are identified, and how the domain promiscuity can be measured. Finally, we will discuss the functional and evolutionary significance of the promiscuous domains.
DOMAIN STRUCTURE OF PROTEINS
The number of unique domains in an organism is roughly proportional to its genome size. In unicellular eukaryotes, such as apicomplexans, diplomonads and protozoans, the unique number of domains is ∼1000, whereas in plants, fungi and animals, the numbers can be as high as ∼3000. The average size of domain is ∼100 amino acids [11]. The number of domains per gene (modularity) follows the power-law (see below) distribution [12], and it has been shown that tissue-specific genes have higher modularity [12, 13].
The estimation of the frequency of multidomain proteins in the three superkingdoms of life (bacteria, archaea and eukaryotes) varies with the methodologies and database used [14–18], but the emerging consensus is that prokaryotes have fewer multidomain proteins than eukaryotes. The tendency of formation of multidomain proteins increases from archaea to bacteria to eukaryotes [1, 19]. Although within eukaryotes, particularly in animals, there is a distinct tendency towards formation of multidomain proteins (39% of metazoan proteins contain more than one Pfam domain, whereas the corresponding number for unicellular eukaryotes is smaller, 32% [20]), a large fraction of the proteins in all three super-kingdoms of life contain 0–1 domain [2, 18, 20, 21]. However, we have to keep in mind that poor description of domains in some lineages may create problems for this analysis. Proteins with zero domains may actually lack domains, or such proteins may contain domains that are yet unknown. However, it was suggested that the differences between different evolutionary lineages are unlikely to be due to differences in annotation coverage [20]. As shown by Ekman and co-workers [17], the Pfam domain coverage is similar for archaea, bacteria and eukaryota: in each group about 70% of the proteins have at least one Pfam domain. In agreement with this conclusion, analyses by Tordai and co-workers [20] have also shown that Pfam coverage is similar for bacteria, archaea, protozoa, plants, fungi and metazoa. It is, therefore, reasonable to infer that the differences in the number of multidomain proteins in archaea, bacteria and eukaryotes are indeed true.
The propensity of protein domains to form multidomain architecture increases with organismal complexity. Though complexity is a contentious issue in evolution, here we define it as the number of cell types in an organism. The phenomenon that organisms with higher complexity tend to acquire more multidomain proteins is called ‘domain accretion’ [22], which could translate into increasing interaction amongst the domains. This may be one of the explanations of the apparent lack of correlation between the complexity and number of genes in a genome (G-value paradox): flies have fewer genes than nematodes; humans have fewer genes than rice [23]. Increasing modularity through domain accretion, at least in theory, can overcome the shortcoming posed by fewer genes in the genome. The biological mechanisms dictating domain accretion is not known. But, there is evidence that domains involved in the same functional pathway tend to come together in one protein sequence [24]. This phenomenon has been used to determine the functions of unknown domains in proteins, in what is called the ‘Rosetta Stone’ approach [24].
Given the large number of domains present in an organism, the possible combinatorial arrangements are enormous. However, in eukaryotic genomes domains are present only in a limited set of arrangements in multidomain proteins. This suggests that evolutionary constraints play an important role in the selection of domain architectures observed in multidomain proteins [2]. Indeed, domain arrangements, even the domain ordering in multidomain proteins, determine their three dimensional arrangements, and therefore, might affect function [25]. In earlier studies, it was shown that most of the domain combinations in multidomain proteins have been formed only once in the evolution, and the domain combinations are inherited rather than formed through convergent evolution [14, 26]. However, in a recent study, Forslund and co-workers claimed that convergent evolution is more prevalent than previously thought [27]. They investigated the prevalence of domain architecture reinvention in 96 genomes with a novel domain tree-based method that uses maximum parsimony for inferring ancestral protein architectures. They detected multiple origins for 12.4% of the architectures. This result indicates that domain architecture reinvention is a much more common phenomenon than previously thought [27]. Thus, it is possible that the process of convergent domain architecture evolution is driven by functional necessity.
PROMISCUOUS DOMAINS
Domains are present in various combinations in multidomain proteins. While some domains are present in stable configuration, others are present in many different domain milieus. Promiscuous or mobile domains are domains that reside in many different domain combinations [20, 24, 28]. The term promiscuity carries several connotations when applied to a protein domain. In scientific literature, promiscuity can signify domains with higher degree of mobility (as described above), or domains that physically interact with many other domains (protein–protein interactions), or domains that bind different types of molecules. In this article, the term promiscuous domain will be used to mean mobile domains.
Although the reasons why some domains are mobile and others are static are largely unknown, some recent studies indicate the possible properties and thereby hint at the reasons. It has been shown that domains in multidomain proteins are generally smaller in size than those that are present as single domain [20]. This phenomenon is claimed to be due to the fact that domains that are present in different protein environments need to fold independently, and their smaller size facilitates independent folding [20]. It has been shown that the mobility of domains may have a large functional dependence: those required for specific functions tend to get mobile in specific lineages [28].
It has been recently shown that promiscuous domains evolve more slowly compared to non-promiscuous ones [28]. It has also been shown that promiscuous domains identified by their co-occurrence in single polypeptide alone also tend to show a higher number of physical domain–domain interactions [28]. This is true even for promiscuous domains (e.g. SH3 and PDZ, Table 1) that do not bind to other globular domains, but instead to short linear sequence motifs or covalent protein modifications present in the interaction partner. Taking these observations together, it appears that because promiscuous domains need to participate in many different kinds of protein–protein interactions, they tend to evolve slowly than domains that need to participate in specific interactions, where compensatory mutations in the both interaction partners could relax the selection pressure on the sequence.
Table 1:
Ten promiscuous domains with the highest average promiscuity in majority of eukaryotes
Domain (ID) | Average promiscuitya | Description |
---|---|---|
PH (smart00233) | 680.07 | Protein–protein interactions; various signaling processes, in particular, inositol phosphate signaling |
AAA+ (smart00382) | 637.38 | ATPase involved in various functions, including chaperone roles and various forms of signal transduction |
SH3 (smart00326) | 587.36 | Protein–protein interactions; various forms of signaling |
C1 (smart00109) | 442 | Small-molecule binding and protein–protein interaction domains present, primarily in protein kinases; various forms of signaling |
GATase (pfam00117) | 424.69 | Glutamine amidotransferase domain found in a variety of metabolic enzymes |
PHD (smart00249) | 420.38 | Protein–protein interactions, primarily in chromatin |
PDZ (smart00228) | 418.74 | Protein–protein interactions; various forms of signaling |
Biotin_lipoyl (pfam00364) | 371.68 | Coenzyme-binding domain of various metabolic enzymes |
RING (smart00184) | 364.35 | Ubiquitin signaling: E3 component of ubiquitin ligases |
EGF (smart00181) | 323.56 | Epidermal growth factor domain; various forms of extracellular signaling |
Domain (ID) | Average promiscuitya | Description |
---|---|---|
PH (smart00233) | 680.07 | Protein–protein interactions; various signaling processes, in particular, inositol phosphate signaling |
AAA+ (smart00382) | 637.38 | ATPase involved in various functions, including chaperone roles and various forms of signal transduction |
SH3 (smart00326) | 587.36 | Protein–protein interactions; various forms of signaling |
C1 (smart00109) | 442 | Small-molecule binding and protein–protein interaction domains present, primarily in protein kinases; various forms of signaling |
GATase (pfam00117) | 424.69 | Glutamine amidotransferase domain found in a variety of metabolic enzymes |
PHD (smart00249) | 420.38 | Protein–protein interactions, primarily in chromatin |
PDZ (smart00228) | 418.74 | Protein–protein interactions; various forms of signaling |
Biotin_lipoyl (pfam00364) | 371.68 | Coenzyme-binding domain of various metabolic enzymes |
RING (smart00184) | 364.35 | Ubiquitin signaling: E3 component of ubiquitin ligases |
EGF (smart00181) | 323.56 | Epidermal growth factor domain; various forms of extracellular signaling |
aAverage promiscuity is defined as the mean promiscuity value calculated over 28 eukaryotic species in reference [28].
Table 1:
Ten promiscuous domains with the highest average promiscuity in majority of eukaryotes
Domain (ID) | Average promiscuitya | Description |
---|---|---|
PH (smart00233) | 680.07 | Protein–protein interactions; various signaling processes, in particular, inositol phosphate signaling |
AAA+ (smart00382) | 637.38 | ATPase involved in various functions, including chaperone roles and various forms of signal transduction |
SH3 (smart00326) | 587.36 | Protein–protein interactions; various forms of signaling |
C1 (smart00109) | 442 | Small-molecule binding and protein–protein interaction domains present, primarily in protein kinases; various forms of signaling |
GATase (pfam00117) | 424.69 | Glutamine amidotransferase domain found in a variety of metabolic enzymes |
PHD (smart00249) | 420.38 | Protein–protein interactions, primarily in chromatin |
PDZ (smart00228) | 418.74 | Protein–protein interactions; various forms of signaling |
Biotin_lipoyl (pfam00364) | 371.68 | Coenzyme-binding domain of various metabolic enzymes |
RING (smart00184) | 364.35 | Ubiquitin signaling: E3 component of ubiquitin ligases |
EGF (smart00181) | 323.56 | Epidermal growth factor domain; various forms of extracellular signaling |
Domain (ID) | Average promiscuitya | Description |
---|---|---|
PH (smart00233) | 680.07 | Protein–protein interactions; various signaling processes, in particular, inositol phosphate signaling |
AAA+ (smart00382) | 637.38 | ATPase involved in various functions, including chaperone roles and various forms of signal transduction |
SH3 (smart00326) | 587.36 | Protein–protein interactions; various forms of signaling |
C1 (smart00109) | 442 | Small-molecule binding and protein–protein interaction domains present, primarily in protein kinases; various forms of signaling |
GATase (pfam00117) | 424.69 | Glutamine amidotransferase domain found in a variety of metabolic enzymes |
PHD (smart00249) | 420.38 | Protein–protein interactions, primarily in chromatin |
PDZ (smart00228) | 418.74 | Protein–protein interactions; various forms of signaling |
Biotin_lipoyl (pfam00364) | 371.68 | Coenzyme-binding domain of various metabolic enzymes |
RING (smart00184) | 364.35 | Ubiquitin signaling: E3 component of ubiquitin ligases |
EGF (smart00181) | 323.56 | Epidermal growth factor domain; various forms of extracellular signaling |
aAverage promiscuity is defined as the mean promiscuity value calculated over 28 eukaryotic species in reference [28].
DOMAIN CO-OCCURRENCE NETWORK
If we plot the frequency distribution of domain in an organism, the plot roughly follows a power-law (Figure 1) [1, 29, 30]. In the power-law, the frequency of an event f(x) is proportional to its rank i with a relation _f(x) ∼ i_−γ, where γ is a parameter. The power-law has been identified in numerous biological, physical and social contexts, such as hypertext links in Internet, population distribution is towns, number of reactions in which a particular metabolite is involved, number of pseudogenes in a particular gene family, and many others [31–39]. Two very common versions of the power-law are Zipf's law, which describes the frequency distribution of words in a text [40] and the Pareto distribution, which describes the distribution of people by wealth [41]. Pareto distribution also led to the famous Pareto principle, which says ‘few contain many and most contain few’ or the so called 80-20 rule. Examples of such rule are 20% of product from a company determines 80% of the return, 20% of the defects caused 80% of the problems, and many others.
Figure 1:
Power-law distribution of domains in human genome. (A) Rank of a domain after sorting according the frequency in the genome on _X_-axis is plotted against the frequency on the _Y_-axis. (B) Log–log plot of the ranks of domain on _X_-axis is plotted against the frequency on the _Y_-axis.
The power-law distribution has special mathematical properties related to a type of network called ‘scale-free’, where the frequency distribution of node degrees (number of nodes to which a given node is connected) follows a power-law [33, 34]. Many biological networks are scale-free in nature, such as metabolic networks, protein–protein interaction networks, and many others [36, 42].
Domain co-occurrence networks also fall under the scale-free category [21]. These networks are graphs in which each node represents a domain, and two nodes are connected by an edge only if they are present is a single protein sequence [20, 21, 43, 44]. In a scale-free network, there are few nodes that are highly connected, but majority of them have low connectivity. Additionally, in a scale-free network, the features of the network and the underlying distribution do not change with the increasing number of nodes. In a protein domain co-occurrence network, promiscuous mobile domains are highly connected nodes or hubs (Figure 2). This type of distribution of connectivity is very different from random network where the connectivity is largely uniform. Moreover, the scale-free nature of such a network is largely assumed to exist due to ‘preferential attachment’, which dictates that the probability of a node acquiring new connections is proportional to its degree (the number of nodes to which a given node is connected). Thus the implication of such connectivity for a domain co-occurrence network is important in showing that domain combinations in proteins are not random and that promiscuous domains have a tendency to become more promiscuous during evolution.
Figure 2:
The partial domain co-occurrence graph of promiscuous domains, PH, SH3 and S_TKC in human genome. The nodes represent domains; two nodes are connected by an edge only when the connecting domains are present next to each other on the same protein sequence.
HOW NEW DOMAIN COMBINATIONS ARE CREATED
To attain promiscuity status a domain needs to create new domain combination, it is, therefore, important to understand how new domain combinations are created in proteins. Although the biological mechanisms that give rise to new domain combinations are largely unknown, several mechanisms have been proposed with anecdotal evidence. Examples of such mechanisms are gene fusion and fission, de novo creation of genes from non-coding elements, and recruitment of the mobile genetic elements [45]. Domains are frequently gained by proteins through insertions at the N or C terminus [46, 47]. Repeated domains can also arise through duplication [48]. Novel structure can also arise due to circular permutation of existing domains [49].
It has been shown that the domain boundaries in animal genomes, particularly extracellular portions of animal membrane proteins, coincide with the exons in which the domain resides [50–53]. The idea is that exon-bordering domains may move in the genome as ‘cassette-exon’. The existence of cassette-exons could be explained as a by-product of exon-shuffling, a process where new genes evolve by shuffling of existing exons in a gene. Exon-shuffling has been forwarded as an evidence of the ‘intron-early’ theory, which proposes that introns were present in the Last Universal Common Ancestor (LUCA) of all extant organisms, and later lost in prokaryotes. In contrast, ‘intron-late’ proponents believe that they were a late innovation in eukaryotes and prokaryotes never had introns. Evidence of the exon-shuffling has been found in animals [54, 55], whereas in plants and fungi there is no evidence of exon-shuffling [50, 56].
The present diversity of domain combinations in proteins does not differ significantly from stochastic birth, death and innovation models (BDIMs) [1, 30, 39, 57]. These models predict the presence of an equilibrium state of the domain distribution, which is reached exponentially; the death of a domain must be counteracted by ‘innovation’ or creation of new domains. BDIMs ignore completely the individuality of gene families and the selective forces that make some of them expendable and others indispensable. Despite this obvious over-simplification, BDIMs accurately reproduce the observed family size distributions, suggesting that genome evolution might be largely a stochastic process, which is modulated by natural selection [1, 19].
A QUANTITATIVE MEASUREMENT OF PROMISCUITY
To identify promiscuous domain one needs to consider several parameters. Some of these parameters are as follows: (a) other domains that co-occur with a particular domain in one protein sequence, (b) number of different multidomain architectures in which a domain participates and (c) the abundance of a domain in the genome. Earlier work relied on the parameter (a) to find promiscuous domains. These works made use of the connectivity parameter of domain co-occurrence network to find out promiscuous domains [21, 44]. Note that by definition promiscuous domains co-occur more with other domains, and therefore, are highly connected nodes or hubs in domain-occurrence network. Works that relied on connectivity parameters simply identified these highly connected nodes. But relying solely the connectivity parameters is largely misleading, because it is known that many domains, though participating in large multidomain architectures, in fact exist in fewer local contexts [20]. It is, therefore, necessary to consider immediate domain neighbors (domains adjacent to a given domain on a polypeptide sequence) to correctly identify promiscuous domains. In a later study, Tordai and co-workers [20] took this fact into account to identify promiscuous domains by considering ‘domain triplets’, three domains next to each other on a protein sequence. This study identified promiscuous domains as those who participate in many of these triplets. This is akin to using parameter (b). But even this study, which took local environment into account, largely ignored the abundance of domain in the genome [20], a very important criterion to determine domain promiscuity correctly. Promiscuity involves duplication and insertion of a given domain in a new location. Thus it is imperative to differentiate domains that are present with high abundance in the genome and participate in large number of combination as a result of their high abundance, from the true promiscuous domains. This is illustrated in the following example. Consider domain A is present twice in a genome with domains B and C in combinations AB and AC. Now consider another domain P, which is present thrice in the genome, twice as PQ and only once as PR, where Q and R are other two domains. A calculation that ignores the abundance will rank both A and P having same promiscuity. But, in reality, the promiscuity of A should be higher than P because, in spite of having a lower abundance, domain A participates in larger number of combinations.
Recently, we developed a method to objectively measure mobility/promiscuity of a protein domain [28], taking the abundance of a domain into consideration. The method uses techniques from computational linguistics to measure promiscuity from domain co-occurrence. The method, called ‘bigram analysis’, is generally used to find words with more semantic importance in any language [58]. It has also been employed in finding words that are semantically linked to each other. The idea is to count the number of times a pair of words (bigram) occurs in a text (corpus). If a pair occurs less frequently from the background distribution, it carries more semantic information than the others. Additionally, this analysis also points out the words that, by nature, tend to participate is many bigrams and are, therefore, promiscuous.
We used the whole genome sequence as text (corpus) and each protein as sentence and each domain as word and used the same bigram analysis to statistically identify domains that participate in many bigrams and are therefore promiscuous [28]. This method generates the measured promiscuity value for each domain in the genome. Using this method, we calculated the promiscuity values for each domain in 28 eukaryotic species spanning all the major branches of the eukaryotic tree (see Supplementary Data for details) [28].
It was recently shown that there is a relationship across genomes between the promiscuity of a given domain and its frequency [59]. However, the strength of this relationship differs for different domains. A new index ‘domain versatility index’ (DVI) was suggested. DVI was defined as the strength of the relationship between the number of occurrences of a domain (N) and the number of bigrams (NN) in which this domain participates. More precisely, the logarithmic regression of NN over N was calculated, and the linear coefficient was taken as DVI. The authors explored links between the versatility of a domain, when unlinked from abundance, and its biological properties. The results suggested that domains occurring as single domain proteins and domains appearing frequently at protein termini have a higher DVI. This is consistent with previous observations that the evolution of domain re-arrangements is primarily driven by fusion of pre-existing arrangements and single domains, as well as by loss of domains at protein termini. Contrary to previous studies, versatility is lower in eukaryotes. It was suggested that a random attachment process is sufficient to explain the observed distribution of domain arrangements [59]. There was also very high correlation (88%) between promiscuity values calculated by DVI and bigram analysis [59].
FUNCTIONAL SIGNIFICANCE OF PROMISCUOUS DOMAINS
The lists of the identified promiscuous domains differ according to the identification methods. However, regardless of identification method it is apparent that the majority of promiscuous domains are involved in signaling [20, 21, 28, 44]. Some domains like PH, SH3, EGF and PDZ are present in the top promiscuous domains in all of these studies. All these domains are involved in cellular signaling one way or another.
According to their functions, promiscuous domains can be classified predominantly into five categories: (a) transcription; (b) signal transduction; (c) extracellular structures/cell–cell signaling; (d) post-translational modification/chaperons/protein turnover and (e) cytoskeleton [28]. Among these categories, signal transduction and extracellular structures/cell–cell signaling are most frequent. If we calculate the number of promiscuous domains in these five categories in all the major branches of eukaryotes (Figure 3), we find that except the category of post-translational modification/chaperones/protein turnover (Figure 3B), other four most frequent categories increase non-linearly with the increase in the number of domains in the genome (Figure 3A and C–E). The linear increase in Figure 3B is largely due to the fact that post-translational modification category includes ubiquitination related domains [28]. It has been recently shown that these domains predominantly are found to be promiscuous in all branches of eukaryotes [28], and therefore, show a uniform increase in promiscuity throughout the eukaryotic kingdom. In other categories (Figure 3A and C–E), there is an initial lag period for low promiscuity, followed by an exponential increase in promiscuity. This entry to the exponential phase with higher promiscuity is due to appearance of specific clades. In the case of extracellular structures/cell–cell signaling (Figure 3D), the entry into exponential promiscuity coincides with the appearance of multicellularity. In other categories (Figure 3A, C and E), the entry into the exponential phase coincides with the appearance of animals. As described in the previous study [28], it is obvious that promiscuity is a feature that has a strong functional component and might be largely dictated by functional requirements of an organism.
Figure 3:
Increase in promiscuous domains in 28 eukaryotic organisms (see [28] for detailed list of the organisms). The organisms are sorted with the increasing number of domain types in the genome and plotted on the _X_-axis. The number of promiscuous domains belonging to the five major categories in each organism is plotted on the _Y_-axis. Each plot represents one category; the category is mentioned on top of each plot. The goodness-of-fit measures for both linear and non-linear fit are also mentioned on top of each plot.
ROLE OF PROMISCUOUS DOMAINS IN EVOLUTION
If we observe the distribution of promiscuous domains in three major branches of eukaryotes, animals, plants and fungi, we find that there is a small set of core domains that are promiscuous in all these three branches of life. These core domains are largely involved in biological features that are fundamental to eukaryotic cells, such as chromatin remodeling (PHD, SET, BROMO, CHROMO, BRCT and in part AAA + ATPase) and ubiquitin signaling (RING, UBQ, UCH and UBA) [28]. Moreover, most of these core promiscuous domains are involved in signaling processes in cells (Table 1) [28]. Additionally, there are domains that are promiscuous in specific lineages. Domains that are required for specific biological functions in specific lineage tend to get more promiscuous. Prominent examples are EGF, a domain involved in various forms of extracellular signaling, is promiscuous in animals, and fCBD, a domain involved in cellulose-binding, is promiscuous in fungi [28].
Promiscuity values of the protein domains can be used as an evolutionary character in eukaryotes. Using parsimony, we reconstructed the evolutionary scenario of promiscuity in the major eukaryotic lineage [28]. We found that promiscuity is a volatile character in evolution. Some evolutionary conserved combinations of domains act as a reservoir from which new lineage-specific domain combinations are created [28]. Over all, very few domains have retained their promiscuity status during evolution. Using the unikont-bikont tree topology [60], we found two domains, AAA+ ATPase and BROMO were likely to be promiscuous in the last universal common ancestor of all the analyzed eukaryotic species (LECA; Figure 4). The major gain of promiscuity happened at the base of animals, where 22 domains became promiscuous. In general, there is tendency of increase in promiscuity during eukaryotic evolution [28]. Domain promiscuity can also be used as a genome level feature to reconstruct phylogenetic trees at the genome scale. A phylogenetic tree constructed using promiscuity bears strong resemblance to the existing phylogenetic trees with minor differences [28].
Figure 4:
Ancestral reconstruction of domain promiscuity in 28 eukaryotes. The tree topology is from unikont–opisthokont tree [60], and the ancestral reconstruction was created using parsimony with binary character of promiscuity for each domain. Each node is marked with a pie diagram containing gain of promiscuity in black, and loss of promiscuity in white; the gain and loss are relative to the parent node. Each pie diagram shows the fraction of domains that gained or lost promiscuity status. Additionally, each branch is colored according to the overall gain or loss in that branch; thick black lines indicate branches that gained promiscuous domains, and thick grey lines indicate branches that lost promiscuous domains.
CONCLUSIONS
Domain combinations in protein sequences are important biological and evolutionary features. We have only very recently begun to understand the evolution of protein domain architecture. Despite the evidences of domain gain and loss in various organisms, the mechanism through which these dynamics are achieved is largely unknown. Analysis of promiscuous/mobile domains might elucidate the biological mechanisms of how domains are gained in proteins.
There are several genetic mechanisms creating new domain combinations: genetic recombination, exon-shuffling, involvement of transposable elements, etc. We have little evidence of direct involvement of any such mechanism. The contributions of each of these mechanisms are unknown. The probability of joining one given domain type to another largely depends on the probability of genetic change leading to new combinations, and probability of the fixation of the new domain combinations [20]. Minor but a significant portion (up to 12% depending on methods used) [26, 27] of domain combinations in the genome has been shown to be created through convergent evolution, which suggests that selection does play a role in shaping domain combinations. Moreover, we have now moderately good evidence of the functional role of new domain combinations in a lineage specific manner, and therefore, it is not unreasonable to conclude that newly gained domains are fixed though natural selection. More studies are needed before any comprehensive theory of domain combination of protein can be reached. Two independent studies, one from our group [28] and one from Weiner and co-workers [59] taking domain abundance into consideration, came to similar lists of promiscuous domains. This suggests that the identification of promiscuous domains is reliable. However, contradictions remain. For example, Werner and co-workers found that contrary to previously reported findings, the versatility is lower in eukaryotes. The difference is small, but statistically significant [59].
The identification of promiscuous domains has practical applications for comparative and evolutionary genomics. In particular, presence of these domains may be taken into account for sequence comparisons aimed at identification of clusters of orthologous genes, in order to avoid errors in ortholog assignment. For example, the sequences of these domains can be masked. By introducing objective, quantitative measures of domain promiscuity, a rational basis for such a filtering procedure can be designed.
- Protein domain promiscuity is a volatile feature in evolution and plays specific functional roles in different phylogenetic lineages.
- Promiscuous domains are, typically, involved in protein–protein interactions and play crucial roles in interaction networks, particularly those that contribute to signal transduction.
- Genetic mechanism(s) shaping domain promiscuity is largely unknown, but we have strong evidence of natural selection shaping promiscuity.
Acknowledgements
We thank Kasturi Mitra, Dr Susan Gentleman and Charlie Shenitz for carefully reading and proof-reading the manuscript. This work was supported in part by the Intramural Research Program of the National Institutes of Health/DHHS.
SUPPLEMENTARY DATA
Lists of promiscuous domains in major eukaryotic lineages and other information are available at http://www.ncbi.nlm.nih.gov/CBBresearch/Koonin/resources/malay/bib2008/.
References
The structure of the protein universe and genome evolution
,
Nature
,
2002
, vol.
420
(pg.
218
-
23
)
The multiplicity of domains in proteins
,
Ann Rev Biochem
,
1995
, vol.
64
(pg.
287
-
314
)
et al.
The Pfam protein families database
,
Nucleic Acids Res
,
2008
, vol.
36
(pg.
D281
-
8
)
et al.
SMART, a simple modular architecture research tool: identification of signaling domains
,
Proc Natl Acad Sci USA
,
1998
, vol.
95
(pg.
5857
-
64
)
et al.
CDD: a conserved domain database for interactive domain family analysis
,
Nucleic Acids Res
,
2007
, vol.
35
(pg.
D237
-
40
)
et al.
New developments in the InterPro database
,
Nucleic Acids Res
,
2007
, vol.
35
(pg.
D224
-
8
)
et al.
SCOP: a structural classification of proteins database for the investigation of sequences and structures
,
J Mol Biol
,
1995
, vol.
247
(pg.
536
-
40
)
et al.
ProDom: automated clustering of homologous domains
,
Brief Bioinform
,
2002
, vol.
3
(pg.
246
-
51
)
Dictionary of recurrent domains in protein structures
,
Proteins
,
1998
, vol.
33
(pg.
88
-
96
)
et al.
CATH–a hierarchic classification of protein domain structures
,
Structure
,
1997
, vol.
5
(pg.
1093
-
108
)
Domain size distributions can predict domain boundaries
,
Bioinformatics
,
2000
, vol.
16
(pg.
613
-
8
)
Modular genes with metazoan-specific domains have increased tissue specificity
,
Trends Genet
,
2005
, vol.
21
(pg.
210
-
13
)
Compactness of human housekeeping genes: selection for economy or genomic design?
,
Trends Genet
,
2004
, vol.
20
(pg.
248
-
53
)
Domain combinations in archaeal, eubacterial and eukaryotic proteomes
,
J Mol Biol
,
2001
, vol.
310
(pg.
311
-
25
)
CHOP proteins into structural domain-like fragments
,
Proteins
,
2004
, vol.
55
(pg.
678
-
88
)
Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins
,
Protein Sci
,
1998
, vol.
7
(pg.
445
-
56
)
et al.
Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions
,
J Mol Biol
,
2005
, vol.
348
(pg.
231
-
43
)
et al.
Distribution of protein folds in the three superkingdoms of life
,
Genome Res
,
1999
, vol.
9
(pg.
17
-
26
)
Biological applications of the theory of birth-and-death processes
,
Brief Bioinform
,
2006
, vol.
7
(pg.
70
-
85
)
et al.
Modules, multidomain proteins and organismic complexity
,
FEBS J
,
2005
, vol.
272
(pg.
5064
-
78
)
Scale-free behavior in protein domain networks
,
Mol Biol Evol
,
2001
, vol.
18
(pg.
1694
-
1702
)
The impact of comparative genomics on our understanding of evolution
,
Cell
,
2000
, vol.
101
(pg.
573
-
6
)
The G-value paradox
,
Evol Dev
,
2002
, vol.
4
(pg.
73
-
5
)
et al.
Detecting protein function and protein–protein interactions from genome sequences
,
Science
,
1999
, vol.
285
(pg.
751
-
3
)
The geometry of domain combination in proteins
,
J Mol Biol
,
2002
, vol.
315
(pg.
927
-
39
)
Convergent evolution of domain architectures (is rare)
,
Bioinformatics
,
2005
, vol.
21
(pg.
1464
-
71
)
et al.
Domain tree-based analysis of protein architecture evolution
,
Mol Biol Evol
,
2008
, vol.
25
(pg.
254
-
64
)
et al.
Evolution of protein domain promiscuity in eukaryotes
,
Genome Res
,
2008
, vol.
18
(pg.
449
-
61
)
,
Computational and Statistical Approaches to Genomics.
,
2002
Kluwer
Boston
et al.
Birth and death of protein domains: a simple model of evolution explains power law behavior
,
BMC Evol Biol
,
2002
, vol.
2
pg.
18
Emergence of scaling in random networks
,
Science
,
1999
, vol.
286
(pg.
509
-
512
)
Topological properties of citation and metabolic networks
,
Phys Rev E Stat Nonlin Soft Matter Phys
,
2001
, vol.
64
pg.
036106
,
Linked: The New Science of Networks.
,
2002
Perseus Publishing
Statistical mechanics of complex networks
,
Rev Mod Phys
,
2002
, vol.
74
(pg.
47
-
97
)
Scale invariance in biology: coincidence or footprint of a universal mechanism?
,
Biol Rev Camb Philos Soc
,
2001
, vol.
76
(pg.
161
-
209
)
et al.
The large-scale organization of metabolic networks
,
Nature
,
2000
, vol.
407
(pg.
651
-
4
)
The frequency distribution of gene family sizes in complete genomes
,
Mol Biol Evol
,
1998
, vol.
15
(pg.
583
-
9
)
et al.
The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties
,
Genome Biol
,
2002
, vol.
3
:RESEARCH0040
Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model
,
J Mol Biol
,
2001
, vol.
313
(pg.
673
-
681
)
Human Behaviour and the Principle of Least Effort
,
1949
Addison-Wesley
Boston
Cours d’Economie Politique
,
1897
Rouge et Cie
Paris
et al.
Lethality and centrality in protein networks
,
Nature
,
2001
, vol.
411
(pg.
41
-
2
)
Evolutionary cores of domain co-occurrence networks
,
BMC Evol Biol
,
2005
, vol.
5
pg.
24
Comparative analysis of protein domain organization
,
Genome Res
,
2004
, vol.
14
(pg.
343
-
53
)
et al.
The origin of new genes: glimpses from the young and old
,
Nat Rev Genet
,
2003
, vol.
4
(pg.
865
-
75
)
et al.
Domain rearrangements in protein evolution
,
J Mol Biol
,
2005
, vol.
353
(pg.
911
-
23
)
Domain deletions and substitutions in the modular protein evolution
,
FEBS J
,
2006
, vol.
273
(pg.
2037
-
47
)
Expansion of protein domain repeats
,
PLoS Comput Biol
,
2006
, vol.
2
pg.
e114
Evolution of circular permutations in multidomain proteins
,
Mol Biol Evol
,
2006
, vol.
23
(pg.
734
-
43
)
Genome evolution and the evolution of exon-shuffling – a review
,
Gene
,
1999
, vol.
238
(pg.
103
-
14
)
Modular assembly of genes and the evolution of new functions
,
Genetica
,
2003
, vol.
118
(pg.
217
-
31
)
et al.
Significant expansion of exon-bordering protein domains during animal proteome evolution
,
Nucleic Acids Res
,
2005
, vol.
33
(pg.
95
-
105
)
et al.
Analysis of evolution of exon–intron structure of eukaryotic genes
,
Brief Bioinform
,
2005
, vol.
6
(pg.
118
-
34
)
The exon theory of genes
,
Cold Spring Harb Symp Quant Biol
,
1987
, vol.
52
(pg.
901
-
5
)
On the ancient nature of introns
,
Gene
,
1993
, vol.
135
(pg.
137
-
44
)
Exons – original building blocks of proteins?
,
Bioessays
,
1991
, vol.
13
(pg.
187
-
92
)
Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome
,
Bioinformatics
,
2001
, vol.
17
(pg.
988
-
96
)
,
Foundations of Statistical Natural Language Processing.
,
1999
MIT Press
Cambridge, MA
Just how versatile are domains?
,
BMC Evol Biol
,
2008
, vol.
8
pg.
285
The root of the eukaryote tree pinpointed
,
Curr Biol
,
2003
, vol.
13
(pg.
R665
-
66
)
Published by Oxford University Press 2009.
Citations
Views
Altmetric
Metrics
Total Views 2,713
1,966 Pageviews
747 PDF Downloads
Since 12/1/2016
Month: | Total Views: |
---|---|
December 2016 | 2 |
January 2017 | 2 |
February 2017 | 5 |
March 2017 | 4 |
April 2017 | 1 |
May 2017 | 3 |
June 2017 | 3 |
July 2017 | 2 |
August 2017 | 8 |
September 2017 | 2 |
October 2017 | 5 |
November 2017 | 6 |
December 2017 | 32 |
January 2018 | 12 |
February 2018 | 10 |
March 2018 | 15 |
April 2018 | 20 |
May 2018 | 20 |
June 2018 | 14 |
July 2018 | 19 |
August 2018 | 13 |
September 2018 | 14 |
October 2018 | 15 |
November 2018 | 25 |
December 2018 | 14 |
January 2019 | 10 |
February 2019 | 21 |
March 2019 | 36 |
April 2019 | 40 |
May 2019 | 33 |
June 2019 | 20 |
July 2019 | 27 |
August 2019 | 33 |
September 2019 | 35 |
October 2019 | 35 |
November 2019 | 34 |
December 2019 | 22 |
January 2020 | 27 |
February 2020 | 29 |
March 2020 | 22 |
April 2020 | 69 |
May 2020 | 33 |
June 2020 | 15 |
July 2020 | 16 |
August 2020 | 23 |
September 2020 | 33 |
October 2020 | 23 |
November 2020 | 57 |
December 2020 | 28 |
January 2021 | 13 |
February 2021 | 28 |
March 2021 | 58 |
April 2021 | 24 |
May 2021 | 34 |
June 2021 | 27 |
July 2021 | 40 |
August 2021 | 13 |
September 2021 | 24 |
October 2021 | 56 |
November 2021 | 69 |
December 2021 | 19 |
January 2022 | 13 |
February 2022 | 45 |
March 2022 | 49 |
April 2022 | 89 |
May 2022 | 45 |
June 2022 | 29 |
July 2022 | 48 |
August 2022 | 20 |
September 2022 | 40 |
October 2022 | 46 |
November 2022 | 45 |
December 2022 | 83 |
January 2023 | 53 |
February 2023 | 63 |
March 2023 | 42 |
April 2023 | 55 |
May 2023 | 53 |
June 2023 | 26 |
July 2023 | 12 |
August 2023 | 36 |
September 2023 | 13 |
October 2023 | 39 |
November 2023 | 38 |
December 2023 | 28 |
January 2024 | 40 |
February 2024 | 28 |
March 2024 | 32 |
April 2024 | 36 |
May 2024 | 27 |
June 2024 | 31 |
July 2024 | 32 |
August 2024 | 16 |
September 2024 | 21 |
October 2024 | 30 |
November 2024 | 18 |
Citations
66 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic