Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs - PubMed (original) (raw)

Comparative Study

. 2004 Jun;14(6):1107-18.

doi: 10.1101/gr.1774904.

Affiliations

Comparative Study

Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs

Haiyuan Yu et al. Genome Res. 2004 Jun.

Abstract

Proteins function mainly through interactions, especially with DNA and other proteins. While some large-scale interaction networks are now available for a number of model organisms, their experimental generation remains difficult. Consequently, interolog mapping--the transfer of interaction annotation from one organism to another using comparative genomics--is of significant value. Here we quantitatively assess the degree to which interologs can be reliably transferred between species as a function of the sequence similarity of the corresponding interacting proteins. Using interaction information from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, and Helicobacter pylori, we find that protein-protein interactions can be transferred when a pair of proteins has a joint sequence identity >80% or a joint E-value <10(-70). (These "joint" quantities are the geometric means of the identities or E-values for the two pairs of interacting proteins.) We generalize our interolog analysis to protein-DNA binding, finding such interactions are conserved at specific thresholds between 30% and 60% sequence identity depending on the protein family. Furthermore, we introduce the concept of a "regulog"--a conserved regulatory relationship between proteins across different species. We map interologs and regulogs from yeast to a number of genomes with limited experimental annotation (e.g., Arabidopsis thaliana) and make these available through an online database at http://interolog.gersteinlab.org. Specifically, we are able to transfer approximately 90,000 potential protein-protein interactions to the worm. We test a number of these in two-hybrid experiments and are able to verify 45 overlaps, which we show to be statistically significant.

Copyright 2004 Cold Spring Harbor Laboratory Press

PubMed Disclaimer

Figures

Figure 1

Figure 1

Schematic illustration of protein–protein interologs and the mapping methods. (A) Original interolog mapping. Theoretically, A-A′ and B-B′ should be orthologs between the two organisms. Operationally, only best-matching homologs are required. (B) Generalized interolog mapping. Proteins A1′, A2′, A3′, and A4′ in the target organism are all homologs of protein A in the source organism. These proteins form the A′ family. Likewise, protein B's homologs (B1′, B2′, B3′) form the B′ family in the target organism. If we know that protein A interacts with B, we can predict that the A′ family and the B′ family are interacting families. All possible pairs between these two families are considered as the generalized interologs (shown as black, dashed lines with arrows). (C) Comparison with the gold standards. After the interactions in the source organism are mapped onto the target organism, the predictions (i.e., generalized interologs) are compared with the gold standard positives and negatives. True positives are the predictions that overlap with the gold standard positives. False positives are those that overlap with the gold standard negatives. (D) Schematic illustration of protein–DNA interologs and regulogs. In the source organism, TF A binds to its binding site (SA) and regulates the downstream gene B. To perform the regulog mapping, TF A′ in the target organism needs to be the ortholog of A. Proteins B and B′ should also be orthologs. The DNA sequence upstream of gene B′ needs to contain the same motif (SA′) as SA. However, practically TF A and A′ only need to share ≥30% identity. The interaction between TF A′ and SA′ is the protein–DNA interolog of that between A and SA. The regulatory relationships between A → B and A′ → B′ are regulogs.

Figure 1

Figure 1

Schematic illustration of protein–protein interologs and the mapping methods. (A) Original interolog mapping. Theoretically, A-A′ and B-B′ should be orthologs between the two organisms. Operationally, only best-matching homologs are required. (B) Generalized interolog mapping. Proteins A1′, A2′, A3′, and A4′ in the target organism are all homologs of protein A in the source organism. These proteins form the A′ family. Likewise, protein B's homologs (B1′, B2′, B3′) form the B′ family in the target organism. If we know that protein A interacts with B, we can predict that the A′ family and the B′ family are interacting families. All possible pairs between these two families are considered as the generalized interologs (shown as black, dashed lines with arrows). (C) Comparison with the gold standards. After the interactions in the source organism are mapped onto the target organism, the predictions (i.e., generalized interologs) are compared with the gold standard positives and negatives. True positives are the predictions that overlap with the gold standard positives. False positives are those that overlap with the gold standard negatives. (D) Schematic illustration of protein–DNA interologs and regulogs. In the source organism, TF A binds to its binding site (SA) and regulates the downstream gene B. To perform the regulog mapping, TF A′ in the target organism needs to be the ortholog of A. Proteins B and B′ should also be orthologs. The DNA sequence upstream of gene B′ needs to contain the same motif (SA′) as SA. However, practically TF A and A′ only need to share ≥30% identity. The interaction between TF A′ and SA′ is the protein–DNA interolog of that between A and SA. The regulatory relationships between A → B and A′ → B′ are regulogs.

Figure 1

Figure 1

Schematic illustration of protein–protein interologs and the mapping methods. (A) Original interolog mapping. Theoretically, A-A′ and B-B′ should be orthologs between the two organisms. Operationally, only best-matching homologs are required. (B) Generalized interolog mapping. Proteins A1′, A2′, A3′, and A4′ in the target organism are all homologs of protein A in the source organism. These proteins form the A′ family. Likewise, protein B's homologs (B1′, B2′, B3′) form the B′ family in the target organism. If we know that protein A interacts with B, we can predict that the A′ family and the B′ family are interacting families. All possible pairs between these two families are considered as the generalized interologs (shown as black, dashed lines with arrows). (C) Comparison with the gold standards. After the interactions in the source organism are mapped onto the target organism, the predictions (i.e., generalized interologs) are compared with the gold standard positives and negatives. True positives are the predictions that overlap with the gold standard positives. False positives are those that overlap with the gold standard negatives. (D) Schematic illustration of protein–DNA interologs and regulogs. In the source organism, TF A binds to its binding site (SA) and regulates the downstream gene B. To perform the regulog mapping, TF A′ in the target organism needs to be the ortholog of A. Proteins B and B′ should also be orthologs. The DNA sequence upstream of gene B′ needs to contain the same motif (SA′) as SA. However, practically TF A and A′ only need to share ≥30% identity. The interaction between TF A′ and SA′ is the protein–DNA interolog of that between A and SA. The regulatory relationships between A → B and A′ → B′ are regulogs.

Figure 1

Figure 1

Schematic illustration of protein–protein interologs and the mapping methods. (A) Original interolog mapping. Theoretically, A-A′ and B-B′ should be orthologs between the two organisms. Operationally, only best-matching homologs are required. (B) Generalized interolog mapping. Proteins A1′, A2′, A3′, and A4′ in the target organism are all homologs of protein A in the source organism. These proteins form the A′ family. Likewise, protein B's homologs (B1′, B2′, B3′) form the B′ family in the target organism. If we know that protein A interacts with B, we can predict that the A′ family and the B′ family are interacting families. All possible pairs between these two families are considered as the generalized interologs (shown as black, dashed lines with arrows). (C) Comparison with the gold standards. After the interactions in the source organism are mapped onto the target organism, the predictions (i.e., generalized interologs) are compared with the gold standard positives and negatives. True positives are the predictions that overlap with the gold standard positives. False positives are those that overlap with the gold standard negatives. (D) Schematic illustration of protein–DNA interologs and regulogs. In the source organism, TF A binds to its binding site (SA) and regulates the downstream gene B. To perform the regulog mapping, TF A′ in the target organism needs to be the ortholog of A. Proteins B and B′ should also be orthologs. The DNA sequence upstream of gene B′ needs to contain the same motif (SA′) as SA. However, practically TF A and A′ only need to share ≥30% identity. The interaction between TF A′ and SA′ is the protein–DNA interolog of that between A and SA. The regulatory relationships between A → B and A′ → B′ are regulogs.

Figure 2

Figure 2

Conservation of protein–protein interactions between homologous protein pairs. (A,B) Relationships between V and J I. (C,D) Relationships between V and JE.(E,F) Relationships between L and J E. (A,C,E) Calculated based on the results from worm-yeast mapping. (B,D,F) The weighted average obtained when the interactions in all four organisms (i.e., S. cerevisiae, C. elegans, D. melanogaster, and H. pylori) were mapped onto yeast. (A) Low: JI ≤ 10%; Medium: 20% ≤ _J_I ≤ 30%; High: JI ≥ 40%. (C,D) Low: 10–40 ≤ J E ≤ 10–10; Medium: 10–100 ≤ J E ≤ 10–50; High: J E ≤ 10–110. Error bars represent 95% CI calculated by a resampling algorithm (see Supplemental material).

Figure 3

Figure 3

Distribution of the number of generalized interologs as a function of joint _E_-value (J E). The dashed line represents the number of all predictions above a given JE, that is, G(J). The solid line represents the number of true positives above a given JE, that is, TP.

Figure 4

Figure 4

Comparison of generalized interolog mapping with PIE. In this figure, the plot (TP/|P′| versus TP/FP) is analogous to an ROC plot (TP/P vs. FP/N). Based on this curve, the performance of our method is comparable to that of the large-scale experimental data sets.

Figure 5

Figure 5

Conservation of protein–DNA interactions between homologous TFs. The conservation is measured as the relationships between V and I. The legend appears as an inset on the graph. The red, bold curve was calculated for all TFs in the source data sets (see Supplementary material). Error bars represent 95% CI calculated by the resampling algorithm.

Figure 6

Figure 6

Percentage of the overlaps between the predictions and different groups. (All) All experimentally determined interaction pairs; (Proteasome) interaction pairs involved in the 26S proteasome; (DDR) interaction pairs involved in DNA-damage repair; (Vulval-dev) interaction pairs involved in vulval development; (Others) interaction pairs involved in germ line, meiosis, metazoan, mitotic machinery, dauer formation, Chromosome III, chromatin remodeling, pharynx, and immunity. The _P_-values measuring the statistical significance of the overlaps between different groups and the predictions are given on top of each bar, which are calculated using the hypergeometric models (see Supplementary material).

Figure 7

Figure 7

Screenshot of the interolog/regulog database.

Similar articles

Cited by

References

    1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403–410. - PubMed
    1. Andrade, M.A. and Sander, C. 1997. Bioinformatics: From genome data to biological knowledge. Curr. Opin. Biotechnol. 8: 675–683. - PubMed
    1. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25–29. - PMC - PubMed
    1. Attwood, T.K., Beck, M.E., Bleasby, A.J., Degtyarenko, K., Michie, A.D., and Parry-Smith, D.J. 1997. Novel developments with the PRINTS protein fingerprint database. Nucleic Acids Res. 25: 212–217. - PMC - PubMed
    1. Bader, G.D., Betel, D., and Hogue, C.W. 2003. BIND: The Biomolecular Interaction Network Database. Nucleic Acids Res. 31: 248–250. - PMC - PubMed

WEB SITE REFERENCE

    1. http://interolog.gersteinlab.org

Publication types

MeSH terms

Substances

LinkOut - more resources