More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology - PubMed (original) (raw)

More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology

Wing-Cheong Wong et al. PLoS Comput Biol. 2010.

Abstract

Large-scale genome sequencing gained general importance for life science because functional annotation of otherwise experimentally uncharacterized sequences is made possible by the theory of biomolecular sequence homology. Historically, the paradigm of similarity of protein sequences implying common structure, function and ancestry was generalized based on studies of globular domains. Having the same fold imposes strict conditions over the packing in the hydrophobic core requiring similarity of hydrophobic patterns. The implications of sequence similarity among non-globular protein segments have not been studied to the same extent; nevertheless, homology considerations are silently extended for them. This appears especially detrimental in the case of transmembrane helices (TMs) and signal peptides (SPs) where sequence similarity is necessarily a consequence of physical requirements rather than common ancestry. Thus, matching of SPs/TMs creates the illusion of matching hydrophobic cores. Therefore, inclusion of SPs/TMs into domain models can give rise to wrong annotations. More than 1001 domains among the 10,340 models of Pfam release 23 and 18 domains of SMART version 6 (out of 809) contain SP/TM regions. As expected, fragment-mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins. More worryingly, we show explicit examples that the scores of clearly false-positive hits, even in global-mode searches, can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74 using conservative criteria, we find that at least between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. Thus, false-positive domain hits enforced by SP/TM regions can lead to dramatic annotation errors where the hit has nothing in common with the problematic domain model except the SP/TM region itself. We suggest a workflow of flagging problematic hits arising from SP/TM-containing models for critical reconsideration by annotation users.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Cumulative plots of SMART version 6 and Pfam release 23 problematic domains.

In SMART version 6, the total number of domains with predicted SP/TM segments peaks at 18, which made up 2.2% of 809 SMART domains (see top). Red triangles mark time points for the years 1998, 2002 and 2009 when the total number of domain models was 86, 600 and 809 respectively. In Pfam, the total number of problematic domains peaks at 1214, which made up 11.8% of 10340 Pfam domains (see bottom). Likewise, red triangles marked the years 1999, 2002 and 2008 with 1465, 3360 and 10340 Pfam entries respectively.

Figure 2

Figure 2. Histograms of average log probability per predicted transmembrane helix and per predicted signal peptide in Pfam release 23.

The top part shows the histogram of average log probability per predicted transmembrane helix; the bottom part shows the same per predicted signal peptide. The log probability provided on the x-axis is calculated with equations 5 and 6. At the TMcutoff of ≥−12 (false-positive rate 4.67%) and SPcutoff of ≥−1 (false-positive rate 4.02%), the number of predicted TM helices and signal peptides are 3849 and 164 respectively.

Figure 3

Figure 3. Average log probability plot of transmembrane helix and signal peptide predictions per domain.

The top part shows the average log probability per predicted transmembrane helix calculated per domain; the bottom part shows the same per predicted signal peptide. Whereas the y-axis shows the log probability in accordance with equation 6 applied over all predicted segments for a given domain, the x-axis represents their cumulative length. At the TMcutoff of ≥−12 and SPcutoff of ≥−1 (horizontal dashed lines), the number of problematic TM and SP domains are 1079 and 164 respectively. The total number of problematic domains is 1214 (1050 TM, 135 SP and 29 concurrent TM and SP).

Figure 4

Figure 4. Examples of domain architectures of false-positive HMM hits caused by TM helices in the fragment-mode search.

We show illustrative examples for six Pfam release 23 models: Herpes_glycop_D (PF01537.9), CDC50 (PF03381.7), Cation_ATPase_N (PF00690.18), GSPII_F (PF00482.11), PAP2 (PF01569.13) and HCV_NS4b (PF01001.11). The black boxes denote the problematic domain annotations in the respective sequences. Additional material such as hmmpfam outputs and alignments are available at the associated BII WWW site for this work. Domain architecture illustrations were created with DOG 1.5 .

Figure 5

Figure 5. Examples of domain architectures of false-positive HMM hits caused by TM helices/signal peptdes in the global-mode search.

Findings for nine Pfam release 23 models Pig-P (PF08510.4), PAP2(PF 01569.13), EMP24_GP25L (PF01105.15), PTPLA (PF04387.6), Lamp (PF01299.9), MttA_Hcf106 (PF02416.8), HAMP (PF00672.17), Nodulin_late (PF07127.3) and GRP (PF07172.3) are shown. The black boxes denote the problematic domain annotations in the respective sequences. Additional material such as hmmpfam outputs and alignments are available at the associated BII WWW site for this work. Domain architecture illustrations were created with DOG 1.5 .

Figure 6

Figure 6. Relationship between the gathering score and the corresponding E-value threshold for Pfam domain library release 23.

Whereas the y-axis shows the gathering score threshold (GA) for the global-mode search, x-axis shows the corresponding E-value threshold (in decimal log scale) calculated with the domain-specific extreme-value function with parameters provided in the corresponding HMM file (for an NR database size of 7365651 sequences) for this score. The upper plot represents the distribution for 9126 domains without detected SP/TM region, the middle part shows the same for the 1214 domains with SP/TM problems. Effectively, there is no clear correlation between gathering score and E-value threshold. If E-values close to 0.1 are considered significant, all dots should be close to the “−1” line (horizontal dashed lines) in this graph and, indeed, there is some agglomeration of data points in that area; yet, there are numerous outliers. Note that the E-values are computed using the equation

where formula image is the database size, formula image and formula image are the extreme value distribution (EVD) parameters of the domain model. The bottom plot depicts the histogram of the 10340 domains in Pfam rel.23. The median of all log E-values that corresponded to the domain-specific GAs is found to be −1.16. This translates to an E-value of 0.07.

Figure 7

Figure 7. Histograms of average log probability per predicted transmembrane helix for SCOP v1.75 α-proteins class and membrane protein class.

The top (average log probability per predicted transmembrane helix for SCOP v1.75 α-proteins class) and bottom (average log probability per predicted transmembrane helix for SCOP v1.75 membrane protein class) histograms represent the false-positive and true-positive distributions for TM predictions respectively. The total number of predicted structural and membrane helices is 2293 and 5592 respectively.

Figure 8

Figure 8. Histograms of average log probability per predicted signal peptide for SCOP v1.75 α- and membrane protein class and SMART version 6.

The top (average log probability per predicted signal peptide for SCOP v1.75 α- and membrane protein class) and bottom (average log probability per predicted signal peptide for SMART version) histograms represent the false-positive and true-positive distributions for the SP predictions respectively. The total number of predicted signal peptides for SCOP α- and membrane proteins is 193 and 379 respectively, while the total number for SMART is 45. All except SM00817 Amelin (no available structure) were validated against their respective PDB entries.

Similar articles

Cited by

References

    1. Eisenhaber F. Prediction of Protein Function: Two Basic Concepts and One Practical Recipe. In: Eisenhaber F, editor. Discovering Biomolecular Mechanisms with Computational Biology. Georgetown and New York: Landes Biosciences and Springer; 2006. pp. 39–54.
    1. Ooi HS, Kwo CY, Wildpaner M, Sirota FL, Eisenhaber B, et al. ANNIE: integrated de novo protein sequence annotation. Nucleic Acids Res. 2009;37:W435–W440. - PMC - PubMed
    1. Sammut SJ, Finn RD, Bateman A. Pfam 10 years on: 10,000 families and still growing. Brief Bioinform. 2008;9:210–219. - PubMed
    1. Ivanov D, Schleiffer A, Eisenhaber F, Mechtler K, Haering CH, et al. Eco1 is a novel acetyltransferase that can acetylate proteins involved in cohesion. Curr Biol. 2002;12:323–328. - PubMed
    1. Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, et al. Predicting function: from genes to genomes and back. J Mol Biol. 1998;283:707–725. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources