A new protein linear motif benchmark for multiple sequence alignment software - PubMed (original) (raw)

A new protein linear motif benchmark for multiple sequence alignment software

Emmanuel Perrodou et al. BMC Bioinformatics. 2008.

Abstract

Background: Linear motifs (LMs) are abundant short regulatory sites used for modulating the functions of many eukaryotic proteins. They play important roles in post-translational modification, cell compartment targeting, docking sites for regulatory complex assembly and protein processing and cleavage. Methods for LM detection are now being developed that are strongly dependent on scores for motif conservation in homologous proteins. However, most LMs are found in natively disordered polypeptide segments that evolve rapidly, unhindered by structural constraints on the sequence. These regions of modular proteins are difficult to align using classical multiple sequence alignment programs that are specifically optimised to align the globular domains. As a consequence, poor motif alignment quality is hindering efforts to detect new LMs.

Results: We have developed a new benchmark, as part of the BAliBASE suite, designed to assess the ability of standard multiple alignment methods to detect and align LMs. The reference alignments are organised into different test sets representing real alignment problems and contain examples of experimentally verified functional motifs, extracted from the Eukaryotic Linear Motif (ELM) database. The benchmark has been used to evaluate and compare a number of multiple alignment programs. With distantly related proteins, the worst alignment program correctly aligns 48% of LMs compared to 73% for the best program. However, the performance of all the programs is adversely affected by the introduction of other sequences containing false positive motifs. The ranking of the alignment programs based on LM alignment quality is similar to that observed when considering full-length protein alignments, however little correlation was observed between LM and overall alignment quality for individual alignment test cases.

Conclusion: We have shown that none of the programs currently available is capable of reliably aligning LMs in distantly related sequences and we have highlighted a number of specific problems. The results of the tests suggest possible ways to improve program accuracy for difficult, divergent sequences.

PubMed Disclaimer

Figures

Figure 1

Figure 1

BAliBASE Reference Set 9 construction protocol. Flow-chart showing the 3 major steps of the protocol used to construct the BAliBASE Reference Set 9.

Figure 2

Figure 2

Example alignment in BAliBASE Reference Set 9. Part of an alignment, showing the MOD_PKM_1 ELM regular expression (R.R.. [ST]...), with examples of true positive, false positive and false negative motifs. The last three sequences do not contain any examples of the motif and cannot be aligned unambiguously in the region of the motif instance.

Figure 3

Figure 3

Program SPS scores at different levels of overall sequence similarity. a) Box plots of the SPS scores obtained by the different alignment programs in subset 1, showing the extreme observations (stars or circles), lower quartile, median, upper quartile, and largest observation in each similarity category. b) Execution times in seconds required to construct all the multiple alignments in Subset 1. Programs are displayed in the order of the Friedman test using the SPS scores for group V11 (additional file 1), with the highest scoring program on the left.

Figure 4

Figure 4

Program SPS scores depending on different motif characteristics. Box plots of the SPS scores obtained by the different alignment programs in subset 1, group V11 (<20% identity) under different conditions. The boxplots indicate the extreme observations (stars), lower quartile, median, upper quartile, and largest observation. Significant differences, according to a Wilcoxon signed ranks test (p < 0.05), are indicated by an asterix on the x-axis. P-values for the Wilcoxon tests are available in additional file 1, table 2. a) SPS scores for motifs found in globular domains versus disordered regions. b) SPS scores for motifs with a conserved residue versus variable motifs.

Figure 5

Figure 5

Program SPS scores after inclusion of sequences without validated motifs. Box plots of the SPS scores obtained by the different alignment programs under different conditions, showing the extreme observations (stars or circles), lower quartile, median, upper quartile, and largest observation. Significant differences, according to a Wilcoxon signed ranks test (p < 0.05), are indicated by an asterix on the x-axis. P-values for the Wilcoxon tests are available in additional file 1, table 3. a) SPS scores for alignments of sequences with validated motifs only compared to alignments including sequences with errors. b) SPS scores for alignments of sequences with validated motifs only compared to alignments including sequences containing false positive (FP) motifs. c) SPS scores for alignments of sequences with validated motifs only compared to alignments including sequences that do not contain any examples of the motif.

Figure 6

Figure 6

Program NorMD scores at different levels of overall sequence similarity. Box plots of the NorMD scores, representing the alignment quality of the full length alignment, obtained by the different alignment programs for the different similarity categories in subset 1. The boxplots indicate the extreme observations (stars), lower quartile, median, upper quartile, and largest observation. Programs are displayed in the order of the Friedman test using the NorMD scores for group V11 (additional file 1), with the highest scoring program on the left.

References

    1. Finn RD, Mistry J, Schuster-Böckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A. Pfam: clans, web tools and services. Nucleic Acids Res. 2006;34:D247–251. doi: 10.1093/nar/gkj149. - DOI - PMC - PubMed
    1. Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P. SMART 5: domains in the context of genomes and networks. Nucleic Acids Res. 2006;34:D257–260. doi: 10.1093/nar/gkj079. - DOI - PMC - PubMed
    1. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R, Courcelle E, Das U, Daugherty L, Dibley M, Finn R, Fleischmann W, Gough J, Haft D, Hulo N, Hunter S, Kahn D, Kanapin A, Kejariwal A, Labarga A, Langendijk-Genevaux PS, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Nikolskaya AN, Orchard S, Orengo C, Petryszak R, Selengut JD, Sigrist CJ, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C. New developments in the InterPro database. Nucleic Acids Res. 2007;35:D224–228. doi: 10.1093/nar/gkl841. - DOI - PMC - PubMed
    1. Dyson HJ, Wright PE. Coupling of folding and binding for unstructured proteins. Curr Opin Struct Biol. 2002;12:54–60. doi: 10.1016/S0959-440X(02)00289-0. - DOI - PubMed
    1. Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, Szabo B, Tompa P, Chen J, Uversky VN, Obradovic Z, Dunker AK. DisProt: the Database of Disordered Proteins. Nucleic Acids Res. 2007;35:D786–793. doi: 10.1093/nar/gkl893. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources