Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach - PubMed (original) (raw)

Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach

Kenta Motomura et al. PLoS One. 2012.

Abstract

The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or "words". We first confirmed that the English language highly likely follows Zipf's law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and "compressed" English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scale-free. A distribution pattern of SCSs in proteins is similar among species, but species-specific features are also present. Based on the availability scores of SCSs, we found that sequence motifs are enriched in high-availability sites (i.e., "key words") and vice versa. In fact, the highest availability peak within a given protein sequence often directly corresponds to a sequence motif. The amino acid composition of high-availability sites within motifs is different from that of entire motifs and all protein sequences, suggesting the possible functional importance of specific SCSs and their compositional amino acids within motifs. We anticipate that our availability-based word decoding approach is complementary to sequence alignment approaches in predicting functionally important sites of unknown proteins from their amino acid sequences.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1. Distributions of English letters and protein amino acid sequences.

The relationships between rank (R) and number (frequency) of words (N) are shown in the log-log plots. The background red lines indicate Y = A – bX, where b varies with an interval of 0.05. The thick red line indicates the best-fit least squares line. The green dots and lines indicate the observed results. (a) Natural English, following Zipf's law (left), and distribution of English word lengths (middle and right). The word lengths peak at 3. (b) Compressed English. (c) Protein amino acid sequences.

Figure 2. The exponent and linear widths of compressed English and proteins in the rank-frequency plot.

(a) The exponent b and correlation coefficient r in compressed English and proteins. (b) The linear width in compressed English and proteins both in _X_- and _Y_-axes.

Figure 3. Relationship between remaining top rank sample (%) and discriminant R value.

(a) nr-aa (all). (b) H. sapiens. (c) D. melanogaster. (d) A. thaliana. (e) E. coli.

Figure 4. The rank-frequency relationship of proteins in four model organisms.

Boundaries at the top 50% rank samples were indicated by blue dotted line. (a) H. sapiens. (b) D. melanogaster. (c) A. thaliana. (d) E. coli.

Figure 5. The exponents and linear widths of proteins from four model organisms in the rank-frequency plots.

(a) The exponent b. (b) The correlation coefficient r. (c) The linear width of _X_-axis. (d) Linear width of _Y_-axis.

Figure 6. The availability plots of two examples of amino acid sequences of proteins, peptidase M10 (top) and ribonuclease T2 (bottom).

_X_-and _Y_-axes indicate protein amino acid sequence and availability score, respectively.

Figure 7. The percentages of correspondence between sequence motifs and SCSs (i.e., triplets, quartets, and pentats).

The blue bars indicate rA = 100%, whereas the red bars indicate _rA_≥50%.

Figure 8. The characterization of high-availability sites within motifs.

(a) The distribution of the pentat availability scores within motifs and random fragments. (b) The rank order analysis of the amino acid composition of full sequences, motifs, and high-availability areas of pentats, quartets, and triplets within motifs. The rank order distance (ROD) scores are shown beside the horizontal bars. The numbers at the top of each amino acid indicate the rank difference from the full sequences.

Cited by

The Compressed Vocabulary of Microbial Life.
Caetano-Anollés G. Caetano-Anollés G. Front Microbiol. 2021 Jul 7;12:655990. doi: 10.3389/fmicb.2021.655990. eCollection 2021. Front Microbiol. 2021. PMID: 34305827 Free PMC article.
Computer-Aided Design of Antimicrobial Peptides: Are We Generating Effective Drug Candidates?
Cardoso MH, Orozco RQ, Rezende SB, Rodrigues G, Oshiro KGN, Cândido ES, Franco OL. Cardoso MH, et al. Front Microbiol. 2020 Jan 22;10:3097. doi: 10.3389/fmicb.2019.03097. eCollection 2019. Front Microbiol. 2020. PMID: 32038544 Free PMC article. Review.
The estimation of probability distribution for factor variables with many categorical values.
Lee M, Kang YS, Seok J. Lee M, et al. PLoS One. 2018 Aug 24;13(8):e0202547. doi: 10.1371/journal.pone.0202547. eCollection 2018. PLoS One. 2018. PMID: 30142178 Free PMC article.
Quantiprot - a Python package for quantitative analysis of protein sequences.
Konopka BM, Marciniak M, Dyrka W. Konopka BM, et al. BMC Bioinformatics. 2017 Jul 17;18(1):339. doi: 10.1186/s12859-017-1751-4. BMC Bioinformatics. 2017. PMID: 28716000 Free PMC article.

References

1. Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181: 223–230. - PubMed
1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. (2000) The Protein Data Bank. Nucleic Acids Res 28: 253–242. - PMC - PubMed
1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402. - PMC - PubMed
1. Searls DB (2002) The language of genes. Nature 420: 211–217. - PubMed
1. Searls DB (1997) Linguistic approaches to biological sequences. Comput Appl Biosci 13: 333–344. - PubMed

Publication types

MeSH terms

Substances

Grants and funding

This work was supported by funds from University of the Ryukyus, Okinawa, Japan. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources

Full Text Sources

Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach - PubMed (original) (raw)