Predicting the functional effect of amino acid substitutions and indels - PubMed (original) (raw)

Predicting the functional effect of amino acid substitutions and indels

Yongwook Choi et al. PLoS One. 2012.

Abstract

As next-generation sequencing projects generate massive genome-wide sequence variation data, bioinformatics tools are being developed to provide computational predictions on the functional effects of sequence variations and narrow down the search of casual variants for disease phenotypes. Different classes of sequence variations at the nucleotide level are involved in human diseases, including substitutions, insertions, deletions, frameshifts, and non-sense mutations. Frameshifts and non-sense mutations are likely to cause a negative effect on protein function. Existing prediction tools primarily focus on studying the deleterious effects of single amino acid substitutions through examining amino acid conservation at the position of interest among related sequences, an approach that is not directly applicable to insertions or deletions. Here, we introduce a versatile alignment-based score as a new metric to predict the damaging effects of variations not limited to single amino acid substitutions but also in-frame insertions, deletions, and multiple amino acid substitutions. This alignment-based score measures the change in sequence similarity of a query sequence to a protein sequence homolog before and after the introduction of an amino acid variation to the query sequence. Our results showed that the scoring scheme performs well in separating disease-associated variants (n = 21,662) from common polymorphisms (n = 37,022) for UniProt human protein variations, and also in separating deleterious variants (n = 15,179) from neutral variants (n = 17,891) for UniProt non-human protein variations. In our approach, the area under the receiver operating characteristic curve (AUC) for the human and non-human protein variation datasets is ∼0.85. We also observed that the alignment-based score correlates with the deleteriousness of a sequence variation. In summary, we have developed a new algorithm, PROVEAN (Protein Variation Effect Analyzer), which provides a generalized approach to predict the functional effects of protein sequence variations including single or multiple amino acid substitutions, and in-frame insertions and deletions. The PROVEAN tool is available online at http://provean.jcvi.org.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have the following competing interests: The authors have developed a new algorithm, PROVEAN (Protein Variation Effect Analyzer), which provides a generalized approach to predict the functional effects of protein sequence variations including single or multiple amino acid substitutions, and in-frame insertions and deletions. The PROVEAN tool is available online at http://provean.jcvi.org. There are no further patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials, as detailed online in the guide for authors.

Figures

Figure 1

Figure 1. Examples of computing and interpreting delta alignment scores for six different known variations, (A) deleterious substitution (MIM:151623), (B) neutral substitution (dbSNP:rs1042522), (C) deleterious deletion (MIM:219700), (D) neutral deletion (dbSNP:rs72471101), (E) deleterious insertion (MIM:164200), and (F) neutral insertion (dbSNP:rs10625857) with respect to the selected homologous proteins.

The amino acid residue replaced, deleted, or inserted is indicated by an arrow, and the difference between two alignments is indicated by a rectangle. Low delta scores are interpreted as deleterious, and high delta scores are interpreted as neutral. The BLOSUM62 and gap penalties of 10 for opening and 1 for extension were used.

Figure 2

Figure 2. An overview of the PROVEAN procedure.

Figure 3

Figure 3. PROVEAN score distribution for deleterious and neutral human protein variations.

For all classes of variations including substitutions, indels, and replacements, the distribution shows a distinct separation between the deleterious and neutral variations.

Figure 4

Figure 4. PROVEAN score distribution of deletions and insertions collected from the Human Gene Mutation Database (HGMD) and the 1000 Genomes Project.

Only mutations annotated as “disease-causing” were collected from the HGMD. The distribution shows a distinct separation between the two datasets.

Figure 5

Figure 5. ROC curves of four different prediction tools for single amino acid substitutions found in human and non-human proteins. All tools show a similar predictive ability with the AUC value of ∼0.85.

Figure 6

Figure 6. ROC curves of different prediction tools for single amino acid substitutions in (A) E. coli lac repressor protein and (B) human TP53 tumor suppressor protein.

The AUC values are shown in the legend. The top two performers for LacI were PolyPhen-2 and PROVEAN, and those for TP53 were Mutation Assessor and PROVEAN.

Figure 7

Figure 7. PROVEAN score distribution of TP53 variations binned into 15 classes based on transactivation levels.

For each class, a box plot is shown. The vertical line shows the whole range of delta scores, the thick horizontal line shows the median, and the gray rectangle shows the interquartile range (25%–75%). The PROVEAN score increases and correlates with median transactivation level for the non-functional and partially functional classes of variations.

Figure 8

Figure 8. Correlation of cholesterol efflux values with the PROVEAN scores for ABCA1 variations.

In general, an increase in score (i.e. less deleterious effects) correlates with an increase in cholesterol efflux activity.

Figure 9

Figure 9. Number of supporting sequences used for the Uniprot human proteins carrying neutral or deleterious variants.

(A) Distribution of the 11,990 human proteins based on the number of supporting sequences used for PROVEAN prediction. (B) Prediction accuracy achieved with respect to the number of supporting sequences. The observed accuracy is consistently above 73%, except in cases when the number of supporting sequences drops below 50.

Figure 10

Figure 10. Computing the PROVEAN score.

For simplicity, only the top three clusters were included in building the supporting sequence set in this example.

Similar articles

Cited by

References

    1. The 1000 Genome Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073. - PMC - PubMed
    1. Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, et al. (2010) Integrating common and rare genetic variation in diverse human populations. Nature 467: 52–58. - PMC - PubMed
    1. Cancer Genome Atlas Research Network (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455: 1061–1068. - PMC - PubMed
    1. Ng PC, Levy S, Huang J, Stockwell TB, Walenz BP, et al. (2008) Genetic variation in an individual human exome. PLoS Genet 4: e1000160. - PMC - PubMed
    1. Ng PC, Henikoff S (2001) Predicting deleterious amino acid substitutions. Genome Res 11: 863–874. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources