SIFT: predicting amino acid changes that affect protein function (original) (raw)

Journal Article

*To whom correspondence should be addressed. Tel: +1 2066674515; Fax: +1 2066675889; Email: steveh@fhcrc.org

Search for other works by this author on:

Cite

Pauline C. Ng, Steven Henikoff, SIFT: predicting amino acid changes that affect protein function, Nucleic Acids Research, Volume 31, Issue 13, 1 July 2003, Pages 3812–3814, https://doi.org/10.1093/nar/gkg509
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Single nucleotide polymorphism (SNP) studies and random mutagenesis projects identify amino acid substitutions in protein-coding regions. Each substitution has the potential to affect protein function. SIFT (Sorting Intolerant From Tolerant) is a program that predicts whether an amino acid substitution affects protein function so that users can prioritize substitutions for further study. We have shown that SIFT can distinguish between functionally neutral and deleterious amino acid changes in mutagenesis studies and on human polymorphisms. SIFT is available at http://blocks.fhcrc.org/sift/SIFT.html.

Received January 4, 2003; Revised and Accepted February 28, 2003

INTRODUCTION

Single nucleotide polymorphisms (SNPs) are used as markers in linkage and association studies to detect which regions in the human genome may be involved in disease. SNPs in coding and regulatory regions may be implicated in disease themselves. Non-synonymous SNPs that lead to an amino acid change in the protein product are of major interest, because amino acid substitutions currently account for approximately half of the known gene lesions responsible for human inherited disease (1). SIFT (Sorting Intolerant From Tolerant) uses sequence homology to predict whether an amino acid substitution will affect protein function and hence, potentially alter phenotype (2,3).

SIFT has been applied to human variant databases and was able to distinguish mutations involved in disease from neutral polymorphisms (3). Assuming that disease-causing amino acid substitutions are damaging to protein function, we applied SIFT to a database of missense substitutions associated with or involved in disease (4). SIFT predicted 69% to be damaging. When SIFT was applied to the non-synonymous SNPs in dbSNP (5), a database of putative SNPs, 25% of the variants were predicted to be deleterious. This was similar to SIFT's 20% false positive error which suggested that most non-synonymous SNPs are functionally neutral. Furthermore, a subset of the variants from dbSNP predicted to affect function were involved in disease which confirmed SIFT sensitivity.

The SIFT algorithm relies solely on sequence for prediction, yet performs similarly to tools that use structure (3,6–8). An advantage of not requiring structure is that a larger number of substitutions can be predicted on. Of the non-synonymous SNPs identified by the SNP Consortium, 74% were sufficiently similar to homologs in protein sequence databases for SIFT prediction. The number of substitutions that SIFT can predict on is expected to increase as more genomes are sequenced and more protein sequences become available.

SIFT PREDICTION METHOD

SIFT presumes that important amino acids will be conserved in the protein family, and so changes at well-conserved positions tend to be predicted as deleterious. For example, if a position in an alignment of a protein family only contains the amino acid isoleucine, it is presumed that substitution to any other amino acid is selected against and that isoleucine is necessary for protein function. Therefore, a change to any other amino acid will be predicted to be deleterious to protein function. If a position in an alignment contains the hydrophobic amino acids isoleucine, valine and leucine, then SIFT assumes, in effect, that this position can only contain amino acids with hydrophobic character. At this position, changes to other hydrophobic amino acids are usually predicted to be tolerated but changes to other residues (such as charged or polar) will be predicted to affect protein function.

To predict whether an amino acid substitution in a protein will affect protein function, SIFT considers the position at which the change occurred and the type of amino acid change. Given a protein sequence, SIFT chooses related proteins and obtains an alignment of these proteins with the query. Based on the amino acids appearing at each position in the alignment, SIFT calculates the probability that an amino acid at a position is tolerated conditional on the most frequent amino acid being tolerated. If this normalized value is less than a cutoff, the substitution is predicted to be deleterious (2). The SIFT algorithm and software have been described previously (2,3).

SIFT WEBSITE

Input

Users can obtain predictions for amino acid changes of interest at http://www.blocks.fhcrc.org/sift/SIFT.html. From this page, there are links to three submission pages which allow users different levels of involvement in order to control the quality of their predictions.

For minimal involvement, users can simply submit their protein sequences and amino acid substitutions. In its fully automated mode, SIFT will search for protein sequences homologous to the query protein and based on these sequences, calculate probabilities for each possible amino acid change. Users can select from among SWISS-PROT, SWISS-PROT/TrEMBL, or NCBI's non-redundant protein databases for SIFT to search (4,9).

Although SIFT can choose sequences automatically, better prediction results may be obtained when all of the sequences that are provided are orthologous to the query protein. This is because inclusion of paralogous sequences confounds prediction at residues conserved only among the orthologues. If a user already has sequences that are thought to be functionally similar to the protein of interest, these sequences can be directly submitted and SIFT's step for choosing sequences skipped. Given the query protein and homologous sequences, SIFT obtains the alignment.

If regions are misaligned, SIFT will not recognize conserved positions and therefore miss potentially damaging substitutions. For best prediction quality, a third mode of operation allows users to submit their own alignments.

Output

Predictions are given for all 20 possible amino acid changes at each position in the protein. The alignment is also returned so that users can examine the sequences used for prediction and modify them for resubmission. This option is also useful for removing uncertain, erroneous and misaligned sequences from alignment output generated by SIFT in its automatic mode.

For amino acid substitutions submitted by the user, a more detailed synopsis is provided (Fig. 1). The score is the normalized probability that the amino acid change is tolerated. SIFT predicts substitutions with scores less than 0.05 as deleterious. Some SIFT users have found that substitutions with scores less than 0.1 provide better sensitivity for detecting deleterious SNPs (Cornelia Ulrich, personal communication and 10). The quantitative score allows users to prioritize their amino acid changes by ranking them from the lowest scores to the highest.

Confidence in a substitution predicted to be deleterious depends on the diversity of the sequences in the alignment. If the sequences used for prediction are closely related, then many positions will appear conserved and SIFT will predict most substitutions to affect protein function. This leads to a high false positive error where functionally neutral substitutions are predicted to be deleterious.

To alert the user to these situations, SIFT calculates the median conservation value which measures the diversity of the sequences in the alignment. Conservation, as measured by information content (11), is calculated for each position in the alignment and the median of these values is obtained. Conservation ranges from log220 (= 4.32), when a position is completely conserved and only one amino acid is observed, to zero, when all 20 amino acids are observed at a position. By default, SIFT builds alignments with a median conservation value of 3.0. Predictions based on sequence alignments with higher median conservation values are less diverse and will have a higher false positive error (Fig. 2).

Even if there are few homologous sequences available, SIFT performs better than simply predicting non-conservative amino acid substitutions as deleterious, where non-conservative changes are defined as having negative scores in an amino acid substitution scoring matrix. We have shown that with only one sequence homologous to the test protein, SIFT can predict twice as many neutral substitutions correctly compared to a substitution scoring matrix (2). Even with few homologous sequences, there will be positions that differ between the test protein and the other sequences. Depending on the amino acids appearing at these positions, SIFT may predict these positions to be unimportant for protein function. This additional information can eliminate functionally neutral substitutions and increase selectivity to deleterious substitutions.

In summary, a large number of substitutions can be obtained from mutagenesis projects, SNP datasets, and changes between closely related organisms. When it is not feasible to conduct experiments on all substitutions, SIFT and other similar prediction tools (13) may be useful in prioritizing which changes affect protein function and may contribute to phenotypic differences.

ACKNOWLEDGEMENTS

We thank Jorja Henikoff for advice and encouragement. This work was supported by a grant from NIH (GM29009).

Figure 1. An example of SIFT prediction on amino acid changes in a protein. Substitutions with score less than 0.05 are predicted to affect protein function. In the last prediction, the median conservation of the sequences does not meet the threshold so a warning is issued.

Figure 1. An example of SIFT prediction on amino acid changes in a protein. Substitutions with score less than 0.05 are predicted to affect protein function. In the last prediction, the median conservation of the sequences does not meet the threshold so a warning is issued.

Figure 2. Prediction depends on the diversity of the sequences used in the alignment. Percentage of substitutions correctly predicted is based on over 4000 substitutions that were assayed throughout the LacI protein of Escherichia coli (2,12). When the sequences in the alignment used for prediction are closely related (high median conservation) then many positions appear conserved and important for function. In this situation, prediction accuracy on deleterious substitutions is high but many functionally neutral substitutions are erroneously predicted to be deleterious. To obtain an alignment with a specified median conservation, the LacI protein sequence of E.coli was submitted to the SIFT website and the median conservation setting adjusted. Because the homologous sequences available are distantly related to E.coli LacI, alignments with higher median conservation values could not be obtained. In order to obtain alignments with median conservation values more than 3.25, closely related sequences were simulated by starting with an alignment of identical E.coli LacI sequences. A position and a sequence were randomly selected from the LacI alignment with median conservation 2.75. The amino acid corresponding to this location was substituted in the starting alignment. Amino acids continued to be randomly selected and substituted until the desired median conservation was met. The simulated alignment was then evaluated for its performance as previously described (2) and the plotted value is the average performance of 100 simulated alignments.

References

Krawczak,M., Ball,E.V., Fenton,I., Stenson,P.D., Abeysinghe,S., Thomas,N. and Cooper,D.N. (

2000

) Human gene mutation database-a biomedical information and research resource.

Hum. Mutat.

–51.

Ng,P.C. and Henikoff,S. (

2001

) Predicting deleterious amino acid substitutions.

Genome Res.

863

–874.

Ng,P.C. and Henikoff,S. (

2002

) Accounting for human polymorphisms predicted to affect protein function.

Genome Res.

436

–446.

Bairoch,A. and Apweiler,R. (

2000

) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000.

Nucleic Acids Res.

–48.

Sherry,S.T., Ward,M.H., Kholodov,M., Baker,J., Phan,L., Smigielski,E.M. and Sirotkin,K. (

2001

) dbSNP: the NCBI database of genetic variation.

Nucleic Acids Res.

308

–311.

Sunyaev,S., Ramensky,V., Koch,I., Lathe,W.,III, Kondrashov,A.S. and Bork,P. (

2001

) Prediction of deleterious human alleles.

Hum. Mol. Genet.

591

–597.

Chasman,D. and Adams,R.M. (

2001

) Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation.

J. Mol. Biol.

307

683

–706.

Saunders,C.T. and Baker,D. (

2002

) Evaluation of structural and evolutionary contributions to deleterious mutation prediction.

J. Mol. Biol.

322

891

–901.

Wheeler,D.L., Church,D.M., Lash,A.E., Leipe,D.D., Madden,T.L., Pontius,J.U., Schuler,G.D., Schriml,L.M., Tatusova,T.A., Wagner,L. and Rapp,B.A. (

2002

) Database resources of the National Center for Biotechnology Information: 2002 update.

Nucleic Acids Res.

–16.

Leabman,M.K., Huang,C.C., DeYoung,J., Carlson,E.J., Taylor,T., de la Cruz,M., Johns,S.J., Stryke,D., Kawamoto,M., Urban,T.J., et al. (2003) Natural variation in human membrane transporter genes reveals evolutionary and functional constraints. Proc. Natl Acad. Sci. USA, in press.

Schneider,T.D., Stormo,G.D., Gold,L. and Ehrenfeucht,A. (

1986

) Information content of binding sites on nucleotide sequences.

J. Mol. Biol.

188

415

–431.

Pace,H.C., Kercher,M.A., Lu,P., Markiewicz,P., Miller,J.H., Chang,G. and Lewis,M. (

1997

) Lac repressor genetic map in real space.

Trends Biochem. Sci.

334

–339.

Ramensky,V., Bork,P. and Sunyaev,S. (

2002

) Human non-synonymous SNPs: server and survey.

Nucleic Acids Res.

3894

–3900.

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 37,850

28,192 Pageviews

9,658 PDF Downloads

Since 1/1/2017

Month:	Total Views:
January 2017	50
February 2017	99
March 2017	122
April 2017	77
May 2017	75
June 2017	90
July 2017	89
August 2017	94
September 2017	99
October 2017	107
November 2017	125
December 2017	332
January 2018	428
February 2018	408
March 2018	561
April 2018	563
May 2018	753
June 2018	654
July 2018	610
August 2018	633
September 2018	521
October 2018	657
November 2018	575
December 2018	547
January 2019	425
February 2019	404
March 2019	514
April 2019	505
May 2019	475
June 2019	321
July 2019	301
August 2019	341
September 2019	293
October 2019	292
November 2019	253
December 2019	265
January 2020	259
February 2020	245
March 2020	248
April 2020	163
May 2020	231
June 2020	236
July 2020	262
August 2020	293
September 2020	390
October 2020	388
November 2020	432
December 2020	387
January 2021	373
February 2021	339
March 2021	516
April 2021	419
May 2021	433
June 2021	391
July 2021	350
August 2021	359
September 2021	453
October 2021	431
November 2021	412
December 2021	412
January 2022	424
February 2022	331
March 2022	562
April 2022	546
May 2022	544
June 2022	472
July 2022	316
August 2022	398
September 2022	439
October 2022	433
November 2022	442
December 2022	548
January 2023	397
February 2023	525
March 2023	733
April 2023	599
May 2023	626
June 2023	441
July 2023	493
August 2023	468
September 2023	467
October 2023	629
November 2023	525
December 2023	632
January 2024	572
February 2024	548
March 2024	710
April 2024	581
May 2024	482
June 2024	434
July 2024	368
August 2024	379
September 2024	461
October 2024	245

SIFT: predicting amino acid changes that affect protein function (original) (raw)

Cite

Abstract

INTRODUCTION

SIFT PREDICTION METHOD

SIFT WEBSITE

Input

Output

ACKNOWLEDGEMENTS

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Cited

SIFT: predicting amino acid changes that affect protein function (original) (raw)

Cite

Abstract

INTRODUCTION

SIFT PREDICTION METHOD

SIFT WEBSITE

Input

Output

ACKNOWLEDGEMENTS

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited