SIFT: predicting amino acid changes that affect protein function (original) (raw)

Journal Article

,

*To whom correspondence should be addressed. Tel: +1 2066674515; Fax: +1 2066675889; Email: steveh@fhcrc.org

Search for other works by this author on:

Search for other works by this author on:

Cite

Pauline C. Ng, Steven Henikoff, SIFT: predicting amino acid changes that affect protein function, Nucleic Acids Research, Volume 31, Issue 13, 1 July 2003, Pages 3812–3814, https://doi.org/10.1093/nar/gkg509
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Single nucleotide polymorphism (SNP) studies and random mutagenesis projects identify amino acid substitutions in protein-coding regions. Each substitution has the potential to affect protein function. SIFT (Sorting Intolerant From Tolerant) is a program that predicts whether an amino acid substitution affects protein function so that users can prioritize substitutions for further study. We have shown that SIFT can distinguish between functionally neutral and deleterious amino acid changes in mutagenesis studies and on human polymorphisms. SIFT is available at http://blocks.fhcrc.org/sift/SIFT.html.

Received January 4, 2003; Revised and Accepted February 28, 2003

INTRODUCTION

Single nucleotide polymorphisms (SNPs) are used as markers in linkage and association studies to detect which regions in the human genome may be involved in disease. SNPs in coding and regulatory regions may be implicated in disease themselves. Non-synonymous SNPs that lead to an amino acid change in the protein product are of major interest, because amino acid substitutions currently account for approximately half of the known gene lesions responsible for human inherited disease (1). SIFT (Sorting Intolerant From Tolerant) uses sequence homology to predict whether an amino acid substitution will affect protein function and hence, potentially alter phenotype (2,3).

SIFT has been applied to human variant databases and was able to distinguish mutations involved in disease from neutral polymorphisms (3). Assuming that disease-causing amino acid substitutions are damaging to protein function, we applied SIFT to a database of missense substitutions associated with or involved in disease (4). SIFT predicted 69% to be damaging. When SIFT was applied to the non-synonymous SNPs in dbSNP (5), a database of putative SNPs, 25% of the variants were predicted to be deleterious. This was similar to SIFT's 20% false positive error which suggested that most non-synonymous SNPs are functionally neutral. Furthermore, a subset of the variants from dbSNP predicted to affect function were involved in disease which confirmed SIFT sensitivity.

The SIFT algorithm relies solely on sequence for prediction, yet performs similarly to tools that use structure (3,68). An advantage of not requiring structure is that a larger number of substitutions can be predicted on. Of the non-synonymous SNPs identified by the SNP Consortium, 74% were sufficiently similar to homologs in protein sequence databases for SIFT prediction. The number of substitutions that SIFT can predict on is expected to increase as more genomes are sequenced and more protein sequences become available.

SIFT PREDICTION METHOD

SIFT presumes that important amino acids will be conserved in the protein family, and so changes at well-conserved positions tend to be predicted as deleterious. For example, if a position in an alignment of a protein family only contains the amino acid isoleucine, it is presumed that substitution to any other amino acid is selected against and that isoleucine is necessary for protein function. Therefore, a change to any other amino acid will be predicted to be deleterious to protein function. If a position in an alignment contains the hydrophobic amino acids isoleucine, valine and leucine, then SIFT assumes, in effect, that this position can only contain amino acids with hydrophobic character. At this position, changes to other hydrophobic amino acids are usually predicted to be tolerated but changes to other residues (such as charged or polar) will be predicted to affect protein function.

To predict whether an amino acid substitution in a protein will affect protein function, SIFT considers the position at which the change occurred and the type of amino acid change. Given a protein sequence, SIFT chooses related proteins and obtains an alignment of these proteins with the query. Based on the amino acids appearing at each position in the alignment, SIFT calculates the probability that an amino acid at a position is tolerated conditional on the most frequent amino acid being tolerated. If this normalized value is less than a cutoff, the substitution is predicted to be deleterious (2). The SIFT algorithm and software have been described previously (2,3).

SIFT WEBSITE

Input

Users can obtain predictions for amino acid changes of interest at http://www.blocks.fhcrc.org/sift/SIFT.html. From this page, there are links to three submission pages which allow users different levels of involvement in order to control the quality of their predictions.

For minimal involvement, users can simply submit their protein sequences and amino acid substitutions. In its fully automated mode, SIFT will search for protein sequences homologous to the query protein and based on these sequences, calculate probabilities for each possible amino acid change. Users can select from among SWISS-PROT, SWISS-PROT/TrEMBL, or NCBI's non-redundant protein databases for SIFT to search (4,9).

Although SIFT can choose sequences automatically, better prediction results may be obtained when all of the sequences that are provided are orthologous to the query protein. This is because inclusion of paralogous sequences confounds prediction at residues conserved only among the orthologues. If a user already has sequences that are thought to be functionally similar to the protein of interest, these sequences can be directly submitted and SIFT's step for choosing sequences skipped. Given the query protein and homologous sequences, SIFT obtains the alignment.

If regions are misaligned, SIFT will not recognize conserved positions and therefore miss potentially damaging substitutions. For best prediction quality, a third mode of operation allows users to submit their own alignments.

Output

Predictions are given for all 20 possible amino acid changes at each position in the protein. The alignment is also returned so that users can examine the sequences used for prediction and modify them for resubmission. This option is also useful for removing uncertain, erroneous and misaligned sequences from alignment output generated by SIFT in its automatic mode.

For amino acid substitutions submitted by the user, a more detailed synopsis is provided (Fig. 1). The score is the normalized probability that the amino acid change is tolerated. SIFT predicts substitutions with scores less than 0.05 as deleterious. Some SIFT users have found that substitutions with scores less than 0.1 provide better sensitivity for detecting deleterious SNPs (Cornelia Ulrich, personal communication and 10). The quantitative score allows users to prioritize their amino acid changes by ranking them from the lowest scores to the highest.

Confidence in a substitution predicted to be deleterious depends on the diversity of the sequences in the alignment. If the sequences used for prediction are closely related, then many positions will appear conserved and SIFT will predict most substitutions to affect protein function. This leads to a high false positive error where functionally neutral substitutions are predicted to be deleterious.

To alert the user to these situations, SIFT calculates the median conservation value which measures the diversity of the sequences in the alignment. Conservation, as measured by information content (11), is calculated for each position in the alignment and the median of these values is obtained. Conservation ranges from log220 (= 4.32), when a position is completely conserved and only one amino acid is observed, to zero, when all 20 amino acids are observed at a position. By default, SIFT builds alignments with a median conservation value of 3.0. Predictions based on sequence alignments with higher median conservation values are less diverse and will have a higher false positive error (Fig. 2).

Even if there are few homologous sequences available, SIFT performs better than simply predicting non-conservative amino acid substitutions as deleterious, where non-conservative changes are defined as having negative scores in an amino acid substitution scoring matrix. We have shown that with only one sequence homologous to the test protein, SIFT can predict twice as many neutral substitutions correctly compared to a substitution scoring matrix (2). Even with few homologous sequences, there will be positions that differ between the test protein and the other sequences. Depending on the amino acids appearing at these positions, SIFT may predict these positions to be unimportant for protein function. This additional information can eliminate functionally neutral substitutions and increase selectivity to deleterious substitutions.

In summary, a large number of substitutions can be obtained from mutagenesis projects, SNP datasets, and changes between closely related organisms. When it is not feasible to conduct experiments on all substitutions, SIFT and other similar prediction tools (13) may be useful in prioritizing which changes affect protein function and may contribute to phenotypic differences.

ACKNOWLEDGEMENTS

We thank Jorja Henikoff for advice and encouragement. This work was supported by a grant from NIH (GM29009).

Figure 1. An example of SIFT prediction on amino acid changes in a protein. Substitutions with score less than 0.05 are predicted to affect protein function. In the last prediction, the median conservation of the sequences does not meet the threshold so a warning is issued.

Figure 1. An example of SIFT prediction on amino acid changes in a protein. Substitutions with score less than 0.05 are predicted to affect protein function. In the last prediction, the median conservation of the sequences does not meet the threshold so a warning is issued.

Figure 2. Prediction depends on the diversity of the sequences used in the alignment. Percentage of substitutions correctly predicted is based on over 4000 substitutions that were assayed throughout the LacI protein of Escherichia coli (2,12). When the sequences in the alignment used for prediction are closely related (high median conservation) then many positions appear conserved and important for function. In this situation, prediction accuracy on deleterious substitutions is high but many functionally neutral substitutions are erroneously predicted to be deleterious. To obtain an alignment with a specified median conservation, the LacI protein sequence of E.coli was submitted to the SIFT website and the median conservation setting adjusted. Because the homologous sequences available are distantly related to E.coli LacI, alignments with higher median conservation values could not be obtained. In order to obtain alignments with median conservation values more than 3.25, closely related sequences were simulated by starting with an alignment of identical E.coli LacI sequences. A position and a sequence were randomly selected from the LacI alignment with median conservation 2.75. The amino acid corresponding to this location was substituted in the starting alignment. Amino acids continued to be randomly selected and substituted until the desired median conservation was met. The simulated alignment was then evaluated for its performance as previously described (2) and the plotted value is the average performance of 100 simulated alignments.

Figure 2. Prediction depends on the diversity of the sequences used in the alignment. Percentage of substitutions correctly predicted is based on over 4000 substitutions that were assayed throughout the LacI protein of Escherichia coli (2,12). When the sequences in the alignment used for prediction are closely related (high median conservation) then many positions appear conserved and important for function. In this situation, prediction accuracy on deleterious substitutions is high but many functionally neutral substitutions are erroneously predicted to be deleterious. To obtain an alignment with a specified median conservation, the LacI protein sequence of E.coli was submitted to the SIFT website and the median conservation setting adjusted. Because the homologous sequences available are distantly related to E.coli LacI, alignments with higher median conservation values could not be obtained. In order to obtain alignments with median conservation values more than 3.25, closely related sequences were simulated by starting with an alignment of identical E.coli LacI sequences. A position and a sequence were randomly selected from the LacI alignment with median conservation 2.75. The amino acid corresponding to this location was substituted in the starting alignment. Amino acids continued to be randomly selected and substituted until the desired median conservation was met. The simulated alignment was then evaluated for its performance as previously described (2) and the plotted value is the average performance of 100 simulated alignments.

References

Krawczak,M., Ball,E.V., Fenton,I., Stenson,P.D., Abeysinghe,S., Thomas,N. and Cooper,D.N. (

2000

) Human gene mutation database-a biomedical information and research resource.

Hum. Mutat.

,

15

,

45

–51.

Ng,P.C. and Henikoff,S. (

2001

) Predicting deleterious amino acid substitutions.

Genome Res.

,

11

,

863

–874.

Ng,P.C. and Henikoff,S. (

2002

) Accounting for human polymorphisms predicted to affect protein function.

Genome Res.

,

12

,

436

–446.

Bairoch,A. and Apweiler,R. (

2000

) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000.

Nucleic Acids Res.

,

28

,

45

–48.

Sherry,S.T., Ward,M.H., Kholodov,M., Baker,J., Phan,L., Smigielski,E.M. and Sirotkin,K. (

2001

) dbSNP: the NCBI database of genetic variation.

Nucleic Acids Res.

,

29

,

308

–311.

Sunyaev,S., Ramensky,V., Koch,I., Lathe,W.,III, Kondrashov,A.S. and Bork,P. (

2001

) Prediction of deleterious human alleles.

Hum. Mol. Genet.

,

10

,

591

–597.

Chasman,D. and Adams,R.M. (

2001

) Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation.

J. Mol. Biol.

,

307

,

683

–706.

Saunders,C.T. and Baker,D. (

2002

) Evaluation of structural and evolutionary contributions to deleterious mutation prediction.

J. Mol. Biol.

,

322

,

891

–901.

Wheeler,D.L., Church,D.M., Lash,A.E., Leipe,D.D., Madden,T.L., Pontius,J.U., Schuler,G.D., Schriml,L.M., Tatusova,T.A., Wagner,L. and Rapp,B.A. (

2002

) Database resources of the National Center for Biotechnology Information: 2002 update.

Nucleic Acids Res.

,

30

,

13

–16.

Leabman,M.K., Huang,C.C., DeYoung,J., Carlson,E.J., Taylor,T., de la Cruz,M., Johns,S.J., Stryke,D., Kawamoto,M., Urban,T.J., et al. (2003) Natural variation in human membrane transporter genes reveals evolutionary and functional constraints. Proc. Natl Acad. Sci. USA, in press.

Schneider,T.D., Stormo,G.D., Gold,L. and Ehrenfeucht,A. (

1986

) Information content of binding sites on nucleotide sequences.

J. Mol. Biol.

,

188

,

415

–431.

Pace,H.C., Kercher,M.A., Lu,P., Markiewicz,P., Miller,J.H., Chang,G. and Lewis,M. (

1997

) Lac repressor genetic map in real space.

Trends Biochem. Sci.

,

22

,

334

–339.

Ramensky,V., Bork,P. and Sunyaev,S. (

2002

) Human non-synonymous SNPs: server and survey.

Nucleic Acids Res.

,

30

,

3894

–3900.

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 37,850

28,192 Pageviews

9,658 PDF Downloads

Since 1/1/2017

Month: Total Views:
January 2017 50
February 2017 99
March 2017 122
April 2017 77
May 2017 75
June 2017 90
July 2017 89
August 2017 94
September 2017 99
October 2017 107
November 2017 125
December 2017 332
January 2018 428
February 2018 408
March 2018 561
April 2018 563
May 2018 753
June 2018 654
July 2018 610
August 2018 633
September 2018 521
October 2018 657
November 2018 575
December 2018 547
January 2019 425
February 2019 404
March 2019 514
April 2019 505
May 2019 475
June 2019 321
July 2019 301
August 2019 341
September 2019 293
October 2019 292
November 2019 253
December 2019 265
January 2020 259
February 2020 245
March 2020 248
April 2020 163
May 2020 231
June 2020 236
July 2020 262
August 2020 293
September 2020 390
October 2020 388
November 2020 432
December 2020 387
January 2021 373
February 2021 339
March 2021 516
April 2021 419
May 2021 433
June 2021 391
July 2021 350
August 2021 359
September 2021 453
October 2021 431
November 2021 412
December 2021 412
January 2022 424
February 2022 331
March 2022 562
April 2022 546
May 2022 544
June 2022 472
July 2022 316
August 2022 398
September 2022 439
October 2022 433
November 2022 442
December 2022 548
January 2023 397
February 2023 525
March 2023 733
April 2023 599
May 2023 626
June 2023 441
July 2023 493
August 2023 468
September 2023 467
October 2023 629
November 2023 525
December 2023 632
January 2024 572
February 2024 548
March 2024 710
April 2024 581
May 2024 482
June 2024 434
July 2024 368
August 2024 379
September 2024 461
October 2024 245

×

Email alerts

Citing articles via

More from Oxford Academic