CanPredict: a computational tool for predicting cancer-associated missense mutations (original) (raw)

Journal Article

,

Department of Bioinformatics, Genentech, Inc., South San Francisco, CA 94080, USA

Search for other works by this author on:

,

Department of Bioinformatics, Genentech, Inc., South San Francisco, CA 94080, USA

Search for other works by this author on:

,

Department of Bioinformatics, Genentech, Inc., South San Francisco, CA 94080, USA

Search for other works by this author on:

Department of Bioinformatics, Genentech, Inc., South San Francisco, CA 94080, USA

*To whom correspondence should be addressed. Tel: 650-225-4293; Fax:

650-225-5389

; Email: zemin@gene.com

Search for other works by this author on:

Received:

29 January 2007

Revision received:

17 April 2007

Cite

Joshua S. Kaminker, Yan Zhang, Colin Watanabe, Zemin Zhang, CanPredict: a computational tool for predicting cancer-associated missense mutations, Nucleic Acids Research, Volume 35, Issue suppl_2, 1 July 2007, Pages W595–W598, https://doi.org/10.1093/nar/gkm405
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Various cancer genome projects are underway to identify novel mutations that drive tumorigenesis. While these screens will generate large data sets, the majority of identified missense changes are likely to be innocuous passenger mutations or polymorphisms. As a result, it has become increasingly important to develop computational methods for distinguishing functionally relevant mutations from other variations. We previously developed an algorithm, and now present the web application, CanPredict (http://www.canpredict.org/ or http://www.cgl.ucsf.edu/Research/genentech/canpredict/), to allow users to determine if particular changes are likely to be cancer-associated. The impact of each change is measured using two known methods: Sorting Intolerant From Tolerant (SIFT) and the Pfam-based LogR.E-value metric. A third method, the Gene Ontology Similarity Score (GOSS), provides an indication of how closely the gene in which the variant resides resembles other known cancer-causing genes. Scores from these three algorithms are analyzed by a random forest classifier which then predicts whether a change is likely to be cancer-associated. CanPredict fills an important need in cancer biology and will enable a large audience of biologists to determine which mutations are the most relevant for further study.

INTRODUCTION

The study of mutations that drive tumorigenesis is a central focus of cancer biology. These mutations disrupt genes that regulate normal cellular processes, thereby providing growth advantages and metastatic capabilities to tumor cells. Understanding how such changes lead to an oncogenic phenotype can provide a deeper understanding of the molecular nature of different cancers while also revealing novel therapeutic targets. There are a number of well-known somatic mutations (1) and germline mutations (2,3) that have been implicated in cancer progression. However, there are likely many more mutations that have not yet been found (4). The identification and study of these additional mutations presents an important opportunity for further understanding of the biological processes and pathways underlying cancer.

Many large-scale screens have been initiated to identify novel cancer-causing mutations (4–7) (http://cancergenome.nih.gov). These efforts have relied on sequence analysis of a few hundred to several thousand genes across multiple tumor and cell line samples. While these screens are extremely important for further understanding of tumorigenesis, the results are difficult to interpret because the majority of identified changes are not cancer-causing. In fact, a recent large-scale survey of mutations in breast and colon cancers indicates that causal mutations likely account for less than 1% of all observed non-synonymous changes (4).

The high level of background signal can be attributed in part to single nucleotide polymorphisms (SNPs) and passenger mutations. SNPs can be distinguished from true cancer mutation data by a variety of methods including identifying the same change in a matched normal tissue sample, or identifying the same, change in a database of known SNPs such as dbSNP. However, such approaches can be complicated by many factors including a lack of matched normal samples for re-sequencing putative cancer mutations. Additionally, known SNP databases are largely incomplete (8) and can contain unreliable records, making it difficult to positively identify a particular change as an SNP.

It is even more difficult to distinguish passenger mutations from true cancer mutations as this usually requires laboratory experimentation. Recently, a method was developed by Sjoblom and colleagues (4) to identify passenger mutations by uncovering those changes that occur at a higher than expected frequency in a set of tumor samples. But, since this method is highly dependant on large numbers of representative tumor samples, well-known oncogenes such as BRAF were not identified due to their low observed frequency in the Sjoblom data. Thus, without methods specifically designed to analyze the mutations generated from these genome-scale screens, it is likely that a large number of true causal mutations will be overlooked.

Different algorithms have been developed to measure the effect a particular mutation might have on protein function. These approaches include Sorting Intolerant From Tolerant (SIFT) (9), the Pfam-based LogR.E-value metric (10), Polyphen (11), LS-SNP (12), statistical geometry methods (13), support vector machine methods (14), decision trees (15) and random forest classifiers (16). Additionally, methods based on the gene ontology such as the Gene Ontology Similarity Score (GOSS) (17) can also provide a measure as to how similar a gene of interest is to other known cancer-causing genes. While these algorithms may provide some indication about the nature of a particular mutation, it remains unclear whether by themselves such methods could be directly applicable in cancer mutation analysis.

Recently, using algorithms described earlier, we found that relevant somatic missense mutations behave differently from SNPs, and based on this distinction we developed a computational method to predict whether a variant is likely to be cancer-causing or not (17). Our algorithm uses a random forest classifier to combine data from the SIFT, LogR.E-value and GOSS metrics to generate a prediction to distinguish relevant mutations from other missense changes. We demonstrated that this approach could be potentially useful in distinguishing causal from passenger mutations (17). While this method was described in detail, its implementation requires a thorough understanding of random forest algorithms and the R programming language, likely impeding a large number of experimental biologists from attempting to classify their mutations. Here, we present a web application, CanPredict, that provides a clean and straightforward interface to our algorithm. Changes identified on a RefSeq protein sequence can be submitted and a prediction is generated as to whether the changes are cancer-associated or not. This application provides the first public interface to an important algorithm that can provide insight into the large amount of mutation data being generated from cancer re-sequencing projects.

METHODS AND IMPLEMENTATION

The algorithm supporting the CanPredict application uses a random forest (RF) classifier to predict whether an amino acid change is likely to be cancer-causing or not. RF classifiers divide a large pool of data into smaller subsets based on characteristics of each datum (18). For the CanPredict application, the three characteristics used to describe each mutation are scores from SIFT, the Pfam-based LogR.E-value and the GOSS metrics. The SIFT algorithm uses similarity between closely related proteins to identify potentially deleterious changes (9). SIFT scores <0.05 are predicted to be deleterious (9) and only SIFT scores with a median information content score <3.25 are included for predictions since higher values likely indicate unreliable SIFT scores (9). Also, because the computation time to generate alignments used by the SIFT algorithm is lengthy, the alignments for all RefSeq protein sequences have been pre-computed and are stored on the server. The Pfam-based logR.E-value score predicts whether a change will alter protein function by determining the difference in fit of a wild-type version of the protein to a particular Pfam model (10). These scores were derived from values provided by the HMMER 2.3.2 software and the ls mode was used to search against the Pfam protein family database. The LogR.E-value score was calculated as: log10(E-valuevariant/E-valuecanonical). Lastly, the GOSS metric uses the gene ontology to measure the similarity of the submitted RefSeq gene to other known cancer-causing genes (17).

The training data set used to construct the classifier is composed of 200 randomly selected known somatic cancer mutations and 800 non-cancer, non-synonymous variants. The cancer mutations were downloaded from data stored in the COSMIC database (1) and the non-cancer variants were selected randomly from SNPs stored in dbSNP with a minor allele frequency >20%. For each mutation in the training data, a score from the SIFT, LogR.E-value, and GOSS algorithms was determined. These values were used to build the classifier using the package randomForest 4.5-16 (http://stat-www.berkeley.edu/users/breiman/RandomForests) for the R statistical environment (http://www.r-project.org). The out-of-bag error, an internal measure of the rate of misclassification of the classifier, was determined to be 3.19% suggesting that the classifier is very effective. The training data are freely available from http://share.gene.com/mutation_classification.

As shown previously (17), data from three different experiments suggest that the predictor can function very well to highlight putative cancer mutations. First, in a cross-validation experiment, the classifier consistently revealed a very low false-positive rate of 1.7% for distinguishing relevant mutations from common SNPs (17). Second, an experiment was performed to distinguish recurrently identified mutations from mutations occurring only one time; causal mutations are more likely than passenger changes to be seen in multiple different tumor samples because they are under positive selection in tumor samples. In this analysis, 58% of variants observed more than 10 times were predicted to be cancer-associated while only 43% of variants occurring only one time were predicted cancer-associated (_P_-value 0.018, two-tailed Fisher Exact test) (17). Third, the classifier was used to analyze recent data from a large-scale screen for cancer mutations performed by Sjoblom and colleagues (4). In the paper by Sjoblom, mutations were grouped into those genes likely to cause cancer and those genes unlikely to cause cancer, CAN genes and non-CAN genes, respectively. The CanPredict classifier revealed that mutations in CAN genes were more likely to be predicted as cancer-associated than mutations in non-CAN genes (26.3% to 13.3%, respectively; _P_-value 8.8e-6; two-tailed Fisher Exact test) (17).

The CanPredict user interface was designed using dynamic AJAX technology. The user-supplied mutations and protein sequence data are validated via a server process, and the analysis status is instantly updated without the user leaving the input page. The results summary page is automatically loaded when the AJAX call detects that the analysis is complete. The Dojo library (www.dojotoolkit.org) implements AJAX calls by providing support for the back and forward buttons, changing the URL in the address bar to allow for bookmarking, and gracefully degrading when AJAX or JavaScript are not fully supported on the client.

RESULTS AND DISCUSSION

The CanPredict application can be used to submit a single full-length RefSeq protein sequence or accession and multiple associated changes (Figure 1). Additionally, from the Batch Submission page, the application will accept multiple RefSeq protein accessions and associated changes. There is no limit to the number of changes that can be analyzed from the Batch Submission page. Changes are validated by the server to ensure that the amino acid specified in the change string occurs in the indicated sequence. For testing the application, users can either enter their own mutations or use the test-it link to submit example mutations. Included in these examples are known cancer-causing mutations in BRAF, KRAS and EGFR.

The home page of the CanPredict application.

Figure 1.

The home page of the CanPredict application.

Results of the analysis are returned to the user in a summary page where they can also access all other submitted changes using links at the top of the summary (Figure 2). There is also a link directing users to a detailed description of the scores produced from each metric. Within the submission summary is a prediction from the classifier indicating likely cancer, likely non-cancer or not determined. The sequence flanking the change is included to allow the user to confirm the precise sequence used in the analysis. Below the submission summary are data from the SIFT, logR.E-value and GOSS analyses. As alignment files used by the SIFT algorithm are time-consuming to produce, they are available for download using the provided link. SIFT scores and median information content are also presented and only scores with a median information content of <3.25 are considered reliable (9) and will be used to generate a prediction from the classifier. The logR.E-value analysis indicates the domain altered by the submitted mutation. If there are multiple domains covering the same mutation, the domain with the most deleterious (largest) logR.E-value score will be selected for display and will be used by the classifier. The GOSS score is indicated last, and will be present only if the submitted change resides in a gene with a gene ontology description. The result pages can be bookmarked, and the associated data are saved in the server for a week. Finally, a link presented on the results summary page allows users to download their results in a tab-delimited format. Results from the batch submission page will be returned in a similar tab-delimited format.

The results summary page of the CanPredict application.

Figure 2.

The results summary page of the CanPredict application.

The CanPredict application provides an easily accessible interface for users to determine if an amino acid change is likely to be cancer-causing. This application will likely be very useful for large-scale cancer genome projects.

ACKNOWLEDGEMENTS

We would like to thank Pete Haverty and Bill Forrest for discussions about the CanPredict algorithm, Shiuh-Ming Luoh, Lawrence Hon, Jerry Tang, Kiran Mukhyala and Reece Hart for helpful discussions, Sarah Kaminker for careful reading and editing of the manuscript and William Wood for guidance and support throughout the project. We would also like to thank the UCSF Computer Graphics Laboratory and Dr. Thomas Ferrin for hosting the CanPredict web application. Funding to pay the Open Access publication charges for this article was provided by Genentech, Inc.

Conflict of interest statement. None decalred.

REFERENCES

1

et al.

Cosmic 2005

,

Br. J. Cancer

,

2006

, vol.

94

(pg.

318

-

322

)

2

et al.

Pituitary adenoma predisposition caused by germline mutations in the AIP gene

,

Science

,

2006

, vol.

312

(pg.

1228

-

1230

)

3

et al.

MC1R germline variants confer risk for BRAF-mutant melanoma

,

Science

,

2006

, vol.

313

(pg.

521

-

522

)

4

et al.

The consensus coding sequences of human breast and colorectal cancers

,

Science

,

2006

, vol.

314

(pg.

268

-

274

)

5

et al.

A screen of the complete protein kinase gene family identifies diverse patterns of somatic mutations in human breast cancer

,

Nat. Genet

,

2005

, vol.

37

(pg.

590

-

592

)

6

et al.

Colorectal cancer: mutations in a signalling pathway

,

Nature

,

2005

, vol.

436

pg.

792

7

et al.

Somatic mutations of the protein kinase gene family in human lung cancer

,

Cancer Res

,

2005

, vol.

65

(pg.

7591

-

7595

)

8

Variation is the spice of life

,

Nat. Genet

,

2001

, vol.

27

(pg.

234

-

236

)

9

Accounting for human polymorphisms predicted to affect protein function

,

Genome Res

,

2002

, vol.

12

(pg.

436

-

446

)

10

Large-scale analysis of non-synonymous coding region single nucleotide polymorphisms

,

Bioinformatics

,

2004

, vol.

20

(pg.

1006

-

1014

)

11

Human non-synonymous SNPs: server and survey

,

Nucleic Acids Res

,

2002

, vol.

30

(pg.

3894

-

3900

)

12

LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources

,

Bioinformatics

,

2005

, vol.

21

(pg.

2814

-

2820

)

13

Statistical geometry approach to the study of functional effects of human nonsynonymous SNPs

,

Hum. Mutat

,

2005

, vol.

26

(pg.

471

-

476

)

14

Identification and analysis of deleterious human SNPs

,

J. Mol. Biol

,

2006

, vol.

356

(pg.

1263

-

1274

)

15

A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

,

Bioinformatics

,

2003

, vol.

19

(pg.

2199

-

2209

)

16

Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information

,

Bioinformatics

,

2005

, vol.

21

(pg.

2185

-

2190

)

17

et al.

Distinguishing cancer-associated missense mutations from common polymorphisms

,

Cancer Res

,

2007

, vol.

67

(pg.

465

-

473

)

18

Random Forests

,

Machine Learning

,

2001

, vol.

45

(pg.

5

-

32

)

© 2007 The Author(s)

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 1,883

1,341 Pageviews

542 PDF Downloads

Since 1/1/2017

Month: Total Views:
January 2017 1
February 2017 7
March 2017 5
April 2017 6
May 2017 4
June 2017 8
July 2017 6
August 2017 2
September 2017 3
October 2017 8
November 2017 8
December 2017 27
January 2018 36
February 2018 21
March 2018 22
April 2018 41
May 2018 23
June 2018 16
July 2018 23
August 2018 17
September 2018 15
October 2018 14
November 2018 10
December 2018 15
January 2019 18
February 2019 41
March 2019 24
April 2019 42
May 2019 22
June 2019 23
July 2019 24
August 2019 25
September 2019 21
October 2019 32
November 2019 24
December 2019 58
January 2020 19
February 2020 34
March 2020 28
April 2020 4
May 2020 28
June 2020 19
July 2020 15
August 2020 18
September 2020 5
October 2020 17
November 2020 17
December 2020 10
January 2021 8
February 2021 17
March 2021 12
April 2021 15
May 2021 24
June 2021 26
July 2021 8
August 2021 10
September 2021 14
October 2021 11
November 2021 15
December 2021 15
January 2022 21
February 2022 6
March 2022 18
April 2022 24
May 2022 46
June 2022 8
July 2022 24
August 2022 13
September 2022 20
October 2022 17
November 2022 20
December 2022 17
January 2023 15
February 2023 20
March 2023 15
April 2023 32
May 2023 14
June 2023 29
July 2023 21
August 2023 27
September 2023 16
October 2023 15
November 2023 17
December 2023 39
January 2024 38
February 2024 37
March 2024 41
April 2024 44
May 2024 24
June 2024 20
July 2024 32
August 2024 24
September 2024 28
October 2024 20

Citations

120 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic