The cancer genome (original) (raw)

. Author manuscript; available in PMC: 2010 Feb 15.

Published in final edited form as: Nature. 2009 Apr 9;458(7239):719–724. doi: 10.1038/nature07943

Abstract

All cancers arise as a result of changes that have occurred in the DNA sequence of the genomes of cancer cells. Over the past quarter of a century much has been learnt about these mutations and the abnormal genes that operate in human cancers. We are now, however, moving into an era in which it will be possible to obtain the complete DNA sequence of large numbers of cancer genomes. These studies will provide us with a detailed and comprehensive perspective on how individual cancers have developed.


Cancer is responsible for one in eight deaths worldwide1. It encompasses more than 100 distinct diseases with diverse risk factors and epidemiology which originate from most of the cell types and organs of the human body and which are characterized by relatively unrestrained proliferation of cells that can invade beyond normal tissue boundaries and metastasize to distant organs.

Early insights into the central role of the genome in cancer development emerged in the late nineteenth and early twentieth centuries from studies by David von Hansemann2 and Theodor Boveri3. Examining dividing cancer cells under the microscope, they observed the presence of bizarre chromosomal aberrations. This led to the proposal that cancers are abnormal clones of cells characterized by and caused by abnormalities of hereditary material. Following the discovery of DNA as the molecular substrate of inheritance4 and determination of its structure5, this speculation was supported by the demonstration that agents that damage DNA and generate mutations also cause cancer6. Subsequently, increasingly refined analyses of cancer cell chromosomes showed that specific and recurrent genomic abnormalities, such as the translocation between chromosomes 9 and 22 in chronic myeloid leukaemia (known as the ‘Philadelphia’ translocation7,8), are associated with particular cancer types. Finally, it was demonstrated that introduction of total genomic DNA from human cancers into phenotypically normal NIH3T3 cells could convert them into cancer cells9,10. Isolation of the specific DNA segment responsible for this transforming activity led to the identification of the first naturally occurring, human cancer-causing sequence change—the single base G > T substitution that causes a glycine to valine substitution in codon 12 of the HRAS gene11,12. This seminal discovery in 1982 inaugurated an era of vigorous searching for the abnormal genes underlying the development of human cancer that continues today.

Here we review the principles of our current understanding of cancer genomes. We look forward to the explosion of information about cancer genomes that is imminent and the insights into the process of oncogenesis that this promises to generate.

Cancer is an evolutionary process

All cancers are thought to share a common pathogenesis. Each is the outcome of a process of Darwinian evolution occurring among cell populations within the microenvironments provided by the tissues of a multicellular organism. Analogous to Darwinian evolution occurring in the origins of species, cancer development is based on two constituent processes, the continuous acquisition of heritable genetic variation in individual cells by more-or-less random mutation and natural selection acting on the resultant phenotypic diversity. The selection may weed out cells that have acquired deleterious mutations or it may foster cells carrying alterations that confer the capability to proliferate and survive more effectively than their neighbours. Within an adult human there are probably thousands of minor winners of this ongoing competition, most of which have limited abnormal growth potential and are invisible or manifest as common benign growths such as skin moles. Occasionally, however, a single cell acquires a set of sufficiently advantageous mutations that allows it to proliferate autonomously, invade tissues and metastasize.

The catalogue of somatic mutations in a cancer genome

Like all the cells that constitute the human body, a cancer cell is a direct descendant, through a lineage of mitotic cell divisions, of the fertilized egg from which the cancer patient developed and therefore carries a copy of its diploid genome (Fig. 1). However, the DNA sequence of a cancer cell genome, and indeed of most normal cell genomes, has acquired a set of differences from its progenitor fertilized egg. These are collectively termed somatic mutations to distinguish them from germline mutations that are inherited from parents and transmitted to offspring.

Figure 1. The lineage of mitotic cell divisions from the fertilized egg to a single cell within a cancer showing the timing of the somatic mutations acquired by the cancer cell and the processes that contribute to them.

Figure 1

Mutations may be acquired while the cell lineage is phenotypically normal, reflecting both the intrinsic mutations acquired during normal cell division and the effects of exogenous mutagens. During the development of the cancer other processes, for example DNA repair defects, may contribute to the mutational burden. Passenger mutations do not have any effect on the cancer cell, but driver mutations will cause a clonal expansion. Relapse after chemotherapy can be associated with resistance mutations that often predate the initiation of treatment.

The somatic mutations in a cancer cell genome may encompass several distinct classes of DNA sequence change. These include substitutions of one base by another; insertions or deletions of small or large segments of DNA; rearrangements, in which DNA has been broken and then rejoined to a DNA segment from elsewhere in the genome; copy number increases from the two copies present in the normal diploid genome, sometimes to several hundred copies (known as gene amplification); and copy number reductions that may result in complete absence of a DNA sequence from the cancer genome (Fig. 2).

Figure 2. Figurative depiction of the landscape of somatic mutations present in a single cancer genome.

Figure 2

Part of catalogue of somatic mutations in the small-cell lung cancer cell line NCI-H2171. Individual chromosomes are depicted on the outer circle followed by concentric tracks for point mutation, copy number and rearrangement data relative to mapping position in the genome. Arrows indicate examples of the various types of somatic mutation present in this cancer genome.

In addition, the cancer cell may have acquired, from exogenous sources, completely new DNA sequences, notably those of viruses such as human papilloma virus, Epstein Barr virus, hepatitis B virus, human T lymphotropic virus 1 and human herpes virus 8, each of which is known to contribute to the genesis of one or more type of cancer13.

Compared to the fertilized egg, the cancer genome will also have acquired epigenetic changes which alter chromatin structure and gene expression, and which manifest at DNA sequence level by changes in the methylation status of some cytosine residues. Epigenetic changes can be subject to the same Darwinian natural selection as genetic events, provided that there is epigenetic variation in the population of competing cells, that the epigenetic changes are stably heritable from the mother to the daughter cell and that they generate phenotypic effects for selection to act on.

Finally, it should not be forgotten that another genome is harboured within the cancer cell. The thousands of mitochondria present each carry a circular genome of approximately 17 kilobases. Somatic mutations in mitochondrial genomes have been reported in many human cancers, although their role in the development of the disease is not clear14.

Acquisition of somatic mutations in cancer genomes

The mutations found in a cancer cell genome have accumulated over the lifetime of the cancer patient. Some were acquired when ancestors of the cancer cell were biologically normal, showing no phenotypic characteristics of a cancer cell (Fig. 1). DNA in normal cells is continuously damaged by mutagens of both internal and external origins. Most of this damage is repaired. However, a small fraction may be converted into fixed mutations and DNA replication itself has a low intrinsic error rate. Our understanding of somatic mutation rates in normal human cells is still relatively rudimentary. However, it is likely that the mutation rates of each of the various structural classes of somatic mutation differ and that there are differences among cell types too. Mutation rates increase in the presence of substantial exogenous mutagenic exposures, for example tobacco smoke carcinogens, naturally occurring chemicals such as aflatoxins, which are produced by fungi, or various forms of radiation including ultraviolet light. These exposures are associated with increased rates of lung, liver and skin cancer, respectively, and somatic mutations within such cancers often exhibit the distinctive mutational signatures known to be associated with the mutagen15. The rates of the different classes of somatic mutation are also increased in several rare inherited diseases, for example Fanconi anaemia, ataxia telangiectasia, mosaic variegated aneuploidy and xeroderma pigmentosum, each of which is also associated with increased risks of cancer16,17.

The rest of the somatic mutations in a cancer cell genome have been acquired during the segment of the cell lineage in which predecessors of the cancer cell already show phenotypic evidence of neoplastic change (Fig. 1). Whether the somatic mutation rate is always higher during this part of the lineage is controversial18,19. For some cancers this is clearly the case. For example, colorectal and endometrial cancers with defective DNA mismatch repair due to abnormalities in genes such as MLH1 and MSH2, exhibit increased rates of acquisition of single nucleotide changes and small insertions/deletions at polynucleotide tracts20. Other classes of such ‘mutator phenotypes’ may exist, for example leading to abnormalities in chromosome number or increased rates of genomic rearrangement, although these are generally less well characterized20. The merit of an increased somatic mutation rate with respect to the development of cancer is that it increases the DNA sequence diversity on which selection can act. However, it has been suggested that the mutation rates of normal cells may be sufficient to account for the development of some cancers, without the requirement for a mutator phenotype18,19.

The course of mutation acquisition need not be smooth and predecessors of the cancer cell may suddenly acquire a large number of mutations. This is sometimes termed ‘crisis’21, and can occur after attrition of the telomeres that normally cap the ends of chromosomes, with the cell having to substantially reorganize its genome to survive.

Although complex and potentially cryptic to decipher, the catalogue of somatic mutations present in a cancer cell therefore represents a cumulative archaeological record of all the mutational processes the cancer cell has experienced throughout the lifetime of the patient. It provides a rich, and predominantly unmined, source of information for cancer epidemiologists and biologists with which to interrogate the development of individual tumours.

Driver and passenger mutations

Each somatic mutation in a cancer cell genome, whatever its structural nature, may be classified according to its consequences for cancer development. ‘Driver’ mutations confer growth advantage on the cells carrying them and have been positively selected during the evolution of the cancer. They reside, by definition, in the subset of genes known as ‘cancer genes’. The remainder of mutations are ‘passengers’ that do not confer growth advantage, but happened to be present in an ancestor of the cancer cell when it acquired one of its drivers (see Box 1).

Box 1 | Driver and passenger mutations.

All cancers arise as a result of somatically acquired changes in the DNA of cancer cells. That does not mean, however, that all the somatic abnormalities present in a cancer genome have been involved in development of the cancer. Indeed, it is likely that some have made no contribution at all. To embody this concept, the terms ‘driver’ and ‘passenger’ mutation have been coined.

A driver mutation is causally implicated in oncogenesis. It has conferred growth advantage on the cancer cell and has been positively selected in the microenvironment of the tissue in which the cancer arises. A driver mutation need not be required for maintenance of the final cancer (although it often is) but it must have been selected at some point along the lineage of cancer development shown in Fig. 1.

A passenger mutation has not been selected, has not conferred clonal growth advantage and has therefore not contributed to cancer development. Passenger mutations are found within cancer genomes because somatic mutations without functional consequences often occur during cell division. Thus, a cell that acquires a driver mutation will already have biologically inert somatic mutations within its genome. These will be carried along in the clonal expansion that follows and therefore will be present in all cells of the final cancer.

Some somatic mutations may actually impair cell survival. These will usually be subject to negative selection and hence be absent from the cancer genome. The traces of negative selection in cancer genomes are currently limited but it would be surprising if it was not operative.

A central goal of cancer genome analysis is the identification of cancer genes that, by definition, carry driver mutations. A key challenge will therefore be to distinguish driver from passenger mutations. The main strategy generally used exploits a number of structural signatures associated with mutations that are under positive selection. For example, driver mutations cluster in the subset of genes that are cancer genes whereas passenger mutations are more or less randomly distributed. This has been the approach adopted fruitfully in the past to identify most somatically mutated cancer genes in studies targeted at small regions of the genome.

Whole-genome sequencing, however, incorporating analysis of more than 20,000 protein-coding genes and unknown numbers of functional elements in intronic and intergenic DNA, presents a greater challenge, one rendered more daunting by the likelihood that passenger mutations in most cancer genomes substantially outnumber drivers. Because many cancer genes seem to contribute to cancer development in only a small fraction of tumours, large sample sets will have to be analysed to distinguish infrequently mutated cancer genes from genes with random clusters of passenger mutations. Furthermore, it is conceivable that some mutational processes are directed at specific genomic regions and thus generate clusters of passenger mutations that may be mistaken for drivers.

Therefore, all such signatures of positive selection need to be interpreted with caution. In practice, however, used in an informed and critical manner they will remain effective and reliable guides to the identification of cancer genes. Investigation of the biological consequences of putative driver mutations will often consolidate the evidence implicating them in oncogenesis and will provide insight into the subverted biological processes by which they contribute to cancer development.

The number of driver mutations, and hence the number of abnormal cancer genes, in an individual cancer is a central conceptual parameter of cancer development, but is not well established. It is highly likely that most cancers carry more than one driver and that the number varies between cancer types. On the basis of age–incidence statistics it has been suggested that common adult epithelial cancers such as breast, colorectal and prostate require 5–7 rate-limiting events, possibly equating to drivers, whereas cancers of the haematological system may require fewer22. These estimates are supported by experimental studies which show that engineering changes in the functions of at least five or six genes in normal primary human cells is necessary to convert them into cancer cells23. However, recent analyses of somatic mutation data from cancers indicate that the number of drivers might be much higher24. Ultimately, direct estimates of the number of drivers in individual cancers will be provided by identifying all the cancer genes and systematically measuring the prevalence of mutations in them.

One important subclass of driver is a mutation that confers resistance to cancer therapy (Fig. 1). These are typically found in recurrences of cancers that have initially responded to treatment but that are now resistant. Resistance mutations often confer limited growth advantage on the cancer cell in the absence of therapy. Some seem to predate initiation of treatment, existing as passengers in minor subclones of the cancer cell population until the selective environment is changed by the initiation of therapy25,26. The passenger is then converted into a driver and the resistant subclone preferentially expands, manifesting as the recurrence.

The repertoire of somatically mutated cancer genes

The identification of driver mutations and the cancer genes that they alter has been a central aim of cancer research for more than a quarter of a century. It has been a remarkably successful endeavour, with at least 350 (1.6%) of the ~22,000 protein-coding genes in the human genome reported to show recurrent somatic mutations in cancer with strong evidence that these contribute to cancer development27 (http://www.sanger.ac.uk/genetics/CGP/Census/). Most were identified by first establishing their physical location in the genome through low-resolution genome-wide screens, in particular cytogenetics for chromosomal translocations in leukaemias and lymphomas. A few were discovered using biological assays for transforming activity of whole cancer cell DNA and others through targeted mutational screens guided by biologically well-informed guesswork. Mutations in ~10% of these genes are also found in the germ line, where they confer an increased risk of developing cancer, and these were often initially identified by genetic linkage analysis of affected families. The size of the full repertoire of human cancer genes is a matter of speculation. However, studies in mice have suggested that more than 2,000 genes, when appropriately altered, may have the potential to contribute to cancer development28.

The known cancer genes run the gamut of tissue specificities and mutation prevalences. Some, for example TP53 and KRAS, are frequently mutated in diverse types of cancer whereas others are rare and/or restricted to one cancer type (http://www.sanger.ac.uk/genetics/CGP/cosmic/). In some cancer types, for example colorectal and pancreatic cancer, abnormalities in several known cancer genes are common. In contrast, in gastric cancer, relatively few mutations in known cancer genes have been reported.

Approximately 90% of the known somatically mutated cancer genes are dominantly acting, that is, mutation of just one allele is sufficient to contribute to cancer development. The mutation in such cases usually results in activation of the encoded protein. Ten per cent act in a recessive manner, requiring mutation of both alleles, and the mutations usually result in abrogation of protein function (these are sometimes known as tumour suppressor genes).

Patterns of mutation differ between dominant and recessive cancer genes. Recessive cancer genes are characterized by diverse mutation types, ranging from single base substitutions to whole gene deletions, which have the common outcome of abolishing the function of the encoded protein. In each dominantly acting cancer gene, however, the repertoire of cancer-causing somatic mutations is usually more constrained, both with respect to the type of mutation and its location in the gene. Missense amino acid changes (often restricted to certain key amino acids), in-frame insertions and deletions, and gene amplification are all common mutational mechanisms for activating dominantly acting cancer genes. Most, however, are activated through genomic rearrangement. This may join the sequences of two different genes to create a fusion gene or it may position the cancer gene adjacent to regulatory elements from elsewhere in the genome, resulting in abnormal expression patterns. Most of the known rearranged cancer genes are operative in the relatively rare subset of cancers constituted by leukaemias, lymphomas and sarcomas. Recently, however, rearranged cancer fusion genes were discovered in more than half of prostate cancer cases29 and in lung adenocarcinomas30. Their late discovery probably reflects the difficulty of identifying them amidst the jumble of passenger rearrangements present in many cancer genomes and hints that there are many more rearranged cancer genes to be found in common cancers.

Much of what we know about the biological pathways and processes that are subverted in cancer has originated from experiments exploring the functions of cancer genes. Certain gene families, notably the protein kinases, feature particularly prominently among cancer genes. Furthermore, cancer genes cluster on certain signalling pathways. For example, in the classical MAPK/ERK pathway31 upstream mutations are found in cell-membrane-bound receptor tyrosine kinases such as EGFR, ERBB2, FGFR1, FGFR2, FGFR3, PDGFRA and PDGFRB and also in the downstream cytoplasmic components NF1, PTPN11, HRAS, KRAS, NRAS and BRAF. Recent exhaustive mutational analyses in gliomas have indicated that almost all cases have a mutation at one of the genes on these critical signalling pathways32.

For some cancers, classification and treatment protocols are now defined by the presence of abnormal cancer genes. Acute myeloid leukaemia, for example, is subclassified on the basis of the presence of abnormalities involving specific cancer genes33. Each subtype has a characteristic gene expression profile, cellular morphology, clinical syndrome, prognosis and opportunity for targeted therapy. Moreover, because cancer cells are dependent on the abnormal proteins encoded by mutated cancer genes, they have become targets for the development of new cancer therapeutics. Flagships for this new generation of treatments include imatinib, an inhibitor of the proteins encoded by the ABL and KIT genes, which are mutated and activated, respectively, in chronic myeloid leukaemia34 and gastrointestinal stromal tumours35, and trastuzumab, an antibody directed against the protein encoded by ERBB2 (also known as HER2), which is commonly amplified and overexpressed in breast cancer36.

Early systematic sequencing of cancer genomes

Provision of the reference human genome sequence at the turn of the millennium offered new strategies and opportunities for surveying cancer genomes. Rather than depending on low-resolution maps, the highest possible resolution map, the DNA sequence itself, became available and has empowered investigation of cancer genomes in several ways. For example, much higher-resolution arrays have been developed, allowing finer mapping of copy number changes in cancer genomes leading to the identification of several new amplified cancer genes.

The availability of the human genome sequence has also raised the possibility that DNA sequencing itself could become the primary tool for exploration of cancer genomes. This has prompted several pilot experiments. So far, most have sequenced large numbers of PCR products to detect the base substitutions and small insertions and deletions (collectively termed ‘point’ mutations) present in the coding exons of protein-coding genes32,37-44. Typically, such studies have covered several hundred megabases of cancer genome with designs ranging from hundreds of genes analysed in a few hundred cancers to most of the ~22,000 protein-coding genes in 10–20 examples of a particular cancer class.

Several insights have been provided by these screens. They have brought success in the identification of point-mutated cancer genes including _BRAF45_​,​ _PIK3CA46_​,​ _EGFR47_​,​ HER2 (ref. 48), JAK2 (ref. 49), UTX (ref. 50) and IDH1 (ref. 41). Some of these were unique discoveries, whereas others were simultaneously discovered in targeted mutational screens. Some were previously known cancer genes, but the discovery of point mutations highlighted new mechanisms and cancer types in which they are operative. Some were surprising and highlight the virtue of systematic and comprehensive screens, for example the discovery of the enzyme isocitrate dehydrogenase (IDH1), which constitutes part of the Krebs cycle of oxidative phosphorylation, as a cancer gene mutated in glioma41. Because many are kinases that are activated by the mutations found in cancer, they have prompted a wave of drug discovery to find inhibitors that may serve as anticancer therapeutics51, some of which are already in clinical trials.

Exposing the landscape of the cancer genome

Important insights into the general parameters and patterns of somatic mutation in cancer have also emerged from these early studies. It appears that most somatic point mutations in cancer genomes are passengers39. Although this might have been predicted for mutations in intergenic and intronic DNA, it applies even in protein-coding exons. There is, however, statistical evidence in favour of many more driver mutations than can be accounted for by known cancer genes. These drivers appear to be distributed across a large number of genes, each of which is mutated infrequently, suggesting that the repertoire of somatically mutated human cancer genes is much larger than the ~350 currently catalogued39,44. Conceivably, these infrequently mutated cancer genes confer less selective growth advantage on a clone of cancer cells than more commonly mutated cancer genes, but other explanations can also be invoked. Some analyses also indicate that there may be as many as 20 driver mutations in individual cancers, considerably more than the 5–7 previously predicted24.

Understanding of the prevalence and types of somatic mutation in cancer genomes has been greatly fostered by these studies. Some cancer genomes carry >100,000 point mutations whereas others have fewer than 1,000. Some of this variation can be accounted for by previous heavy mutagenic exposures or the existence of known DNA repair defects. However, in a subset of breast cancers there are large numbers of C-to-G base substitutions, almost always occurring at cytosines that follow a thymine, for which there is no obvious explanation and for which unknown exposures and/or mutator phenotypes are presumably responsible42,43.

The effects of chemotherapy on the cancer genome have also been revealed by systematic sequencing experiments. For example, gliomas that recur after treatment with the DNA alkylating agent temozolomide have been shown to carry huge numbers of mutations with a signature typical of such agents32,52,53. The fact that the mutations could be detected at all indicates that these recurrences are clonal. Thus, these studies indicate that, although temozolomide only confers a short increased lifespan for the patient, almost all cells in a glioma respond and a single cell that is resistant to the chemotherapy proliferates to form the recurrence. Additional studies guided by these observations led to the identification of the underlying mutated resistance gene52,53.

Beyond point mutations, some investigations have begun to explore the features of genomic rearrangements in common cancers, about which remarkably little is known. Early studies using conventional Sanger sequencing indicated that there is substantial complexity of rearrangement in these genomes54,55. The recent advent of massively parallel, second-generation sequencing technologies has enabled more comprehensive genome-wide screens revealing that some cancer genomes carry hundreds of somatically acquired rearrangements, whereas others carry very few. Moreover, the distinctive patterns of rearrangement found indicate that currently uncharacterized mutational processes may be at work56.

Sequencing of cancer genomes in the future

The large-scale, systematic sequencing studies conducted so far have been constrained by the relatively low throughput and high cost of sequencing. They have therefore generally been restricted to components of the cancer genome (for example, coding exons), to small numbers of cancer samples or to a subset of the mutational classes present. In principle, however, all the structural classes of somatic mutation can be detected genome-wide by randomly fragmenting the cancer genome and sequencing large numbers of fragments such that each base in the reference human genome is covered several times by a sequence generated from the cancer. With a high enough level of coverage, essentially a full catalogue of somatic mutations from an individual cancer genome can be obtained, including all point mutations, rearrangements and copy number changes. Mutations in the accompanying mitochondrial genomes of the cancer will also be collected. With further adaptation this could be extended to include epigenetic alterations and could be applied to the transcriptomes of cancers to investigate the first phenotypic effects of all these changes. This catalogue will include all the driver mutations and hence all the cancer genes operating in that cancer, whether they are protein-coding genes, non-coding RNA genes or more cryptic functional elements of the genome. Indeed, if known or unknown DNA viruses have contributed to oncogenesis these will also be discovered. The catalogue will also include all the passenger mutations that incorporate the signatures of previous exposures, DNA repair defects and other mutational processes the cancer has experienced over the decades during which it was evolving.

Until recently, this was an unattainable fantasy. However, the arrival of second-generation sequencing technologies promises a new era for cancer genomics. These platforms currently generate billions of bases of DNA sequence per week, yields that are predicted to increase rapidly over the next couple of years (Fig. 3). Several proof-of-principle studies have recently been published applying these technologies to cancer samples. These have demonstrated that the current generation of massively parallel sequencing platforms can identify the full range of somatically acquired genetic alteration in cancer, including point mutations on a genome-wide basis57, insertions and deletions57, copy number changes56 and genomic rearrangements56, as well as characterizing the cancer cell transcriptome40,41. Furthermore, these approaches have the potential to identify subclonal genetic diversity within the population of cancer cells58, with particular relevance to the detection of subclones carrying drug-resistance mutations59. Indeed, one high-coverage cancer genome sequence has recently been reported57 and several others will emerge during the course of 2009.

Figure 3. Improvements in the rate of DNA sequencing over the past 30 years and into the future.

Figure 3

From slab gels to capillary sequencing and second-generation sequencing technologies, there has been a more than a million-fold improvement in the rate of sequence generation over this time scale.

Even with the remarkable technological advances in sequencing, however, the parameters of experiments to catalogue all somatically acquired variants in a cancer genome are sobering. To obtain a complete catalogue of somatic mutations from an individual human cancer may require 20-fold sequence coverage of the cancer genome, and possibly more. Somatic mutations then have to be distinguished from inherited DNA variants. Although most inherited variants that are common in human populations (>5% allele frequency) have been discovered and are registered in databases, there are myriad rare inherited single nucleotide polymorphisms and structural variants that are not. In most cancer genomes these rare germline variants far outnumber the somatic mutations present. Therefore, for the foreseeable future at least, a high-coverage sequence of the normal genome from the same individual as the cancer will be an inescapable extra burden to allow identification of the somatic changes. Thus, more than 100,000,000,000 base pairs of DNA sequence will probably be required to identify the catalogue of somatic mutations in a single cancer genome.

Subsequently, it will be necessary to distinguish driver mutations from passengers (see Box 1). The power to distinguish clusters of driver mutations in cancer genes from chance clusters of randomly distributed passenger mutations will depend on how frequently a cancer gene is mutated and the prevalence of passenger mutations. To be confident of identifying a cancer gene that is mutated in ~5% of a particular type of cancer will require hundreds of cases to be sequenced. Each of the >100 cancer types will probably require similar sample sizes.

Coordinating the sequencing of cancer genomes

There is, therefore, much work to be done over the next few years. Ideally, it should be organized to maximize use of resources and harmonize the product. This is the mission of the International Cancer Genome Consortium (ICGC, see http://www.icgc.org/home). Building on the success of previous multinational, collaborative initiatives such as the Human Genome Project and the HapMap consortium, the aim of ICGC is to comprehensively characterize somatically acquired genetic events in at least fifty classes of cancer, including those with the highest global incidence and mortality, requiring high-coverage sequencing of 20,000 cancer genomes or more. The full catalogues of somatic mutation from each of these cancers will be integrated with expression and epigenetic profiles of the same cases and correlated with clinical features.

Projects under the ICGC imprimatur will adhere to predetermined standards and procedures for ethical approval, data release, intellectual property, sample quality, clinical annotation, data quality, data storage and sequencing completion. Most importantly, given the demanding nature of the task, the ICGC will coordinate studies to minimize duplication of effort and enable the most parsimonious deployment of resources.

The proposal to sequence large numbers of cancer genomes has generated controversy reminiscent of the debate before sequencing of the reference human genome almost 20 years ago. The experiments will be expensive and, to some extent, we cannot predict what will be found. However, the human genome is finite. Therefore, with further technological advances in DNA sequencing that are already in sight, this is a deliverable project that will comprehensively elucidate central questions relating to the nature of human cancer. The clinical and translational implications of such a body of work are profound. Beyond the identification of further potentially druggable cancer genes, a comprehensive catalogue of somatic mutations in carefully characterized clinical samples will generate new insights into the genetic patterns that underpin disease phenotype, prognosis, drug response and chemotherapy resistance. As the costs of sequencing whole cancer genomes drop towards US$1,000, routine sequencing in a clinical, diagnostic setting will become feasible. Such data may drive individualized therapeutic decision-making through the ability to predict prognosis, to choose therapeutic regimens known to have efficacy for the particular genetic subtype of cancer, to sensitively monitor response to therapy and to identify rare subclones harbouring drug-resistance mutations before therapy is even initiated. Individualized therapeutics will require individualized diagnostics.

The discussion is therefore not about whether to do the experiment, but when and how. In a manner similar to the Human Genome Project we have to coordinate the work internationally to maximize use of resources and minimize duplication of effort to generate a resource of high quality so that we only have to do it once, empowering cancer research with a lasting legacy for the future.

Forward look

Approximately 100,000 somatic mutations from cancer genomes have been reported in the quarter of a century since the first somatic mutation was found in HRAS. Over the next few years several hundred million more will be revealed by large-scale, complete sequencing of cancer genomes. These data will provide us with a fine-grained picture of the evolutionary processes that underlie our commonest genetic disease, providing new insights into the origins and new directions for the treatment of cancer.

Acknowledgements

We would like to thank N. Rahman for comments on the manuscript, G. Tang and B. Barrell for contributions to the figures and the Kay Kendall Leukaemia Fund and the Wellcome Trust for support.

References