The evolutionary history of 2,658 cancers - PubMed (original) (raw)

. 2020 Feb;578(7793):122-128.

doi: 10.1038/s41586-019-1907-7. Epub 2020 Feb 6.

Moritz Gerstung # 1 2 3, Ignaty Leshchiner # 5, Stefan C Dentro # 6 4 7, Santiago Gonzalez # 8, Daniel Rosebrock 5, Thomas J Mitchell 6 9, Yulia Rubanova 10 11, Pavana Anur 12, Kaixian Yu 13, Maxime Tarabichi 6 4, Amit Deshwar 10 11, Jeff Wintersinger 10 11, Kortine Kleinheinz 14 15, Ignacio Vázquez-García 6 9, Kerstin Haase 4, Lara Jerman 8 16, Subhajit Sengupta 17, Geoff Macintyre 18, Salem Malikic 19 20, Nilgun Donmez 19 20, Dimitri G Livitz 5, Marek Cmero 21 22, Jonas Demeulemeester 4 23, Steven Schumacher 5, Yu Fan 13, Xiaotong Yao 24 25, Juhee Lee 26, Matthias Schlesner 14, Paul C Boutros 10 27 28, David D Bowtell 29, Hongtu Zhu 13, Gad Getz 5 30 31 32, Marcin Imielinski 24 25, Rameen Beroukhim 5 33, S Cenk Sahinalp 20 34, Yuan Ji 17 35, Martin Peifer 36, Florian Markowetz 18, Ville Mustonen 37, Ke Yuan 18 38, Wenyi Wang 13, Quaid D Morris 10 11; PCAWG Evolution & Heterogeneity Working Group; Paul T Spellman 12, David C Wedge 7 39, Peter Van Loo 40 41; PCAWG Consortium

Collaborators, Affiliations

The evolutionary history of 2,658 cancers

Moritz Gerstung et al. Nature. 2020 Feb.

Erratum in

Abstract

Cancer develops through a process of somatic evolution1,2. Sequencing data from a single biopsy represent a snapshot of this process that can reveal the timing of specific genomic aberrations and the changing influence of mutational processes3. Here, by whole-genome sequencing analysis of 2,658 cancers as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA)4, we reconstruct the life history and evolution of mutational processes and driver mutation sequences of 38 types of cancer. Early oncogenesis is characterized by mutations in a constrained set of driver genes, and specific copy number gains, such as trisomy 7 in glioblastoma and isochromosome 17q in medulloblastoma. The mutational spectrum changes significantly throughout tumour evolution in 40% of samples. A nearly fourfold diversification of driver genes and increased genomic instability are features of later stages. Copy number alterations often occur in mitotic crises, and lead to simultaneous gains of chromosomal segments. Timing analyses suggest that driver mutations often precede diagnosis by many years, if not decades. Together, these results determine the evolutionary trajectories of cancer, and highlight opportunities for early cancer detection.

PubMed Disclaimer

Conflict of interest statement

R.B. owns equity in Ampressa Therapeutics. G.G. receives research funds from IBM and Pharmacyclics and is an inventor on patent applications related to MuTect, ABSOLUTE, MutSig, MSMuTect and POLYSOLVER. I.L. is a consultant for PACT Pharma. B.J.R. is a consultant at and has ownership interest (including stock and patents) in Medley Genomics. All other authors declare no competing interests.

Figures

Fig. 1

Fig. 1. Timing clonal copy number gains using allele frequencies of point mutations.

a, Principles of timing mutations and copy number gains based on whole-genome sequencing. The number of sequencing reads reporting point mutations can be used to discriminate variants as early or late clonal (green or purple, respectively) in cases of specific copy number gains, as well as clonal (blue) or subclonal (red) in cases without. b, Annotated point mutations in one sample based on VAF (top), copy number (CN) state and structural variants (middle), and resulting timing estimates (bottom). LOH, loss of heterozygosity. c, Overview of the molecular timing distribution of copy number gains across cancer types. Pie charts depict the distribution of the inferred mutation time for a given copy number gain in a cancer type. Green denotes early clonal gains, with a gradient to purple for late gains. The size of each chart is proportional to the recurrence of this event. Abbreviations for each cancer type are defined in Supplementary Table 1. d, Heat maps representing molecular timing estimates of gains on different chromosome arms (x axis) for individual samples (y axis) for selected tumour types. e, Temporal patterns of two near-diploid cases illustrating synchronous gains (top) and asynchronous gains (bottom). f, Left, distribution of synchronous and asynchronous gain patterns across samples, split by WGD status. Uninformative samples have too few or too small gains for accurate timing. Right, the enrichment of synchronous gains in near-diploid samples is shown by systematic permutation tests. g, Proportion of copy number segments (n = 90,387) with secondary gains. Error bars denote 95% credible intervals. ND, near diploid. h, Distribution of the relative latency of n = 824 secondary gains with available timing information, scaled to the time after the first gain and aggregated per chromosome. Source data

Fig. 2

Fig. 2. Timing of point mutations shows that recurrent driver gene mutations occur early.

a, Top, distribution of point mutations over different mutation periods in n = 2,778 samples. Middle, timing distribution of driver mutations in the 50 most recurrent lesions across n = 2,583 white listed samples from unique donors. Bottom, distribution of driver mutations across cancer types; colour as defined in the inset. b, Relative timing of the 50 most recurrent driver lesions, calculated as the odds ratio of early versus late clonal driver mutations versus background, or clonal versus subclonal. Error bars denote 95% confidence intervals derived from bootstrap resampling. Odds ratios overlapping 1 in less than 5% of bootstrap samples are considered significant (coloured). The underlying number of samples with a given mutation is shown in a. c, Relative timing of TP53 mutations across cancer types, as in b. The number of samples is defined in the _x_-axis labels. d, Estimated number of unique lesions (genes) contributing 50% of all driver mutations in different timing epochs across n = 2,583 unique samples, containing n = 5,756 driver mutations with available timing information. Error bars denote the range between 0 and 1 pseudocounts; bars denote the average of the two values. NA, not applicable; NS, not significant. Source data

Fig. 3

Fig. 3. Aggregating single-sample ordering reveals typical timing of driver mutations.

a, Schematic representation of the ordering process. bd, Examples of individual patient trajectories (partial ordering relationships), the constituent data for the ordering model process. eg, Preferential ordering diagrams for colorectal adenocarcinoma (ColoRect–AdenoCA) (e), pancreatic neuroendocrine cancer (Panc–Endocrine) (f) and glioblastoma (CNS–GBM) (g). Probability distributions show the uncertainty of timing for specific events in the cohort. Events with odds above 10 (either earlier or later) are highlighted. The prevalence of the event type in the cohort is displayed as a bar plot on the right. Source data

Fig. 4

Fig. 4. Dynamic mutational processes during early and late clonal tumour evolution.

a, Example of tumours with substantial changes between mutation spectra of early (left) and late (right) clonal time points. The attribution of mutations to the most characteristic signatures are shown. b, Example of clonal-to-subclonal mutation spectrum change. c, Fold changes between relative proportions of early and late clonal mutations attributed to individual mutational signatures. Points are coloured by tissue type. Data are shown for samples (n = 530) with measurable changes in their overall mutation spectra and restricted to signatures active in at least 10 samples. Box plots demarcate the first and third quartiles of the distribution, with the median shown in the centre and whiskers covering data within 1.5× the IQR from the box. d, Fold changes between clonal and subclonal periods in samples (n = 729) with measurable changes in their mutation spectra, analogous to c. Source data

Fig. 5

Fig. 5. Approximate chronological timing inference suggests a timescale of cancer evolution of several years.

a, Mapping of molecular timing estimates to chronological time under different scenarios of increases in the CpG>TpG mutation rate. A greater increase before diagnosis indicates an inflation of the mutation timescale. b, Median latency between WGDs and the last detectable subclone before diagnosis under different scenarios of CpG>TpG mutation rate increases for n = 569 non-hypermutant cancers with at least 100 informative SNVs, low tumour in normal contamination and at least five samples per tumour histology. c, Median latency between the MRCA and the last detectable subclone before diagnosis for different CpG>TpG mutation rate changes in n = 1,921 non-hypermutant samples with low tumour in normal contamination and at least 5 cases per cancer type. Source data

Fig. 6

Fig. 6. Typical timelines of tumour development.

ad, Timelines representing the length of time, in years, between the fertilized egg and the median age of diagnosis for colorectal adenocarcinoma (a), squamous cell lung cancer (b), ovarian adenocarcinoma (c) and pancreatic adenocarcinoma (d). Real-time estimates for major events, such as WGD and the emergence of the MRCA, are used to define early, variable, late and subclonal stages of tumour evolution approximately in chronological time. The range of chronological time estimates according to varying clock mutation acceleration rates is shown as well, with tick marks corresponding to 1×, 2.5×, 5×, 7.5×, 10× and 20×. Driver mutations and copy number alterations (CNA) are shown in each stage according to their preferential timing, as defined by relative ordering. Mutational signatures (Sigs) that, on average, change over the course of tumour evolution, or are substantially active but not changing, are shown in the epoch in which their activity is greatest. DBS, double base substitution; SBS, single base substitutions. Where applicable, lesions with a known timing from the literature are annotated; dagger symbols denotes events that were found to have a different timing; asterisk symbol denotes events that agree with our timing. Source data

Extended Data Fig. 1

Extended Data Fig. 1. Summary of all results obtained for colorectal adenocarcinoma (n = 60) as an example.

a, Clustered heat maps of mutational timing estimates for gained segments, per patient. Colours as indicated in main text: green represents early clonal events, purple represents late clonal. b, Relative ordering of copy number events and driver mutations across all samples. c, Distribution of mutations across early clonal, late clonal and subclonal stages, for the most common driver genes. A maximum of 10 driver genes are shown. d, Clustered mutational signature fold changes between early clonal and late clonal stages, per patient. Green and purple indicate, respectively, a signature decrease and increase in late clonal from early clonal mutations. Inactive signatures are coloured white. e, As in d but for clonal versus subclonal stages. Blue indicates a signature decrease and red an increase in subclonal from clonal mutations. f, Typical timeline of tumour development. Similar result summaries for all other cancer types can be found in the Supplementary Information (pages 46–77).

Extended Data Fig. 2

Extended Data Fig. 2. Comparison of methods used for timing of individual copy number gains.

a, b, Pairwise comparison of the three approaches for timing individual copy number gains. c, Comparison using simulated data, showing high concordance.

Extended Data Fig. 3

Extended Data Fig. 3. Early copy number gains in brain cancers.

a, Three illustrative examples of glioblastoma with trisomy 7. The red arrow depicts the expected VAF cluster of point mutations preceding trisomy 7, which usually contains less than three SNVs. b, Distributions of the number of SNVs preceding trisomy 7 and total number of mutations on chromosome (chr) 7 in n = 34 GBM samples with trisomy 7. c, Medulloblastoma example with isochromosome 17q. d, Distributions of SNVs on 17q in n = 95 samples with isochromosome 17q; 74 out of 95 samples have less than 1 SNV preceding the isochromosome. Source data

Extended Data Fig. 4

Extended Data Fig. 4. Validation of relative ordering model reconstruction based on simulated cohorts of whole-genome samples.

a, Relative ordering model (PhylogicNDT LeagueModel) results for a simulated cohort of samples (n = 100) from a single generalized relative order of events (with varied prevalence) showing high concordance with the true trajectory. Probability distributions show the uncertainty of timing for specific events in the cohort. b, Relative ordering model results on a simulated cohort of samples (n = 95) from a complex mixture of trajectories with different order of events showing high concordance with the expected average trajectory. c, Estimation of accuracy of the relative ordering model reconstruction by simulation of a set of 100 cohorts (n(samples) = 100) with random trajectory mixtures and quantifying the distance in log odds early/late from perfect ordering. For the vast majority of events (even with low number of occurrences in the cohort), the log odds error does not exceed 1, confirming that very few events would switch between timing categories. The inset box corresponds to the first and third quartiles of the distribution, the horizontal line indicates the median and whiskers include data within 1.5× the IQR from the box. d, Simulated data show concordant timing in cohorts with WGD (n = 245). Exclusion of samples with WGD (right, n = 242) introduces only a mild drop in accuracy, indicating that WGD is beneficial but not necessary for the reconstruction. Red dot = true rank. e, Estimated log odds in observed data including WGD (left, n = 245) and without (right, n = 242), across different mutation types. The inset box corresponds to the first and third quartiles of the distribution, the horizontal line indicates the median and whiskers include data within 1.5× the IQR from the box.

Extended Data Fig. 5

Extended Data Fig. 5. Correlation between the league model and Bradley–Terry model ordering.

Direct comparison for each tumour type of the league and Bradley–Terry models for determining the order of recurrent somatic mutations and copy number events. Axes indicate the ordered events observed in the respective tumour types. Correlation is quantified by Spearman’s rank correlation coefficient. A total of n = 756 ordered events are shown. Source data

Extended Data Fig. 6

Extended Data Fig. 6. Examples of mutation spectrum changes across tumour evolution.

a, Three examples of tumours with substantial changes between mutation spectra of early (top) and late (bottom) clonal time points. b, Three examples of tumours with substantial changes between mutation spectra of clonal (top) and subclonal (bottom) time points. Source data

Extended Data Fig. 7

Extended Data Fig. 7. Overview of early-to-late clonal and clonal-to-subclonal signature changes across tumour types.

a, b, Pie charts representing signature changes per cancer type for early-to-late clonal signature changes (a) and clonal-to-subclonal signature changes (b). Signatures that decrease between early and late are coloured green; signatures that increase are purple. The size of each pie chart represents the frequency of each signature. Signatures are split into three categories: (1) clock-like, comprising the putative clock signatures 1 and 5; (2) frequent, which are signatures present in ten or more cancer types; and (3) cancer-type specific, which are in fewer than ten cancer types and are often limited to specific cohorts.

Extended Data Fig. 8

Extended Data Fig. 8. Age-dependent mutation burden and relapse samples indicate near-normal CpG>TpG mutation rate in cancer, with moderate acceleration during carcinogenesis.

a, Across all cancer samples, a predominantly linear accumulation of CpG>TpG mutations (scaled to copy number) is observed over time, as measured by the age at diagnosis. b, Cancer-specific analysis of the CpG>TpG mutation burden as a function of age at diagnosis for n = 1,978 samples of 34 informative cancer types. The dotted line denotes the median mutations per year (that is, not offset), and shading denotes the 95% credible interval of a hierarchical Bayesian linear regression model across all data points. Slope and intercepts are drawn for each cancer type from a gamma distribution, respectively; inference was done by Hamiltonian Monte Carlo sampling. c, Maximum a posteriori estimates of rate and offset for 34 cancer types with 95% credible intervals as defined in b. d, Mutation rate inferred from cancer as in b and from selected normal tissue sequencing studies of n = 140 normal haematopoietic stem cells, n = 1 normal skin sample, n = 182 samples from normal endometrium, and n = 445 normal colonic crypts; error bars denote the 95% confidence interval. e, Median fraction of mutations attributed to linear age-dependent accumulation, based on estimates from b and the age at diagnosis for each sample. Error bars denote the 95% credible interval. f, g, CpG>TpG mutations per gigabase for ovarian cancer (f) and breast cancer (g) samples with matched primary and relapse samples. h, Increase in CpG>TpG mutation rate inferred from paired primary and relapse samples for six cancer types. Bars denote the range of the rate increase for different scenarios of copy number evolution, assuming ploidy changes have occurred prior (upper value) or posterior (lower value) to the branching between primary and relapse sample. Source data

Extended Data Fig. 9

Extended Data Fig. 9. Real-time estimates indicate long latencies for some samples caused by the absence of early mutations.

a, Time of WGD for n = 571 individual patients, split by tumour type with an estimated mutation rate increase of 5×, except for ovary–adenocarcinoma (7.5×) and CNS (2.5×). Error bars represent 80% confidence intervals, reflecting uncertainty stemming from the number of mutations per segment and onset of the rate increase. Box plots demarcate the quartiles and median of the distribution with whiskers indicating 5% and 95% quantiles. b, Scatter plots showing the time of diagnosis (x axis) and inferred time of WGD (y axis) with error bars as in a. c, Scatter plot of early (co-amplified) CpG>TpG mutations (y axis) as a function of the mutational time estimate of WGD (x axis). The black line denotes a nonlinear loess fit with 95% confidence interval. Colours define the cancer type as in a. d, Total CpG>TpG mutations (y axis) as a function of the mutation time estimate of WGD (x axis). Colours and fit as in c. Early molecular timing is thus caused by a depletion of early CpG>TpG mutations, rather than an inflation of late CpG>TpG mutations. e, Estimated median WGD latency of n = 571 patients as in a for fixed (x axis) versus patient specific rate increases, depending on the observed CpG>TpG mutation burden, allowing for a higher (up to 10×) mutation rate increase in samples with more mutations (y axis). Error bars denote the IQR. f, Timing of subclonal diversification using CpG>TpG mutations in n = 1,953 individual patients. Box plots and error bars for data points as in a. g, Comparison of the median duration of subclonal diversification per cancer type assuming branching and linear phylogenies. Source data

Comment in

Similar articles

Cited by

References

    1. Cairns J. Mutation selection and the natural history of cancer. Nature. 1975;255:197–200. doi: 10.1038/255197a0. - DOI - PubMed
    1. Martincorena I, Campbell PJ. Somatic mutation in cancer and normal cells. Science. 2015;349:1483–1489. doi: 10.1126/science.aab4082. - DOI - PubMed
    1. Nik-Zainal S, et al. The life history of 21 breast cancers. Cell. 2012;149:994–1007. doi: 10.1016/j.cell.2012.04.023. - DOI - PMC - PubMed
    1. The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature10.1038/s41586-020-1969-6 (2020).
    1. Moore, L. et al. The mutational landscape of normal human endometrial epithelium. Preprint at bioRxiv 10.1101/505685 (2018).

Publication types

MeSH terms

Grants and funding

LinkOut - more resources