The Cancer Genome Atlas Pan-Cancer Analysis Project (original) (raw)

. Author manuscript; available in PMC: 2014 Feb 11.

Published in final edited form as: Nat Genet. 2013 Oct;45(10):1113–1120. doi: 10.1038/ng.2764

Abstract

Cancer can take hundreds of different forms depending on the location, cell of origin and spectrum of genomic alterations that promote oncogenesis and affect therapeutic response. Although many genomic events with direct phenotypic impact have been identified, much of the complex molecular landscape remains incompletely charted for most cancer lineages. For that reason, The Cancer Genome Atlas (TCGA) Research Network has profiled and analyzed large numbers of human tumours to discover molecular aberrations at the DNA, RNA, protein, and epigenetic levels. The resulting rich data provide a major opportunity to develop an integrated picture of commonalities, differences, and emergent themes across tumour lineages. The Pan-Cancer initiative compares the first twelve tumour types profiled by TCGA. Analysis of the molecular aberrations and their functional roles across tumour types will teach us how to extend therapies effective in one cancer type to others with a similar genomic profile.

Molecular Profiling of Single Tumour Types

That cancer is fundamentally a genomic disease is now well established. Early on, large numbers of oncogenes were identified using functional assays on genetic material from tumours in positive selection systems1-3, and a subset of tumour suppressor genes were identified by analyzing loss of heterozygosity4. More recently, systematic cancer genomics projects have applied emerging technologies to the analysis of specific tumour types including the Cancer Genome Atlas Project (TCGA; Box 1). That disease-specific focus has identified novel oncogenic drivers, those genes contributing to functional change5-7, established molecular subtypes8-13 and identified new biomarkers based on genomic, transcriptomic and proteomic alterations. Some of those biomarkers have clinical implications14,15. For example, we now view ductal breast cancer as a collection of distinct diseases whose major subtypes (e.g. luminal A, luminal B, HER2, basal-like) are managed differently in the clinic; the outcomes for metastatic melanoma have changed as a result of therapeutic targeting of BRAFV600 mutations16; and the fraction of lung cancers treated with targeted agents is increasing with the discovery of likely driver aberrations in most lung tumours17,18. Large-scale processes that shape cancer genomes have similarly been identified. Chromothripsis19 and chromoplexy20, which involve breakage and rearrangement of chromosomes at multiple loci, kataegis21, which describes hypermutational processes associated with genomic rearrangements, are providing insight into tumour evolution (see Garraway and Lander (2013)22 for a review).

Analysis Across Tumour Types

Increases in the number of tumour sample data sets enhance our ability to detect and analyze molecular defects in cancers. For example, driver genes can be pinpointed more precisely by narrowing amplifications and deletions to smaller regions of the chromosome using recurrent events across tumour types. Large cohorts have enabled DNA sequencing to uncover a list of recurrent genomic aberrations (mutations, amplifications, deletions, translocations, fusions and other structural variants), both known and novel, as common themes across tumour types23. However, “long tails” in the distributions of aberrations among samples have also been uncovered24. Indeed, a majority of the TCGA samples have distinct alterations not shared with others in their cohort. Despite the apparent uniqueness of each individual tumour in this regard, the set of molecular aberrations often integrate into known biological pathways that are shared by sets of tumour samples. In other cases, rare somatic mutations can be implicated as drivers by aggregating events across tumour types to improve detection of patterns, for example hotspot mutations in protein domains, leading to identification of potential new drug targets.

Determining whether the rare aberrations are drivers (oncogenic contributors) or just passengers (clonally propagated with neutral effect), and whether they are clinically actionable, will require further functional evaluation as well as analysis of additional tumours to increase power. The identification of more driver aberrations and acquired vulnerabilities for each individual tumour will undoubtedly boost personalized care. Developing treatments that target the ~140 drivers23 validated to date, however daunting, appears possible; devising one-off therapies for the thousands of aberrations in the “long tail” will be much more challenging.

Although important general principles have emerged from decades of study25,26, until recently most research on the molecular, pathological and clinical nature of cancers has been “silo-ed” by tumour type27. One has only to glance at the directory of oncology departments in any major cancer center to realize that medical and surgical cancer care are, for the most part, also divided by disease as defined by organ-of-origin. That framework has made sense for generations, but molecular analysis is now calling this view into question; cancers of disparate organs reveal many shared features, and, conversely, cancers from the same organ are often quite distinct.

Important similarities among tumour subtypes from different organs have already been identified. For example, TP53 mutations drive high-grade serous ovarian, serous endometrial and basal-like breast carcinomas, all of which share a global transcriptional signature of activation of similar oncogenic pathways10,28. Similarly, ERBB2/HER2 is mutated and/or amplified in subsets of glioblastoma, gastric, serous endometrial, bladder and lung cancers. The result, at least in some cases, is responsiveness to HER2-targeted therapy analogous to that previously observed for HER2-amplified breast cancer. Other commonalities across tumour types include inherited and somatic inactivation of the BRCA1/2 pathway in both serous ovarian and basal-like breast cancer, microsatellite instability in colorectal and endometrial tumours, and the recently identified _POLE_-mediated ultramutator phenotype characterized by extremely high mutations rates, common to both colon and endometrial cancers12,28,29. Conversely, there are important cases in which the same genetic aberrations have very different effects depending on the organ within which they arise. A prime example is NOTCH, which is inactivated in some squamous cell cancers of the lung, head and neck30, skin31 and cervix32 but activated by mutation in liquid tumours33.

Such examples illustrate the importance of developing a comprehensive perspective across tumours, independent of histopathologic diagnosis; shared molecular patterns will enable etiologic and therapeutic discoveries in one disease that can be applied to another. Importantly, integrative interpretation of the data will help identify how the consequences of mutations vary across tissues, with important therapeutic implications. Relatively rare cancers, such as the childhood malignancies, particularly stand to benefit from such an approach.

We know much more about the molecular details of major cancers than we did just a few years ago, but once a cancer is metastatic it remains incurable, with few exceptions. Only time will tell whether the integration of molecular characteristics with histology, organ site and metastatic location will contribute to an improvement in patient outcomes. But the balance is shifting in that direction. Hence, the goal of the Pan-Cancer Project is to identify and analyze aberrations in the tumour genome and phenotype that define cancer lineages and those that transcend them. This report outlines the scope of the project and introduces the first coordinated set of manuscripts to be published from the enterprise.

The Pan-Cancer Project

To gain analytical breadth – defining commonalities, differences and emergent themes across cancer types and organs of origin – TCGA launched the Pan-Cancer analysis project at a meeting held on October 26-27, 2012 in Santa Cruz, California. Pan-Cancer is a coordinated initiative whose goals are to assemble coherent, consistent TCGA data sets across tumour types, as well as across platforms, and then to analyze and interpret those data (Box 2). Within two months of the launch a data “freeze” was declared, based on the first twelve TCGA tumour types, each profiled using six different genomic, epigenomic, transcriptional and proteomic platforms (Figure 1). Since that time, the aggregated data sets have been quality-controlled, analyzed statistically and interpreted by a consortium of researchers, principally members of the TCGA Research Network.

Figure 1. Integrated data set for the comparison and contrast of multiple tumour types.

Figure 1

The Pan-Cancer project assembled data from thousands of patients with primary tumours occurring in different sites of the body covering twelve tumour types (upper left panel) including glioblastoma multiform (GBM), lymphoblastic acute myeloid leukemia (LAML), head and neck squamous carcinoma (HNSC), lung adenocarcinoma (LUAD), lung squamous carcinoma (LUSC), breast carcinoma (BRCA), kidney renal clear cell carcinoma (KIRC), ovarian carcinoma (OV), bladder carcinoma (BLCA), colon adenocarcinoma (COAD), uterine cervical and endometrial carcinoma (UCEC), and rectal adenocarcinoma (READ). Six platforms of omics characterizations were performed creating a “data stack” (upper right panel) in which data elements across the platforms are linked by the fact that tissue material from the same samples were assayed, thus maximizing the potential of integrative analysis. Use of the data enables the identification of general trends including common pathways (lower panel) revealing master regulatory hubs activated (red) or deactivated (blue) across different tissue types.

The Pan-Cancer project lays the framework for an analytic process that, in the future, will include integration of new tumour types and data from TCGA and other such enterprises. There are currently major consortial efforts in pediatric cancers (TARGET; Therapeutically Applicable Research to Generate Effective Treatments) and adult cancers (ICGC; the International Cancer Genomics Consortium), as well as smaller projects by research teams around the world. A critical component will be the functional validation of aberrations in individual genes in team science efforts such as the CTD2 (Cancer Target Discovery and Development) and elucidation of pathway and network relationships in programs like the ICBP (Integrative Cancer Biology Program).

A number of major questions in cancer biology that go beyond the single-tumour perspective are being addressed in the collection of Pan-Cancer manuscripts. For example:

Limitations of Pan-Cancer Analysis

Several data integration challenges place unavoidable limitations on the Pan-Cancer analysis at the current time. A key challenge is the integration of data that have been generated on different platforms, or updates of the same platform, as the technologies improve. In the Pan-Cancer studies for example, there have been transitions to much higher density DNA methylation arrays, use of different exome capture technologies, addition of RNA-Seq to microarray-based RNA characterization and increases in the quality and number of antibodies available for reverse-phase proteomic arrays (RPPA). A series of batch effects analyses have been carried out to assess systematic platform-specific biases. However, more work is needed to establish best practices for minimizing unwanted batch effects while preserving biological signals.

The kind and quality of clinical data available for the cancer types varies widely. The differences limit the ability to establish one-size-fits-all norms for demographic information, histopathologic characterization, behavioral context, and clinical outcomes. For example, our survival data are relatively robust for serous ovarian cancer because of its poor prognosis, but still immature for breast and endometrial cancers because (thankfully) most of the patients do better for longer. Certain data elements are routinely collected only when they are anticipated to be relevant (for example, the smoking history of lung, bladder and head-and-neck cancer patients). Clear viral etiologies have been identified in several solid tumours types, including head and neck cancer, cervical cancer, Kaposi's sarcoma and hepatocellular carcinoma. However, a Pan-Cancer analysis of the infectious etiologies of other cancers could not be conducted at present because infection status was recorded for only some tumours and tumour types (as an optional data element). Finally, tumour stage and grade are not easily comparable across different tumour types because, for good reason, each has its own system. This set of challenges to Pan-Cancer analysis highlights the fact that current clinical practice is largely conducted according to tissue or organ.

Statistically speaking, care must be taken to ensure that the increased sample size achieved by cross cancer comparison does not lead to increased false negative rates for discovery (e.g. by ‘diluting out’ an important mutation specific to one disease) or false-positive rates (e.g. by compounding on false-positives known to result from current single-tumour investigations34.

Rare events must not be obscured by disease-associated events. Tumour lineage plays an important role in the observed patterns of co-aberrations and gene expression profiles that indicate different consequences of seemingly similar events, for example involving the same gene(s) or amplicon(s). Likewise, new methods for accurately probing cross-tumour trends will need to account explicitly for the differences across tissues in mutation rates, copy number changes at the focal and arm-level scales, and the prevalence of other co-occurring events in the genetic and epigenetic background.

Despite those challenges, this collection of Pan-Cancer publications represents a landmark in the continuing effort to understand the common and contrasting biology of cancers from a molecular perspective. Still, major questions amenable to further Pan-Cancer investigations remain (Box 3), and the techniques used to compare different tumours will undoubtedly improve with use, time and further collaborative efforts.

Future Directions

The Pan-Cancer project represents one of the first of what will surely be many efforts to coordinate analysis across the molecular landscape of cancer, especially as additional tumour types are investigated in large numbers. Further increasing the number of samples per tumour type and the variety of these tumour types will improve our ability to detect rare driver events in heterogeneous tumour samples. But the true power will come from a detailed analysis across types -- with links to high quality clinical outcomes and eventual experimental validation and clinical trials to test the hypotheses that emerge. Technologies such as laser capture microdissection and cell sorting will improve our ability to distinguish whether omic signals arise from malignant or stromal cells. Histone profiling, protein analysis based on mass spectrometry and de-convolution of tumour heterogeneity through single-cell sequencing are examples expected to add important new dimensions of information. Continued efforts to identify the progenitor cells of tumours will enable distinguishing parochial from universal properties. Clone-level and single-cell cross-tumour comparisons may reveal even further connections among tumour types. Longitudinal genomic studies on primary resected tumours paired with their local recurrences and/or metastases will be undertaken by large consortial efforts, which have heretofore been restricted to primary disease and have lacked information about response to treatment. The characteristics of primary tumours may change markedly when tumours metastasize to distant sites, particularly bone and brain. Pan-Cancer analyses of metastases will therefore be highly informative for mapping out the relationships of metastatic tumours to primaries and to normal tissues, establishing potential rules for invasion and homing.

The power of pan-cancer analysis will increase as technologies for monitoring individual tumour cells at high resolution come into play. Now that the price of genome sequencing has fallen, the next pan-cancer enterprise will be able to analyze large numbers of whole-genome sequences across tumour types. Whole-genome analysis will complement the current studies by shedding light on mutational processes in the non-coding parts of the genome, which are largely unexplored to date. That expanded analysis will bring focus to disruptions in promoter and enhancer sites and aberrations in non-coding RNAs, as well as genomic integration processes at work in tumour evolution that result from mobile endogenous and exogenous DNA elements such as retrotransposons and viruses. Whole-genome sequencing will create a backdrop against which genome-wide association studies can relate inherited predispositions to particular forms of cancer. Systems-oriented approaches, based on relevant pathways and networks, will add to the therapeutic opportunities that arise from the wealth of data. Experimental follow-up will be critical to assess the functional consequences and therapeutic liabilities of these new findings.

From Many Tumours to the Individual Patient

The hope is that cross-tumour investigations such as the Pan-Cancer project will ultimately inform clinical decision-making. We hope they will enable discovery of novel therapeutic agents that can be tested clinically -- perhaps in novel adaptive, biomarker-based clinical trials that cross tumour boundaries. Toward those ends, Pan-Cancer TCGA data sets have been made available publicly in one location. Although coordination remains a challenge, the data sets comprise an unequalled resource for integrative analysis of cancer in its many forms.

A key challenge is the development of clinical trial strategies for connecting subsets of tumours from different tissues in terms of molecular signatures. Recent analyses of pharmacological profiling experiments across a diverse panel of cancer cell lines has suggested that common genetic alterations predict response to therapy across multiple cell lineages40-43. Biomarker-based design of clinical trials can increase statistical power, greatly decreasing the size, expense, and duration of clinical trials.

The number and size of omic datasets on cancer available to the research community for mining and exploring continue to expand rapidly, and computational tools to derive insights into the fundamental causes of cancer are becoming more powerful. It is important to note that the full potential of the enterprise will be realized only over time and with broader efforts. Still, the collection of TCGA Pan-Cancer publications represents a significant contribution to a new period of discovery in cancer research.

Box 1: TCGA: Mission and Strategy.

Important information about the biological relevance of the molecular changes in cancer can be obtained through combined analysis of multiple different types of data at the DNA, RNA, and protein levels.

For that reason, TCGA's principal aims are to generate, quality control, merge, analyze, and interpret molecular profiles at the DNA, RNA, protein and epigenetic levels for hundreds of clinical tumours from various tumour types and their subtypes. Cases that meet quality assurance specifications are characterized using technologies that assess the sequence of the exome, copy number variation (measured by single-nucleotide polymorphism arrays), DNA methylation, mRNA expression and sequence, miRNA expression, and transcript splice variation. Additional platforms applied to a subset of the tumours, including whole genome sequencing and reverse phase protein arrays, provide additional layers of data to complement the core genomic datasets and clinical/pathological data. By the end of 2015, the TCGA Network plans to have achieved the ambitious goal of analyzing the genomic, epigenomic, and gene expression profiles of more than 10,000 specimens from 25 different tumour types.

TCGA's has other, complementary purposes as well: to promote the development and application of new technologies, to detect cancer-specific molecular alterations, to make the data and results freely available to the scientific community, to develop tools and standard operating procedures that can serve other large-scale profiling projects, and to build cadres of individuals (including experimentalists, computational biologists, statistical analysts, computer scientists, and administrative staff) with the expertise to carry out such large scale team science projects. As of July 24, 2013, TCGA has mapped molecular patterns across 7,992 total cases representing 27 tumour types. The data, along with tools for exploring them, are publicly available at cancergenome.nih.gov. Eight ‘marker papers’ (i.e., comprehensive initial publications on each of the tumour types) have been published to date8-13,15,28.

Box 2: Coordination of Data and Results.

The first goal of the Pan-Cancer Analysis Working Group was to assemble data from the separate disease projects to build a well-coordinated joint data set spanning multiple tumour types. A data “freeze” (Dec 21, 2012) based on six different genomic and epigenomic characterization platforms was made available as the “pancan12” data set to all analysis groups. Twelve tumour types (GBM, OV, BRCA, LUSC, LUAD, COAD, READ, KIRC, UCEC, BLCA, HNSC and LAML) were selected based on: data maturity, adequate sample size, and publication or submission for publication of the primary analyses. The pancan12 data set includes a total of 5,074 tumour samples, for which at least one platform from each of genomic, epigenomic, and gene expression data had been assessed for 93% (i.e., 4,705, listed in Table 1 by measurement platform). The essential purpose of such a joint data set is twofold: to increase the statistical power to detect functional genomic determinants of disease and to reveal both tissue-specific aspects of cancer and intrinsic molecular commonalities across tumour types.

The Pan-Cancer analysis project started as an informal collaboration among members of the TCGA Network but then quickly expanded to include many other interested researchers. Ensuring standardization and consistency of the data and annotations across multiple platforms and clinical data elements was a necessity for the project. To coordinate analyses across this large group of researchers, formal pipelines were created to establish a coherent working base of data and results.

The process of TCGA data generation and Pan-Cancer analysis is as follows (Figure 2). First, tumour and germline samples are obtained from a large number of tissue source sites and processed by the Biospecimen Core Resource (with sample selection according to criteria established for each tumour type and with extensive quality controls) to generate purified DNA, RNA and protein preparations. The preparations are sent to Genome Characterization Centers (GCCs) and Genome Sequencing Centers (GSCs) for molecular profiling, and the resulting data are deposited in the TCGA Data Coordinating Center (DCC) to provide a primary source of data, at four levels of data processing. Seven Genome Data Analysis Centers (GDACs), along with analysts in the GCCs, GSCs, and in the external research community, share analysis and interpretation of the data, coordinating activities through face-to-face meetings and regular (usually weekly) teleconferences.

A “data freeze” was created by pulling higher levels of interpreted data (“Level 3”) from the DCC into a coordinating repository called Synapse created by Sage Bionetworks. To create a coherent dataset, a sample “white list” was created by synchronizing flagged samples with the DCC, based on annotations and criteria from the individual disease working groups. The Pan-Cancer project leverages the TCGA infrastructure for sample acquisition, sample processing and data generation on individual tumour types, as well as the production of derived data sets and a variety of analysis results assembled in the Broad Institute's Firehose system (citation). Assembled robust, self-consistent data sets across all 12 Pan-Cancer tumour types were deposited into Synapse. The Synapse system implements mechanisms for tracking provenance and metadata, stable digital object identifiers (DOIs) for data referencing, and flexible methods for data access, either through a wiki-like web-based environment or programmatically through application programming interfaces (APIs). The pancan12 datasets and selected results are available at https://www.synapse.org/#!Synapse:syn300013 (doi:10.7303/syn300013).

Box 3. Examples of additional major questions amenable to further Pan-Cancer analyses.

Figure 2. Data coordination for the Pan-Cancer TCGA project.

Figure 2

Data were collected by the biospecimen collection resource (BCR) from 12 different tumour types, characterized on six major platforms by the genome characterization and sequencing centers (GCC/GSC). Datasets are deposited into the TCGA data coordination center (DCC) from which it is then distributed to the Broad Institute's Firehose and Memorial Sloan Kettering Cancer Center's cBioPortal for various automated processing pipelines. Analysis working groups (AWG) conduct focused analyses on individual tumour types. Results from the DCC, Firehose, and AWGs were collected and stored in Sage Bionetworks’ Synapse system to create a “data freeze.” Genome data analysis centers (GDACs) accessed and deposited both data and results through Synapse to coordinate distributed analyses.

Table 1. The data “freeze” used by the Pan-Cancer project defined on December 21, 2012.

Tabulated are the numbers of unique tumour samples available for each tumour type (rows) and each measurement platform (columns).

RPPAa DNA Methylationb Copy Numberc Mutationd miRNAe Expressionf
LUSC 195 358 345 178 332 227
READ 130 162 164 69 143 71
GBM 214 405 578 290 501 495
LAML 194 198 197 187 179
HNSC 212 310 310 277 309 303
BLCA 54 126 126 99 121 96
KIRC 423 457 457 417 442 431
UCEC 200 512 511 248 497 333
LUAD 237 431 357 229 365 355
OV 332 592 577 316 454 581
BRCA 408 888 887 772 870 817
COAD 269 420 422 155 407 192
Total 2674 4855 4932 3247 4628 4080

ACKNOWLEDGEMENTS

We thank J. Zhang for administrative coordination of TCGA Pan-Cancer Analysis Working Group activities, C. Perou and K. Hoadley for contributions to Figure 1, and D. Wheeler, M. Meyerson, and L. Ding for comments on early drafts of the manuscript. The study was funded by the National Cancer Institute and the National Human Genome Research Institute.

REFERENCES