TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data - PubMed (original) (raw)

doi: 10.1093/nar/gkv1507. Epub 2015 Dec 23.

Tiago C Silva 2, Catharina Olsen 1, Luciano Garofano 3, Claudia Cava 4, Davide Garolini 5, Thais S Sabedot 2, Tathiane M Malta 2, Stefano M Pagnotta 6, Isabella Castiglioni 4, Michele Ceccarelli 7, Gianluca Bontempi 8, Houtan Noushmehr 9

Affiliations

Antonio Colaprico et al. Nucleic Acids Res. 2016.

Abstract

The Cancer Genome Atlas (TCGA) research network has made public a large collection of clinical and molecular phenotypes of more than 10 000 tumor patients across 33 different tumor types. Using this cohort, TCGA has published over 20 marker papers detailing the genomic and epigenomic alterations associated with these tumor types. Although many important discoveries have been made by TCGA's research network, opportunities still exist to implement novel methods, thereby elucidating new biological pathways and diagnostic markers. However, mining the TCGA data presents several bioinformatics challenges, such as data retrieval and integration with clinical data and other molecular data types (e.g. RNA and DNA methylation). We developed an R/Bioconductor package called TCGAbiolinks to address these challenges and offer bioinformatics solutions by using a guided workflow to allow users to query, download and perform integrative analyses of TCGA data. We combined methods from computer science and statistics into the pipeline and incorporated methodologies developed in previous TCGA marker studies and in our own group. Using four different TCGA tumor types (Kidney, Brain, Breast and Colon) as examples, we provide case studies to illustrate examples of reproducibility, integrative analysis and utilization of different Bioconductor packages to advance and accelerate novel discoveries.

© The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

TCGA data overview. (A) bars represent number of patients by disease; bubbles represent the available data size in TB by disease; (B) number of samples by platform and by level, grouped by type: genomic, transcriptomic and epigenomic. (C) Barplot: number of citations for TCGA papers. Bubble plot: number of TCGA papers, in parenthesis the number of papers published by the TCGA Research Network. Source: Scopus search for 'TCGA', adding TCGA Research Network papers that were not found during this search.

Figure 2.

Figure 2.

Overview of TCGAbiolinks functions. TCGAbiolinks is organized in three categories. In the first category (Data), functions to query the TCGA database, to download the data and to prepare it are made available. The second category (Analysis) contains functions that allow the user to carry out different types of analyses; these include clustering (TCGAanalyze_Clustering), differential expression analysis (TCGAanalyze_DEA) and enrichment analysis (TCGAanalyze_EA). Finally, the obtained results can be visualized using the functions in the third category (Visualization): these include principal component analysis (TCGAvisualize_PCA), starburst plots (TCGAvisualize_starburst) and survival curves (TCGAvisualize_SurvivalCoxNET). The different dependencies to other R/Bioconductor packages are specified in the last row of the figure.

Figure 3.

Figure 3.

Integrative analysis of BRCA data using TCGA clinical data and subtypes. Case study n.1 Integrative (or Downstream) analysis of gene expression and clinical data from BRCA disease with univariate and multivariate survival analysis using DNET package. (AD) Top 20 GO, BP, CC, MF (Biological Process, Cellular Component, Molecular Function) and Pathways enriched by DEGs respectively. Gene annotation by DAVID's database. (E) Significant genes univariate Kaplan-Meier and multivariate with Cox regression, in a net of five communities with same _P_-values using DNET package, and interactions among genes by STRING's database.

Figure 4.

Figure 4.

Case study n.2 Integrative (or Downstream) analysis of gene expression and clinical data from LGG disease with unsupervised clustering and crossing expression clusters with clinical and molecular information. (A) Heatmap of 1187 more variables genes clustered with tree k = 4 in EC1, EC2, EC3, EC4. (B) Kaplan Meier survivals plot for EC clusters. (C and D) Distribution of the DNA Methylation clusters and ATRX mutation within the EC clusters.

Figure 5.

Figure 5.

Case study n.3 Integrative analysis of gene expression and DNA methylation data from COAD disease, comparing groups CIMP.L and CIMP.H. (A) Expression volcano plot: fold change of expression data versus significance. (B) DNA methylation volcano plot: difference of DNA methylation versus significance. (C) Starburst plot: DNA methylation significance versus gene expression significance.

Figure 6.

Figure 6.

Case study n.4 TCGAbiolinks integration: integrative analysis using ELMER. (A) Each scatter plot showing the average DNA methylation level of sites with the AP1 motif in all KIRC samples plotted against the expression of the transcription factor CEBPB and GFI1, respectively. (B) The schematic plot shows probe colored in blue and the location of nearby 20 genes, the genes significantly linked to the probe are in red. (C) The plot shows the Odds Ratio (x axis) for the selected motifs with OR above 1.1 and lower boundary of OR above 1.1. The range shows the 95% confidence interval for each Odds Ratio.

Similar articles

Cited by

References

    1. Rubin G., Berendsen A., Crawford S.M., Dommett R., Earle C., Emery J., Fahey T., Grassi L., Grunfeld E., Gupta S., et al. The expanding role of primary care in cancer control. Lancet Oncol. 2015;16:1231–1272. - PubMed
    1. Weinstein J.N., Collisson E.A., Mills G.B., Shaw K.R., Ozenberger B.A., Ellrott K., Shmulevich I., Sander C., Stuart J.M. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 2013;45:1113–1120. - PMC - PubMed
    1. Brat D.J., Verhaak R.G., Aldape K.D., Yung W.K., Salama S.R., Cooper L.A., Rheinbay E., Miller C.R., Vitucci M., Morozova O., et al. Comprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade Gliomas. N. Engl. J. Med. 2015;372:2481–2498. - PMC - PubMed
    1. Hoadley K.A., Yau C., Wolf D.M., Cherniack A.D., Tamborero D., Ng S., Leiserson M.D., Niu B., McLellan M.D., Uzunangelov V., et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014;158:929–944. - PMC - PubMed
    1. Network T.C.G.A.R. Comprehensive molecular characterization of gastric adenocarcinoma. Nature. 2014;513:202–209. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources