CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses - PubMed (original) (raw)

CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses

Vanja Haberle et al. Nucleic Acids Res. 2015.

Abstract

Cap analysis of gene expression (CAGE) is a high-throughput method for transcriptome analysis that provides a single base-pair resolution map of transcription start sites (TSS) and their relative usage. Despite their high resolution and functional significance, published CAGE data are still underused in promoter analysis due to the absence of tools that enable its efficient manipulation and integration with other genome data types. Here we present CAGEr, an R implementation of novel methods for the analysis of differential TSS usage and promoter dynamics, integrated with CAGE data processing and promoterome mining into a first comprehensive CAGE toolbox on a common analysis platform. Crucially, we provide collections of TSSs derived from most published CAGE datasets, as well as direct access to FANTOM5 resource of TSSs for numerous human and mouse cell/tissue types from within R, greatly increasing the accessibility of precise context-specific TSS data for integrative analyses. The CAGEr package is freely available from Bioconductor at http://www.bioconductor.org/packages/release/bioc/html/CAGEr.html.

© The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

CAGEr workflow. (a) Schematic representation of CAGE data and explanation of key terms. (b) Flow chart of main steps in CAGEr. CTSS, CAGE detected TSS; TC, tag/TSS cluster.

Figure 2.

Figure 2.

Power-law based normalization (a) Reverse cumulative distribution of CAGE tag count per CTSS for eight mouse testis samples plotted with CAGEr. Slope of the power-law fitted within the range marked by the dotted lines is shown for each sample in the brackets next to the sample name. Suggested reference power-law distribution is shown as dashed grey line and the corresponding parameters for normalization are denoted in the lower left corner. alpha, absolute value of the reference slope on the log-log scale; T, total number of CAGE tags in the reference distribution. (b) Reverse cumulative distribution of CAGE signal per CTSS after normalization. E13 – E17, embryonic day 13–17; N0–N30, neonate day 0–30.

Figure 3.

Figure 3.

Promoter width. (a) Schematic representation of promoter width assessment using quantile positions of CAGE signal along the promoter. (b) Distribution of promoter width in adult mouse testis for three groups of promoters divided by expression (normalized CAGE tpm). Left panel shows the distribution of the full width from the most 5′ TSS to the most 3′ TSS in the promoter and right panel shows the interquantile width (distance between the positions of the 10th and the 90th percentile). Interquantile width accounts for local level of noise and provides a more robust measure of promoter width, allowing separation of sharp and broad promoters (dashed line). (c) Distribution of match (%) to TATA-box motif in the region −35 to −22 bp upstream of the dominant TSS in sharp and broad promoters. _P_-value of two-tailed Wilcoxon rank-sum test is shown. (d) Percentage of sharp and broad promoters that overlap CpG islands (CGI) and non-methylated islands (NMI; data from (24)).

Figure 4.

Figure 4.

Promoter-centred expression profiling. (a) Self-organizing map clustering of promoter expression across eight mouse testis samples. Each box represents one cluster and the number of contained promoters is denoted above the box. Individual beanplots show distribution of scaled normalized expression for those promoters in different samples denoted on the x-axis. Gene ontology terms significantly enriched in selected clusters are shown in corresponding colours. (b) Example of a constitutively expressed promoter that contains TSSs with distinct expression dynamics. First track shows the span of the cluster (promoter) and is coloured according to its expression class (0_2) as shown in panel (a). Second track shows individual TSS positions with signal above 5 tpm, which are coloured according to their own expression class as shown in Supplementary Figure S5b.

Figure 5.

Figure 5.

Differential TSS usage. (a) Schematics of differential TSS usage assessment. Distribution of TSSs and cumulative distribution of CAGE signal (F1 and F2) along single promoter in two different samples is shown in cyan and orange, respectively. Grey line shows the subtraction of the two cumulatives. The shifting score is calculated as a ratio of the maximal difference between the two cumulatives and the total CAGE signal at that promoter in the sample with lower signal (left panel). The cumulatives are scaled to the range between 0 and 1, and Kolmogorov–Smirnov (K–S) test is used to assess the significance of the difference between resulting empirical distribution functions (F1′ and F2′). Value of the K–S statistic (D) is illustrated by an arrow (right panel). (b) Number of promoters with significant differential TSS usage (K–S test, FDR ≤ 0.01) for all pair-wise comparisons of eight mouse testis samples. (c) Example of a shifting promoter detected using method shown in panel a, which demonstrates differential TSS usage between mouse embryonic (E13) and adult testis. Shifting score and corrected K–S _P_-value are denoted.

Figure 6.

Figure 6.

Comparison between annotated TSS and CAGE. (a) Distance between annotated RefSeq TSS and dominant TSS of the closest CAGE tag cluster in adult mouse testis. Promoters have been separated into sharp and broad class based on their interquantile width as shown in Figure 3b. (b) Non-methylated DNA signal (data from (24)) at promoters sorted by interquantile width and centred at CAGE dominant TSS. Broad promoters are associated with broader non-methylated regions and the level of non-methylation increases with promoter width. (c) Frequency of AA/AT/TA/TT dinucleotides around sharp and broad promoters centred at CAGE dominant TSS (top) or RefSeq annotated TSS (bottom). Magnified view of the signal in the region 50–200 bp downstream of the TSS is shown in the inset and demonstrates the 10 bp periodicity linked to nucleosome positioning (32) in broad promoters. Unlike RefSeq annotation, CAGE allows separation of sharp and broad promoters (Figure 3b) and adds precision into promoter-centred analysis revealing subtle sequence patterns in different classes of promoters.

References

    1. Smale S.T., Kadonaga J.T. The RNA polymerase II core promoter. Annu. Rev. Biochem. 2003;72:449–479. - PubMed
    1. Carninci P., Sandelin A., Lenhard B., Katayama S., Shimokawa K., Ponjavic J., Semple C.A.M., Taylor M.S., Engström P.G., Frith M.C., et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 2006;38:626–635. - PubMed
    1. Suzuki Y., Taira H., Tsunoda T., Mizushima-Sugano J., Sese J., Hata H., Ota T., Isogai T., Tanaka T., Morishita S., et al. Diverse transcriptional initiation revealed by fine, large-scale mapping of mRNA start sites. EMBO Rep. 2001;2:388–393. - PMC - PubMed
    1. Shiraki T., Kondo S., Katayama S., Waki K., Kasukawa T., Kawaji H., Kodzius R., Watahiki A., Nakamura M., Arakawa T., et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl. Acad. Sci. U.S.A. 2003;100:15776–15781. - PMC - PubMed
    1. de Hoon M., Hayashizaki Y. Deep cap analysis gene expression (CAGE): genome-wide identification of promoters, quantification of their expression, and network inference. Biotechniques. 2008;44:627–632. - PubMed

Publication types

MeSH terms

LinkOut - more resources