CAGEr: an R package for CAGE (Cap Analysis of Gene Expression) data analysis and promoterome mining (original) (raw)

Introduction

This document describes how to use CAGEr CAGEr, a Bioconductor package designed to process, analyse and visualise Cap Analysis of Gene Expression (CAGE) sequencing data. CAGE (Kodzius et al. 2006) is a high-throughput method for transcriptome analysis that utilizes_cap trapping_ (Carninci et al. 1996), a technique based on the biotinylation of the 7-methylguanosine cap of Pol II transcripts, to pulldown the 5′-complete cDNAs reversely transcribed from the captured transcripts. A linker sequence is ligated to the 5′ end of the cDNA and a specific restriction enzyme is used to cleave off a short fragment from the 5′ end. Resulting fragments are then amplified and sequenced using massive parallel high-throughput sequencing technology, which results in a large number of short sequenced tags that can be mapped back to the referent genome to infer the exact position of the transcription start sites (TSSs) used for transcription of captured RNAs (Figure 1). The number of CAGE tags supporting each TSS gives the information on the relative frequency of its usage and can be used as a measure of expression from that specific TSS. Thus, CAGE provides information on two aspects of capped transcriptome: genome-wide 1bp-resolution map of TSSs and transcript expression levels. This information can be used for various analyses, from 5′ centered expression profiling(Takahashi et al. 2012) to studying promoter architecture (Carninci et al. 2006).

Figure 1: Overview of CAGE experiment

CAGE samples derived from various organisms (genomes) can be analysed by CAGEr and the only limitation is the availability of the referent genome as a BSgenome package in case when raw mapped CAGE tags are processed. CAGEr provides a comprehensive workflow that starts from mapped CAGE tags and includes reconstruction of TSSs and promoters and their visualisation, as well as more specialized downstream analyses like promoter width, expression profiling and differential TSS usage. It can use both Binary Sequence Alignment Map (BAM) files of aligned CAGE tags or files with genomic locations of TSSs and number of supporting CAGE tags as input. If BAM files are provided_CAGEr_ constructs TSSs from aligned CAGE tags and counts the number of tags supporting each TSS, while allowing filtering out low-quality tags and removing technology-specific bias. It further performs normalization of raw CAGE tag count, clustering of TSSs into tag clusters (TC) and their aggregation across multiple CAGE experiments into promoters to construct the promoterome. Various methods for normalization and clustering of TSSs are supported. Exporting data into different types of track objects allows export and various visualisations of TSSs and clusters (promoters) in the UCSC Genome Browser, which facilitate generation of hypotheses. CAGEr manipulates multiple CAGE experiments at once and performs analyses across datasets, including expression profiling and detection of differential TSS usage (promoter shifting). Multicore option for parallel processing is supported on Unix-like platforms, which significantly reduces computing time.

Here are some of the functionalities provided in this package:

Several data packages are accompanying CAGEr package. They contain majority of the up-to-date publicly available CAGE data produced by major consortia including FANTOM and ENCODE. These include_FANTOM3and4CAGE_ package available from Bioconductor, as well as_ENCODEprojectCAGE_ and ZebrafishDevelopmentalCAGE packages available from http://promshift.genereg.net/CAGEr/. In addition, direct fetching of TSS data from FANTOM5 web resource (the largest collection of TSS data for human and mouse) from within CAGEr is also available. These are all valuable resources of genome-wide TSSs in various tissue/cell types for various model organisms that can be used directly in R. A separate vignette describes how these public datasets can be included into a workflow provided by CAGEr. For further information on the content of the data packages and the list of available CAGE datasets please refer to the vignette of the corresponding data package.

For further details on the implemented methods and for citing the CAGEr package in your work please refer to (Haberle et al. 2015).

Input data for CAGEr

CAGEr package supports three types of CAGE data input:

The type and the format of the input files is specified at the beginning of the workflow, when theCAGEset object is created (section 3.2). This is done by setting the inputFilesType argument, which accepts the following self-explanatory options referring to formats mentioned above:"bam", "bamPairedEnd", "bed", "ctss", "CTSStable".

In addition, the package provides a method for coercing a data.frame object containing single base-pair TSS information into a CAGEset object (as described in section 4.1), which can be further used in the workflow described below.

The CAGEr workflow

Getting started

We start the workflow by creating a CAGEexp object, which is a container for storing CAGE datasets and all the results that will be generated by applying specific functions. The CAGEexp objects are an extension of the_MultiAssayExperiment_ class, and therefore can use all their methods. The expression data is stored in CAGEexp using_SummarizedExperiment_ objects, and can also access their methods.

To load the CAGEr package and the other libraries into your R environment type:

library(CAGEr)

Creating a CAGEexp object

Specifying a genome assembly

In this tutorial we will be using data from zebrafish Danio rerio that was mapped to the danRer7 assembly of the genome. Therefore, the corresponding genome package BSgenome.Drerio.UCSC.danRer7 has to be installed. It will be automatically loaded by CAGEr commands when needed.

In case the data is mapped to a genome that is not readily available through_BSgenome_ package (not in the list returned by BSgenome::available.genomes()function), a custom BSgenome package can be build and installed first. (See the vignette within the BSgenome package for instructions on how to build a custom genome package). The genomeName argument can then be set to the name of the build genome package when creating a CAGEexp object (see the section_Creating CAGEexp object_ below). It can also be set to NULL as a last resort when no BSgenome package is available.

The BSgenome package is required by the CAGEr functions that need access to the genome sequence, for instance for G-correction. It is also used provideseqinfo information to the various Bioconductor objects produced by CAGEr. For this reason, CAGEr will discard alignments that are not on chromosomes named in the BSgenome package. If this is not desirable, set genomeNameto NULL.

Specifying input files

The subset of zebrafish (Danio rerio) developmental time-series CAGE data generated by (Nepal et al. 2013) will be used in the following demonstration of the_CAGEr_ workflow.

Files with genomic coordinates of TSSs detected by CAGE in 4 zebrafish developmental stages are included in this package in the extdata subdirectory. The files contain TSSs from a part of chromosome 17 (26,000,000-46,000,000), and there are two files for one of the developmental stages (two independent replicas). The data in files is organized in four tab-separated columns as described above in section 2.

inputFiles <- list.files( system.file("extdata", package = "CAGEr")
                        , "ctss$"
                        , full.names = TRUE)

Creating the object

The CAGEexp object is crated with the CAGEexp constructor, that requires information on file path and type, sample names and reference genome name.

ce <- CAGEexp( genomeName     = "BSgenome.Drerio.UCSC.danRer7"
             , inputFiles     = inputFiles
             , inputFilesType = "ctss"
             , sampleLabels   = sub( ".chr17.ctss", "", basename(inputFiles))
)

To display the created object type:

ce
## A CAGEexp object of 0 listed
##  experiments with no user-defined names and respective classes.
##  Containing an ExperimentList class object of length 0:
##  Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

The supplied information can be seen with the colData accessor, whereas all other slots are still empty, since no data has been read yet and no analysis conducted.

colData(ce)
## DataFrame with 5 rows and 3 columns
##                                 inputFiles inputFilesType        sampleLabels
##                                <character>    <character>         <character>
## Zf.30p.dome         /tmp/RtmpAAXmSN/Rins..           ctss         Zf.30p.dome
## Zf.high             /tmp/RtmpAAXmSN/Rins..           ctss             Zf.high
## Zf.prim6.rep1       /tmp/RtmpAAXmSN/Rins..           ctss       Zf.prim6.rep1
## Zf.prim6.rep2       /tmp/RtmpAAXmSN/Rins..           ctss       Zf.prim6.rep2
## Zf.unfertilized.egg /tmp/RtmpAAXmSN/Rins..           ctss Zf.unfertilized.egg

Reading in the data

In case when the CAGE / TSS data is to be read from input files, an empty CAGEexp object with information about the files is first created as described above in section 3.2. To actually read in the data into the object we use getCTSS() function, that will add an experiment called tagCountMatrix to the CAGEexp object.

ce <- getCTSS(ce)
ce
## A CAGEexp object of 1 listed
##  experiment with a user-defined name and respective class.
##  Containing an ExperimentList class object of length 1:
##  [1] tagCountMatrix: RangedSummarizedExperiment with 23343 rows and 5 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

This function reads the provided files in the order they were specified in theinputFiles argument. It creates a single set of all TSSs detected across all input datasets (union of TSSs) and a table with counts of CAGE tags supporting each TSS in every dataset. (Note that in case when a CAGEr object is created by coercion from an existing expression table there is no need to callgetCTSS()).

Genomic coordinates of all TSSs and numbers of supporting CAGE tags in every input sample can be retrieved using the CTSStagCountSE() function. CTSScoordinatesGR() accesses the CTSS coordinates and CTSStagCountDF() accesses the CTSS expression values.111 Data can also be accessed directly using the native methods of the MultiAssayExperiment andSummarizedExperiment classes, for example ce[["tagCountMatrix"]],rowRanges(ce[["tagCountMatrix"]]) and assay(ce[["tagCountMatrix"]]).

CTSStagCountSE(ce)
## class: RangedSummarizedExperiment 
## dim: 23343 5 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames(5): Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2
##   Zf.unfertilized.egg
## colData names(0):
CTSScoordinatesGR(ce)
## CTSS object with 23343 positions and 0 metadata columns:
##           seqnames       pos strand
##              <Rle> <integer>  <Rle>
##       [1]    chr17  26027430      +
##       [2]    chr17  26050540      +
##       [3]    chr17  26118088      +
##       [4]    chr17  26142853      +
##       [5]    chr17  26166954      +
##       ...      ...       ...    ...
##   [23339]    chr17  45975041      -
##   [23340]    chr17  45975540      -
##   [23341]    chr17  45975544      -
##   [23342]    chr17  45982697      -
##   [23343]    chr17  45999921      -
##   -------
##   seqinfo: 26 sequences (1 circular) from danRer7 genome
##   BSgenome name: BSgenome.Drerio.UCSC.danRer7
CTSStagCountDF(ce)
## DataFrame with 23343 rows and 5 columns
##       Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2 Zf.unfertilized.egg
##             <Rle>   <Rle>         <Rle>         <Rle>               <Rle>
## 1               0       0             1             0                   0
## 2               0       0             0             0                   1
## 3               0       0             1             0                   0
## 4               0       0             0             1                   0
## 5               0       0             1             0                   0
## ...           ...     ...           ...           ...                 ...
## 23339           1       0             0             0                   0
## 23340           0       2             0             0                   0
## 23341           0       1             0             0                   0
## 23342           0       0             1             0                   0
## 23343           1       0             0             0                   0
CTSStagCountGR(ce, 1)  # GRanges for one sample with expression count.
## CTSS object with 7277 positions and 1 metadata column:
##          seqnames       pos strand | score
##             <Rle> <integer>  <Rle> | <Rle>
##      [1]    chr17  26222417      + |     1
##      [2]    chr17  26323229      + |     1
##      [3]    chr17  26453603      + |     2
##      [4]    chr17  26453615      + |     1
##      [5]    chr17  26453632      + |     3
##      ...      ...       ...    ... .   ...
##   [7273]    chr17  45901810      - |     1
##   [7274]    chr17  45901814      - |     1
##   [7275]    chr17  45901816      - |     1
##   [7276]    chr17  45975041      - |     1
##   [7277]    chr17  45999921      - |     1
##   -------
##   seqinfo: 26 sequences (1 circular) from danRer7 genome
##   BSgenome name: BSgenome.Drerio.UCSC.danRer7

Note that the samples are ordered in the way they were supplied when creating the CAGEexp object and will be presented in that order in all the results and plots. To check sample labels and their ordering type:

sampleLabels(ce)
##               #FF0000               #CCFF00               #00FF66 
##         "Zf.30p.dome"             "Zf.high"       "Zf.prim6.rep1" 
##               #0066FF               #CC00FF 
##       "Zf.prim6.rep2" "Zf.unfertilized.egg"

In addition, a colour is assigned to each sample, which is consistently used to depict that sample in all the plots. By default a rainbow palette of colours is used and the hexadecimal format of the assigned colours can be seen as names attribute of sample labels shown above. The colours can be changed to taste at any point in the workflow using the setColors() function.

Quality controls and preliminary analyses

Genome annotations

By design, CAGE tags map transcription start sites and therefore detect promoters. Quantitatively, the proportion of tags that map to promoter regions will depend both on the quality of the libraries and the quality of the genome annotation, which may be incomplete. Nevertheless, strong variations between libraries prepared in the same experiment may be used for quality controls.

CAGEr can intersect CTSSes with reference transcript models and annotate them with the name(s) of the models, and the region categories promoter, exon, intron and_unknown_, by using the annotateCTSS function. The reference models can be GENCODE loaded with the import.gff function of the rtracklayer package, or any other input that has the same structure, see help("annotateCTSS") for details. In this example, we will use a sample annotation for zebrafish (see help("exampleZv9_annot")).

ce <- annotateCTSS(ce, exampleZv9_annot)

The annotation results are stored as tag counts in the sample metadata, and as new columns in the CTSS genomic ranges

colData(ce)[,c("librarySizes", "promoter", "exon", "intron", "unknown")]
## DataFrame with 5 rows and 5 columns
##                     librarySizes  promoter      exon    intron   unknown
##                        <integer> <integer> <integer> <integer> <integer>
## Zf.30p.dome                41814     37843      2352       594      1025
## Zf.high                    45910     41671      2848       419       972
## Zf.prim6.rep1              34053     29531      2714       937       871
## Zf.prim6.rep2              34947     30799      2320       834       994
## Zf.unfertilized.egg        56140     51114      2860       400      1766
CTSScoordinatesGR(ce)
## CTSS object with 23343 positions and 2 metadata columns:
##           seqnames       pos strand |  genes annotation
##              <Rle> <integer>  <Rle> |  <Rle>      <Rle>
##       [1]    chr17  26027430      + |           unknown
##       [2]    chr17  26050540      + | grid1a   promoter
##       [3]    chr17  26118088      + | grid1a       exon
##       [4]    chr17  26142853      + | grid1a     intron
##       [5]    chr17  26166954      + | grid1a       exon
##       ...      ...       ...    ... .    ...        ...
##   [23339]    chr17  45975041      - |           unknown
##   [23340]    chr17  45975540      - |           unknown
##   [23341]    chr17  45975544      - |           unknown
##   [23342]    chr17  45982697      - |           unknown
##   [23343]    chr17  45999921      - |           unknown
##   -------
##   seqinfo: 26 sequences (1 circular) from danRer7 genome
##   BSgenome name: BSgenome.Drerio.UCSC.danRer7

A function plotAnnot is provided to plot the annotations as stacked bar plots. Here, all the CAGE libraries look very promoter-specific.

plotAnnot(ce, "counts")
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_segment()`).
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_point()`).

Correlation between samples

As part of the basic sanity checks, we can explore the data by looking at the correlation between the samples. The plotCorrelation2() function will plot pairwise scatter plots of expression scores per TSS or consensus cluster and calculate correlation coefficients between all possible pairs of samples222 Alternatively, the plotCorrelation() function does the same and colors the scatterplots according to point density, but is much slower.. A threshold can be set, so that only regions with an expression score (raw or normalized) above the threshold (either in one or both samples) are considered when calculating correlation. Three different correlation measures are supported: Pearson’s, Spearman’s and Kendall’s correlation coefficients. Note that while the scatterplots are on a logarithmic scale with pseudocount added to the zero values, the correlation coefficients are calculated on untransformed (but thresholded) data.

corr.m <- plotCorrelation2( ce, samples = "all"
                          , tagCountThreshold = 1, applyThresholdBoth = FALSE
                          , method = "pearson")

Figure 2: Correlation of raw CAGE tag counts per TSS

Sequence distribution at the transcription start site.

The presence of the core promoter motifs can be checked with the TSSlogo()function, provided that the CAGEexp object was built with a _BSgenome_package allowing access to the sequence flanking the transcription start sites.

TSSlogo(CTSScoordinatesGR(ce) |> subset(annotation == "promoter"), upstream = 35)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ggseqlogo package.
##   Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Figure 3: Sequence logo at the transcription start site

The TSSlogo() function can also be used later. When passed tag clusters_or consensus clusters, it will automatically center the regions on their_dominant peak.

Merging of replicates

Based on calculated correlation we might want to merge and/or rearrange some of the datasets. To rearrange the samples in the temporal order of the zebrafish development (unfertilized egg -> high -> 30 percent dome -> prim6) and to merge the two replicas for the prim6 developmental stage we use the mergeSamples() function:

ce <- mergeSamples(ce, mergeIndex = c(3,2,4,4,1), 
                   mergedSampleLabels = c("Zf.unfertilized.egg", "Zf.high", "Zf.30p.dome", "Zf.prim6"))
ce <- annotateCTSS(ce, exampleZv9_annot)

The mergeIndex argument controls which samples will be merged and how the final dataset will be ordered. Samples labeled by the same number (in our case samples three and four) will be merged together by summing number of CAGE tags per TSS. The final set of samples will be ordered in the ascending order of values provided in mergeIndex and will be labeled by the labels provided in the mergedSampleLabels argument. Note that mergeSamples function resets all slots with results of downstream analyses, so in case there were any results in the CAGEexp object prior to merging, they will be removed. Thus, annotation has to be redone.

Normalization

Library sizes (number of total sequenced tags) of individual experiments differ, thus normalization is required to make them comparable. The librarySizes function returns the total number of CAGE tags in each sample:

librarySizes(ce)
## [1] 56140 45910 41814 69000

The CAGEr package supports both simple tags per million normalization and power-law based normalization. It has been shown that many CAGE datasets follow a power-law distribution(Balwierz et al. 2009). Plotting the number of CAGE tags (X-axis) against the number of TSSs that are supported by <= of that number of tags (Y-axis) results in a distribution that can be approximated by a power-law. On a log-log scale this reverse cumulative distribution will manifest as a monotonically decreasing linear function, which can be defined as

\[y = -1 * \alpha * x + \beta\]

and is fully determined by the slope \(\alpha\) and total number of tags T (which together with\(\alpha\) determines the value of \(\beta\)).

To check whether our CAGE datasets follow power-law distribution and in which range of values, we can use the plotReverseCumulatives function:

plotReverseCumulatives(ce, fitInRange = c(5, 1000))

Figure 4: Reverse cumulative distribution of CAGE tags

In addition to the reverse cumulative plots (Figure 4), a power-law distribution will be fitted to each reverse cumulative using values in the specified range (denoted with dashed lines in Figure 4) and the value of \(\alpha\)will be reported for each sample (shown in the brackets in the Figure 4legend). The plots can help in choosing the optimal parameters for power-law based normalization. We can see that the reverse cumulative distributions look similar and follow the power-law in the central part of the CAGE tag counts values with a slope between -1.1 and -1.3. Thus, we choose a range from 5 to 1000 tags to fit a power-law, and we normalize all samples to a referent power-law distribution with a total of 50,000 tags and slope of -1.2 (\(\alpha = 1.2\)).333 Note that since this example dataset contains only data from one part of chromosome 17 and the total number of tags is very small, we normalize to a referent distribution with a similarly small number of tags. When analyzing full datasets it is reasonable to set total number of tags for referent distribution to one million to get normalized tags per million values.

To perform normalization we pass these parameters to the normalizeTagCount function.

ce <- normalizeTagCount(ce, method = "powerLaw", fitInRange = c(5, 1000), alpha = 1.2, T = 5*10^4)
## Warning in dim(assays): The dim() method for DataFrameList objects is deprecated. Please use dims()
##   on these objects instead.
## Warning in nrow(x): The nrow() method for DataFrameList objects is deprecated. Please use nrows()
##   on these objects instead.
## Warning in ncol(x): The ncol() method for DataFrameList objects is deprecated. Please use ncols()
##   on these objects instead.
## Warning in dim(assays): The dim() method for DataFrameList objects is deprecated. Please use dims()
##   on these objects instead.
## Warning in nrow(x): The nrow() method for DataFrameList objects is deprecated. Please use nrows()
##   on these objects instead.
## Warning in ncol(x): The ncol() method for DataFrameList objects is deprecated. Please use ncols()
##   on these objects instead.
ce[["tagCountMatrix"]]
## class: RangedSummarizedExperiment 
## dim: 23343 4 
## metadata(0):
## assays(2): counts normalizedTpmMatrix
## rownames: NULL
## rowData names(2): genes annotation
## colnames(4): Zf.unfertilized.egg Zf.high Zf.30p.dome Zf.prim6
## colData names(0):

The normalization is performed as described in (Balwierz et al. 2009):

In addition to the two provided normalization methods, a pass-through option none can be set asmethod parameter to keep using raw tag counts in all downstream steps. Note thatnormalizeTagCount() has to be applied to CAGEr object before moving to next steps. Thus, in order to keep using raw tag counts run the function with method="none". In that case, all results and parameters in the further steps that would normally refer to normalized CAGE signal (denoted as tpm), will actually be raw tag counts.

CTSS flagging

Some CTSSes have a low expression score, and are found in only a few libraries. In non-PCR-amplified CAGE libraries, a CTSS found in a single library with an expression score of 1 tag represents the detection of a single mRNA molecule’s 5-prime end. But it could also represent the mismapping one molecule because of a sequencing error. To flag CTSSes that have poor reproducibility support so that other steps of the analysis can ignore them, the filterLowExpCTSSfunction is provided. It will add an internal flag in the filteredCTSSidxcolumn of the CTSS objects, set to FALSE where expression is lower than a given threshold in a given number of samples. This flag is later used by CTSS clustering and export functions.

Let’s flag low-fidelity TSSs supported by less than 1 normalized tag counts in all of the samples.

ce <- filterLowExpCTSS(ce, thresholdIsTpm = TRUE, nrPassThreshold = 1, threshold = 1)
CTSSnormalizedTpmGR(ce,1)
## CTSS object with 8395 positions and 4 metadata columns:
##          seqnames       pos strand |           genes annotation filteredCTSSidx
##             <Rle> <integer>  <Rle> |           <Rle>      <Rle>           <Rle>
##      [1]    chr17  26050540      + |          grid1a   promoter           FALSE
##      [2]    chr17  26391265      + | si:ch73-34j14.2       exon           FALSE
##      [3]    chr17  26446219      + |                    unknown           FALSE
##      [4]    chr17  26453605      + |                   promoter            TRUE
##      [5]    chr17  26453632      + |                   promoter            TRUE
##      ...      ...       ...    ... .             ...        ...             ...
##   [8391]    chr17  45901781      - |           arf6b   promoter           FALSE
##   [8392]    chr17  45901784      - |           arf6b   promoter            TRUE
##   [8393]    chr17  45901800      - |           arf6b   promoter           FALSE
##   [8394]    chr17  45901814      - |           arf6b   promoter            TRUE
##   [8395]    chr17  45901816      - |           arf6b   promoter            TRUE
##                      score
##                      <Rle>
##      [1] 0.658379223965292
##      [2] 0.658379223965292
##      [3] 0.658379223965292
##      [4]  1.30374775871268
##      [5]  1.30374775871268
##      ...               ...
##   [8391] 0.658379223965292
##   [8392] 0.658379223965292
##   [8393] 0.658379223965292
##   [8394]  1.30374775871268
##   [8395]  1.94429500559247
##   -------
##   seqinfo: 26 sequences (1 circular) from danRer7 genome
##   BSgenome name: BSgenome.Drerio.UCSC.danRer7

CTSS clustering

Transcription start sites are found in the promoter region of a gene and reflect the transcriptional activity of that promoter (Figure 6). TSSs in the close proximity of each other give rise to a functionally equivalent set of transcripts and are likely regulated by the same promoter elements. Thus, TSSs can be spatially clustered into larger transcriptional units, called tag clusters (TCs) that correspond to individual promoters.CAGEr supports two methods for sample-specific spatial clustering of TSSs along the genome:

We will perform a simple distance-based clustering using 20 bp as a maximal allowed distance between two neighbouring TSSs.

ce <- distclu(ce, maxDist = 20, keepSingletonsAbove = 5)

Our final set of tag clusters will not include singletons (clusters with only one TSS), unless the normalized signal is above 5, it is a reasonably supported TSS. The CTSS clustering functions function creates a set of clusters for each sample separately; for each cluster it returns the genomic coordinates, counts the number of TSSs within the cluster, determines the (1-based) position of the most frequently used (dominant) TSS, calculates the total CAGE signal within the cluster and CAGE signal supporting the dominant TSS only. We can extract tag clusters for a desired sample fromCAGEexp object by calling the tagClustersGR function:

tagClustersGR(ce, sample = "Zf.unfertilized.egg")
## TagClusters object with 517 ranges and 3 metadata columns:
##       seqnames            ranges strand |            score    dominant_ctss
##          <Rle>         <IRanges>  <Rle> |            <Rle>           <CTSS>
##     1    chr17 26453632-26453708      + | 26.9709371501973 chr17:26453667:+
##     2    chr17 26564508-26564610      + | 128.637202208017 chr17:26564585:+
##     3    chr17 26595637-26595793      + | 216.999442534332 chr17:26595750:+
##     4    chr17 26596033-26596091      + | 10.4200035230486 chr17:26596070:+
##     5    chr17 26596118-26596127      + | 12.1994648486481 chr17:26596118:+
##   ...      ...               ...    ... .              ...              ...
##   513    chr17 45700182-45700187      - | 9.61820033171689 chr17:45700182:-
##   514    chr17 45901329-45901330      - | 1.96212698267798 chr17:45901329:-
##   515    chr17 45901698-45901710      - | 27.6544648890639 chr17:45901698:-
##   516    chr17 45901732-45901784      - |  119.96944736195 chr17:45901749:-
##   517    chr17 45901814-45901816      - | 3.24804276430515 chr17:45901816:-
##         nr_ctss
##       <integer>
##     1        12
##     2        24
##     3        35
##     4         9
##     5         4
##   ...       ...
##   513         3
##   514         2
##   515         4
##   516        15
##   517         2
##   -------
##   seqinfo: 26 sequences (1 circular) from danRer7 genome

Track export for genome browsers

CAGE data can be visualized in the genomic context by converting raw or normalized CAGE tag counts to a track object and exporting it to a file format such as BED, bedGraph or BigWig for uploading (or linking) to a genome browser`444 Note that the ZENBU genome browsercan also display natively data from BAM or BED files as coverage tracks.. The (exportToTrack) function can take a variety of inputs representing_CTSS_, Tag Clusters or Consensus Clusters, with raw or normalised expression scores. When asked to export multiple samples it will return a list of tracks.

trk <- exportToTrack(CTSSnormalizedTpmGR(ce, "Zf.30p.dome"))
ce |> CTSSnormalizedTpmGR("all") |> exportToTrack(ce, oneTrack = FALSE)
## GRangesList object of length 4:
## [[1]]
## UCSC track 'Zf.unfertilized.egg (TC)'
## UCSCData object with 8395 ranges and 6 metadata columns:
##          seqnames    ranges strand |           genes annotation filteredCTSSidx
##             <Rle> <IRanges>  <Rle> |           <Rle>      <Rle>           <Rle>
##      [1]    chr17  26050540      + |          grid1a   promoter           FALSE
##      [2]    chr17  26391265      + | si:ch73-34j14.2       exon           FALSE
##      [3]    chr17  26446219      + |                    unknown           FALSE
##      [4]    chr17  26453605      + |                   promoter            TRUE
##      [5]    chr17  26453632      + |                   promoter            TRUE
##      ...      ...       ...    ... .             ...        ...             ...
##   [8391]    chr17  45901781      - |           arf6b   promoter           FALSE
##   [8392]    chr17  45901784      - |           arf6b   promoter            TRUE
##   [8393]    chr17  45901800      - |           arf6b   promoter           FALSE
##   [8394]    chr17  45901814      - |           arf6b   promoter            TRUE
##   [8395]    chr17  45901816      - |           arf6b   promoter            TRUE
##          cluster     score     itemRgb
##            <Rle> <numeric> <character>
##      [1]          0.658379      grey50
##      [2]          0.658379      grey50
##      [3]          0.658379      grey50
##      [4]          1.303748       black
##      [5]          1.303748       black
##      ...     ...       ...         ...
##   [8391]          0.658379      grey50
##   [8392]          0.658379       black
##   [8393]          0.658379      grey50
##   [8394]          1.303748       black
##   [8395]          1.944295       black
##   -------
##   seqinfo: 26 sequences (1 circular) from danRer7 genome
## 
## ...
## <3 more elements>

Some track file format, for instance bigWig or bedGraph require the + and- strands to be separated, because they do not allow overlapping ranges. This can be done with the (split) function like in the following example555 The drop = TRUE option is needed to remove the * level..

split(trk, strand(trk), drop = TRUE)
## GRangesList object of length 2:
## $`+`
## UCSC track 'Zf.30p.dome (TC)'
## UCSCData object with 3778 ranges and 6 metadata columns:
##          seqnames    ranges strand | genes annotation filteredCTSSidx
##             <Rle> <IRanges>  <Rle> | <Rle>      <Rle>           <Rle>
##      [1]    chr17  26222417      + |          unknown           FALSE
##      [2]    chr17  26323229      + |          unknown           FALSE
##      [3]    chr17  26453603      + |         promoter            TRUE
##      [4]    chr17  26453615      + |         promoter           FALSE
##      [5]    chr17  26453632      + |         promoter            TRUE
##      ...      ...       ...    ... .   ...        ...             ...
##   [3774]    chr17  45975288      + |          unknown            TRUE
##   [3775]    chr17  45975289      + |          unknown            TRUE
##   [3776]    chr17  45975290      + |          unknown            TRUE
##   [3777]    chr17  45975292      + |          unknown            TRUE
##   [3778]    chr17  45975293      + |          unknown            TRUE
##                         cluster     score     itemRgb
##                           <Rle> <numeric> <character>
##      [1]                         0.680723      grey50
##      [2]                         0.680723      grey50
##      [3]                         1.394796       black
##      [4]                         0.680723      grey50
##      [5]                         2.122024       black
##      ...                    ...       ...         ...
##   [3774] chr17:45975252-45975..  10.44998       black
##   [3775] chr17:45975252-45975..   1.39480       black
##   [3776] chr17:45975252-45975..   4.34801       black
##   [3777] chr17:45975252-45975..   1.39480       black
##   [3778] chr17:45975252-45975..   2.85793       black
##   -------
##   seqinfo: 26 sequences (1 circular) from danRer7 genome
## 
## $`-`
## UCSC track 'Zf.30p.dome (TC)'
## UCSCData object with 3499 ranges and 6 metadata columns:
##          seqnames    ranges strand | genes annotation filteredCTSSidx cluster
##             <Rle> <IRanges>  <Rle> | <Rle>      <Rle>           <Rle>   <Rle>
##      [1]    chr17  26068225      - |          unknown           FALSE        
##      [2]    chr17  26068227      - |          unknown           FALSE        
##      [3]    chr17  26068233      - |          unknown           FALSE        
##      [4]    chr17  26074127      - |          unknown            TRUE        
##      [5]    chr17  26113371      - |          unknown           FALSE        
##      ...      ...       ...    ... .   ...        ...             ...     ...
##   [3495]    chr17  45901810      - | arf6b   promoter           FALSE        
##   [3496]    chr17  45901814      - | arf6b   promoter            TRUE        
##   [3497]    chr17  45901816      - | arf6b   promoter            TRUE        
##   [3498]    chr17  45975041      - |          unknown           FALSE        
##   [3499]    chr17  45999921      - |          unknown           FALSE        
##              score     itemRgb
##          <numeric> <character>
##      [1]  0.680723      grey50
##      [2]  0.680723      grey50
##      [3]  0.680723      grey50
##      [4]  1.394796       black
##      [5]  0.680723      grey50
##      ...       ...         ...
##   [3495]  0.680723      grey50
##   [3496]  0.680723       black
##   [3497]  0.680723       black
##   [3498]  0.680723      grey50
##   [3499]  0.680723      grey50
##   -------
##   seqinfo: 26 sequences (1 circular) from danRer7 genome

For bigWig export, the (rtracklayer::export.bw) needs to be run on each element of the list returned by the (split) command.

For bedGraph export, the (rtracklayer::export.bedGraph) command can take the list as input and will produce a single file containing the two tracks. (Figure 6) shows an example of _bedGraph_visualisation.

For BED export, the (rtracklayer::export.bed) can operate directly on the track object.

Other export format probably operate in a way similar to one of the cases above.

Figure 6: CAGE data bedGraph track visualized in the UCSC Genome Browser

Interquantile width can also be visualized in a gene-like representation in the genome browsers by passing quantile information during data conversion to the UCSCData format and then exporting it into a BED file:

iqtrack <- exportToTrack(ce, what = "tagClusters", qLow = 0.1, qUp = 0.9, oneTrack = FALSE)
iqtrack
## GRangesList object of length 4:
## $Zf.unfertilized.egg
## UCSC track 'TC'
## UCSCData object with 517 ranges and 8 metadata columns:
##         seqnames            ranges strand |     score   nr_ctss q_0.1 q_0.9
##            <Rle>         <IRanges>  <Rle> | <numeric> <integer> <Rle> <Rle>
##     [1]    chr17 26453632-26453708      + |   26.9709        12    36    72
##     [2]    chr17 26564508-26564610      + |  128.6372        24    17    81
##     [3]    chr17 26595637-26595793      + |  216.9994        35    37   114
##     [4]    chr17 26596033-26596091      + |   10.4200         9     1    50
##     [5]    chr17 26596118-26596127      + |   12.1995         4     1    10
##     ...      ...               ...    ... .       ...       ...   ...   ...
##   [513]    chr17 45700182-45700187      - |   9.61820         3     1     6
##   [514]    chr17 45901329-45901330      - |   1.96213         2     1     2
##   [515]    chr17 45901698-45901710      - |  27.65446         4     1     2
##   [516]    chr17 45901732-45901784      - | 119.96945        15     2    21
##   [517]    chr17 45901814-45901816      - |   3.24804         2     1     3
##         interquantile_width     thick      name        blocks
##                       <Rle> <IRanges> <logical> <IRangesList>
##     [1]                  37  26453667      <NA>    1,36-72,77
##     [2]                  65  26564585      <NA>   1,17-81,103
##     [3]                  78  26595750      <NA>  1,37-114,157
##     [4]                  50  26596070      <NA>       1-50,59
##     [5]                  10  26596118      <NA>          1-10
##     ...                 ...       ...       ...           ...
##   [513]                   6  45700182      <NA>           1-6
##   [514]                   2  45901329      <NA>           1-2
##   [515]                   2  45901698      <NA>        1-2,13
##   [516]                  20  45901749      <NA>     1,2-21,53
##   [517]                   3  45901816      <NA>           1-3
##   -------
##   seqinfo: 26 sequences (1 circular) from danRer7 genome
## 
## ...
## <3 more elements>
#rtracklayer::export.bed(iqtrack, "outputFileName.bed")

In this gene-like representation (Figure 7), the oriented line shows the full span of the cluster, filled block marks the interquantile width and a single base-pair thick block denotes the position of the dominant TSS.

Figure 7: Tag clusters visualization in the genome browser

Expression profiling

The CAGE signal is a quantitative measure of promoter activity. In CAGEr, normalised expression scores of individual CTSSs or consensus clusters can be clustered in expression classes. Two unsupervised clustering algorithms are supported: kmeans and self-organizing maps (SOM). Both require to specify a number of clusters in advance. Results are stored in the exprClass metadata column of the CTSS or consensus clusters genomic ranges, and theexpressionClass accessor function is provided for convenience.

In the example below, we perform expression clustering at the level of entire promoters using SOM algorithm with 4 × 2 dimensions and applying it only to consensus clusters with a normalized CAGE signal of at least 10 TPM in at least one sample.

ce <- getExpressionProfiles(ce, what = "consensusClusters", tpmThreshold = 10, 
  nrPassThreshold = 1, method = "som", xDim = 4, yDim = 2)

consensusClustersGR(ce)$exprClass |> table(useNA = "always")
## 
##  0_0  0_1  1_0  1_1  2_0  2_1  3_0  3_1 <NA> 
##   46   49   25   19   15    5    6   45   75

Distribution of expression across samples for the 8 clusters returned by SOM can be visualized using the plotExpressionProfiles function as shown in Figure ??:

plotExpressionProfiles(ce, what = "consensusClusters")
## Warning in ggplot2::scale_x_log10(): log-10 transformation introduced infinite
## values.
## Warning: Removed 103 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

(#fig:ConsensusClustersExpressionProfiles_svg)Expression clusters

Each cluster is shown in different color and is marked by its label and the number of elements (promoters) in the cluster. We can extract promoters belonging to a specific cluster by typing commands like:

consensusClustersGR(ce) |> subset(consensusClustersGR(ce)$exprClass ==  "0_1")
## ConsensusClusters object with 49 ranges and 8 metadata columns:
##                             seqnames            ranges strand |
##                                <Rle>         <IRanges>  <Rle> |
##   chr17:26645157-26645514:+    chr17 26645157-26645514      + |
##   chr17:26651964-26652050:+    chr17 26651964-26652050      + |
##   chr17:28161574-28161757:+    chr17 28161574-28161757      + |
##   chr17:28670871-28670986:+    chr17 28670871-28670986      + |
##   chr17:28683436-28683585:+    chr17 28683436-28683585      + |
##                         ...      ...               ...    ... .
##   chr17:43639501-43639675:-    chr17 43639501-43639675      - |
##   chr17:43910083-43910371:-    chr17 43910083-43910371      - |
##   chr17:44487317-44487409:-    chr17 44487317-44487409      - |
##   chr17:45175977-45175990:-    chr17 45175977-45175990      - |
##   chr17:45545922-45545996:-    chr17 45545922-45545996      - |
##                                        score    dominant_ctss   nr_ctss
##                                        <Rle>           <CTSS> <integer>
##   chr17:26645157-26645514:+ 2818.40482968203 chr17:26645160:+        92
##   chr17:26651964-26652050:+ 72.8417558520884 chr17:26652002:+        13
##   chr17:28161574-28161757:+ 1305.40900274216 chr17:28161692:+        53
##   chr17:28670871-28670986:+ 3142.30648416071 chr17:28670882:+        52
##   chr17:28683436-28683585:+  822.70299120256 chr17:28683496:+        44
##                         ...              ...              ...       ...
##   chr17:43639501-43639675:- 297.617148143508 chr17:43639621:-        39
##   chr17:43910083-43910371:- 2488.62120758589 chr17:43910245:-        94
##   chr17:44487317-44487409:- 1619.09807160225 chr17:44487371:-        37
##   chr17:45175977-45175990:- 35.1668012378721 chr17:45175986:-         4
##   chr17:45545922-45545996:- 919.561599189015 chr17:45545991:-        32
##                             annotation    genes q_0.1 q_0.9 exprClass
##                                  <Rle>    <Rle> <Rle> <Rle>     <Rle>
##   chr17:26645157-26645514:+   promoter   pgrmc2     1   110       0_1
##   chr17:26651964-26652050:+       exon   pgrmc2    22    81       0_1
##   chr17:28161574-28161757:+   promoter  TMEM30B    64   181       0_1
##   chr17:28670871-28670986:+   promoter MIS18BP1     3    65       0_1
##   chr17:28683436-28683585:+   promoter  heatr5a    49    77       0_1
##                         ...        ...      ...   ...   ...       ...
##   chr17:43639501-43639675:-   promoter  zfyve28    90   164       0_1
##   chr17:43910083-43910371:-   promoter   ahsa1l   115   252       0_1
##   chr17:44487317-44487409:-   promoter    exoc5     1    70       0_1
##   chr17:45175977-45175990:-   promoter  fam161b    10    14       0_1
##   chr17:45545922-45545996:-   promoter  znf106a    27    70       0_1
##   -------
##   seqinfo: 26 sequences (1 circular) from danRer7 genome

Consensus clusters and information on their expression profile can be exported to a BED file, which allows visualization of the promoters in the genome browser colored in the color of the expression cluster they belong to (Figure 8:

cc_iqtrack <- exportToTrack(ce, what = "consensusClusters", colorByExpressionProfile = TRUE)
cc_iqtrack
## UCSC track 'TC'
## UCSCData object with 285 ranges and 9 metadata columns:
##         seqnames            ranges strand |     score   nr_ctss annotation
##            <Rle>         <IRanges>  <Rle> | <numeric> <integer>      <Rle>
##     [1]    chr17 26453647-26453719      + |   174.694        22   promoter
##     [2]    chr17 26564524-26564591      + |   308.691        27   promoter
##     [3]    chr17 26595673-26595750      + |   891.068        34   promoter
##     [4]    chr17 26596033-26596339      + |   321.610        57   promoter
##     [5]    chr17 26645157-26645514      + |  2818.405        92   promoter
##     ...      ...               ...    ... .       ...       ...        ...
##   [281]    chr17 45534727-45534729      - |  176.5030         3     intron
##   [282]    chr17 45545922-45545996      - |  919.5616        32   promoter
##   [283]    chr17 45554314-45554345      - |   44.0059         6   promoter
##   [284]    chr17 45700092-45700187      - |   39.7368         8       exon
##   [285]    chr17 45901695-45901752      - |  846.5649        22   promoter
##           genes q_0.1 q_0.9 exprClass     thick      name
##           <Rle> <Rle> <Rle>     <Rle> <IRanges> <logical>
##     [1]   ttc7b    21    57       1_1  26453701      <NA>
##     [2]   nrde2     1    64       0_0  26564585      <NA>
##     [3]  larp1b    12    78       1_0  26595750      <NA>
##     [4]  larp1b    34   266       1_0  26596198      <NA>
##     [5]  pgrmc2     1   110       0_1  26645160      <NA>
##     ...     ...   ...   ...       ...       ...       ...
##   [281] znf106a     1     3       3_1  45534727      <NA>
##   [282] znf106a    27    70       0_1  45545991      <NA>
##   [283] tmem206     1    32       2_0  45554339      <NA>
##   [284]   susd4     4    93      <NA>  45700182      <NA>
##   [285]   arf6b     4    55       1_1  45901749      <NA>
##   -------
##   seqinfo: 26 sequences (1 circular) from danRer7 genome
#rtracklayer::export.bed(cc_iqtrack, "outputFileName.bed")

Figure 8: Consensus clusters colored by expression profile in the genome browser

Expression profiling of individual TSSs is done using the same procedure as described above for consensus clusters, only by setting wha = "CTSS" in all of the functions.

Differential expression analysis

The raw expression table for the consensus clusters can be exported to the _DESeq2_package for differential expression analysis. For this, the column data needs to contain factors that can group the samples. They can have any name.

ce$group <- factor(c("a", "a", "b", "b"))
dds <- consensusClustersDESeq2(ce, ~group)

Enhancers

The FANTOM5 project reported that “_enhancer activity can be detected through the presence of balanced bidirectional capped transcripts_” (Andersson et al. 2014). The CAGEr package is providing a wrapper to the CAGEfightR package’s function quickEnhancers so that it can run directly on CAGEexp objects. The wrapper returns a modified CAGEexp object in which the results are stored in its enhancers experiment slot.

ce <- quickEnhancers(ce)
ce[["enhancers"]]
## class: RangedSummarizedExperiment 
## dim: 33 4 
## metadata(0):
## assays(1): counts
## rownames(33): chr17:26690165-26690757 chr17:27120436-27120991 ...
##   chr17:45175861-45176390 chr17:45611150-45611574
## rowData names(4): score thick balance bidirectionality
## colnames(4): Zf.unfertilized.egg Zf.high Zf.30p.dome Zf.prim6
## colData names(12): inputFiles inputFilesType ... Name totalTags

Appendix

Creating a CAGEexp object by coercing a data frame

A CAGEexp object can also be created directly by coercing a data frame containing single base-pair TSS information. To be able to do the coercion into a CAGEexp, the data frame must conform with the following:

An example of such data frame is shown below:

TSS.df <- read.table(system.file( "extdata/Zf.unfertilized.egg.chr17.ctss"
                                , package = "CAGEr"))
# make sure the column names are as required
colnames(TSS.df) <- c("chr", "pos", "strand", "Zf.unfertilized.egg")
# make sure the column classes are as required
TSS.df$chr <- as.character(TSS.df$chr)
TSS.df$pos <- as.integer(TSS.df$pos)
TSS.df$strand <- as.character(TSS.df$strand)
TSS.df$Zf.unfertilized.egg <- as.integer(TSS.df$Zf.unfertilized.egg)
head(TSS.df)
##     chr      pos strand Zf.unfertilized.egg
## 1 chr17 26050540      +                   1
## 2 chr17 26074127      -                   2
## 3 chr17 26074129      -                   3
## 4 chr17 26222545      -                   1
## 5 chr17 26322780      -                   1
## 6 chr17 26322832      -                   2

This data.frame can now be coerced to a CAGEexp object, which will fill the corresponding slots of the object with provided TSS information:

ce.coerced <- as(TSS.df, "CAGEexp")
ce.coerced
## A CAGEexp object of 1 listed
##  experiment with a user-defined name and respective class.
##  Containing an ExperimentList class object of length 1:
##  [1] tagCountMatrix: RangedSummarizedExperiment with 8395 rows and 1 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

Summary of the CAGEr accessor functions

Originally there was one accessor per slot in CAGEset objects (the original_CAGEr_ format), but now that I added the CAGEexp class, that uses different underlying formats, the number of accessors increased because a) I provide the old ones for backwards compatibility and b) there multiple possible output formats.

Before releasing this CAGEr update in Bioconductor, I would like to be sure that the number of accessors and the name scheme are good enough.

Please let me know your opinion about the names and scope of the accessors below:

CTSS data

Cluster data

Gene data

Summary of the CAGEexp experiment slots and assays

A CAGEexp object can contain the following experiments.

Please let me know your opinion about the names

CAGEexp assays

Summary of the CAGEr classes

The CTSS, CTSS.chr, TagCluster and ConsensusClsuters are mostly used internally or type safety and preventing me (Charles) from mixing up inputs. They are visible from the outside. Should they be used more extensively ? Can they be replaced by more “core” Bioconductor classes ?

Paired-end CAGE read alignment with the nf-core/rnaseq pipeline

The modern CAGE protocols starting from nAnTi-CAGE (Murata et al. 2014) onward can be sequenced paired-end when they are random-primed. Many aligners can map the read pairs but it is important to pay attention to the way they encode the existence of unmapped extra G bases in their output (typically in BAM format).

CAGEr is able to read the BAM files of the HiSAT2 aligner produced by thenf-co.re/rnaseq pipeline. One of the benefits of using a full pipeline to produce the alignment files is that the results will include some quality controls that can be used to identify defects before investing more time in the CAGEr analysis. Optionally, the first 6 or 9 bases (depending on the protocol) of Read 2 may be clipped, as they originate from the random primer and not from the RNA. However, forgetting to do so has very little impact on the results.