Usage of the metaseqR2 package (original) (raw)
Getting started
Installation
To install the metaseqR2 package, start R and enter:
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("metaseqR2")
Introduction
library(metaseqR2)
Detailed instructions on how to run the metaseqr2 pipeline can be found under the main documentation of the metaseqR2 package.
Briefly, to run metaseqr2 you need:
- Input RNA-Seq data. These can come in three forms:
- A text tab delimited file in a spreadsheet-like format containing at least unique gene identifiers (corresponding to one of metaseqR2 supported annotation sources, that is Ensembl, UCSC, RefSeq) or if you are using a custom annotation (with a GTF file), unique gene identifiers corresponding to this GTF file. This case is applicable in case of receiving a ready-made counts table from an external source, such as a sequencing facility or a public dataset.
- A text tab delimited file in a spreadsheet-like format containing all the required annotation elements and additional columns with read counts. This solution is applicable only for gene analysis (
transLevel = "gene"
andcountType = "gene"
). Generally, it is not recommended to embed the annotation and this case is supported only for backwards compatibility. - A set of BAM files, aligned according to the mRNA sequencing protocol, usually a spliced aligner like HiSat or STAR. This is the recommended analysis procedure and the BAM files are declared in a targets text file.
- A local annotation database. This is not required as all required annotation can be downloaded on the fly, but it is recommended for speed, if you have a lot of analyses to perform.
- A list of statistical contrasts for which you wish to check differential expression
- An internet connection so that the interactive report can be properly rendered, as the required JavaScript libraries are not embedded to the package. This is required only once as the report is then self-contained.
For demonstration purposes, a very small dataset (with embedded annotation) is included with the package.
Data filtering
The metaseqR2 pipeline has several options for gene filtering at the gene and exon levels. These filters span various areas including: * The presence of a minimum number of reads in a fraction of the samples per condition or experiment-wise. * The exclusion of specific biotypes (e.g. exluding pseudogenes) * The filtering based on several expression attributes such as average read presence over n kbs or the exclusion of genes whose expression is below the expression of a set of genes known not to be expressed in the biological mechanism under investigation * Filters based on exon expression such as the minimum fraction of exons that should contain reads over a gene.
In addition, the metaseqR2 pipeline offers several analysis “presets” with respect to the filtering layers applied, the statistical analysis stringency and the amount of data exported.
All the aforementioned parameters are well-documented in the main manual of the package and the respective man pages.
The report
In the end of each metaseqr2 pipeline run, a detailed HTML report of the procedure and the findings is produced. Apart from description of the process, all the input parameters and other data related to the differential expression analysis, the report contains a lot of interactive graphs. Specifically, all the quality control and result inspection plots are interactive. This is achieved by making extensive use of the JavaScript librariesHighcharts, Plotly andjvenn to create more user-friendly and directly explorable plots. By default metaseqr2 produces all available diagnostic plots, according always to input. For example, if the_biotype_ feature is not available in a case where annotation="embedded"
, plots like biodetection
and countsbio
will not be available. If not all diagnostic plots are not required, a selection can be made with the qcPlots
argument, possibly making the report “lighter” and less browser-demanding.
The HTML report creation mechanism is through the packages rmarkdown and knitr. This means that the Pandoc libraries must be installed. A lot of details on this can be found in Pandoc’s website as well as knitr and rmarkdown websites and guides. Although the generic mechanism is more computationally demanding than standard HTML (e.g. using brew as in the previous metaseqR), the results are more standardized, cross-platform and fully reproducible.
During development, we found out that knitr faces certain difficulties in our settings, that is embedding a lot of predefined graphs in JSON format and all required libraries and data in a single HTML page. This situation led to crashes because of memory usage and of course, very large HTML files. We resolved this by using (according to usage scenario and where the report is intended to be seen):
Regarding case (1), IndexedDB is a modern technology to create simple, in-browser object databases which has several usages, but mostly to avoid the burden of synchronously loading big-sized objects at the same time of simple HTML rendering. IndexedDB is supported by all modern browser and is essentially a replacement for localStorage
which had space limitations. Dexie is a simple interface to IndexedDB. Thus, all the plot data are created and stored in Dexie for rendering when needed. This rendering method can be used both when the report is seen as a stand-alone document, locally, without the presence of a web server or internet connection, and is the default method.
Regarding case (2), all the predefined plot data are stored in a report-specific SQLite database which is then queried using sql.js. This way can be chosen when it is known that the report will be presented through a web server (e.g. Apache) as in any other case, modern web browser (except MS Edge) do not allow by default opening local files from an HTML page for security reasons. Also, sql.js is quite large as a library (altough downloaded once for recurring reports). This method produces slightly smaller files but is slightly slower. Using Dexie is the preferred and safest method for both scenarios.
In both cases, the serialized JSON used for Highcharts and jvenn plots is placed in data/reportdb.js
when using Dexie or data/reportdb.sqlite
when using sql.js. Experienced users can then open these files and tweak the plots as desired. The above paths are relative to the report’s location exportWhere
arguments.
metaseqR2 report has the following sections, depending also on which diagnostic and exploration plots have been asked from the run command. As plots are categorized, if no plot from a specific category is asked, then this category will not appear. Below, the categories:
Summary
The Summary section is further categorized in several subsections. Specifically:
- Analysis summary: This section contains an auto-generated text that analytically describes the computational process followed and summarized the results of each step. This text can be used as is or with slight modifications in a Methods section of an article.
- Input options: This section provides a list of the input arguments to the pipeline in a more human-readable format.
- Filtering: This section reports in detail the number of filtered genes decomposed according to the number of genes removed by each applied filter.
- Differential expression: This section reports in detail the number of differentially expressed genes for each contrast, both when using only a p-value cutoff as well as an FDR cutoff (numbers in parentheses), that is, genes passing the multiple testing correction procedure selected. These numbers also are calculated based on a simple fold change cutoff in log2 scale. Finally, when multiple algorithms are used with p-value combination, this section reports all the findings analytically per algorithm.
- Command: This section contains the command used to run the metaseqr2 pipeline for users that want to experiment.
- Run log: This section contains critical messages displayed within the R session running
metaseqr2
displayed as a log.
Quality control
The Quality control section contains several interactive plots concerning the overall quality control of each sample provided as well as overall assessments. The quality control plots are the Multidimensional Scaling (MDS) plot, the Biotypes detection (Biodetection) plot, the Biotype abundance (Countsbio) plot, the Read saturation (Saturation) plot, the Read noise (ReadNoise) plot, the Correlation heatmap (Correlation), the Pairwise sample scatterplots (Pairwise) and the Filtered entities (Filtered) plot. Each plot is accompanied by a detailed description of what it depicts. Where multiple plot are available (e.g. one for each sample), a selection list on the top of the respective section allows the selection of the sample to be displayed.
Normalization
The Normalization section contains several interactive plots that can be used to inspect and assess the normalization procedure. Therefore, normalization plots are usually paired, showing the same data instance normalized and not normalized. The normalization plots are the Expression boxplots (Boxplots) plots, the GC content bias (GC bias) plots, the Gene length bias (Length bias) plots, the Within condition mean-difference (Mean-Difference) plots, the Mean-variance relationship (Mean-Variance) plot and the RNA composition (Rna composition) plot. Each plot is accompanied by a detailed description of what it depicts. Where multiple plot are available (e.g. one for each sample), a selection list on the top of the respective section allows the selection of the sample to be displayed.
Statistics
The Statistics section contains several interactive plots that can be used to inspect and explore the outcome of statistical testing procedures. The statistics plots are the Volcano plot (Volcano), the MA or Mean-Difference across conditions (MA) plot, the Expression heatmap (Heatmap) plot, the Chromosome and biotype distributions (Biodist) plot, the Venn diagram across statistical tests (StatVenn), the Venn diagram across contrasts (FoldVenn) and the Deregulogram. Each plot is accompanied by a detailed description of what it depicts. Please note that the heatmap plots show only the top percentage of differentially expressed genes as this is controlled by the reportTop
parameter of the pipeline. When multiple plots are available (e.g. one for each contrast), a selection list on the top of the respective section allows the selection of the sample to be displayed.
Results
The Results section contains a snapshot of the differentially expressed genes in table format with basic information about each gene and some links to external resources. Certain columns of the table are colored according to significance. Larger bars and more intense colors indicate higher significance. For example, bar in the p_value column is larger if the genes has higher statistical significance and the fold change cell background is bright red if the gene is highly up-regulated. From the Results section, full gene lists can be downloaded in text tab-delimited format and viewed with a spreadsheet application like MS Excel. A selector on the top of the section above the table allows the display of different contrasts.
References
The References section contains bibliographical references regading the algorihtms used by the metaseqr2 pipeline and is adjusted according to the algorithms selected.
Genome browser tracks
metaseqR2 utilizes Bioconductor facilities to create normalized bigWig files. It also creates a link to open single stranded tracks in the genome browser and a track hub to display stranded tracks, in case where a stranded RNA-Seq protocol has been applied. Just make sure that their output directory is served by a web server like Apache. See main documentation for more details.
Please note that if requested, metaseqR2 will try to create tracks even with a custom organism. This is somewhat risky as
- the track generation may fail
- for heavily customized cases, you will manually have to crate aso .2bit files for visualization in e.g. the UCSC Genome Browser
Nevertheless, we have chosen to allow the track generation as, many times a user just uses slight modifications of e.g. the human genome annotation, where some elements may be manually curated, of elements are added (e.g. non-annotated non-coding RNAs). Therefore, in case of custom organisms, a warning is thrown but the functionality is not turned off. Please turn off manually if you are sure you do not want tracks. You may also use the createSignalTracks
function directly.
List of required packages
Although this is not usually the content of a vignette, the complex nature of the package requires this list to be populated also here. Therefore, metaseqR2 would benefit from the existence of all the following packages:
- ABSSeq
- Biobase
- BiocGenerics
- BiocManager
- BiocParallel
- BiocStyle
- biomaRt
- Biostrings
- BSgenome
- corrplot
- DESeq
- DESeq2
- DSS
- DT
- EDASeq
- edgeR
- GenomeInfoDb
- GenomicAlignments
- GenomicFeatures
- GenomicRanges
- gplots
- graphics
- grDevices
- heatmaply
- htmltools
- httr
- IRanges
- jsonlite
- knitr
- limma
- log4r
- magrittr
- methods
- NBPSeq
- NOISeq
- pander
- parallel
- qvalue
- rmarkdown
- rmdformats
- RMySQL
- Rsamtools
- RSQLite
- rtracklayer
- RUnit
- S4Vectors
- stats
- stringr
- SummarizedExperiment
- survcomp
- TCC
- utils
- VennDiagram
- vsn
- zoo
A recent version of Pandoc is also required, ideally above 2.0.
Session Info
sessionInfo()
## R version 4.5.0 beta (2025-04-02 r88102)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] splines stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] pander_0.6.6 magrittr_2.0.3
## [3] htmltools_0.5.8.1 heatmaply_1.5.0
## [5] viridis_0.6.5 viridisLite_0.4.2
## [7] plotly_4.10.4 ggplot2_3.5.2
## [9] gplots_3.2.0 DT_0.33
## [11] rmdformats_1.0.4 knitr_1.50
## [13] BSgenome.Mmusculus.UCSC.mm10_1.4.3 BSgenome_1.77.0
## [15] rtracklayer_1.69.0 BiocIO_1.19.0
## [17] Biostrings_2.77.0 XVector_0.49.0
## [19] metaseqR2_1.21.0 locfit_1.5-9.12
## [21] limma_3.65.0 DESeq2_1.49.0
## [23] SummarizedExperiment_1.39.0 Biobase_2.69.0
## [25] MatrixGenerics_1.21.0 matrixStats_1.5.0
## [27] GenomicRanges_1.61.0 GenomeInfoDb_1.45.0
## [29] IRanges_2.43.0 S4Vectors_0.47.0
## [31] BiocGenerics_0.55.0 generics_0.1.3
## [33] BiocStyle_2.37.0
##
## loaded via a namespace (and not attached):
## [1] DSS_2.57.0 bitops_1.0-9 httr_1.4.7
## [4] webshot_0.5.5 RColorBrewer_1.1-3 tools_4.5.0
## [7] R6_2.6.1 HDF5Array_1.37.0 lazyeval_0.2.2
## [10] rhdf5filters_1.21.0 permute_0.9-7 withr_3.0.2
## [13] prettyunits_1.2.0 gridExtra_2.3 VennDiagram_1.7.3
## [16] preprocessCore_1.71.0 cli_3.6.4 formatR_1.14
## [19] TSP_1.2-4 labeling_0.4.3 sass_0.4.10
## [22] genefilter_1.91.0 Rsamtools_2.25.0 txdbmaker_1.5.0
## [25] FMStable_0.1-4 R.utils_2.13.0 parallelly_1.43.0
## [28] RSQLite_2.3.9 hwriter_1.3.2.1 crosstalk_1.2.1
## [31] gtools_3.9.5 dplyr_1.1.4 dendextend_1.19.0
## [34] survcomp_1.59.0 Matrix_1.7-3 interp_1.1-6
## [37] futile.logger_1.4.3 abind_1.4-8 R.methodsS3_1.8.2
## [40] lifecycle_1.0.4 yaml_2.3.10 edgeR_4.7.0
## [43] rhdf5_2.53.0 qvalue_2.41.0 SparseArray_1.9.0
## [46] BiocFileCache_2.17.0 grid_4.5.0 blob_1.2.4
## [49] crayon_1.5.3 pwalign_1.5.0 lattice_0.22-7
## [52] beachmat_2.25.0 GenomicFeatures_1.61.0 annotate_1.87.0
## [55] KEGGREST_1.49.0 EDASeq_2.43.0 pillar_1.10.2
## [58] rjson_0.2.23 log4r_0.4.4 future.apply_1.11.3
## [61] codetools_0.2-20 glue_1.8.0 ShortRead_1.67.0
## [64] data.table_1.17.0 vctrs_0.6.5 png_0.1-8
## [67] bootstrap_2019.6 gtable_0.3.6 assertthat_0.2.1
## [70] cachem_1.1.0 aroma.light_3.39.0 xfun_0.52
## [73] S4Arrays_1.9.0 prodlim_2024.06.25 survival_3.8-3
## [ reached 'max' / getOption("max.print") -- omitted 86 entries ]