Guidelines for RNA-Seq data analysis (original) (raw)
Related papers
RseqFlow: workflows for RNA-Seq data analysis
Bioinformatics, 2011
We have developed an RNA-Seq analysis workflow for single-ended Illumina reads, termed RseqFlow. This workflow includes a set of analytic functions, such as quality control for sequencing data, signal tracks of mapped reads, calculation of expression levels, identification of differentially expressed genes and coding SNPs calling. This workflow is formalized and managed by the Pegasus Workflow Management System, which maps the analysis modules onto available computational resources, automatically executes the steps in the appropriate order and supervises the whole running process. RseqFlow is available as a Virtual Machine with all the necessary software, which eliminates any complex configuration and installation steps.
Quality Control of RNA-Seq Experiments
Methods in Molecular Biology, 2014
Direct sequencing of the complementary DNA (cDNA) using high-throughput sequencing technologies (RNA-seq) is widely used and allows for more comprehensive understanding of the transcriptome than microarray. In theory, RNA-seq should be able to precisely identify and quantify all RNA species, small or large, at low or high abundance. However, RNA-seq is a complicated, multistep process involving reverse transcription, amplification, fragmentation, purification, adaptor ligation, and sequencing. Improper operations at any of these steps could make biased or even unusable data. Additionally, RNA-seq intrinsic biases (such as GC bias and nucleotide composition bias) and transcriptome complexity can also make data imperfect. Therefore, comprehensive quality assessment is the first and most critical step for all downstream analyses and results interpretation. This chapter discusses the most widely used quality control metrics including sequence quality, sequencing depth, reads duplication rates (clonal reads), alignment quality, nucleotide composition bias, PCR bias, GC bias, rRNA and mitochondria contamination, coverage uniformity, etc.
Edward Oakeley, Jianying Li, Wenjun Bao, Simon Lin, Weimin Cai, Hong Fang, Gordon Smyth, Djork-arné Clevert, Liqing Wan, Jinhui Wang, Zhan Ye, Oliver Stegle, Roger Perkins, Elia Stupka, Paweł Łabaj, Danielle Thierry-mieg, Ana Conesa, Samir Lababidi, Peter Sykacek, Javier Santoyo
Nature Biotechnology, 2014
An Overview of RNA-seq Data Analysis
Journal of Biology and Life Science
Latest breakthrough in high-throughput DNA sequencing have been launched different arenas for transcriptome analyses, jointly named RNA-seq (RNA-sequencing). It exposes the existence and amount of RNA in a biotic sample at a specific time by utilizing next generation sequencing (NGS). In this review, we aimed to explore the several methods which are applied in analyzing RNA-seq data. We also discussed its importance over microarray data. As establishment of several methods have already taken place to analyze RNA-seq data, therefore, further analysis is very essential to select the best one to avoid false positive outcomes.
Transcriptome Analysis Throughout RNA-seq
Transcriptomics in Health and Disease, 2014
Differential gene expression profile is a powerful tool to identify changes in cell or tissue trancriptomes, which allows to understanding complex biological process such as oncogenesis, cell differentiation and host immunological response to pathogens, among others. To date, the gold standard technique to compare gene expression profile is micro-array hybridization of a RNA preparation. In recent years technological advances led to a new generation of sequencing methods, which can be explored to uncover the complete content of a cell transcriptome. Such a deep sequencing of a RNA preparation, named RNA-seq, allows to virtually detect the complete RNA content, including low abundant isoforms. The RNA-seq quantitative aspect may be further explored to detect gene differential expression based on a reference genome and gene model. In contrast to micro-arrays, RNA-seq may find a broader range of RNA isoforms as well as novel RNA molecules, and has been gradually substituting micro-arrays to differential gene expression profile. In this chapter we describe how deep sequencing may be used to describe changes in the gene expression profile, its advantages and limitations.
High throughput mRNA sample sequencing, known as RNA-seq, is as a powerful approach to detect differentially expressed genes starting from millions of short sequence reads. Although several workflows have been proposed to analyze RNA-seq data, the experiment quality control as a whole is not usually considered, thus potentially biasing the results and/or causing information lost. Experiment quality control refers to the analysis of the experiment as a whole, prior to any analysis. It not only inspects the presence of technical effects, but also if general biological assumptions are fulfilled. In this sense, multivariate approaches are crucial for this task. Here, a multivariate approach for quality control in RNA-seq experiments is proposed. This approach uses simple and yet effective well-known statistical methodologies. In particular, Principal Component Analysis was successfully applied over real data to detect and remove outlier samples. In addition, traditional multivariate exploration tools were applied in order to asses several controls that can help to ensure the results quality. Based on differential expression and functional enrichment analysis, here is demonstrated that the information retrieval is significantly enhanced through experiment quality control. Results show that the proposed multivariate approach increases the information obtained from RNA-seq data after outlier samples removal.
NASA GeneLab RNA-Seq Consensus Pipeline: Standardized Processing of Short-Read RNA-Seq Data
2020
SummaryWith the development of transcriptomic technologies, we are able to quantify precise changes in gene expression profiles from astronauts and other organisms exposed to spaceflight. Members of NASA GeneLab and GeneLab-associated analysis working groups (AWGs) have developed a consensus pipeline for analyzing short-read RNA-sequencing data from spaceflight-associated experiments. The pipeline includes quality control, read trimming, mapping, and gene quantification steps, culminating in the detection of differentially expressed genes. This data analysis pipeline and the results of its execution using data submitted to GeneLab are now all publicly available through the GeneLab database. We present here the full details and rationale for the construction of this pipeline in order to promote transparency, reproducibility and reusability of pipeline data, to provide a template for data processing of future spaceflight-relevant datasets, and to encourage cross-analysis of data from ...
BISR-RNAseq: an efficient and scalable RNAseq analysis workflow with interactive report generation
BMC Bioinformatics, 2019
Background RNA sequencing has become an increasingly affordable way to profile gene expression patterns. Here we introduce a workflow implementing several open-source softwares that can be run on a high performance computing environment. Results Developed as a tool by the Bioinformatics Shared Resource Group (BISR) at the Ohio State University, we have applied the pipeline to a few publicly available RNAseq datasets downloaded from GEO in order to demonstrate the feasibility of this workflow. Source code is available here: workflow: https://code.bmi.osumc.edu/gadepalli.3/BISR-RNAseq-ICIBM2019 and shiny: https://code.bmi.osumc.edu/gadepalli.3/BISR\_RNASeq\_ICIBM19\. Example dataset is demonstrated here: https://dataportal.bmi.osumc.edu/RNA\_Seq/. Conclusion The workflow allows for the analysis (alignment, QC, gene-wise counts generation) of raw RNAseq data and seamless integration of quality analysis and differential expression results into a configurable R shiny web application.
Genomics Data, 2014
Massive parallel DNA sequencing combined with chromatin immunoprecipitation and a large variety of DNA/ RNA-enrichment methodologies is at the origin of data resources of major importance. Indeed these resources, available for multiple genomes, represent the most comprehensive catalogue of (i) cell, development and signal transduction-specified patterns of binding sites for transcription factors ('cistromes') and for transcription and chromatin modifying machineries and (ii) the patterns of specific local post-translational modifications of histones and DNA ('epigenome') or of regulatory chromatin binding factors. In addition, (iii) the resources specifying chromatin structure alterations are emerging. Importantly, these types of "omics" datasets populate increasingly public repositories and provide highly valuable resources for the exploration of general principles of cell function in a multi-dimensional genome-transcriptome-epigenome-chromatin structure context. However, data mining is critically dependent on the data quality, an issue that, surprisingly, is still largely ignored by scientists and well-financed consortia, data repositories and scientific journals. So what determines the quality of ChIP-seq experiments and the datasets generated therefrom and what refrains scientists from associating quality criteria to their data? In this 'opinion' we trace the various parameters that influence the quality of this type of datasets, as well as the computational efforts that were made until now to qualify them. Moreover, we describe a universal quality control (QC) certification approach that provides a quality rating for ChIP-seq and enrichment-related assays. The corresponding QC tool and a regularly updated database, from which at present the quality parameters of more than 8000 datasets can be retrieved, are freely accessible at www.ngs-qc.org.
Statistical Issues in the Analysis of ChIP-Seq and RNA-Seq Data
Genes, 2010
The recent arrival of ultra-high throughput, next generation sequencing (NGS) technologies has revolutionized the genetics and genomics fields by allowing rapid and inexpensive sequencing of billions of bases. The rapid deployment of NGS in a variety of sequencing-based experiments has resulted in fast accumulation of massive amounts of sequencing data. To process this new type of data, a torrent of increasingly sophisticated algorithms and software tools are emerging to help the analysis stage of the NGS applications. In this article, we strive to comprehensively identify the critical challenges that arise from all stages of NGS data analysis and provide an objective overview of what has been achieved in existing works. At the same time, we highlight selected areas that need much further research to improve our current capabilities to delineate the most information possible from NGS data. The article focuses on applications dealing with ChIP-Seq and RNA-Seq.