PRADA: pipeline for RNA sequencing data analysis (original) (raw)

Journal Article

,

1 Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, 2 The Eli and Edythe L. Broad Institute of Harvard University and MIT, Cambridge, MA 02142 and 3 Department of Pathology, Memorial Sloan-Kettering Cancer Center, New York, NY 10015, USA

† The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.

Search for other works by this author on:

,

1 Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, 2 The Eli and Edythe L. Broad Institute of Harvard University and MIT, Cambridge, MA 02142 and 3 Department of Pathology, Memorial Sloan-Kettering Cancer Center, New York, NY 10015, USA

† The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.

Search for other works by this author on:

,

1 Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, 2 The Eli and Edythe L. Broad Institute of Harvard University and MIT, Cambridge, MA 02142 and 3 Department of Pathology, Memorial Sloan-Kettering Cancer Center, New York, NY 10015, USA

† The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.

Search for other works by this author on:

,

1 Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, 2 The Eli and Edythe L. Broad Institute of Harvard University and MIT, Cambridge, MA 02142 and 3 Department of Pathology, Memorial Sloan-Kettering Cancer Center, New York, NY 10015, USA

Search for other works by this author on:

,

1 Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, 2 The Eli and Edythe L. Broad Institute of Harvard University and MIT, Cambridge, MA 02142 and 3 Department of Pathology, Memorial Sloan-Kettering Cancer Center, New York, NY 10015, USA

Search for other works by this author on:

,

1 Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, 2 The Eli and Edythe L. Broad Institute of Harvard University and MIT, Cambridge, MA 02142 and 3 Department of Pathology, Memorial Sloan-Kettering Cancer Center, New York, NY 10015, USA

Search for other works by this author on:

,

1 Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, 2 The Eli and Edythe L. Broad Institute of Harvard University and MIT, Cambridge, MA 02142 and 3 Department of Pathology, Memorial Sloan-Kettering Cancer Center, New York, NY 10015, USA

Search for other works by this author on:

,

1 Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, 2 The Eli and Edythe L. Broad Institute of Harvard University and MIT, Cambridge, MA 02142 and 3 Department of Pathology, Memorial Sloan-Kettering Cancer Center, New York, NY 10015, USA

Search for other works by this author on:

,

1 Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, 2 The Eli and Edythe L. Broad Institute of Harvard University and MIT, Cambridge, MA 02142 and 3 Department of Pathology, Memorial Sloan-Kettering Cancer Center, New York, NY 10015, USA

*To whom correspondence should be addressed.

Search for other works by this author on:

1 Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, 2 The Eli and Edythe L. Broad Institute of Harvard University and MIT, Cambridge, MA 02142 and 3 Department of Pathology, Memorial Sloan-Kettering Cancer Center, New York, NY 10015, USA

*To whom correspondence should be addressed.

Search for other works by this author on:

Associate Editor: Gunnar Ratsch

† The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.

Author Notes

Revision received:

21 February 2014

Cite

Wandaliz Torres-García, Siyuan Zheng, Andrey Sivachenko, Rahulsimham Vegesna, Qianghu Wang, Rong Yao, Michael F. Berger, John N. Weinstein, Gad Getz, Roel G.W. Verhaak, PRADA: pipeline for RNA sequencing data analysis, Bioinformatics, Volume 30, Issue 15, August 2014, Pages 2224–2226, https://doi.org/10.1093/bioinformatics/btu169
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Summary: Technological advances in high-throughput sequencing necessitate improved computational tools for processing and analyzing large-scale datasets in a systematic automated manner. For that purpose, we have developed PRADA (Pipeline for RNA-Sequencing Data Analysis), a flexible, modular and highly scalable software platform that provides many different types of information available by multifaceted analysis starting from raw paired-end RNA-seq data: gene expression levels, quality metrics, detection of unsupervised and supervised fusion transcripts, detection of intragenic fusion variants, homology scores and fusion frame classification. PRADA uses a dual-mapping strategy that increases sensitivity and refines the analytical endpoints. PRADA has been used extensively and successfully in the glioblastoma and renal clear cell projects of The Cancer Genome Atlas program.

Availability and implementation: http://sourceforge.net/projects/prada/

Contact: gadgetz@broadinstitute.org or rverhaak@mdanderson.org

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Transcriptome sequencing provides insights into the quantity, structure and composition of RNA molecules in a biological sample. Analytical tools for analysis of RNA sequencing data are available ( Kim and Salzberg, 2012 ; McPherson et al. , 2011 ), but those tools generally focus on single end points, such as quantitation of expression levels or identification of fusion transcripts. As the technology becomes more accessible, there is an increased need for computational pipelines that can process large numbers of raw RNA-sequencing datasets quickly, accurately and comprehensively. For that purpose, we have developed PRADA (Pipeline for RNA-Sequencing Data Analysis). PRADA was designed to be modular in the functional sense that different modules output different types of information on the transcripts. It implements resource management structures such as LSF and PBS, allowing quick scale-up for processing of thousands of RNA-seq samples.

2 METHODS

PRADA was designed for processing paired-end sequencing data in fastq, Sequence Alignment/Map (SAM) format or the compressed binary version of SAM (BAM) ( Li et al. , 2009 ). The processing module applies an alignment strategy in which reads are mapped to a combined genome and transcriptome reference, allowing reads to align to known transcript sequences, including exon junctions and unannotated mRNAs. The mapping strategy has previously been described in Berger et al. (2010) . The appropriate reference files are available for download at http://bioinformatics.mdanderson.org/Software/PRADA/ . This strategy retrieves all best alignments per read from the dual reference file using BWA ( Li and Durbin, 2009 ). After initial mapping, the alignments of reads that map to multiple locations (both transcriptomic and genomic) are collapsed into single genomic coordinates, including reads that span exon junctions. Once mapped, reads are filtered out if their best placements are not mapped to multiple genomic coordinates. Quality scores are recalibrated using the Genome Analysis Toolkit (GATK) framework ( McKenna et al. , 2010 ), index files are generated using Samtools ( Li et al. , 2009 ) and duplicate reads are flagged using Picard ( http://picard.sourceforge.net/ ).

For expression and quality control metrics, PRADA’s expression module calls the java executable of RNA-SeQC DeLuca et al. , 2012 ). RNA-SeQC is a publicly available tool that produces data quality metrics of three types: mapped read counts, coverage and correlation. The read count metrics include total number of reads, duplicates, uniquely mapped reads and reads per kilobase per million mapped (RPKM). The coverage metrics include GC bias, 3′/5′ bias and mean number of bases per read. Expression correlation is reported when multiple samples are analyzed.

The fusion module aims to detect chimeric transcripts through identification of discordant read pairs and fusion-spanning reads. Discordant read pairs are paired read-ends that map uniquely (i.e. mapping quality equal to 37) to different protein-coding genes with orientation consistent with formation of a sense–sense chimera. Mitochondrial genes and clone IDs are ignored. If a read maps to overlapping genes, the genes are split up as two different instances. Further evidence for transcript fusion is sought through evaluation of putative fusion junction spanning reads. They are detected in PRADA by the construction of a sequence database that holds all possible exon–exon junctions that match the 3′ end of one gene fused to the 5′ end of a second gene. All hypothetical exon junctions are created using the Ensembl transcriptome reference. Then, unmapped reads aligned to the database of hypothetical exon junctions. Only reads of which the mate pair maps to either of the two fusion partner genes are considered. Each fusion is annotated by sample name, 5′ and 3′ gene name, chromosome location and blastn homology scores (see below).

The supervised fusion screen module G eneral U ser d E fined S upervised S earch (GUESS) was developed to facilitate rapid detection of a single fusion, e.g. FGFR3-TACC3 in GBM ( Singh et al. , 2012 ). GUESS screens BAM files for the presence of discordant read pairs and fusion-spanning reads of specific genes defined by the user. We have developed two variants of GUESS, one that searches for f usion t ranscripts involving two given genes (GUESS-ft) and one that searches for i ntragenic f usions (GUESS-if), such as the EGFR vIII variant that deletes exons 2–7.

To allow filtering of homology artifacts from the results of the fusion module and GUESS-ft, the similarity of two fusion partner genes is assessed using BlastN. Metrics provided are bitscore and its associated E -value, where an E -value of >0.001 is considered to be non-homologous.

The frame module predicts whether a fusion transcript is in frame and thereby capable of producing a functional protein, based on the combinatorics of the transcript(s) in the Ensembl database for the genes involved.

3 RESULTS

3.1 The Cancer Genome Atlas unsupervised fusion results

We used PRADA to process RNA-seq data from 416 renal clear cell carcinoma (ccRCC) samples and 164 glioblastoma multiforme (GBM) samples from The Cancer Genome Atlas (TCGA). Among 84 predicted gene fusions in 416 ccRCCs were 5 SFPQ-TFE3 transcripts, and the overall validation rate was 85% ( Cancer Genome Atlas Research Network, 2013 ). Fusions found in 164 GBMs ( n = 229) included recurrent rearrangements such as the previously reported FGFR3-TACC3 in 2 samples and EGFR -associated fusions in 11 samples ( Zheng et al. , 2013 ). Data from whole genome sequencing, available for a subset of the GBM, validated 41 of 49 predicted fusions (84%). A TFG-GPR128 fusion was observed in both renal and GBM samples.

3.2 Supervised detection of TFG-GPR128

A germ line copy number variant involving TFG and GPR128 has been described in human population cohorts ( Jakobsson et al. , 2008 ). Using the GUESS-ft supervised fusion search module, we evaluated the presence of TFG-GPR128 fusions in 321 TCGA tumor-adjacent normal tissues from 11 cancer types ( Supplementary Table S1 ). TFG-GPR128 fusion was detected at low levels in 3 of the 321 normal samples ( Supplementary Table S1 ). The matching tumor sample of two of three TFG-GPR128 harboring normals also expressed this fusion construct, corroborating its germ line status.

3.3 Correlation of RPKM values with U133A microarray expression levels

We tested the RPKM functionality of PRADA's expression module in the context of subtype classification using 164 RNA-seq samples from GBM, comparing its subtype stratification with that based on U133A array data. The comparison showed a high (80.9%) concordance rate in subtype calls for expression data generated by the two platforms ( Supplementary Table S2 ), a similar percentage classified reliably as previously reported ( Verhaak et al. , 2010 ).

3.4 Comparison of fusion transcript detection by PRADA, Defuse and Tophat-fusion

To evaluate PRADA fusion detection accuracy, we obtained RNA-seq data and whole genome sequencing data of the U87 glioma cell line. PRADA detected 11 fusions, 6 of which related to DNA structural rearrangements, TopHat-fusion ( Kim and Salzberg, 2012 ) predicted 42 fusions of which 12 validated in DNA, while Defuse ( McPherson et al. , 2011 ) found 51 fusions of which 12 related to DNA lesions ( Supplementary Text and Supplementary Table S3 ).

4 DISCUSSION

The power of PRADA is based on (i) its scalability, (ii) its mapping to both transcriptomic and genome, a distinctive feature of PRADA in comparison with other RNA analysis tools such as Tophat-fusion and Defuse, which rely on alignments of partial reads to identify gene fusions, (iii) its modularity and (iv) its comprehensive repertoire of output information from the incorporated modules. It enables the user to compute multiple analytical metrics using one software package and to do so for large numbers of samples at once in a fully automated fashion. It has been tested on thousands of RNA-seq samples from a wide variety of tumor types and normal tissues in TCGA. PRADA is designed to be run out of the box with little configuration, and is compatible with PBS and LSF compute clusters. A single PRADA tarball, including binaries of the packages it relies on, a comprehensive and detailed manual, and test FASTQ/BAM files, are freely available at http://sourceforge.net/projects/prada/and through Galaxy at http://toolshed.g2.bx.psu.edu/view/siyuan/prada .

Funding : The content is solely the responsibility of the authors and does not necessarily represent NCI/NIH. Supported in part by NCI grant number CA143883/Chapman Foundation/Dell Foundation.

Conflict of Interest : none declared.

REFERENCES

, et al.

Integrative analysis of the melanoma transcriptome

,

Genome Res

,

2010

, vol.

20

(pg.

413

-

427

)

Cancer Genome Atlas Research Network

Comprehensive molecular characterization of clear cell renal cell carcinoma

,

Nature

,

2013

, vol.

499

(pg.

43

-

49

)

, et al.

RNA-SeQC: RNA-seq metrics for quality control and process optimization

,

Bioinformatics

,

2012

, vol.

28

(pg.

1530

-

1532

)

, et al.

Genotype, haplotype and copy-number variation in worldwide human populations

,

Nature

,

2008

, vol.

451

(pg.

998

-

1003

)

, .

TopHat-Fusion: an algorithm for discovery of novel fusion transcripts

,

Genome Biol.

,

2012

, vol.

12

pg.

R72

, .

Fast and accurate short read alignment with burrows wheeler transform

,

Bioinformatics

,

2009

, vol.

25

(pg.

1754

-

1760

)

, et al.

The sequence alignment/map format and SAMtools

,

Bioinformatics

,

2009

, vol.

25

(pg.

2078

-

2079

)

, et al.

The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data

,

Genome Res.

,

2010

, vol.

20

(pg.

1297

-

1303

)

, et al.

deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data

,

PLoS Comput. Biol.

,

2011

, vol.

7

pg.

e1001138

, et al.

Transforming fusions of FGFR and TACC genes in human glioblastoma

,

Science

,

2012

, vol.

377

(pg.

1231

-

1235

)

, et al.

Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1

,

Cancer Cell

,

2010

, vol.

17

(pg.

98

-

110

)

, et al.

A survey of intragenic breakpoints in glioblastoma identifies a distinct subset associated with poor survival

,

Genes Dev.

,

2013

, vol.

27

(pg.

1462

-

1472

)

Author notes

Associate Editor: Gunnar Ratsch

† The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.

© The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 8,998

7,225 Pageviews

1,773 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 12
December 2016 13
January 2017 34
February 2017 98
March 2017 97
April 2017 105
May 2017 128
June 2017 116
July 2017 114
August 2017 136
September 2017 103
October 2017 92
November 2017 121
December 2017 152
January 2018 167
February 2018 148
March 2018 157
April 2018 149
May 2018 110
June 2018 117
July 2018 79
August 2018 134
September 2018 69
October 2018 120
November 2018 129
December 2018 125
January 2019 90
February 2019 107
March 2019 154
April 2019 159
May 2019 101
June 2019 109
July 2019 140
August 2019 116
September 2019 121
October 2019 134
November 2019 71
December 2019 91
January 2020 103
February 2020 95
March 2020 64
April 2020 112
May 2020 71
June 2020 79
July 2020 72
August 2020 68
September 2020 71
October 2020 57
November 2020 95
December 2020 87
January 2021 68
February 2021 68
March 2021 87
April 2021 114
May 2021 82
June 2021 78
July 2021 74
August 2021 90
September 2021 73
October 2021 105
November 2021 108
December 2021 83
January 2022 84
February 2022 58
March 2022 94
April 2022 104
May 2022 112
June 2022 87
July 2022 72
August 2022 93
September 2022 96
October 2022 141
November 2022 104
December 2022 116
January 2023 128
February 2023 106
March 2023 78
April 2023 108
May 2023 66
June 2023 71
July 2023 106
August 2023 82
September 2023 73
October 2023 51
November 2023 58
December 2023 79
January 2024 83
February 2024 69
March 2024 111
April 2024 65
May 2024 67
June 2024 62
July 2024 56
August 2024 45
September 2024 47
October 2024 4

Citations

112 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic