Cloud-scale RNA-sequencing differential expression analysis with Myrna - PubMed (original) (raw)

Cloud-scale RNA-sequencing differential expression analysis with Myrna

Ben Langmead et al. Genome Biol. 2010.

Abstract

As sequencing throughput approaches dozens of gigabases per day, there is a growing need for efficient software for analysis of transcriptome sequencing (RNA-Seq) data. Myrna is a cloud-computing pipeline for calculating differential gene expression in large RNA-Seq datasets. We apply Myrna to the analysis of publicly available data sets and assess the goodness of fit of standard statistical models. Myrna is available from http://bowtie-bio.sf.net/myrna.

PubMed Disclaimer

Figures

Figure 1

Figure 1

The Myrna pipeline. (a) Reads are aligned to the genome using a parallel version of Bowtie. (b) Reads are aggregated into counts for each genomic feature - for example, for each gene in the annotation files. (c) For each sample a normalization constant is calculated based on a summary of the count distribution. (d) Statistical models are used to calculate differential expression in the R programming language parallelized across multiple processors. (e) Significance summaries such as _P_-values and gene-specific counts are calculated and returned. (f) Myrna also returns publication ready coverage plots for differentially expressed genes.

Figure 2

Figure 2

Hapmap results. Histograms of _P_-values from six different analysis strategies applied to randomly labeled samples. In each case the _P_-values should be uniformly distributed (blue dotted line) since the labels are randomly assigned. (a) Poisson model, 75th percentile normalization. (b) Poisson model, 75th percentile included as term. (c) Gaussian model, 75th percentile normalization. (d) Gaussian model, 75th percentile included as term. (e) Permutation model, 75th percentile normalization. (f) Permutation model, 75th percentile included as term.

Figure 3

Figure 3

Hapmap _P_-values versus read depth. A plot of _P_-value versus the log base 10 of the average count for each gene using the six different analysis strategies applied to randomly labeled samples. In each case the _P_-values should be uniformly distributed between zero and one. (a) Poisson model, 75th percentile normalization. (b) Poisson model, 75th percentile included as term. (c) Gaussian model, 75th percentile normalization. (d) Gaussian model, 75th percentile included as term. (e) Permutation model, 75th percentile normalization. (f) Permutation model, 75th percentile included as term.

Figure 4

Figure 4

Scalability of Myrna. Number of worker CPU cores allocated from EC2 versus throughput measured in experiments per hour: that is, the reciprocal of the wall clock time required to conduct a whole-human experiment on the 1.1 billion read Pickrell et al. dataset [32]. The line labeled 'linear speedup' traces hypothetical linear speedup relative to the throughput for 80 processor cores.

Similar articles

Cited by

References

    1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46. doi: 10.1038/nrg2626. - DOI - PubMed
    1. Pepke S, Wold B, Mortazavi A. Computation for ChIP-seq and RNA-seq studies. Nat Methods. 2009;6:S22–S32. doi: 10.1038/nmeth.1371. - DOI - PMC - PubMed
    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed
    1. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94. doi: 10.1186/1471-2105-11-94. - DOI - PMC - PubMed
    1. Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, Morin RD, Zhao Y, Hirst M, Schein JE, Horsman DE, Connors JM, Gascoyne RD, Marra MA, Jones SJ. De novo transcriptome assembly with ABySS. Bioinformatics. 2009;25:2872–2877. doi: 10.1093/bioinformatics/btp367. - DOI - PubMed

Publication types

MeSH terms

LinkOut - more resources