Normalization, testing, and false discovery rate estimation for RNA-sequencing data - PubMed (original) (raw)

Normalization, testing, and false discovery rate estimation for RNA-sequencing data

Jun Li et al. Biostatistics. 2012 Jul.

Abstract

We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.

PubMed Disclaimer

Figures

Fig 1.

Fig 1.

Pipeline for a typical RNA-Seq experiment. Firstly, mRNA is randomly fragmented into small pieces. These small pieces are then reverse transcribed into a cDNA library by random priming. Then this cDNA library is amplified via PCR and sequenced by a sequencing machine, producing a list of reads. These reads are mapped to a known transcriptome which consists of m genes, and the number of reads mapping to each gene is used as a measure of gene expression. Thus, the results of an RNA-Seq experiment are summarized by a vector of m counts.

Fig 2.

Fig 2.

Histograms of score statistics for simulated data with a two-class outcome. Here, we use the signed version of the score statistic for a more clear display. The permutation distribution of the non-null genes (middle) is much wider than the permutation distribution of the null genes (right), which is very similar to the true null distribution (left).

Fig 3.

Fig 3.

FDR curves for simulated (top three panels) Poisson distributed data and (bottom three panels) negative binomial distributed data. The left, middle, and right panels show results from data with different types of outcome (averaged over 100 simulations). The solid curves show the true FDRs; the broken curves are estimates. All curves are based on the score statistic (3.8 and 3.9) but use different methods to estimate FDRs: PoissonSeq, the usual permutation plug-in method (permutation), and the theoretical p-value method (theoretical p-value). The true FDR curve, which is the same for all three procedures, is also shown. In the Poisson case, both PoissonSeq and the theoretical p-value method give much more accurate FDR estimates than the usual permutation plug-in method. In the negative binomial case, the PoissonSeq estimate of FDR is much more accurate than the other two estimates.

Fig 4.

Fig 4.

FDR curves for simulated (left) Poisson-distributed data and (right) negative binomial distributed data. The solid curves show the true FDRs; the broken curves are estimates. These are results (averaged over 100 simulations) on data with two-class outcome using different methods: our method (PoissonSeq), SAM applied to the square root of total-count normalized data (SAM), the method proposed by Marioni and others (2008) (LRT on Poisson), edgeR with the default total-count normalization (edgeR, total-count norm.), and edgeR with TMM normalization (edgeR, TMM norm). In the Poisson case, we see that only edgeR with TMM normalization and our PoissonSeq method yield accurate FDR estimates. PoissonSeq and edgeR with TMM normalization also yield much lower true FDRs than the other methods. In the negative binomial case, only the results using PoissonSeq and edgeR with TMM normalization are shown. We see that the true FDR curves of the two methods are almost the same, while our estimate is more accurate.

Fig 5.

Fig 5.

FDR curves estimated by PoissonSeq and edgeR with TMM normalization for two data sets: (left) the data set from Marioni and others (2008) and (right) the data set from 't Hoen and others (2008).

Similar articles

Cited by

References

    1. Agresti A. Categorical Data Analysis. 2nd edition. New York: Wiley; 2002.
    1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;11:R106. - PMC - PubMed
    1. Baggerly KA, Deng L, Morris JS, Aldaz CM. Overdispersed logistic regression for sage: modelling multiple groups and covariates. BMC Bioinformatics. 2004;5:144. - PMC - PubMed
    1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B. 1995;85:289–300.
    1. Bloom JS, Khan Z, Kruglyak L, Singh M, Caudy AA. Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. BMC Genomics. 2009;10:221. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources