Synthetic spike-in standards for RNA-seq experiments - PubMed (original) (raw)

Synthetic spike-in standards for RNA-seq experiments

Lichun Jiang et al. Genome Res. 2011 Sep.

Abstract

High-throughput sequencing of cDNA (RNA-seq) is a widely deployed transcriptome profiling and annotation technique, but questions about the performance of different protocols and platforms remain. We used a newly developed pool of 96 synthetic RNAs with various lengths, and GC content covering a 2(20) concentration range as spike-in controls to measure sensitivity, accuracy, and biases in RNA-seq experiments as well as to derive standard curves for quantifying the abundance of transcripts. We observed linearity between read density and RNA input over the entire detection range and excellent agreement between replicates, but we observed significantly larger imprecision than expected under pure Poisson sampling errors. We use the control RNAs to directly measure reproducible protocol-dependent biases due to GC content and transcript length as well as stereotypic heterogeneity in coverage across transcripts correlated with position relative to RNA termini and priming sequence bias. These effects lead to biased quantification for short transcripts and individual exons, which is a serious problem for measurements of isoform abundances, but that can partially be corrected using appropriate models of bias. By using the control RNAs, we derive limits for the discovery and detection of rare transcripts in RNA-seq experiments. By using data collected as part of the model organism and human Encyclopedia of DNA Elements projects (ENCODE and modENCODE), we demonstrate that external RNA controls are a useful resource for evaluating sensitivity and accuracy of RNA-seq experiments for transcriptome discovery and quantification. These quality metrics facilitate comparable analysis across different samples, protocols, and platforms.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Library characteristics, ERCC quantification, and coverage. Quality control plots for a stranded ENCODE RNA-seq library of K562 cell Poly-A+ RNA with ERCC spike-ins (library 7). (A) Mismatch rate along reads mapped to all ERRC RNAs. The first 6 bp correspond to the random reverse transcription hexamer-priming site. (B) Scatter plot for sense and antisense read counts per ERCC. (C,D) Scatter plots of read counts versus mass (concentration times length) per ERCC: (C) 100% ERCC library (library 6) and (D) pool of 44 2% ERCC spike-in H. sapiens libraries (libraries 7–50). ERCC-00073 showed aberrant abundance patterns in multiple RNA-seq experiments, as did ERCC-00144 in ERCC pool 14. They may have been inaccurately quantified in our ERCC test set due to errors during the complex mixing scheme used to generate the pools, as they are also suspect in RT-PCR and array experiments on these ERCC pools (M Salit, unpubl.). (E) Scatter plot of read counts in the 100% ERCC library versus a 1% ERCC spike-in D. melanogaster library (library 5). (F) Average sequencing depth and percentage of primary sequence covered for all ERCC transcripts for real data (black) and simulated data (gray).

Figure 2.

Figure 2.

Estimation of cellular transcript abundance in a S2 cell. (A) This plot shows results from a library (library 3) made of 100 ng S2 polyA+ RNA (mRNA yield for this extraction is 0.175 pg/cell) and 5 ng of pool 15 ERCC RNAs. A linear regression of abundance estimated from RNA-seq and the known input amounts. Dashed lines represent 95% confidence intervals for the regression fit. (B) Distribution of S2 transcript abundance estimated from RNA-seq.

Figure 3.

Figure 3.

Quantification errors and biases. (A) Scatter plot of read counts for each ERCC transcript in two different libraries of human RNA-seq with 2% ERCC spike-ins (K562 A+ Repl.1 and K562 A+ Repl.2, libraries 7, 8). (B) Scatter plot of fold deviation between replicates versus read counts for a given ERCC RNA. (C,D) Read counts for two example ERCCs relative to the total number of ERCC reads across 44 different libraries (libraries 7–50) with ERCC spike-in H. sapiens RNA samples (black line), the negative binomial distribution (solid gray), and random samples (n = 44) from the negative binomial distribution (dotted gray). The observed distribution fits a negative binomial model over a Poisson model (P < 2.2 × 10−16, likelihood ratio). (E_–_G) Scatter plots of the fold deviation between observed and expected read count for each ERCC in the 100% ERCC library (library 6) compared with read count (E), GC content (F), and ERCC RNA length (G).

Figure 4.

Figure 4.

Stereotypic read density heterogeneity in ERCC RNA-seq. (A) Traces of relative coverage along ERCC-0002 in two different ENCODE libraries (libraries 7, 8). The pattern is highly reproducible (Pearson's r = 0.96). (B) Average relative coverage along all control RNAs for ERCC spiked in the H. sapiens libraries (libraries 7–50). Dashed lines represent 1 SD around the average across different libraries.

Figure 5.

Figure 5.

Sequence patterns predictive of overrepresentation in RNA-seq. Patterns in the single-end 100% ERCC library (library 6) and ENCODE strand-specific pair end libraries (libraries 7–50) based on coefficients from the glm model (Li et al. 2010) (see Methods). (A,B) Regression coefficient for each base at positions around the beginning of reads mapped to the forward (A) and reverse (B) strands of ERCC-transcripts in the unstranded 100% ERCC library (library 6). (C,D) Regression coefficient for each type of nucleotide at different relative position to the upstream (C) or downstream (D) read of read pairs mapped to ERCC in the stranded ENCODE libraries. Adenosine is treated as base level in the regression model; i.e., the coefficient for “A” is always 0, while the other coefficients represent the predicted overrepresentation due to the presence of this nucleotide at this position, relative to an adenosine.

Figure 6.

Figure 6.

Smoothing read densities. (A,B) Local read heterogeneity of a single ERCC RNA in the 100% ERCC library (library 6). Smoothing read density using the GLM linear model (B), and the more complex MART model (C; see text). (D) Variance in read depth of randomly drawn 50-bp windows from all ERCC RNAs based on an unbiased simulation, raw data, and the smoothed coverage from the sequence specific models. (E) The effect of coverage on read depth variance in simulated data. For the most abundant quartile of transcripts (Q1, mean coverage >19.9), the ratio of the read depth of 50-bp windows to the average depth is between 0.2853 and 1.2360. For Q2, (mean coverage >1.5), inner quartile range for the ratio is between 0.1917 and 1.2540. For Q3, (mean coverage >0.09), it is 0 with moderate large outliers (>21). For Q4 (mean coverage <0.08), it is 0 with very large outliers (>400).

Similar articles

Cited by

References

    1. Agarwal A, Koppstein D, Rozowsky J, Sboner A, Habegger L, Hillier LW, Sasidharan R, Reinke V, Waterston RH, Gerstein M 2010. Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays. BMC Genomics 11: 383. - PMC - PubMed
    1. Baker SC, Bauer SR, Beyer RP, Brenton JD, Bromley B, Burrill J, Causton H, Conley MP, Elespuru R, Fero M, et al. 2005. The External RNA Controls Consortium: a progress report. Nat Methods 2: 731–734 - PubMed
    1. Berezikov E, Robine N, Samsonova A, Westholm JO, Naqvi A, Hung JH, Okamura K, Dai Q, Bortolamiol-Becet D, Martin R, et al. 2011. Deep annotation of Drosophila melanogaster microRNAs yields insights into their processing, modification, and emergence. Genome Res 21: 203–215 - PMC - PubMed
    1. Bloom JS, Khan Z, Kruglyak L, Singh M, Caudy AA 2009. Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. BMC Genomics 10: 221 doi: 10.1186/1471-2164-10-221 - PMC - PubMed
    1. Bogdanova EA, Shagin DA, Lukyanov SA 2008. Normalization of full-length enriched cDNA. Mol Biosyst 4: 205–212 - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources