Gene ontology analysis for RNA-seq: accounting for selection bias - PubMed (original) (raw)

Gene ontology analysis for RNA-seq: accounting for selection bias

Matthew D Young et al. Genome Biol. 2010.

Abstract

We present GOseq, an application for performing Gene Ontology (GO) analysis on RNA-seq data. GO analysis is widely used to reduce complexity and highlight biological processes in genome-wide expression studies, but standard methods give biased results on RNA-seq data due to over-detection of differential expression for long and highly expressed transcripts. Application of GOseq to a prostate cancer data set shows that GOseq dramatically changes the results, highlighting categories more consistent with the known biology.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Length distribution of genes in Gene Ontology categories. (a) The distribution of average gene lengths in GO categories on a log10 scale. The GO category gene length is given by the median length of the genes within the category. (b) _P_-values for the two-sided Mann-Whitney U test comparing the median length of genes in a GO category with the overall distribution of genes for 7,873 GO categories. The excess of low _P_-values shows that there are many GO categories that contain a set of significantly long or short genes.

Figure 2

Figure 2

Differential expression as a function of gene length and read count. (a) The proportion of DE genes is plotted as a function of the transcript length. Each point represents the percentage of genes called DE in a bin of 300 genes plotted against the median gene length for the bin. The green line is the probability weighting function obtained by fitting a monotonic cubic spline with 6 knots to the binary data. A clear trend towards more detected differential expression at longer gene lengths can be seen. (b) The same, except instead of transcript length, the total number of reads for each gene was used. Again, a trend towards more DE for genes with more reads can be seen. Note the greater range of probabilities compared to (a).

Figure 3

Figure 3

Change in Gene Ontology category rank between the standard and GOseq methodologies. (a) Change in rank of GO categories going from the hypergeometric method to GOseq correcting for length bias plotted against the log of the average gene length of the category. (b) Change in rank of GO categories going from the hypergeometric method to GOseq correcting for total read count plotted against the log of the average number of counts of each gene in the category. A trend for the standard method to underestimate significance for GO categories containing short (or highly expressed) genes and overestimate significance for GO categories containing long (or underexpressed) genes can be clearly seen.

Figure 4

Figure 4

Comparison of GOseq and the standard hypergeometric methods. (a) The _P_-values generated with GOseq using the Wallenius approximation and the standard hypergeometric method are plotted against the _P_-values calculated with GOseq using random sampling (200k repeats). The Wallenius method (green crosses) shows good agreement with the high resolution (200,000 repeats) random sampling. A large discrepancy in _P_-values is seen between GOseq and the hypergeometric method. (a) The number of discrepancies between lists is shown for a given list size. The black line compares GOseq using high resolution sampling with the hypergeometric method. The red line compares GOseq using high resolution sampling with the Wallenius approximation. Again, GOseq using the Wallenius method shows little difference from GOseq using the random sampling method (with 200k repeats) while the hypergeometric method shows a large number (approximately 20%) of discrepancies.

Figure 5

Figure 5

A comparison of Gene Ontology analysis using RNA-seq and microarrays on the same samples. The fraction of GO categories identified by RNA-seq data that overlap with the microarray GO analysis are shown as a function of the number of categories selected. RNA-seq data have been analyzed using GOseq and hypergeometric methods. The GOseq categories have a consistently higher overlap with the microarray GO categories than the standard method.

Figure 6

Figure 6

Change in most significant categories when correcting for total read count bias or length bias versus the standard method. A plot of the number of discrepancies in the most significant GO categories generated using different methods. This plot compares the length bias correcting version of GOseq to the standard hypergeometric method (green line) and the total read count bias correcting version of GO-seq to the standard hypergeometric method (black line).

Similar articles

Cited by

References

    1. Fu X, Fu N, Guo S, Yan Z, Xu Y, Hu H, Menzel C, Chen W, Li Y, Zeng R, Khaitovich P. Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genomics. 2009;10:161. - PMC - PubMed
    1. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. - PMC - PubMed
    1. Wang X, Sun Q, McGrath SD, Mardis ER, Soloway PD, Clark AG. Transcriptome-wide identification of novel imprinted genes in neonatal mouse brain. PloS One. 2008;3:e3839. - PMC - PubMed
    1. Wahlstedt H, Daniel C, Enstero M, Ohman M. Large-scale mRNA sequencing determines global regulation of RNA editing during brain development. Genome Res. 2009;19:978–986. - PMC - PubMed
    1. Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009;4:14. - PMC - PubMed

MeSH terms

Substances

LinkOut - more resources