Assessing the performance of different high-density tiling microarray strategies for mapping transcribed regions of the human genome - PubMed (original) (raw)

Comparative Study

Assessing the performance of different high-density tiling microarray strategies for mapping transcribed regions of the human genome

Olof Emanuelsson et al. Genome Res. 2007 Jun.

Abstract

Genomic tiling microarrays have become a popular tool for interrogating the transcriptional activity of large regions of the genome in an unbiased fashion. There are several key parameters associated with each tiling experiment (e.g., experimental protocols and genomic tiling density). Here, we assess the role of these parameters as they are manifest in different tiling-array platforms used for transcription mapping. First, we analyze how a number of published tiling-array experiments agree with established gene annotation on human chromosome 22. We observe that the transcription detected from high-density arrays correlates substantially better with annotation than that from other array types. Next, we analyze the transcription-mapping performance of the two main high-density oligonucleotide array platforms in the ENCODE regions of the human genome. We hybridize identical biological samples and develop several ways of scoring the arrays and segmenting the genome into transcribed and nontranscribed regions, with the aim of making the platforms most comparable to each other. Finally, we develop a platform comparison approach based on agreement with known annotation. Overall, we find that the performance improves with more data points per locus, coupled with statistical scoring approaches that properly take advantage of this, where this larger number of data points arises from higher genomic tiling density and the use of replicate arrays and mismatches. While we do find significant differences in the performance of the two high-density platforms, we also find that they complement each other to some extent. Finally, our experiments reveal a significant amount of novel transcription outside of known genes, and an appreciable sample of this was validated by independent experiments.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Comparing human chromosome 22 transcription data sets with gene annotation. Transcription data sets were derived from previously published studies. They were generated from three different microarray platforms: PCR (red squares), MAS (blue diamond), and Affymetrix (green circles); in a total of 15 separate experiments (tissues or cell lines), each represented by a point in the figure. We used the RefSeq annotation as a benchmark to assess the quality of the data from each experiment. (_x_-axis) The fraction of exonic probes that were identified to be transcribed in individual experiments (sensitivity). (_y_-axis) The fraction of transcribed probes overlapping with an exon (PPV). The PCR tiling-array data were from placenta, fibroblast, and B-cells (Rinn et al. 2003; White et al. 2004), the MAS data from liver (Bertone et al. 2004), and the Affymetrix sets were collected from Kapranov et al. (2002) representing 11 different cell lines. (Arrow) The Affymetrix data from the U87 cell line is not representative since a long section of chromosome 22 is identified as transcriptionally silent, suggesting that this particular experiment probably did not work or something is unusual about U87.

Figure 2.

Figure 2.

(A) Number of nucleotides in placental TARs as a function of segmentation threshold (percentiles). TARs were generated with the maxgap/minrun algorithm based on the scored hybridization intensity data using a genomic window and technical replicates: MAS-B scored with the standard sign test (green); MAS Fwd-Rev scoring using reverse strand as “mismatch” (orange), pseudomedian; Affymetrix scored using pseudomedian from PM-MM (blue). The data points corresponding to the data sets used in the Comparison section are circled: Thresholds are 90th percentile for Affy and 91st percentile for MAS-B (sign test scoring). (x_-axis) The percentile score threshold for calling a probe “positive.” (y_-axis) The number of nucleotides in TARs (in megabase pairs). The dashed line corresponds to the number of nucleotides in exons in the analyzed region (1,001,238 nt). (B) Positive predictive value (PPV) versus sensitivity for three different ways of scoring and segmenting the MAS-B data, varying the segmentation threshold from 70th percentile (to the right in the figure) to 99th percentile (to the left) for the MAS-B set scored with the standard sign test (green); scored using reverse strand as “mismatch” (Fwd-Rev scoring, orange); and the result from HMM segmentation (Viterbi decoding) of sign test-scored data (gray triangle). Sensitivity (x_-axis), defined as the percentage of bases in GENCODE exonic regions that are covered by a TAR. PPV (y_-axis), defined as the percentage of bases in the TARs that overlap with a GENCODE exonic region. (C) PPV versus sensitivity for two different ways of scoring the placenta Affy data, using three replicates (six array features) unless otherwise stated: Wilcoxon signed rank test (blue circles), and standard sign test (using PM–MM values: cyan triangles; using PM-only values: yellow squares). The result from reducing the genomic density of the Affy array to 50% (i.e., removing the data from every second probe) is also shown using PM–MM values (three replicates: cyan triangles, dashed line; single replicate only: gray triangles, dashed line). (D) PPV versus sensitivity for MAS-B and Affy placenta data, varying the segmentation threshold from the 70th percentile (right) to the 99th percentile (left). The average results of TARs generated from raw intensities from single arrays for Affy (PM only [blue squares], and PM–MM [blue triangles; solid line actual genomic density, dashed line 50% genomic density]) and MAS-B (green squares) are plotted, as well as scored results for Affy (blue circles) and MAS-B (green circles). Sensitivity (_x_-axis), and PPV (_y_-axis), defined as above. The data points corresponding to the data sets used in the Comparison section are circled. The hatched area marks where a sensitivity of 30% is achieved for the various sets. (E) PPV for placental TAR sets when choosing a segmentation threshold that yields ∼30% sensitivity (hatched area in B). Note that the actual sensitivity varies slightly between the sets.

Figure 3.

Figure 3.

TAR set agreement. (A) Overlap of TAR sets, measured in number of overlapping nucleotides (kilobases). All three placenta TAR sets (MAS-B, MAS-N, Affy) and both NB4 TAR sets (MAS-N and Affy). R is a measure of the size of the overlap. R = |∩|/|U| (calculated pairwise for the three placenta TAR sets). (B) Size of TAR set overlap, expressed in R, for comparisons within biological samples but across different array platforms (black lines), and comparisons within array platforms but across the biological samples (brown lines). Values in the leftmost column of the graph are calculated with no further constraints. Second column, only TARs overlapping with conserved regions are included. Third column, only TARs overlapping with GENCODE exons are included. Fourth column, only TARs overlapping with both conserved and exon regions are included.

Figure 4.

Figure 4.

Distribution of GENCODE exon coverage by placenta TARs: all exons (MAS-B, green squares_,_ and Affy, blue squares); 5′ exons (Affy, blue circles); 3′ exons (Affy, blue triangles). (_x_-axis) The fraction to which an exon is covered by a TAR; 0.0–1.0 split up in 10 bins. (_y_-axis) The percentage of exons covered by a TAR to the fraction represented on the _x_-axis.

Similar articles

Cited by

References

    1. Ashurst J.L., Chen C.K., Gilbert J.G., Jekosch K., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Chen C.K., Gilbert J.G., Jekosch K., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Gilbert J.G., Jekosch K., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Jekosch K., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Searle S.M., Stalker J., Storey R., Trevanion S., Stalker J., Storey R., Trevanion S., Storey R., Trevanion S., Trevanion S., et al. The vertebrate genome annotation (Vega) database. Nucleic Acids Res. 2005;33:D459–D465. - PMC - PubMed
    1. Bertone P., Stolc V., Royce T.E., Rozowsky J.S., Urban A.E., Zhu X., Rinn J.L., Tongprasit W., Samanta M., Weissman S., Stolc V., Royce T.E., Rozowsky J.S., Urban A.E., Zhu X., Rinn J.L., Tongprasit W., Samanta M., Weissman S., Royce T.E., Rozowsky J.S., Urban A.E., Zhu X., Rinn J.L., Tongprasit W., Samanta M., Weissman S., Rozowsky J.S., Urban A.E., Zhu X., Rinn J.L., Tongprasit W., Samanta M., Weissman S., Urban A.E., Zhu X., Rinn J.L., Tongprasit W., Samanta M., Weissman S., Zhu X., Rinn J.L., Tongprasit W., Samanta M., Weissman S., Rinn J.L., Tongprasit W., Samanta M., Weissman S., Tongprasit W., Samanta M., Weissman S., Samanta M., Weissman S., Weissman S., et al. Global identification of human transcribed sequences with genome tiling arrays. Science. 2004;306:2242–2246. - PubMed
    1. Bertone P., Trifonov V., Rozowsky J.S., Schubert F., Emanuelsson O., Karro J., Kao M.-Y., Snyder M., Gerstein M., Trifonov V., Rozowsky J.S., Schubert F., Emanuelsson O., Karro J., Kao M.-Y., Snyder M., Gerstein M., Rozowsky J.S., Schubert F., Emanuelsson O., Karro J., Kao M.-Y., Snyder M., Gerstein M., Schubert F., Emanuelsson O., Karro J., Kao M.-Y., Snyder M., Gerstein M., Emanuelsson O., Karro J., Kao M.-Y., Snyder M., Gerstein M., Karro J., Kao M.-Y., Snyder M., Gerstein M., Kao M.-Y., Snyder M., Gerstein M., Snyder M., Gerstein M., Gerstein M. Design optimization methods for genomic DNA tiling arrays. Genome Res. 2006;16:271–281. - PMC - PubMed
    1. Blanchette M., Kent W.J., Riemer C., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Kent W.J., Riemer C., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Riemer C., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Rosenbloom K., Clawson H., Green E.D., Clawson H., Green E.D., Green E.D., et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14:708–715. - PMC - PubMed
    1. Brudno M., Do C., Cooper G., Kim M.F., Davydov E., Green E.D., Sidow A., Batzoglou S., Do C., Cooper G., Kim M.F., Davydov E., Green E.D., Sidow A., Batzoglou S., Cooper G., Kim M.F., Davydov E., Green E.D., Sidow A., Batzoglou S., Kim M.F., Davydov E., Green E.D., Sidow A., Batzoglou S., Davydov E., Green E.D., Sidow A., Batzoglou S., Green E.D., Sidow A., Batzoglou S., Sidow A., Batzoglou S., Batzoglou S. LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003;13:721–731. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources