Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application - PubMed (original) (raw)

Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application

C Li et al. Genome Biol. 2001.

Abstract

Background: A model-based analysis of oligonucleotide expression arrays we developed previously uses a probe-sensitivity index to capture the response characteristic of a specific probe pair and calculates model-based expression indexes (MBEI). MBEI has standard error attached to it as a measure of accuracy. Here we investigate the stability of the probe-sensitivity index across different tissue types, the reproducibility of results in replicate experiments, and the use of MBEI in perfect match (PM)-only arrays.

Results: Probe-sensitivity indexes are stable across tissue types. The target gene's presence in many arrays of an array set allows the probe-sensitivity index to be estimated accurately. We extended the model to obtain expression values for PM-only arrays, and found that the 20-probe PM-only model is comparable to the 10-probe PM/MM difference model, in terms of the expression correlations with the original 20-probe PM/MM difference model. MBEI method is able to extend the reliable detection limit of expression to a lower mRNA concentration. The standard errors of MBEI can be used to construct confidence intervals of fold changes, and the lower confidence bound of fold change is a better ranking statistic for filtering genes. We can assign reliability indexes for genes in a specific cluster of interest in hierarchical clustering by resampling clustering trees. A software dChip implementing many of these analysis methods is made available.

Conclusions: The model-based approach reduces the variability of low expression estimates, and provides a natural method of calculating expression values for PM-only arrays. The standard errors attached to expression values can be used to assess the reliability of downstream analysis.

PubMed Disclaimer

Figures

Figure 1

Figure 1

values for probe sets. values estimated for probe sets (a) 6457, (b) 1248, and (c) 6571 in six array sets (shown in panels 1–6 from left to right for each probe set). values (constrained to have sum square equal to number of probes used in each array set) are on the _y_-axis, and probe pairs are labeled 1 to 20 on the _x_-axis. The title of each panel (for example, p = 0) indicates the proportion of arrays 'present' for the target gene in the array set. Large circles represent identified probe-outliers by negativity or large SE of .

Figure 2

Figure 2

Boxplots of average pairwise correlations of

s between two array sets. They are stratified by average lower presence proportion in two array sets (the presence proportion of a probe set is the proportion of arrays in an array set where the target gene is called 'present' by GeneChip's algorithm). The average is taken over C(6, 2) = 15 pairwise comparison of two array sets for each probe set, and the correlation is calculated using probes that are not identified as an outlier in both array sets. The range of the average lower presence proportion for the six boxplots are: (0, 0.17), (0.17, 0.34), (0.34, 0.51), (0.51, 0.68), (0.68, 0.85), (0.85, 1). The title of each boxplot is the number of probe sets classified into this boxplot. Eleven probe sets with too few non-outlier probes to calculate correlations for all 15 comparisons are not included in the boxplots. The average lower presence proportion and average pairwise correlation for probe sets in Figure 1 are (a) 1, 0.95; (b), 0.93, 0.94; and (c) 0, 0.86.

Figure 3

Figure 3

Histogram of correlations between model-based expression values estimated using the 20-probe difference model and those estimated using different models. (a) 10-probe difference model; (b) 20-probe PM-only model; (c) 20-probe MM-only model. All comparisons are across the 21 arrays in array set 1.

Figure 4

Figure 4

Boxplot of correlations between θ values estimated using the 20-probe difference model and θs estimated using different models, stratified by presence proportion. (a) 10-probe difference model; (b) 20-probe PM-only model; and (c) 20-probe MM-only model. The number of presence calls for a probe set in the 21 arrays and the subpopulation size for the six boxplots are: 0–3, 4,385; 4–7, 693; 8–11, 413; 12–15, 488; 16–19, 497; and 20–21, 323. Only 6,799 probe sets that have 20 probes are used.

Figure 5

Figure 5

Log (base 10) expression indexes of a pair of replicate arrays (array 1 and 2 of array set 5) for different statistical methods. (a) MBEI method; (b) AD method. Only 6,695 (a) and 4,696 (b) probe sets with positive values in both arrays are used. The center line is y = x, and the flanking lines indicate the difference of a factor of two.

Figure 6

Figure 6

Boxplots of average absolute log (base 10) ratios between replicate arrays stratified by presence proportion for different statistical methods. (a) MBEI method; (b) AD method. The number of presence calls for a probe set in the 58 arrays for the six boxplots are: 0–9, 10–19, 20–29, 30–39, 40–49, 50–58. The title of each boxplot is the number of probe sets used for the boxplot. The average is taken over 29 replicate pairs. Log ratios are not calculated for negative expression values or expression values identified as 'array-outliers' by the MBEI method in either array of a replicate pair, and are not used to calculate the average. 744 probe sets are not included as their average absolute log ratios cannot be calculated for all the 29 pairs using either method.

Figure 7

Figure 7

Similar plots as in Figure 6 for another set of 30 pairs of duplicated human U95A arrays. (a) MBEI method; (b) AD method.The number of presence calls for a probe set in the 60 arrays for the six boxplots are: 0–9, 10–19, 20–29, 30–39, 40–49, 50–60. The title of each boxplot is the number of probe sets used for the boxplot.

Figure 8

Figure 8

Gene clustering (a) 225 filtered genes are clustered based on their expression profiles across 20 samples. Each gene's expression values are standardized to have mean 0 and SD 1 across 20 samples. Dark blue represents low expression level and dark red high expression level. We might be particularly interested in the cluster colored in blue. (b) The clustering tree after a particular resampling. Although the original 'blue' genes are scattered to various places, we can still determine where the original cluster is, using the criteria described in the text. (c) After resampling 30 times, the reliability of the genes belonging to the original cluster is indicated by the vertical gray-scale bar on the left of the blue-red picture.

Figure 9

Figure 9

Normalization of gene expression levels between arrays. (a) The CEL intensities (see text) of a pair of replicate arrays (array 11 and 12 in array set 5) are plotted against each other. The baseline array 11 (shown on the _y_-axis) is not as bright as array 12 (shown on the _x_-axis). The smoothing spline (green curve) deviates from the diagonal line y = x (blue curve), indicating the need for normalization. (b) The same plot as (a) with superimposed circles representing the invariant set, on the basis of which a piecewise linear normalization relationship is determined (black dotted line, whose _y_-coordinate is the normalized value of array 12). The normalization curve is close to the smoothing spline curve in (a) as the two arrays are replicated arrays and all probes should be invariant. (c) After normalization (_y_-axis is the baseline array 11, and _x_-axis the normalized value of array 12), the scatterplot centers around the diagonal line and the array 12 is adjusted to have the similar overall brightness as array 11. The smoothing spline curve is also close to the diagonal line. (d) The Q-Q plot of probe intensities of array 11 and normalized array 12 shows the probes in the two sets have almost the same distribution.

Figure 10

Figure 10

Similar plots as in Figure 9 for arrays hybridized to two different samples (array 24 and 36 of array set 5). (a) CEL intensities; (b) same plot as in (a) with superimposed circles representing the invariant set; (c) after renormalization; (d) Q-Q plot of normalized probe intensities. Note that the smoothing spline in (a) is affected by several points at the lower-right corner, which might belong to differentially expressed genes. The invariant set, on the other hand, does not include these points when determining the normalization curve, leading to a different normalization relationship at the high end.

Similar articles

Cited by

References

    1. Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001;98:31–36. - PMC - PubMed
    1. Hakak Y, Walker JR, Li C, Wong WH, Davis KL, Buxbaum JD, Haroutunian V, Fienberg AA. Genome-wide expression analysis reveals dysregulation of myelination-related genes in chronic schizophrenia. Proc Natl Acad Sci USA. 2001;98:4746–4751. - PMC - PubMed
    1. Wodicka L, Dong H, Mittmann M, Ho M, Lockhart D. Genome-wide expression monitoring in Saccharomyces cerevisiae. Nat Biotechnol. 1997;15:1359–1367. - PubMed
    1. Wallace D. The Behrens-Fisher and Fieller-Creasy problems. In Lecture Notes in Statistics 1, RAFisher: An Appreciation Edited by Fienberg SE, Hinkley DV Springer-Verlag. 1988. pp. 119–147.
    1. Cox DR, Hinkley DV. Theoretical Statistics London: Chapman and Hall, 1974.

Publication types

MeSH terms

Substances

LinkOut - more resources