ExpressYourself: a modular platform for processing and visualizing microarray data (original) (raw)

Abstract

DNA microarrays are widely used in biological research; by analyzing differential hybridization on a single microarray slide, one can detect changes in mRNA expression levels, increases in DNA copy numbers and the location of transcription factor binding sites on a genomic scale. Having performed the experiments, the major challenge is to process large, noisy datasets in order to identify the specific array elements that are significantly differentially hybridized. This normally requires aggregating different, often incompatible programs into a multi-step pipeline. Here we present ExpressYourself, a fully integrated platform for processing microarray data. In completely automated fashion, it will correct the background array signal, normalize the Cy5 and Cy3 signals, score levels of differential hybridization, combine the results of replicate experiments, filter problematic regions of the array and assess the quality of individual and replicate experiments. ExpressYourself is designed with a highly modular architecture so various types of microarray analysis algorithms can readily be incorporated as they are developed; for example, the system currently implements several normalization methods, including those that simultaneously consider signal intensity and slide location. The processed data are presented using a web-based graphical interface to facilitate comparison with the original images of the array slides. In particular, Express Yourself is able to regenerate images of the original microarray after applying various steps of processing, which greatly facilities identification of position-specific artifacts. The program is freely available for use at http://bioinfo.mbb.yale.edu/expressyourself.

INTRODUCTION

Microarrays are widely employed, among other uses, to compare mRNA expression levels (15), DNA copy number (69) and transcription factor binding in biological samples (1013). The concept underlying these experiments is straightforward; fluorescence-labeled nucleic acids in ‘test’ and ‘reference’ samples are probed simultaneously on a microarray slide, and their relative abundance is derived from the comparative fluorescence of the probe molecules hybridized to individual array elements. Though the technology is relatively new, several aspects of data analysis beyond the experimental stage are now well established; these include scanning the arrays to measure fluorescence intensity, quantifying the array images via densitometry algorithms (14,15), clustering similarly expressed genes (1620) and integrating microarray data with genomic information (2128). However, a topic still under much discussion is how to treat the raw numerical data immediately after scanning and quantifying the array images (27,29).

Data processing aims to fill this gap. In particular it serves three purposes: (i) to detect and minimize the level of noise associated with the experiments; (ii) to assess the quality of the data once the noise has been reduced; and (iii) to identify the array elements that are actually differentially hybridized.

Here we present ExpressYourself, an automated platform for processing microarray data that is freely available over the web (http://bioinfo.mbb.yale.edu/expressyourself). The software performs correction of the background array signal, normalization, scoring, combination of replicate experiments, filtering problematic regions of the array and quality assessment of hybridizations. We incorporate novel and published algorithms that are reasonable, understandable and make minimal assumptions about the data. The program can handle gene expression, chromatin immunoprecipitated DNA probings (ChIp-chip) and most comparative genomic hybridization (CGH) data. The results are clear and easy to understand, and the graphical interface allows users to compare each processing step with the original slide images.

DATA PROCESSING IN ExpressYourself

ExpressYourself processes the data in a sequential manner using the major steps shown in Figure 1A. The stages can be broadly grouped into: (i) noise reduction; (ii) quality control; and (iii) differential hybridization scoring. We demonstrate the use of ExpressYourself using the data from a ChIp-chip experiment of the HCM1 transcription factor (30).

Figure 1.

Figure 1

(A) Flow chart of data processing. Schematic images of a microarray slide depicting different stages of data processing: (B) filtering flawed array regions, (C) the effects of background correction, and (D) outcome of normalization.

Input data

The data input to ExpressYourself comprise text files generated by image analysis software. Currently, the program recognizes files from Axon GenePix versions 2.0–4.0 (http://www.axon.com/GN_GenePixSoftware.html), Scanalyze version 2.0 (http://rana.lbl.gov/EisenSoftware.htm) and UCSF SPOT version 2.0 (15) and produces the best results if input files are left intact (i.e. no data is deleted). Multiple files are accepted and may represent information for replicate experiments. The processing steps to be applied to the data may be changed by altering the parameters at this stage.

We interpret the Cy5 and Cy3 signals of an array element as the median foreground minus background intensities for each dye (_S_=_I_foreground−_I_background). The foreground intensity is the fluorescence of a spot within a defined area, usually described by a circle enclosing the spot, and the background is that of the immediate area surrounding the spot, usually described by a bounding box. The level of differential hybridization at each array element is determined as the relative signal between the two dyes.

Noise reduction

Individual spot and regional filtering.

Technological limitations in array production and experimental techniques mean that microarray slides are often imperfect. Nearly every experiment contains individual array elements of poor quality, comprising spots that are small compared to the rest of the array, have unusual morphology (i.e. non-round), exhibit uneven hybridization (i.e. doughnut or crescent-moon patterns) or have saturated signal intensity. Most image analysis software permits users to flag such array elements manually. But with up to 40 000 spots per slide, this is very time-consuming and difficult to perform in a consistent manner. ExpressYourself automatically flags and, if necessary, removes poor quality spots. Manual flagging is therefore unnecessary, although the program will consider such flags if instructed by the user. Imperfections on the array can also extend beyond individual spots. Large dust particles, printing inconsistencies and scratches sometimes render entire regions of the array unusable. ExpressYourself detects and removes these flaws automatically (Fig. 1B).

Background correction.

Although we remove obvious defects before further processing, it is important, if possible, to correct minor imperfections confined to small areas so that we preserve the maximum amount of usable data (14,15,31,32). As mentioned above, the background signal is commonly defined as the average intensity of the immediate area surrounding each array element. Minor imperfections (very small specks, dust and scratches limited to the vicinity of a spot) often distort background signals, making them extremely variable even between adjacent spots. Therefore the aim of background correction is to reduce the local background distortions that are restricted to single array elements, while maintaining the overall variability represented by gradual changes between bright and dark regions across the slide. We overcome this problem by calculating the average background signal from a wider area, typically spanning 3×3 to 5×5 spots (31). In doing so, we minimize the contribution of minor flaws to the background signal, and we remove most of the local distortions. Figure 1C displays the effects of correcting the Cy5 background intensity. Many regions of local variability are removed, but overall variation in array intensity remains.

Cy5/Cy3 normalization.

Once the signal intensities have been calculated using the corrected background, we can compare the relative contributions of the Cy5 and Cy3 signals. Ideally, the signals of the two dyes should be equal for nucleic acid probes that have equal concentration in the test and reference samples (i.e. the ratio, _R_=_S_1/_S_2, of the two signals should approach 1 for probes hybridizing to an equal degree in both fluorescence channels). In practice, the signals can be quite different. Dyes have different molecular characteristics, hybridization to the arrays can be non-specific or incomplete, and there is spatial heterogeneity in the probing conditions across the slide. Normalization aims to compensate for these effects by applying a scale factor such that signals of probes with unchanged concentration are equal (29,31,3339). The signals of the remaining array elements are scaled relative to the baseline set for the constant probes. Figure 1D shows a schematic of the example array before and after it has been normalized.

A major issue in microarray normalization lies in defining the set of constant probes and this is reflected in the many approaches that have been published, including the use of house-keeping genes, spiked controls and total nucleic acid concentrations (see 31 for an overview). We prefer to use the ‘constant majority’ method, which assumes that the majority of probes do not change in concentration. The method is generally applicable to many experiments as it is valid even in cases where up to 50% of probes have altered concentrations, does not require prior knowledge of which probes remain constant and allows for intensity and spatial considerations (see below).

At its simplest the method calculates the scale factor from the robust mean of all _S_1/_S_2 ratios, i.e. the distribution of all ratios is transformed so that it centers about 1. However, two particular issues must be addressed: signal intensity and array position. First, because the two dyes differ in fluorescent properties, the bias in ratios often depends on the signal intensity (29,3740). Therefore, different scale factors must be used for array elements at different intensities. Second, the positional issue is due to differences in hybridization conditions across the slide (31,35,40), and it is common to observe array images in which hybridization of entire regions is dominated by one dye. Thus different scale factors must be used for different regions of the physical slide. To determine scale factors in each situation we employ local regression to determine a ‘best fit’ for the data, using the LOWESS and LOESS packages (4143). In the former case, we calculate the local mean intensity ratio as it varies over a range of signals in two dimensions (Cy5 versus Cy3). In the latter case, we determine the mean ratio as it varies across the surface of the microarray slide by fitting a three-dimensional curve to the data points.

Replicate array scaling.

Many array experiments are conducted in replicates; however, differences in sample concentrations, probing conditions and scanner settings mean that the range of signal intensities can be quite variable. Prior to combining replicate experiments, we calculate the robust standard deviations of signals in each experiment and scale each so the widths of signal distributions are equal.

Quality control

It is useful to have an objective measure of data quality (Fig. 1A) (31,33,39,44,45). Firstly, it allows the user to see how well the experiment has performed as this is not usually obvious from visual inspection of the slide images. Secondly, it assesses the degree to which the noise reduction steps have improved the data. Finally, by identifying the most serious problems, the user can modify future experiments to improve results. Here we introduce some of the data quality measures that we have incorporated into ExpressYourself to date.

Percentage of good quality array elements.

The simplest quality metric is a basic calculation of the proportion of array elements and regions the filtering process has removed; the larger the proportion, the poorer the quality of experiment. By breaking down the numbers according to error type (e.g. spot diameter, homogeneity, saturation), we can determine the defective properties that are most problematic for a given array.

Intra-array hybridization quality.

Many microarrays are designed with spots printed in duplicate, side-by-side. We gauge the consistency of hybridizations within the array by measuring the difference in signals between these duplicates [e.g. _D_=(_R_dup1−_R_dup2)/(_R_dup1+_R_dup2)]. The mean of _D_2, 〈_D_2〉, then summarizes the consistency of hybridization within the array. Since we expect 〈_D_〉=0 then Var(D)=∑(D _i_−0)2/_N_=〈_D_2〉 so the consistency of hybridization can also be visualized as the width of the distribution of D.

Replicate array hybridization quality.

We extend this measure to determine the consistency of replicate experiments, by calculating the difference in signals between equivalent spots across multiple slides. We construct an analogous quality score D _i_′=(_R_α,_i_−_R_β,i)/(_R_α,i+_R_β,i) for spot i on slides α and β. Again, the width of the distribution of _D_′ measures the quality of an experiment with respect to others and allows users to decide whether the entire experiment should be removed from the dataset. Values for _D_′ can also be used to identify regions of a slide that are of particularly poor quality.

Scoring differentially hybridized array elements

The final step is to identify array elements that exhibit differential hybridization (Fig. 1A). These ultimately correspond to those genes that have altered expression levels, chromosomal regions that have changed copy number or the locations of transcription factor binding sites, depending on the nature of the experiment. The major issue is to single out spots whose relative Cy5-Cy3 signals stand out from the experimental noise at sufficient statistical significance.

ExpressYourself currently incorporates three scoring methods. The most simplistic and widely used approach is to define a ratio cut-off and identify the probes that exhibit fold changes greater than this threshold (3,4648). Another popular approach is to use variations of Student's paired _t_-test to compare all signals from the test and reference samples (4951). Differentially hybridized spots are identified as those exhibiting a _p_-value less than a user-specified cut-off. We also include a novel method for scoring differential hybridization (Fig. 2B; manuscript in preparation). We standardize each spot's ratio by dividing it with a local standard deviation; this deviation is determined as a function of the spot's total intensity (_S_1+_S_2). The standardized ratios are fit to a distribution and outliers at a user-defined _p_-value are identified as being differentially hybridized. The outliers are removed from the dataset and the entire process is repeated with the new, smaller set of spots. The iteration continues until no new outliers are detected.

Figure 2.

Figure 2

Screenshot of: (A) the main, and (B) scoring pages of ExpressYourself.

THE USER INTERFACE FOR ExpressYourself

ExpressYourself is accessed using a web browser and Figure 2 displays elements of the user interface. The toolbar allows users to view the data at different stages of processing and the corresponding output is presented in the main area of the web page (Fig. 2A). In the centre of the display, we recreate the slide image using values from the input file, and it is updated through each processing step. Specific regions of the slide can be viewed in detail by clicking on the area of interest. Selecting individual spots can access data associated with each array element (e.g. name of array element, diameter, signal intensities and data quality flagging). Distributions of the Cy5 and Cy3 signals are displayed at the right side of the page. The scoring page lists the differentially hybridized spots that are considered statistically significant and also displays them as graphical plots (Fig. 2B). The user can download the results in a text file for further analysis. The aim of the graphical interface is to enable users to visualize the data in the context of a microarray slide and statistical distributions. It facilitates comparisons of the processed data with the original slide images and allows them to track changes to spots of interest. In additional data quality pages, the schematics are particularly useful for uncovering position specific artifacts on the microarray slide.

DATA DOWNLOAD

Processed data can be downloaded by the user as text files; these include array signal intensities after each processing step, a list of array elements that are differentially hybridized along with significance scores, and the results of data quality analyses including flagging (−50 for poor quality array elements and 0 for good quality elements).

CONCLUSIONS

Summary

Here we presented ExpressYourself, a web-based program for processing microarray data. We have incorporated novel and published algorithms to reduce the experimental noise, assess the quality of the data and identify differentially hybridized array elements. The program can process data from most gene expression, ChIp-chip and CGH experiments. The results are clear and the graphical interface allows immediate identification of the most important features of the experiment.

Future improvements

ExpressYourself is continually updated as better processing methods are developed both within and outside our laboratory. Immediate plans include addition of alternative normalization methods, clustering and a visual tool linking array images to genomic features, given a corresponding microarray designed to map chromosomal loci. We also have future plans for improved scoring schemes and more advanced methods for combining the data from replicate experiments.

AVAILABILITY

ExpressYourself is freely accessible for use at http://bioinfo.mbb.yale.edu/expressyourself. The program is written in C and Perl and may be installed on any web server for local use. Enquiries can be made to nicholas.luscombe@yale.edu.

ExpressYourself currently accepts input files in GenePix Pro versions 2.0–4.0, Scanalyze version 2.0, or UCSF SPOT version 2.0 format. The processing steps to be applied to the data may be changed by altering the parameters at the input stage. The program and its outputs are accessible using any modern web browser (Explorer 6.0, Netscape 7.0 or Mozilla 1.3) and text-based results can be downloaded for further analysis.

Acknowledgments

ACKNOWLEDGEMENTS

Thanks to the Snyder and Stern laboratories for sample datasets and their assistance in testing the software, Paul Lizardi and Michelle Lacey for useful discussions. N.M.L. is sponsored by the Anna Fuller Fund and M.G. acknowledges support from the Keck Foundation. P.B., M.S. and M.G. are supported in part by NIH grants P50 HG02357 and R01 CA77808.

REFERENCES