An efficient pseudomedian filter for tiling microrrays - PubMed (original) (raw)

An efficient pseudomedian filter for tiling microrrays

Thomas E Royce et al. BMC Bioinformatics. 2007.

Abstract

Background: Tiling microarrays are becoming an essential technology in the functional genomics toolbox. They have been applied to the tasks of novel transcript identification, elucidation of transcription factor binding sites, detection of methylated DNA and several other applications in several model organisms. These experiments are being conducted at increasingly finer resolutions as the microarray technology enjoys increasingly greater feature densities. The increased densities naturally lead to increased data analysis requirements. Specifically, the most widely employed algorithm for tiling array analysis involves smoothing observed signals by computing pseudomedians within sliding windows, a O(n2logn) calculation in each window. This poor time complexity is an issue for tiling array analysis and could prove to be a real bottleneck as tiling microarray experiments become grander in scope and finer in resolution.

Results: We therefore implemented Monahan's HLQEST algorithm that reduces the runtime complexity for computing the pseudomedian of n numbers to O(nlogn) from O(n2logn). For a representative tiling microarray dataset, this modification reduced the smoothing procedure's runtime by nearly 90%. We then leveraged the fact that elements within sliding windows remain largely unchanged in overlapping windows (as one slides across genomic space) to further reduce computation by an additional 43%. This was achieved by the application of skip lists to maintaining a sorted list of values from window to window. This sorted list could be maintained with simple O(log n) inserts and deletes. We illustrate the favorable scaling properties of our algorithms with both time complexity analysis and benchmarking on synthetic datasets.

Conclusion: Tiling microarray analyses that rely upon a sliding window pseudomedian calculation can require many hours of computation. We have eased this requirement significantly by implementing efficient algorithms that scale well with genomic feature density. This result not only speeds the current standard analyses, but also makes possible ones where many iterations of the filter may be required, such as might be required in a bootstrap or parameter estimation setting. Source code and executables are available at http://tiling.gersteinlab.org/pseudomedian/.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Number of features per Affymetrix brand microarray. Example studies' [1; 13–17] feature count are plotted as a function of their publication year.

Figure 2

Figure 2

Computation of sliding window pseudomedianutilizing a 115 bp bandwidth. Raw signals are depicted in (A) along with their genomic coordinates (B). Pseudomedians are computed in a sliding window (C). As an example, (D) tabulates all 28 pairwise average in the window centered at position 1,250. The averages are sorted in (E) from which their median can be computed (F). The median of pairwise averages is also called the pseudomedian.

Figure 3

Figure 3

Pseudocode for the pseudomedian algorithm. For illustration purposes, this pseudocode assumes no ties in X.

Figure 4

Figure 4

Partitioning the pairwise means in linear time. A partition is indicated by a dotted line such that all elements above and to the left of the line are strictly less than six. The pairwise averages that are required to be computed for determining the partition are indicated by the path. Specifically, imagine that the current best guess at the pseudomedian is six – that is, we now want to divide S0 into those averages less than six and those greater than or equal to six. We start at the top and right of the matrix and encounter a value of 5. This is less than six so we move down one row. Again, we encounter a value, 5.5, that is less than six so we move down one more row. Note that we now know every element in these previous two rows are less than our partitioning element and that we determined this to be so by computing just two pairwise averages. Returning to the partitioning, we next encounter the value 6.5 which is greater than six, so we scan the row to the left until we find the first element, 5.5, which is less than our partitioning element. When this occurs, we move down a row from six, encounter 6.5, move left, find the value six and reach the diagonal, which completes the partitioning. Importantly, we reached this diagonal by computing only seven pairwise averages. The facilitating requirement of the input data is that it is sorted. This is the partitioning technique implemented in Monahan's algorithm.

Figure 5

Figure 5

Observed runtime differences between computing the pseudomedian from its definition and by following Monahan's algorithm. On the _x_-axis is the number of randomly generated numbers for which the pseudomedian was desired. The _y_-axis is the result of dividing the definition's runtime by Monahan's runtime on the same set of values. The ratio scales linearly with number of elements as predicted from theory. The definition requires a O(n_2log_n) computation whereas Monahan's algorithm needs just O(n_log_n) computing time; we therefore expect a linear increase in their runtime ratio as n increases, and this is what we observe.

Figure 6

Figure 6

Linked list (A) and skip list (B). The skip list contains nodes with a variable number of forward pointers. A node's level is the same as the number of forward pointers it has. Insertion and deletion into the linked list requires a simple traversal of the list, requiring O(n) time. Insertion and deletion into/from the skip list can make use of short-cuts provided by the structure. For example, to insert '30' into (B), one would start at the node having value '1', skip ahead to the node holding '9', skip ahead to '13' and then to '29' before encountering '41', which is greater than the value to insert. The traversal would then visit the node having '37' and finally insert between '29' and '37'. Such an operation can be shown to have O(n_log_n) expected time.

Figure 7

Figure 7

Runtimes of the pseudomedian smoothing algorithms over a one million element dataset. Times are plotted for the definition-derived implementation and for the Monahan modification as a function of feature count within the sliding window.

Figure 8

Figure 8

Fractional runtimes for the Monahan-modified pseudomedian filter. 'Sorting' refers to time spent sorting values within each window, 'Partitioning' refers to time spent searching for the pseudomedian, and 'Overhead' is the remainder of runtime within the filter.

Figure 9

Figure 9

Runtimes for pseudomedian smoothing using the Monahan modification with and without skip list.

Similar articles

Cited by

References

    1. Selinger DW, Cheung KJ, Mei R, Johansson EM, Richmond CS, Blattner FR, Lockhart DJ, Church GM. RNA expression analysis using a 30 base pair resolution Escherichia coli genome array. Nat Biotechnol. 2000;18:1262–1268. doi: 10.1038/82367. - DOI - PubMed
    1. Bertone P, Trifonov V, Rozowsky JS, Schubert F, Emanuelsson O, Karro J, Kao M, Snyder M, Gerstein M. Design optimization methods for genomic DNA tiling arrays. Genome Res. 2006;16:271–281. doi: 10.1101/gr.4452906. - DOI - PMC - PubMed
    1. Wheelan SJ, Martinez-Murillo F, Irizarry RA, Boeke JD. Stacking the deck: double-tiled DNA microarrays. Nat Methods. 2006;3:903–907. doi: 10.1038/nmeth951. - DOI - PubMed
    1. ENCODE Consortium The ENCODE (ENCyclopedia Of DNA Elements) project. Science. 2004;306:636–640. doi: 10.1126/science.1105136. - DOI - PubMed
    1. Royce TE, Rozowsky JS, Bertone P, Samanta M, Stolc V, Weissman S, Snyder M, Gerstein M. Issues in the analysis of oligonucleotide tiling microarrays for transcript mapping. Trends Genet. 2005;21:466–475. doi: 10.1016/j.tig.2005.06.007. - DOI - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources