Missing value estimation for DNA microarray gene expression data: local least squares imputation (original) (raw)

Abstract

Motivation: Gene expression data often contain missing expression values. Effective missing value estimation methods are needed since many algorithms for gene expression data analysis require a complete matrix of gene array values. In this paper, imputation methods based on the least squares formulation are proposed to estimate missing values in the gene expression data, which exploit local similarity structures in the data as well as least squares optimization process.

Results: The proposed local least squares imputation method (LLSimpute) represents a target gene that has missing values as a linear combination of similar genes. The similar genes are chosen by _k_-nearest neighbors or k coherent genes that have large absolute values of Pearson correlation coefficients. Non-parametric missing values estimation method of LLSimpute are designed by introducing an automatic _k_-value estimator. In our experiments, the proposed LLSimpute method shows competitive results when compared with other imputation methods for missing value estimation on various datasets and percentages of missing values in the data.

Availability: The software is available at http://www.cs.umn.edu/~hskim/tools.html

Contact: hpark@cs.umn.edu

1 INTRODUCTION

Microarray data analysis has been successfully applied in a number of studies over a broad range of biological disciplines including cancer classification by class discovery and prediction (Golub et al., 1999), identification of the unknown effects of a specific therapy (Perou et al., 2000), identification of genes relevant to a certain diagnosis or therapy (Cho et al., 2003) and cancer prognosis (Shipp et al., 2002; van't Veer et al., 2002). Since multivariate supervised classification methods such as support vector machines (SVMs) (Vapnik, 1995), and multivariate statistical analysis methods such as principal component analysis (PCA), singular value decomposition (SVD) (Golub and van Loan, 1996; Alter et al., 2000) and generalized SVD (GSVD) (Golub and van Loan, 1996; Alter et al., 2003) cannot be applied to data with missing values, the missing value estimation is an important preprocessing step. Gene expression data sets often contain missing values due to various reasons, e.g. insufficient resolution, image corruption, dust or scratches on the slides or experimental error during the laboratory process. Since it is often very costly or time consuming to repeat the experiment, many algorithms have been developed to recover the missing values (Troyanskaya et al., 2001; Oba et al., 2003; Friedland et al., 2003). Moreover, estimating unknown elements in the given matrix has many potential applications in the other fields. There are several approaches for estimating the missing values. Recently, for missing value estimation, the SVD-based method (SVDimpute) and weighted _k_-nearest neighbors imputation (KNNimpute) have been introduced (Troyanskaya et al., 2001). It has been shown that KNNimpute performs better on non-time series data or noisy time series data, while SVDimpute works well on time series data with low noise levels. Overall, the weighted _k_-nearest neighbor based imputation provides a more robust method for missing value estimation than the SVD-based method (Troyanskaya et al., 2001).

Throughout the paper, we will use G ∈ ℝ_m_×n to denote a gene expression data matrix with m genes and n experiments, and assume m ≫ n. In the matrix G, a row

\({\hbox{ g }}_{i}^{\hbox{ T }}\in {\mathbb{R}}^{1\times n}\)

represents expressions of the _i_-th gene in n experiments:

\[G=\left(\begin{array}{c}{\hbox{ g }}_{1}^{\hbox{ T }}\\ \vdots \\ {\hbox{ g }}_{m}^{\hbox{ T }}\end{array}\right)\in {\mathbb{R}}^{m\times n}.\]

A missing value in the _l_-th location of the _i_-th gene is denoted as α, i.e.

\[G\left(i,l\right)={\hbox{ g }}_{i}\left(l\right)=\alpha .\]

For simplicity of algorithm description, all missing value estimation algorithms mentioned in this paper are described first assuming there is a missing value in the first position of the first gene, i.e.

\[G\left(1,1\right)={\hbox{ g }}_{1}\left(1\right)=\alpha ,\]

then the general algorithms for our proposed missing value estimation methods for DNA microarray expression data are introduced.

The KNNimpute method (Troyanskaya et al., 2001) finds k (k < m) other genes with expressions most similar to that of g1 and with the values in their first positions not missing. The missing value of g1 is estimated by the weighted average of values in the first positions of these k closest genes. For the weighted average, the contribution of each gene is weighted by the similarity of its expression to that of g1. In the SVDimpute method (Troyanskaya et al., 2001), the SVD of the matrix _G_′, which is obtained after all missing values of the G are substituted by zero or row averages, is computed. Then, using the t most significant eigengenes (Alter et al., 2000) of _G_′, where the specific value of t is either predetermined or determined based on datasets, a missing value α in g1 is estimated by regressing this gene against the t most significant eigengenes. Using the coefficients of the regression, the missing value is estimated as a linear combination of the values in the first position of t eigengenes. When determining these regression coefficients, the missing value g1(1) of g1 and the first values of the t eigengenes are not used. The above procedure is repeated until the total change of the matrix becomes insignificant. The computational complexity of SVDimpute is O(n_2_mj), where j is the number of iterations performed before the threshold value is reached. SVDimpute is useful for time series data with low noise level. Recently, Bayesian PCA (BPCA), which simultaneously estimates a probabilistic model and latent variables within the framework of Bayesian inference, has been successfully applied to missing value estimation problems (Oba et al., 2003). Also, a fixed rank approximation algorithm (FRAA) (Friedland et al., 2003) using the SVD has been proposed. However, FRAA could not outperform KNNimpute even though it is more accurate than replacing missing values with 0's or with row means. More recently, Bø et al. (2004) has introduced a missing value estimation method based on the least squares principle, which utilizes correlations between both genes and arrays, referred to as LSimpute.

In this paper, we introduce novel least squares based imputation methods, where a target gene that has missing values is represented as a linear combination of similar genes. Rather than using all available genes in the data, since only similar genes based on a similarity measure are used, our method is referred to as local least squares imputation (LLSimpute). As similarity measures, both _L_2-norm and Pearson correlation coefficients are investigated for comparison. We evaluate all proposed imputation methods on the five datasets and compare them with KNNimpute and an estimation method based on BPCA for various percentages of missing values.

2 METHODS

There are two steps in the local least squares imputation. The first step is to select k genes by the _L_2-norm or by Pearson correlation coefficients. The second step is regression and estimation, regardless of how the k genes are selected. A heuristic k parameter selection method is described in the Results and discussion section.

2.1 Selecting genes

To recover a missing value α in the first location g1(1) of g1 in G ∈ ℝ_m_×n, the _k_-nearest neighbor gene vectors for g1,

\[{\hbox{ g }}_{{s}_{i}}^{\hbox{ T }}\in {\mathbb{R}}^{1\times n},1\le i\le k,\]

are found for LLSimpute based on the _L_2-norm (LLSimpute/L2). In this process of finding the similar genes, the first component of each gene is ignored following the fact that g1(1) is missing.

The LLSimpute based on the Pearson correlation coefficient (Pearson, 1894), referred to as LLSimpute/PC, takes advantage of the coherent genes. When there is a missing value in the first location of g1, the Pearson correlation coefficient r_1_j between two vectors g′1 = (g_12,…,g_1_n)T and g′_j = (g _j_2,…,g jn)T is defined as

\[{r}_{1j}=\frac{1}{\left(n-1\right)}{\displaystyle \sum _{k=2}^{n}}\left(\frac{{g}_{1k}-{\overline{g}}_{1}}{{\sigma }_{1}}\right)\left(\frac{{g}_{jk}-{\overline{g}}_{j}}{{\sigma }_{j}}\right),\left(1\right)\]

where

\({\overline{g}}_{j}\)

is the average of values in g′j and σ_j_ is the SD of these values. The components of g1 that correspond to missing values are not considered in computing the coefficients. We used the absolute values of the Pearson correlation coefficients since the highly correlated but opposite signed components of the genes, i.e r ≃ −1.0, are also helpful in estimating missing values. In LLSimpute/PC, missing values in the target genes are estimated by local least squares where highly correlated genes in the microarray data are selected based on the Pearson correlation coefficients. First, all Pearson correlation coefficients between g1 and the other genes are computed. Then, to recover a missing value in the first location of g1, G(1,1) = g1(1) = α, the k genes with the largest Pearson correlation coefficients in magnitude,

\[{\hbox{ g }}_{{s}_{i}}^{\hbox{ T }}\in {\mathbb{R}}^{1\times n},1\le i\le k,\]

are found for LLSimpute/PC.

2.2 Gene-wise formulation of local least squares imputation

As imputation can be performed regardless of how the _k_-genes are selected, we present only the imputation based on _L_2-norm for simplicity. Based on these _k_-neighboring gene vectors, the matrix A ∈ ℝ_k_×(_n_−1) and the two vectors b ∈ ℝ_k_×1 and w ∈ ℝ(_n_−1)×1 are formed. The k rows of the matrix A consist of the _k_-nearest neighbor genes

\({\hbox{ g }}_{{s}_{i}}^{\hbox{ T }}\in {\mathbb{R}}^{1\times n},1\le i\le k\)

⁠, with their first values deleted, the elements of the vector b consists of the first components of the k vectors

\({\hbox{ g }}_{{s}_{i}}^{\hbox{ T }}\)

⁠, and the elements of the vector w are the n − 1 elements of the gene vector g1 whose missing first item is deleted. After the matrix A, and the vectors b and w are formed, the least squares problem is formulated as

\[\underset{\hbox{ x }}{min}{\Vert {A}^{\hbox{ T }}\hbox{ x }-\hbox{ w }\Vert }_{2}.\left(2\right)\]

Then, the missing value α is estimated as a linear combination of first values of genes

\[\alpha ={\hbox{ b }}^{\hbox{ T }}{\hbox{ x }=\hbox{ b }}^{\hbox{ T }}{\left({A}^{\hbox{ T }}\right)}^{\dagger}\hbox{ w },\left(3\right)\]

where (_A_T)† is the pseudoinverse of _A_T.

For example, assume that the target gene g1 has a missing value in the first position among the total of six experiments. If the missing value is to be estimated by the k similar genes, the matrix A, and vectors b and w are constructed as

\[\begin{array}{c}\left(\begin{array}{c}{\hbox{ g }}_{1}^{\hbox{ T }}\\ {\hbox{ g }}_{{s}_{1}}^{\hbox{ T }}\\ \vdots \\ {\hbox{ g }}_{{s}_{k}}^{\hbox{ T }}\end{array}\right)=\left(\begin{array}{cc}\alpha & {\hbox{ w }}^{\hbox{ T }}\\ \hbox{ b }& A\end{array}\right)\\ =\left(\begin{array}{cccccc}\alpha & {\hbox{ w }}_{1}& {\hbox{ w }}_{2}& {\hbox{ w }}_{3}& {\hbox{ w }}_{4}& {\hbox{ w }}_{5}\\ {\hbox{ b }}_{1}& {A}_{1,1}& {A}_{1,2}& {A}_{1,3}& {A}_{1,4}& {A}_{1,5}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {\hbox{ b }}_{k}& {A}_{k,1}& {A}_{k,2}& {A}_{k,3}& {A}_{k,4}& {A}_{k,5}\end{array}\right),\end{array}\]

where α is the missing value and

\({\hbox{ g }}_{{s}_{1}}^{\hbox{ T }},\dots ,{\hbox{ g }}_{{s}_{k}}^{\hbox{ T }}\)

are genes similar to

\({\hbox{ g }}_{1}^{\hbox{ T }}\)

⁠. From the second to the last components of the neighbor genes,

\({\hbox{ a }}_{i}^{\hbox{ T }}\)

⁠, 1 ≤ i ≤ k, form the _i_-th row vector of the matrix A. The vector w of the known elements of target gene g1 can be represented as a linear combination

\[\hbox{ w }\simeq {\hbox{ x }}_{1}{\hbox{ a }}_{1}+{\hbox{ x }}_{2}{\hbox{ a }}_{2}+\dots +{\hbox{ x }}_{k}{\hbox{ a }}_{k},\]

where xi are the coefficients of the linear combination, found from the least squares formulation (2). Accordingly, the missing value α in g1 can be estimated by

\[\alpha ={\hbox{ b }}^{\hbox{ T }}{\hbox{ x }=\hbox{ b }}_{1}{\hbox{ x }}_{1}+{\hbox{ b }}_{2}{\hbox{ x }}_{2}+\dots +{\hbox{ b }}_{k}{\hbox{ x }}_{k}.\]

Now, we deal with the case in which there are more than one missing values in a gene vector. In this case, to recover the total of q missing values in any locations of the gene g1, first, the _k_-nearest neighbor gene vectors for g1,

\[{\hbox{ g }}_{{s}_{i}}^{\hbox{ T }}\in {\mathbb{R}}^{1\times n},1\le i\le k,\]

are found. In this process of finding the similar genes, the q components of each gene at the q locations of missing values in g1 are ignored. Then, based on these k neighboring gene vectors, a matrix A ∈ ℝ_k_×(n_−_q) a matrix B ∈ ℝ_k_×q and a vector w ∈ ℝ(n_−_q)×1 are formed. The _i_-th row vector

\({\hbox{ a }}_{i}^{\hbox{ T }}\)

of the matrix A consists of the _i_-th nearest neighbor genes

\({\hbox{ g }}_{{s}_{i}}^{\hbox{ T }}\in {\mathbb{R}}^{1\times n},1\le i\le k\)

⁠, with its elements at the q missing locations of missing values of g1 excluded. Each column vector of the matrix B consists of the values of the _j_-th location of the missing values (1 ≤ j ≤ q) of the k vectors

\({\hbox{ g }}_{{s}_{i}}^{\hbox{ T }}\)

⁠. The elements of the vector w are the n − q elements of the gene vector g whose missing items are deleted. After the matrices A and B and a vector w are formed, the least squares problem is formulated as

\[\underset{\hbox{ x }}{min}{\Vert {A}^{\hbox{ T }}\hbox{ x }-\hbox{ w }\Vert }_{2}.\left(4\right)\]

Then, the vector u = (α1,α2, …, α_q_)T of q missing values can be estimated as

\[\hbox{ u }=\left(\begin{array}{c}{\alpha }_{1}\\ \vdots \\ {\alpha }_{q}\end{array}\right)={B}^{\hbox{ T }}\hbox{ x }={B}^{\hbox{ T }}{\left({A}^{\hbox{ T }}\right)}^{\dagger}\hbox{ w },\left(5\right)\]

where (_A_T)† is the pseudoinverse of _A_T.

For example, assume that the target gene g1 has two missing values in the 1st and the 6th positions among the total six experiments. If the missing value is to be estimated by the k similar genes, each element of the matrix A and B, and a vector w are constructed as

\[\left(\begin{array}{c}{\hbox{ g }}_{1}^{\hbox{ T }}\\ {\hbox{ g }}_{{\hbox{ s }}_{1}}^{\hbox{ T }}\\ \vdots \\ {\hbox{ g }}_{{\hbox{ s }}_{\hbox{ k }}}^{\hbox{ T }}\end{array}\right)=\left(\begin{array}{cccccc}{\alpha }_{1}& {\hbox{ w }}_{1}& {\hbox{ w }}_{2}& {\hbox{ w }}_{3}& {\hbox{ w }}_{4}& {\alpha }_{2}\\ {B}_{1,1}& {A}_{1,1}& {A}_{1,2}& {A}_{1,3}& {A}_{1,4}& {B}_{1,2}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {B}_{k,1}& {A}_{k,1}& {A}_{k,2}& {A}_{k,3}& {A}_{k,4}& {B}_{k,2}\end{array}\right),\]

where α1 and α2 are the missing values and

\({\hbox{ g }}_{{s}_{1}}^{\hbox{ T }},\dots ,{\hbox{ g }}_{{s}_{k}}^{\hbox{ T }}\)

are the k genes that are most similar to g1. The known elements of w can be represented by

\[\hbox{ w }\simeq {\hbox{ x }}_{1}{\hbox{ a }}_{1}+{\hbox{ x }}_{2}{\hbox{ a }}_{2}+\dots +{\hbox{ x }}_{k}{\hbox{ a }}_{k},\]

where xi are the coefficients of the linear combination, found from the least squares formulation (4). And, the missing values in g1 can be estimated by

\[\begin{array}{c}{\alpha }_{1}={B}_{1,1}{\hbox{ x }}_{1}+{B}_{2,1}{\hbox{ x }}_{2}+\dots +{B}_{k,1}{\hbox{ x }}_{\hbox{ k }},\\ {\alpha }_{2}={B}_{1,2}{\hbox{ x }}_{1}+{B}_{2,2}{\hbox{ x }}_{2}+\dots +{B}_{k,2}{\hbox{ x }}_{\hbox{ k }},\end{array}\]

where α1 and α2 are the first and second missing values in the target gene.

2.3 Experiment-wise formulation of local least squares imputation

In this subsection, we introduce another possible least squares formulation and illustrate relationship between these two least squares formulations. Based on the matrix A and the vector b presented in the previous subsection, the following least squares problem for missing value estimation can also be formulated:

\[\underset{\hbox{ y }}{min}{\Vert A\hbox{ y }-\hbox{ b }\Vert }_{2}.\left(6\right)\]

Then, the vector b of the known elements of the experiment that has the missing value α as its first component can be represented as a linear combination of other experiments. Accordingly, the missing value α can be estimated as a linear combination of values of the first components of the experiments by

\[\alpha ={\hbox{ w }}^{\hbox{ T }}{\hbox{ y }=\hbox{ w }}^{\hbox{ T }}{A}^{\dagger}\hbox{ b },\left(7\right)\]

where _A_† is the pseudoinverse of A. The pseudoinverse _A_† of A can be computed by

\[\begin{array}{c}{A}^{\dagger}=\left[{V}_{1}{V}_{2}\right]\left[\begin{array}{cc}{\Sigma }_{1}^{-1}& 0\\ 0& 0\end{array}\right]{\left[{U}_{1}{U}_{2}\right]}^{\hbox{ T }}\\ ={V}_{1}{\Sigma }_{1}^{-1}{U}_{1}^{\hbox{ T }}\end{array}\]

from the SVD of A,

\[A=\left[{U}_{1}{U}_{2}\right]\left[\begin{array}{cc}{\Sigma }_{1}& 0\\ 0& 0\end{array}\right]{\left[{V}_{1}{V}_{2}\right]}^{\hbox{ T }}={U}_{1}{\Sigma }_{1}{V}_{1}^{\hbox{ T }},\]

where U_1 ∈ ℝ_k_×_r, Σ1 ∈ ℝ_r_×r, V_1 ∈ ℝ_n_−1×_r and r = rank(Σ1) = rank(A).

Although the gene-wise formulation and the experiment-wise formulation may seem to represent the missing value in two different ways, the solutions are in fact the same due to the following relations:

\[{\hbox{ b }}^{\hbox{ T }}{\hbox{ x }=\hbox{ b }}^{\hbox{ T }}{\left({A}^{\hbox{ T }}\right)}^{\dagger}\hbox{ w }={\left({\hbox{ w }}^{\hbox{ T }}{A}^{\dagger}\hbox{ b }\right)}^{\hbox{ T }}={\hbox{ w }}^{\hbox{ T }}\hbox{ y }.\left(8\right)\]

Therefore, we choose an imputation formulation of Equation (2) which represents a gene by a linear combination of the other similar genes.

For the case in which there are more than one missing values in a gene vector, the formulation for multiple missing value estimation analogous to Equation (6) would be

\[\underset{Y}{min}{\Vert AY-B\Vert }_{F},\left(9\right)\]

where F denotes the Frobenius norm. Then, the missing values u = (α1,α2,…,α_q_)T can be estimated as a linear combination of values of w, i.e.

\[{\hbox{ u }}^{\hbox{ T }}={\hbox{ w }}^{\hbox{ T }}Y={\hbox{ w }}^{\hbox{ T }}{A}^{\dagger}B,\left(10\right)\]

where _A_† is the pseudoinverse of A. This vector u is identical to that in Equation (5) which is obtained based on gene-wise formulation since

\[{B}^{\hbox{ T }}\hbox{ x }={B}^{\hbox{ T }}{\left({A}^{\hbox{ T }}\right)}^{\dagger}\hbox{ w }={\left({\hbox{ w }}^{\hbox{ T }}{A}^{\dagger}B\right)}^{\hbox{ T }}={Y}^{\hbox{ T }}\hbox{ w }.\left(11\right)\]

Therefore, we chose an imputation formulation of Equation (4) which represents a gene by a linear combination of the similar genes.

For estimating each missing value, we need to build the matrices A and B and a vector w, and solve the least squares problem of Equation (4). To take advantage of non-missing entries of neighbor genes which have missing values, each missing value is initially estimated by the gene-wise average. If number of missing entries is much smaller than the number of genes, the neighbor genes that contain missing value are excluded when building the least squares systems. In this case, the matrices A and B do not contain any estimated entries. This process is helpful in achieving more accurate estimation result since it circumvents possible errors generated from pre-estimation by the row-averages.

3 RESULTS AND DISCUSSION

3.1 Datasets

Five microarray datasets have been used in our experiments. The first dataset was obtained from α-factor block release that was studied for the identification of cell-cycle regulated genes in yeast Saccharomyces cerevisiae (Spellman et al., 1998). We built a complete data matrix of 4304 genes and 18 experiments (SP.ALPHA) that does not have any missing value to assess missing value estimation methods. The second dataset of a complete matrix of 4304 genes and 14 experiments (SP.ELU) is based on an elutriation dataset (Spellman et al., 1998). The 4304 genes originally had no missing values in the α-factor block release set and the elutriation dataset. The third dataset was from 784 cell-cycle-regulated genes, which were classified by Spellman et al. (1998) into five classes, for the same 14 experiments as the second data set. After removing all gene rows that have missing values, we built the third data set of 474 genes and 14 experiments (SP.CYCLE). The fourth dataset is from a study of response to environmental changes in yeast (Gasch et al., 2001). It contains 6361 genes and 156 experiments that have time-series of specific treatments. A complete matrix of 2641 genes and 44 experiments was formed after removing experimental columns that have >8% missing values and then selecting gene rows that do not have any missing value (GA.ENV). The fifth dataset is the cDNA microarray data relevant to human colorectal cancer (CRC) (Takemasa et al., 2001). This dataset contains 758 genes and 205 primary CRCs that include 127 non-metastatic primary CRCs, 54 metastatic primary CRCs to the liver and 24 metastatic primary CRCs to distant organs exclusive of the liver, and 12 normal colonic epithelia (TA.CRC) (Oba et al., 2003). This is a challenging dataset with multiple experiments with no time course relationships. The SP.ALPHA, SP.ELU and TA.CRC are the same datasets that were used in the study of BPCA (Oba et al., 2003). The SP.CYCLE dataset was designed to test how much an imputing method can take advantage of strongly correlated genes in estimating missing values.

Given an original expression data matrix G o with m _o_-genes × n_-experiments from the Stanford Microarray Database (SMD) (Sherlock et al., 2001), we prepared the initial full matrix G i ∈ ℝ_m i × n where m i genes have no missing values (m i ≤ m o). If there is no missing value in the original matrix, then the initial matrix G i is set to G o. Given an initial expression data matrix G i, certain percentage of the data elements of G i are randomly selected and regarded as missing values. The performance of the missing value estimation is evaluated by normalized root mean squared error (NRMSE):

\[\hbox{ NRMSE }=\sqrt{\hbox{ mean }\left[{\left({\hbox{ y }}_{\hbox{ guess }}-{\hbox{ y }}_{\hbox{ ans }}\right)}^{2}\right]}/\hbox{ std }\left[{\hbox{ y }}_{\hbox{ ans }}\right]\left(12\right)\]

where yguess and yans are vectors whose elements are the estimated values and the known answer values, respectively, for all missing entries. The mean and the SD are calculated over missing entries in the entire matrix. In KNNimpute, a weighted average of the _k_-nearest neighbors is used as an estimate for each missing value in the target gene. The similarity between two genes is defined by the reciprocal of the Euclidian distance calculated for non-missing components. Then, the missing entry is estimated as an average weighted by the similarity values.

3.2 A method for selection of the model parameter k

The value for k in KNNimpute and the reduced rank in SVDimpute are important model parameters to choose for obtaining high performance. However, there is no theoretical result for determining these parameters optimally. Similarly, we need to determine the number of nearest neighbors for LLSimpute/L2 and the number of coherent genes for LLSimpute/PC. In our experiments, the following heuristic algorithm for estimating parameter k is used.

Assuming that there are q missing values in the target gene g, all the gene vectors are sorted according to their similarity to g,

\[{\tilde{\hbox{ g }}}_{{s}_{i}}^{\hbox{ T }}\in {\mathbb{R}}^{1\times n},1\le i\le m-1,\]

where the gene

\({\tilde{\hbox{ g }}}_{{s}_{1}}\)

is the most similar gene to g. Then we estimate an artificial missing value in g using different _k_-values. The vector w is defined based on the n − q elements of the target gene vector g whose missing items are deleted and the matrix B is formed by the values of _j_-th location of the missing values (1 ≤ j ≤ q) of k similar genes

\({\tilde{\hbox{ g }}}_{{s}_{i}}\)

⁠, 1 ≤ i ≤ k. Now, assume that the first position of w has a missing value and it needs to be estimated by various numbers of similar genes. Let αtrue denote w1 which is in fact known. For the least squares formulation, the rows of the matrix A consist of ãi(2 : n − q)T for 1 ≤ i ≤ k and a vector w is set to be w(2:n − q).

For example, assume that the target gene g has two missing values in the 1st and 6th positions among the total six experiments. Considering w1 as a missing value which is in fact known, the matrix A and vectors w = (w2w3w4)T and b = (b1 … bk)T are constructed from

\[\left(\begin{array}{c}{\hbox{ g }}_{1}^{\hbox{ T }}\\ {\hbox{ g }}_{{s}_{1}}^{\hbox{ T }}\\ \vdots \\ {\hbox{ g }}_{{s}_{k}}^{\hbox{ T }}\end{array}\right)=\left(\begin{array}{cccccc}\hbox{ miss }& \alpha & {\hbox{ w }}_{2}& {\hbox{ w }}_{3}& {\hbox{ w }}_{4}& \hbox{ miss }\\ {B}_{1,1}& {\hbox{ b }}_{1}& {A}_{1,2}& {A}_{1,3}& {A}_{1,4}& {B}_{1,2}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {B}_{k,1}& {\hbox{ b }}_{k}& {A}_{k,2}& {A}_{k,3}& {A}_{k,4}& {B}_{k,2}\end{array}\right),\left(13\right)\]

where

\({\hbox{ g }}_{{s}_{1}}^{\hbox{ T }},\dots ,{\hbox{ g }}_{{s}_{k}}^{\hbox{ T }}\)

are the k genes similar to g1. Then,

\({\hbox{ a }}_{i}^{\hbox{ T }}\)

is the _i_-th row vector of the matrix A, for 1 ≤ i ≤ k, and the known elements of w can be represented by

\[\hbox{ w }\simeq {\hbox{ x }}_{1}{\hbox{ a }}_{1}+{\hbox{ x }}_{2}{\hbox{ a }}_{2}+\dots +{\hbox{ x }}_{k}{\hbox{ a }}_{k},\]

where xi are the coefficients, found from the least squares formulation

\[\underset{\hbox{ x }}{min}{\Vert {A}^{\hbox{ T }}\hbox{ x }-\hbox{ w }\Vert }_{2}.\]

Finally, the value w1 = α in g1 can be estimated by

\[\alpha ={\hbox{ b }}^{\hbox{ T }}{\hbox{ x }=\hbox{ b }}_{1}{\hbox{ x }}_{1}+{\hbox{ b }}_{2}{\hbox{ x }}_{2}+\dots +{\hbox{ b }}_{k}{\hbox{ x }}_{k},\]

and compared to the actual value w1. In the above, the k most similar genes are used for estimating entire missing values in the given matrix. Repeating these estimations using several _k_-values, a _k_-value that produces the best estimation ability for the artificial missing values can be found. If there are many missing entries per each row, the above process can be performed considering more than one non-missing positions as missing values in order to obtain more reliable _k_-value.

This procedure decides a number of similar genes that show good performance for estimating missing values using non-missing elements. The _k_-value depends on the characteristic of the given data matrix. The motivation of this procedures is that the _k_-value that shows the best performance using known elements of the matrix can be near an optimal _k_-value. The gene that has missing values is represented as a linear combination of the k similar genes in the least squares formulations. Hence, an optimal _k_-value is predicted by using the known values in the same gene that has missing values.

We introduce two extended missing value estimation methods, LLSk/L2 and LLSk/PC, which perform LLSimpute/L2 and LLSimpute/PC, respectively, after predicting _k_-value using the proposed model selection algorithm. Since LLSk/L2 and LLSk/PC automatically determine the only necessary parameter k, they can be classified as non-parametric missing value estimation methods like BPCA.

3.3 Experimental results

In Figure 1, we compared NRMSE of Equation (12) of the missing value estimation methods discussed in this paper. The SP.ALPHA and SP.ELU sets are the same datasets used in the study of BPCA (Oba et al., 2003) and we obtained the same NRMSE values for KNNimpute and BPCA as those presented by Oba et al. (2003). The missing value estimation based on BPCA showed good performance on the SP.ELU dataset. However, LLSimpute/L2 and LLSimpute/PC outperformed BPCA as well as KNNimpute when k is large.

In Figure 2, overall, LLSimpute shows better performance as the number of genes increases for estimating missing values on the SP.CYCLE dataset. The NRMSE values of LLSimpute/ L2 and BPCA were 0.594 and 0.771, respectively. The SP.CYCLE dataset has significant cluster structures. In this case, LLSimpute showed the best performance among all methods compared, while BPCA showed less accurate results than KNNimpute. The NRMSE values of LLSimpute and BPCA on the GA.ENV dataset were 0.534 and 0.603, respectively. From Figures 1 and 2, we confirmed that the _k_-value estimation algorithm is producing appropriate _k_-values. In Figures 1 and 2, as k increase, the NRMSE of LLSimpute/LS first drops, then rise (but the top portions of the plots are truncated), then drops again and stabilizes. The first drop and rise can be explained by Equation (2). Getting more information from a larger number of genes is responsible for the first drop. As k increase, the matrix A of Equation (2) is getting to be square, i.e. the number of rows of A is the same as the number of columns. Then, the effect of the genes less similar to the target gene become involved in the solution and the performance may become worse. As k keeps increasing, the square structure of A is getting changed to the least squares structure, i.e. the number of rows of A is larger than the number of columns. Then, the minimum norm solution of Equation (2) is the same as the least squares solution of Equation (6) [see Equation (8)]. The second big drop and stabilization can be more easily explained by Equation (6). It shows that considering a sufficient number of genes is helpful in the least squares problem.

To show the dependency of the performance with respect to the number of experiments, the smaller datasets were prepared by clipping a certain number of experiments from TA.CRC dataset. Figure 3 shows the missing value estimation ability for various experiment sizes: 20, 50, 100, 150 and 200. For these experiments, the 5% missing entries were randomly generated from the smaller datasets of TA.CRC. As the number of samples increased the information useful for the imputation increased. The results of KNNimpute were obtained by choosing a _k_-value that provided the best performance of KNNimpute in each test. The results for LLSk/L2 and LLSk/PC were obtained from _k_-value estimation algorithm. As reported in the study of BPCA (Oba et al., 2003), the performance of KNNimpute did not improve much as the number of samples increased. When the number of samples was small, KNNimpute exhibited better performance than BPCA. The advantage of KNNimpute for smaller numbers of samples is in using local similarity. The advantage of BPCA for larger numbers of samples is its ability to capture useful information by a Bayesian optimization process. LLSimpute showed the best performance even when the number of sample was small since it uses the local similarity structures and optimization process by the least squares.

Figure 4 shows the results for various percentage (1, 5, 10, 15 and 20%) of missing entries on SP.CYCLE and GA.ENV datasets. For KNNimpute, the number of genes (k) was chosen to be the one that exhibited the best performance in each test. For various percentages of missing entries, LLSimpute showed the best performance consistently. In KNNimpute, the Euclidean distance seems to be an accurate norm since the log-transformation of the data reduces the effect of outliers on gene similarity determination (Troyanskaya et al., 2001). In Figures 3 and 4, we observed the similar result that the Euclidean distance norm is slightly better than the Pearson correlation coefficient for computing gene similarity in the local least squares imputation methods.

To show how the methods respond to higher noise levels, six noisy datasets were prepared based on SP.CYCLE by adding random noise of various levels, with normal distribution. After building matrices of random numbers with the normal distribution of mean μ = 0 and various SD (σ = 0.01, 0.05, 0.1, 0.15, 0.2 and 0.25), each noise matrix was added to the data matrix of SP.CYCLE with the 5% of missing values in order to build the six noisy datasets. Figure 5 shows that the performance of BPCA varies relatively largely dependent on the noise level. The performance results of LLSk/L2 and LLSk/PC were less sensitive to the noise level. KNNimpute showed robustness against the noise when the SD was less <0.15.

3.4 Comparison to other methods

In KNNimpute (Troyanskaya et al., 2001), after obtaining the k genes gs i which are most similar to the target gene g, it estimates the gene vector g* by

\[{\hbox{ g }}^{*}=\frac{{\omega }_{1}{\hbox{ g }}_{{s}_{1}}+{\omega }_{2}{\hbox{ g }}_{{s}_{2}}+\dots +{\omega }_{k}{\hbox{ g }}_{{s}_{k}}}{{\omega }_{1}+\dots +{\omega }_{k}},\]

where ω_i_ is the similarity between g and gs i. In implementing of KNNimpute, we used

\[{\omega }_{i}=1/{\Vert \hbox{ w }-{\hbox{ a }}_{i}\Vert }_{2},\left(14\right)\]

where w and ai are the sub-vectors of g and gs i, respectively, where the missing components of g are deleted. If w = ai, then a missing value in the target gene g is estimated in KNNimpute as the value in the corresponding location of the vector gs i. In LLSimpute, the coefficients of the linear combination of non-missing part of the similar genes ai, 1 ≤ i ≤ k, are optimized by the least squares solution instead of using the similarity measure of Equation (14).

It should be noted that LLSimpute and LSimpute (Bø et al., 2004) use different approaches for imputation, even though both use least squares. According to Figures 1 and 5, the Pearson correlation based method is no better than the _L_2-norm based method. The LSimpute (Bø et al., 2004) uses only the Pearson correlation in selecting genes and arrays. Moreover, Bø et al. focused on testing only small k values (k = 5, 10, 15, 20 and 25) without _k_-value estimator in order to compare with KNNimpute by using the fact that Troyanskaya et al. (2001) reported the best results for k between 10 and 20. However, our experiments indicate that the optimal _k_-value can be larger than 25 in our least squares formulation.

The BPCA method (Oba et al., 2003) consists of three components: (1) principal component (PC) regression, (2) Bayesian estimation and (3) iterations based on expectation-maximization (EM). In PC regression, the missing part u in a gene expression vector g is estimated from the observed part w by using the principal axis vectors. Let s denote the number of the principal axis vectors. Then, the known elements can be represented by

\[\hbox{ w }\simeq {\varsigma }_{1}{\hbox{ p }}_{1}^{\hbox{ obs }}+{\varsigma }_{2}{\hbox{ p }}_{2}^{\hbox{ obs }}+\dots +{\varsigma }_{s}{\hbox{ p }}_{s}^{\hbox{ obs }},\]

where ς_i_ are the coefficients of the linear combination and

\({\hbox{ p }}_{i}^{\hbox{ obs }}\)

are the observed parts of principal axis vectors. After obtaining the coefficients by the least squares, the missing part is estimated as

\[\hbox{ u }={\varsigma }_{1}{\hbox{ p }}_{1}^{\hbox{ miss }}+{\varsigma }_{2}{\hbox{ p }}_{2}^{\hbox{ miss }}+\dots +{\varsigma }_{s}{\hbox{ p }}_{s}^{\hbox{ miss }},\]

where

\({\hbox{ p }}_{i}^{\hbox{ miss }}\)

are the missing parts of principal axis vectors. This is the conceptual process of BPCA even though it takes advantage of the sophisticated Bayesian estimation and EM-line repetitive algorithm. The SVDimpute and BPCA showed similar results when s is small, since they employ the same PC regression process, while BPCA showed better performance than SVDimpute when s is larger since BPCA automatically reduces the redundant principal axes (Oba et al., 2003).

The major difference between BPCA and LLSimpute is that LLSimpute is an optimization process based on local similar structure while BPCA is an optimization method based on PCs. The BPCA achieves an improvement over SVDimpute by incorporating Bayesian optimization and LLSimpute achieves an improvement over KNNimpute by incorporating the least squares. In the study of Troyanskaya et al. (2001), it was shown that KNNimpute is more robust and accurate than SVDimpute. The SVDimpute has several weaknesses. The SVDimpute solution relies on entire genes and experiments in the dataset and does not consider local structure. In addition, for non-time series data, a clear expression pattern may not exist. For noisy data, expression patterns for smaller groups of genes may not be represented well by the dominant eigengenes. Based on these observations, it is possible to expect that LLSimpute can exhibit highly competitive performance, which is corroborated in our experiments.

4 CONCLUSION

We have successfully developed local least squares imputation methods for the missing value estimation of DNA microarray gene expression data. Once the genes similar to the target gene with missing values are identified based on Euclidean distance or Pearson correlation coefficient, missing values can be estimated by representing the target gene as a linear combination of the similar genes or by representing the target experiment that has missing values as a linear combination of related experiments. Non-parametric missing values estimation methods of LLSk/L2 and LLSk/PC are designed by introducing automatic _k_-value estimator. The proposed missing value estimation methods can be applied to various biological and chemical experiment data.

Even though BPCA showed better performances than KNNimpute for all datasets tested in the study of BPCA (Oba et al., 2003), when genes have dominant local similarity structures, BPCA may be less accurate than KNNimpute (Oba et al., 2003). However, our local least squares imputation methods take advantage of the local similarity structures in addition to the optimization process by the least squares, which is one of the most important advance of LLSimpute. Although we cannot guarantee that LLSimpute will always show better performance than BPCA and KNNimpute, our experiments suggest that the LLSimpute is a robust and accurate missing value estimation method.

Comparison of the NRMSEs of various methods and effect of the number of genes for estimating missing values on SP.ALPHA dataset of 4304 genes and 18 experiments and SP.ELU dataset of 4304 genes and 14 experiments. The 5% entries of each dataset were missing. The results of the methods that do not depend on the number of genes are shown on the y-axis.

Fig. 1

Comparison of the NRMSEs of various methods and effect of the number of genes for estimating missing values on SP.CYCLE dataset of 474 genes and 14 experiments and GA.ENV dataset of 2641 genes and 44 experiments. The 5% entries of each dataset were missing. The results of the methods that do not depend on the number of genes are shown on the y-axis.

Fig. 2

Comparison of the NRMSEs against number of samples for four methods (KNNimpute, BPCA, LLSk/L2 and LLSk/PC) on TA.CRC dataset.

Fig. 3

Comparison of the NRMSEs against number of samples for four methods (KNNimpute, BPCA, LLSk/L2 and LLSk/PC) on TA.CRC dataset.

Comparison of the NRMSEs against percentage of missing entries for four methods (KNNimpute, BPCA, LLSk/L2 and LLSk/PC) on SP.CYCLE and GA.ENV datasets.

Fig. 4

Comparison of the NRMSEs against percentage of missing entries for four methods (KNNimpute, BPCA, LLSk/L2 and LLSk/PC) on SP.CYCLE and GA.ENV datasets.

Comparison of the NRMSEs with respect to noise levels. We added artificial noise with normal distribution of a mean μ = 0 and various SD (σ = 0.01, 0.05, 0.1, 0.15, 0.2 and 0.25) to SP.CYCLE dataset.

Fig. 5

The authors would like to thank the University of Minnesota Supercomputing Institute (MSI) for providing the computing facilities. We also thank Dr Shigeyuki Oba for providing datasets and helpful discussions. The work of H.P. has been performed while at the NSF and was partly supported by IR/D from the National Science Foundation (NSF). This material is based upon the work supported in part by the National Science Foundation Grants CCR-0204109 and ACI-0305543. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

REFERENCES

Alter, O., Brown, P.O., Botstein, D.

2000

Singular value decomposition for genome-wide expression data processing and modeling.

Proc. Natl Acad. Sci. USA

10101

–10106

Alter, O., Brown, P.O., Botstein, D.

2003

Generalized singular value decomposition for comparative analysis of genome-scale expression datasets of two different organisms.

Proc. Natl Acad. Sci. USA

100

3351

–3356

Bø, T.H., Dysvik, B., Jonassen, I.

2004

LSimpute: accurate estimation of missing values in microarray data with least squares methods.

Nucleic Acids Res.

e34

Cho, J.H., Lee, D., Park, J.H., Lee, I.B.

2003

New gene selection method for classification of cancer subtypes considering within-class variation.

FEBS Lett.

551

–7

Institute for Mathematics and its Applications Preprint Series. Friedland, S., Niknejad, A., Chihara, L.

2003

A simultaneous reconstruction of missing data in DNA microarrays. No. 1948

Gasch, A.P., Huang, M., Metzner, S., Botstein, D., Elledge, S.J., Brown, P.O.

2001

Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p.

Mol. Biol. Cell

2987

–3003

Golub, G.H. and van Loan, C.F.

Matrix Computations

1996

3rd edn , Baltimore, CA Johns Hopkins University Press

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.

1999

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Science

286

, pp.

531

–537

Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K., Ishii, S.

2003

A Bayesian missing value estimation method for gene expression profile data.

Bioinformatics

2088

–2096

Pearson, K.

1894

Contributions to the mathematical theory of evolution.

Phil. Trans. R. Soc. London

185

–110

Perou, C.M., Sorlie, T., Eisen, M.B., van de Rijn, M., Jeffrey, S.S., Rees, C.A., Pollack, J.R., Ross, D.T., Johnsen, H., Akslen, L.A., et al.

2000

Molecular portraits of human breast tumors.

Nature

406

747

–752

Sherlock, G., Hernandez-Boussard, T., Kasarskis, A., Binkley, G., Matese, J.C., Dwight, S.S., Kaloper, M., Weng, S., Jin, H., Ball, C.A., et al.

2001

The stanford microarray database.

Nucleic Acids Res.

152

–155

Shipp, M.A., Ross, K.N., Tamayo, P., Weng, A.P., Kutok, J.L., Aguiar, R.C., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G.S., et al.

2002

Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning.

Nat. Med.

–74

Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher, B.

1998

Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.

Mol. Biol. Cell

3273

–3297

Takemasa, I., Higuchi, H., Yamamoto, H., Sekimoto, M., Tomita, N., Nakamori, S., Matoba, R., Monden, M., Matsubara, K.

2001

Construction of preferential cDNA microarray specialized for human colorectal carcinoma: molecular sketch of colorectal cancer.

Biochem. Biophys. Res. Commun.

285

1244

–1249

Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.

2001

Missing value estimation methods for DNA microarray.

Bioinformatics

520

–525

van't Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., et al.

2002

Gene expression profiling predicts clinical outcome of breast cancer.

Nature

415

530

–536

Vapnik, V.

The Nature of Statistical Learning Theory

1995

, New York Springer-Verlag