flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding (original) (raw)

Journal Article

Department of Neurology and Center of Translational System Biology, Mount Sinai School of Medicine, New York, NY 10029, USA}

* To whom correspondence should be addressed.

Search for other works by this author on:

Department of Neurology and Center of Translational System Biology, Mount Sinai School of Medicine, New York, NY 10029, USA}

Search for other works by this author on:

Revision received:

27 April 2012

Navbar Search Filter Mobile Enter search term Search

Abstract

Motivation: For flow cytometry data, there are two common approaches to the unsupervised clustering problem: one is based on the finite mixture model and the other on spatial exploration of the histograms. The former is computationally slow and has difficulty to identify clusters of irregular shapes. The latter approach cannot be applied directly to high-dimensional data as the computational time and memory become unmanageable and the estimated histogram is unreliable. An algorithm without these two problems would be very useful.

Results: In this article, we combine ideas from the finite mixture model and histogram spatial exploration. This new algorithm, which we call flowPeaks, can be applied directly to high-dimensional data and identify irregular shape clusters. The algorithm first uses _K_-means algorithm with a large K to partition the cell population into many small clusters. These partitioned data allow the generation of a smoothed density function using the finite mixture model. All local peaks are exhaustively searched by exploring the density function and the cells are clustered by the associated local peak. The algorithm flowPeaks is automatic, fast and reliable and robust to cluster shape and outliers. This algorithm has been applied to flow cytometry data and it has been compared with state of the art algorithms, including Misty Mountain, FLOCK, flowMeans, flowMerge and FLAME.

Availability: The R package flowPeaks is available at https://github.com/yongchao/flowPeaks.

Contact: yongchao.ge@mssm.edu

Supplementary information: Supplementary data are available at Bioinformatics online

1 INTRODUCTION

In analyzing flow cytometry data, one fundamental question is how to divide the cells into distinct subsets with the phenotypes defined by the fluorescent intensity of the cell surface or intracellular markers. The unsupervised clustering for flow cytometry data is traditionally done by manual gating, where cells are sequentially clustered (gated) in one-dimension (1D) or 2D with the aid of 2D contour plots and 1D histograms. Manual gating has two problems: it is (i) highly subjective, depending on the users' expertise and the sequences of the markers to draw the gates and where to draw the gates and, (ii) tedious, for data consisting of n channels, the user needs to check and draw the gates on possibly formula pairs of 2D contour plots. The automatic gating of the cells, in machine learning called unsupervised clustering, has become an active research area for the past several years. There are currently two common approaches to address the unsupervised clustering problem, one is based on the finite mixture model (Aghaeepour et al., 2011; Chan et al., 2008; Finak et al., 2009; Lo et al., 2008; Pyne et al., 2009) and the other is based on spatial exploration of the histograms (Naumann et al., 2010; Qian et al., 2010; Sugar and Sealfon, 2010). Both approaches have their own weaknesses. The finite mixture model assumes that the data are generated by a mixture of Gaussian distributions, Student's _t_-distribution or skewed _t_-distributions. Some of these methods require data transformation to reduce the data asymmetry. There are two issues faced by the finite mixture model: (i) how many components are needed and (ii) the cluster shape is not necessarily the same as what the model assumed. Most authors resort to the Bayesian information criterion (BIC) or some variants to determine the optimum number of components (Finak et al., 2009; Lo et al., 2008; Pyne et al., 2009), which still leaves ambiguity as there are competing finite mixtures that give similar BIC with completely different partitions of the data. The BIC approach is also computationally very burdensome since it needs to compute the clustering for all possible K and then determine the best K. If the cluster shape is not convex or very asymmetrical, these algorithms are likely to split a single cluster into several small ones. The new-generation algorithms such as Misty Mountain (Sugar and Sealfon, 2010) and FLOCK (Qian et al., 2010) try to find the irregular shape and not to rely on K. They are fast and they find the data-dependent cluster shape. However, the new-generation algorithms cannot be applied directly to high-dimensional data. Thus, Misty Mountain needs to first apply principal component analysis to reduce the dimension and FLOCK needs to search a 3D subspace that is optimal for a particular cluster. These dimension reduction techniques may result in information loss. In this article, our goal is to combine these two approaches, allowing us to quickly detect the data-dependent cluster shapes so that the algorithm can be applied directly to high-dimensional data.

2 METHODS

2.1 What is a cluster

As said in Jain (2010), there is inherent vagueness in the definition of a cluster. We want to illustrate what a cluster is with a toy example. Figure 1 shows a density function of two Gaussian distributions when varying the mean of the first distribution. In Figure 2, the means are fixed, and the proportion for the first Gaussian distribution is varied. Most figures show two distinct peaks. However, we can see that the data should be considered as one cluster in Figures 1C and 2C, because there is only a single peak. An ideal cluster would be such that the corresponding probability density function has a unique peak (mode) and every point can move to the peak following a monotonically nondecreasing path. In this article, we use _K_-means as a building block to estimate the probability density function (see Sections 2.2 and 2.3), which is then used to partition the clusters based on the above consideration (see Section 2.4).

Fig. 1.

The overall density with two Gaussian mixture components with different choices of mean for the first component. The _y_-axis for the black curve is the probability density f(x) at x given by the _x_-axis. The overall density function f(x)=_w_1ϕ(x; μ1, σ12)+(1−_w_1)ϕ(x; μ2, σ22), where _w_1=0.7, μ2=4, σ1=2, σ2=1.5, ϕ(x; μ, σ2) is the density function of the Gaussian distribution with mean μ and variance σ2, and μ1 takes the values of −3, −1, and 1, respectively. The two components w k_ϕ(x; μ_k, σ_k_2) (_k_=1, 2) are respectively given by the red and green curves (A color version of this figure is available as Supplementary Material)

The overall density with two Gaussian mixture components for different choice of w1. We set μ1=−3, μ2=4, σ1=2, σ2=1.5. The presentation of this figure is the same as in Figure 1 except that μ1 is fixed at -3 and w1 takes the values of 0.7, 0.95 and 0.99 for the three panels (A color version of this figure is available as Supplementary Material)

Fig. 2.

The overall density with two Gaussian mixture components for different choice of _w_1. We set μ1=−3, μ2=4, σ1=2, σ2=1.5. The presentation of this figure is the same as in Figure 1 except that μ1 is fixed at -3 and _w_1 takes the values of 0.7, 0.95 and 0.99 for the three panels (A color version of this figure is available as Supplementary Material)

2.2 _K_-means

The _K_-means algorithm has traditionally been used in unsupervised clustering, and was applied to flow cytometry data as early as in Murphy (1985), and as recently as in Aghaeepour et al. (2011). In fact, _K_-means is a special case of a Gaussian finite mixture model where the variance matrix of each cluster is restricted to be the identity matrix. Our use of _K_-means is not for the final clustering, but for a first partition of the cells, for which we can compute the smoothed density function. In the literature, the most popular _K_-means implementation is based on Lloyd's algorithm (Lloyd, 1982). Since there are many local minima, the final clustering depends critically on the initial seeds. We used the seeds generation algorithm from the K-means++ algorithm (Arthur and Vassilvitskii, 2007). Let x _i_=(x _i_1,…, x i d) be a _d_-dimensional vector for the measurements of cell i and c h be the seed vector for cluster h. Initially, a random cell is picked and assigned to _c_1. To sequentially determine the seed for cluster k (_k_=2,···, K), we first compute the minimum Euclidean distance for all cells to the previous _k_−1 seeds by

formula

A cell x i is selected to be the seed c k of the _k_-th cluster according to the probability d i_2/{∑_j_=1_n d _j_2}. After the seeds for all K clusters are assigned, Lloyd's algorithm (Lloyd, 1982) will iterate with the following two steps: assign each data point with a cluster label according to the smallest distance to the K seeds (cluster membership assignment step) and then recompute the center vector of all data points that are assigned with the same cluster label (center update step). The updated center vectors become the seed vectors for the cluster membership assignment step in the next iteration. We use a k_-d tree representation of cells (Kanungo et al., 2002) for improved computing speed for the implementation of Lloyd's algorithm. After Lloyd's algorithm converged, we further applied the Hartigan and Wong's (1979) algorithm to recompute the cluster centers and cluster membership to decrease the objective function ∑_i_=1_n_‖_x i_−_c L _i_‖2, where L _i_∈1,…, K is the cluster label of x i and c k is the center vector for cluster _k_∈1,…, K. We could have applied the Hartigan and Wong's algorithm directly to the seeds, but the computation is too slow.

In general clustering, it is important to specify a good K in the _K_-means algorithm. For our purpose, a very accurate specification of K is not necessary. However, it is still important that the K can give a smooth density in which the peaks can reveal the clustering structure. This specification of K is similar to the determination of the number of bins in drawing histograms. We adopted the formula of Freedman and Diaconis (1981)

formula

(1)

where x(1)j, x(n)j are, respectively, the minimum and maximum of the _j_-th dimension of the data x _j_=(x_1_j, x_2_j,…, x n j) and IQR(·) is the interquartile range of the data, defined as the difference between the 75th percentile and 25th percentile. Then our K is defined as the median of K j's, i.e.

formula

(2)

where ⌈·⌉ is the ceiling function that maps a real number to the smallest following integer.

2.3 Gaussian finite mixture to model the density function

After _K_-means, we may approximate the density function f(x) by the Gaussian finite mixture models,

formula

where the proportion w k of the k_-th component satisfies 0≤_w k_≤1 and ∑_k_=1_K w k_=1 and ϕ(x; μ_k, Σ_k_) is the probability density function of the multivariate normal distribution with mean μ_k_ and variance matrix Σ_k_. After applying the K_-means algorithm of Section 2.2, we have already partitioned the data into K clusters, and for the k_-th cluster, we can compute the sample proportion w k, sample mean μ_k and sample variance matrix Σ_k (a rigorous writing would require the hat notation, which is ignored for the sake of simplicity). However, the estimate Σ_k_ may be too noisy, and we want to smooth the variance matrix by

formula

where h and _h_0 are customized parameters tuned to make the density function smoother or rougher. The default setting in the software is h_=1.5 and h_0=1. Here, λ_k_=nw k/(k+nw k) so that a greater w k results in a λ_k closer to 1; Σ0 is the variance matrix assuming the data are uniformly distributed in the whole data range and is a diagonal matrix with its (j, j) element Σ0_j,_j_={(x(n)j_−_x(1)j)/_k_1/d}2 for _j_=1,…, d.

2.4 Peak search and merging

According to our definition, a cluster is defined by the local peak. For all cells, we can use the greatest gradient search (hill climbing) to find which local peak a given cell can reach. This rules out any global optimization strategy such as the conjugate gradient algorithm. It is computationally very time consuming to search all the local maximums of the density function for all cells. Since the cells are pre-grouped by the _K_-means, we only need to search the local peaks for the centers of the _K_-means clusters. The hill climbing method searches along the greatest gradient of the density function. If we take the negative of the density function as the optimization function, the hill climbing of peak search can be achieved by the deepest descent algorithm, which is implemented by the GSL library at http://www.gnu.org/software/gsl/. We also need to restrict the step size in case it steps too far away and jumps to another local peak. When the data move from one _K_-means cluster into another _K_-means cluster, we can speed it up by moving directly to the center of the other cluster. When two peaks are relatively close, they should be joined together and considered as a single peak. We search the two peaks with the closest Euclidean distance and check if the two clusters may not be too different from a single cluster. The details on the local peak search and peak merging are described in the Appendix. Algorithm 1 gives the summary of the steps to use in K_-means and density peak finding in order to cluster the flow cytometry data as implemented in the software flowPeaks. In the end, we will obtain (≤_K) of merged clusters, each of which consists of one or many _K_-means clusters.

2.5 Cluster tightening

The default setting in the flowPeaks algorithm is to not identify the outliers. Some data points may lie far from the center or cannot be unambiguously classified into a specific cluster. We determine whether a data point is an outlier using the following strategy. Let ℓ(x) be the final merged cluster label of data point x. Let ω_i_ and f i(x) (respectively) be the proportion and the probability density function of the i_-th final merged cluster. The proportion ω_i is the sum of w k's of the _K_-means clusters that form the _i_-th final merged cluster. The density function f i(x) itself is a Gaussian finite mixture based on the _K_-means clusters that are merged into the _i_-th final cluster, while the overall density function f(x) is based on all K_-means clusters (see Section 2.3) and f(x)=∑_i_=1 ω_i f i(x). A point x is an outlier if

formula

The numbers 0.01 and 0.8 can be adjusted in the software settings.

3 RESULTS

3.1 Datasets

Barcode data: The data were generated for a barcoding experiment (Krutzik and Nolan, 2006) with varying concentrations of flurophores (APC and Pacific Blue). The flow cytometry data have 180912 cells and three channels with an additional channel for Alexa. The manual gates for the 20 clusters to be used for assessing cluster algorithm performance were created from flowJo (www.flowjo.com).

Simulated concave data: The data were simulated with two distinctive concave shapes based on the idea from the supplemental material of Pyne et al. (2009). It has 2729 rows and 2 columns. Both barcode data and simulated concave data along with their gold standard cluster labels are available in the flowPeaks package.

GvHD dataset: Graft versus host disease dataset and the manual gates are obtained from Aghaeepour et al. (2011). This dataset contains 12 samples, and the cells are stained with four markers, CD4, CD8b, CD3 and CD8. In addition, two channels FS and SS are also measured. These data are mostly analyzed based on the four markers unless specified otherwise. The numbers of cells of the 12 samples range from 12 000 to 32 000.

Rituximab data: The flow cytometry data that are obtained from the flowClust package (Lo et al., 2009). They have 1545 cells and two channels of interest. The data were originally produced by Gasparetto et al. (2004). The barcode data, simulated data and GvHD datasets have gold standard cluster labels (either by simulation or manual gating) to assess performance. The rituximab data are used for the purpose of exploration. Figure 3 displays all four datasets.

Fig. 3.

Four example datasets: (A) barcode data with the 20 clusters. (B) Concave data with two clusters. (C) One of the 12 samples in the GvHD dataset. The plot only shows the scatter plot of the first two of the four channels, where different colors indicate different manual gated clusters and black points are outliers. (D) Rituximab data (A color version of this figure is available as Supplementary Material)

3.2 Different metrics to assess the cluster algorithm performance

The most widely used metric to assess how a candidate clustering algorithm compares with the gold standard, for which the correct cluster membership is known, is the adjusted Rand index (Hubert and Arabie, 1983; Rand, 1971). The Rand index (Rand, 1971) is based on the percentage of the agreement between the two clustering methods. Let us assume that n data points are labeled differently with two different clustering methods, say Method A and Method B with K A and K B clusters. Let A i,_i_=1,…, n and B i,_i_=1,…, n be the cluster labels for the two methods. The Rand index is defined as

formula

where I(·) is the indicator function. The adjusted Rand corrects for chance, and the general form is

formula

In order to compute the adjusted Rand index, we first define the contingency tables

formula

for _a_=1,…, K A, _b_=1,···, K B. The marginal sums on the contingency tables are then defined as

formula

Note that n_=∑_a_=1_K A n a,+=∑b_=1_K B n+,b. The adjusted Rand index can be quickly computed using the following formula (Hubert and Arabie, 1983)

formula

Rosenberg and Hirschberg (2007) proposed the _V_-measure to evaluate the clustering algorithm. This measure uses entropy to assess how much a second clustering provides extra information for the first clustering. For the clustering Method A, the entropy is

formula

and the conditional entropy

formula

The conditional entropy H(A|B) is always no greater than the entropy H(A). The extra information provided by Method B for Method A is the reduced entropy H(A)−H(A|B). After normalization, we can define

formula

In the above equation, by definition _h_=1 if H(A)=0. If we reverse the positions of A and B, we can define

formula

If Method B is the candidate clustering to be compared with the gold standard clustering A, h evaluates the homogeneity of clustering for Method B, while c evaluates the completeness. The homogeneity ensures that the gold standard labels (A labels) for all data points of a candidate cluster B are unique. Completeness ensures that for each gold standard cluster (A cluster), data points are all assigned to a single candidate cluster (B cluster). Details can be found in Rosenberg and Hirschberg (2007). The _V_-measure is a weighed harmonic mean of h and c, V_β=(1+β)hc/(β_h+c). In this artice, we will fix β to be 1.

3.3 Application

Table 1 displays the running time of all algorithms that are applied to the concave and barcode datasets described in Section 3.1. The algorithms flowPeaks, Misty Mountain (Sugar and Sealfon, 2010), FLOCK (Qian et al., 2010) and flowMeans (Aghaeepour et al., 2011) are falling into a category where the computational time is under several minutes so that they can compete with manual gating, while FLAME (Pyne et al., 2009) and flowMerge (Finak et al., 2009) take too much computational time to be practically useful. Among the first four algorithms, a good seeding strategy and _k_-d tree implementation make flowPeaks a little bit faster than the other algorithms.

Table 1.

Comparison of the running time of different flow cytometry clustering algorithms

flowPeaks	Misty mountain	FLOCK	flowMeans	FLAME	flowMerge
Concave	0.13	0.59	6	6.2	1434	3202
Barcode	2.3	24.7	14	82.3	80 952	132 446

flowPeaks	Misty mountain	FLOCK	flowMeans	FLAME	flowMerge
Concave	0.13	0.59	6	6.2	1434	3202
Barcode	2.3	24.7	14	82.3	80 952	132 446

Table 1.

Comparison of the running time of different flow cytometry clustering algorithms

flowPeaks	Misty mountain	FLOCK	flowMeans	FLAME	flowMerge
Concave	0.13	0.59	6	6.2	1434	3202
Barcode	2.3	24.7	14	82.3	80 952	132 446

flowPeaks	Misty mountain	FLOCK	flowMeans	FLAME	flowMerge
Concave	0.13	0.59	6	6.2	1434	3202
Barcode	2.3	24.7	14	82.3	80 952	132 446

When we applied the three metrics in Section 3.2 to assess different algorithms, we removed the outliers according to the gold standard. Tables 2 and 3 give the performance of different algorithms to be compared with the gold standard. We see that flowPeaks does quite well for the barcode data and the concave data. Due to the slow speed of flowMerge and FLAME and the difficulty to batch running FLOCK and FLAME, which are only available from a web interface, for performance comparison on the 12 samples in the GvHD dataset, we only selected flowPeaks, Misty Mountain and flowMeans, which are the three best algorithms according to Tables 2 and 3. Table 4 shows that flowPeaks is better than the other two algorithms for the GvHD dataset. We have displayed the flowPeaks results for the four datasets in Figures 4A, 5A–C. Since rituximab does not have a gold standard, the visual display shows that flowPeaks does a good job revealing the cluster structure of the data. Figure 5D displays the application of flowPeaks in the GvHD data when FSC and SSC channels are included. The clustering on 6D highly agrees with 4D with only 0.59% of points classified differently between 6D and 4D.

Fig. 4.

Application of flowPeaks to the barcode data. (A) the bold boundary displays the clusters output by flowPeaks with their centers (⊕), the dotted lines are the boundary for the underlying _K_-means clusters with their centers (○). The local peaks are indicated by ▵. (B) The same as in (A) except the outliers have been identified as black points and other secondary information was not displayed, and the clusters are labeled according to their proportions (w k) (A color version of this figure is available as Supplementary Material)

Fig. 5.

The application of flowPeaks to different datasets. See Figure 4A for explanations of legends. (A) Concave data. (B) One sample of the GvHD dataset that is shown in Figure 3C. The boundaries between flowPeaks clusters and between _K_-means clusters are not drawn as two non-overlapping clusters generated in 4D may overlap in a 2D projection. (C) Rituximab dataset. (D) The same sample in (B) with FSC and SSC channels included. Due to the long running time required for the heatmap, 4000 data points were randomly selected to generate the cluster-tree and the heatmap. The three rows fp4D.lab, fp6D.lab and GS.lab, respectively, display the class labels of the flowPeaks on 4D, flowPeaks on 6D, and the Gold Standard, where different colors indicate different clusters. The signal intensities of all six channels are displayed in the heatmap with the key displayed on the bottom (A color version of this figure is available as Supplementary Material)

Table 2.

The performance of different algorithms on the barcode data

flowPeaks	Misty mountain	FLOCK	flowMeans	FLAME	flowMerge
Adj-Rand	0.998	0.971	0.258	0.998	0.859	0.801
_F_-measure	0.993	0.984	0.341	0.993	0.868	0.887
_V_-measure	0.996	0.967	0.567	0.995	0.946	0.952

flowPeaks	Misty mountain	FLOCK	flowMeans	FLAME	flowMerge
Adj-Rand	0.998	0.971	0.258	0.998	0.859	0.801
_F_-measure	0.993	0.984	0.341	0.993	0.868	0.887
_V_-measure	0.996	0.967	0.567	0.995	0.946	0.952

Adj-Rand is for the adjusted Rand index.

Table 2.

The performance of different algorithms on the barcode data

flowPeaks	Misty mountain	FLOCK	flowMeans	FLAME	flowMerge
Adj-Rand	0.998	0.971	0.258	0.998	0.859	0.801
_F_-measure	0.993	0.984	0.341	0.993	0.868	0.887
_V_-measure	0.996	0.967	0.567	0.995	0.946	0.952

flowPeaks	Misty mountain	FLOCK	flowMeans	FLAME	flowMerge
Adj-Rand	0.998	0.971	0.258	0.998	0.859	0.801
_F_-measure	0.993	0.984	0.341	0.993	0.868	0.887
_V_-measure	0.996	0.967	0.567	0.995	0.946	0.952

Adj-Rand is for the adjusted Rand index.

Table 3.

Performance of different algorithms on the concave data

flowPeaks	Misty mountain	FLOCK	flowMeans	FLAME	flowMerge
Adj-Rand	1.000	1.000	0.501	0.723	0.232	0.952
_F_-measure	1.000	1.000	0.683	0.884	0.438	0.987
_V_-measure	1.000	1.000	0.667	0.713	0.480	0.932

flowPeaks	Misty mountain	FLOCK	flowMeans	FLAME	flowMerge
Adj-Rand	1.000	1.000	0.501	0.723	0.232	0.952
_F_-measure	1.000	1.000	0.683	0.884	0.438	0.987
_V_-measure	1.000	1.000	0.667	0.713	0.480	0.932

Adj-Rand is for the adjusted Rand index.

Table 3.

Performance of different algorithms on the concave data

flowPeaks	Misty mountain	FLOCK	flowMeans	FLAME	flowMerge
Adj-Rand	1.000	1.000	0.501	0.723	0.232	0.952
_F_-measure	1.000	1.000	0.683	0.884	0.438	0.987
_V_-measure	1.000	1.000	0.667	0.713	0.480	0.932

flowPeaks	Misty mountain	FLOCK	flowMeans	FLAME	flowMerge
Adj-Rand	1.000	1.000	0.501	0.723	0.232	0.952
_F_-measure	1.000	1.000	0.683	0.884	0.438	0.987
_V_-measure	1.000	1.000	0.667	0.713	0.480	0.932

Adj-Rand is for the adjusted Rand index.

Table 4.

The comparison of the performance on the 12 samples of the GvHD dataset

flowPeaks	Misty mountain	flowMeans
Adj-Rand	0.807 (0.175)	0.675 (0.287)	0.573 (0.292)
_F_-measure	0.924 (0.075)	0.859 (0.146)	0.848 (0.120)
_V_-measure	0.816 (0.135)	0.664 (0.205)	0.639 (0.199)

flowPeaks	Misty mountain	flowMeans
Adj-Rand	0.807 (0.175)	0.675 (0.287)	0.573 (0.292)
_F_-measure	0.924 (0.075)	0.859 (0.146)	0.848 (0.120)
_V_-measure	0.816 (0.135)	0.664 (0.205)	0.639 (0.199)

Adj-Rand is for the adjusted Rand index. Each entry lists the mean and standard deviation.

Table 4.

The comparison of the performance on the 12 samples of the GvHD dataset

flowPeaks	Misty mountain	flowMeans
Adj-Rand	0.807 (0.175)	0.675 (0.287)	0.573 (0.292)
_F_-measure	0.924 (0.075)	0.859 (0.146)	0.848 (0.120)
_V_-measure	0.816 (0.135)	0.664 (0.205)	0.639 (0.199)

flowPeaks	Misty mountain	flowMeans
Adj-Rand	0.807 (0.175)	0.675 (0.287)	0.573 (0.292)
_F_-measure	0.924 (0.075)	0.859 (0.146)	0.848 (0.120)
_V_-measure	0.816 (0.135)	0.664 (0.205)	0.639 (0.199)

Adj-Rand is for the adjusted Rand index. Each entry lists the mean and standard deviation.

4 SOFTWARE

We have implemented the algorithm in C++ wrapped into an R package named ‘flowPeaks’. The following example illustrates how to use the basic functions of this R package

The above R script will display Figure 4A. In order to identify the outliers to obtain Figure 4B, we can proceed further with the following script

For further use of the software flowPeaks, one can consult the package's vignette pdf file and help documents.

5 DISCUSSION AND FUTURE WORK

In this article, we described the algorithm flowPeaks that combines the _K_−means and density function peak finding to partition the flow cytometry data into distinct clusters. We have compared our algorithm with other state of the art algorithms for real and simulated datasets. Our algorithm is fast and able to detect the non−convex shapes. We should point out that flowPeaks's goal is to find the overall density shape and search for global structure. It will not be able to uncover overlapping clusters as shown in Figure 1C or the rare cluster as shown in Figure 2C. The flowPeaks algorithm is based on the geometrical shape of the density function. Prior to apply flowPeaks, data transformation may be necessary to reveal the structure, and irrelevant channels need to be first discarded to avoid the curse of dimensionality. Due to the curse of dimensionality, if the data dimension is too high and the number of cells is too low where the density function cannot be reliably estimated by flowPeaks, users should alternatively use the heatmap to visualize the data.

As commented in Jain (2010), there is not a single clustering algorithm suitable for all datasets. This is probably true for flow cytometry clustering. There is not a good collection of flow cytometry data with gold standard gates, which makes algorithm comparison very challenging. The comparison in Section 3.3 should not be taken literally. We tend to agree with Naumann et al. (2010) that ‘it is too early for extensive comparisons of automated gating procedure’. The current approach of using the manual gating as a gold standard to compare the automatic gating algorithm is very subjective. We participated with flowPeaks and support vector machine algorithm in 2011's flowCAP II (http://flowcap.flowsite.org/summit2011.html). Our algorithm gave 100% prediction accuracy for the clinical flow cytometry data, establishing us as one of the best algorithms. We have released our datasets in our flowPeaks package with the gold standard gates so that one can test one's favorite algorithm with our datasets. The source code and windows binary built of the R package flowPeaks is available at https://github.com/yongchao/flowPeaks. The package is in the progress of being permanently hosted at the Bioconductor (Ihaka and Gentleman, 1996; Gentleman et al., 2004) with open source code for algorithm developers and batching processing.

APPENDIX

Mathematical notation

For the sake of clarity, we will use the following notation. Assume the data consist of n points in d dimension.

Let the underlying clusters, obtained by _K_−means, be labeled as 1,…, K. The density function generated by the finite mixture model is

formula

where w k, μ_k_, formula are the weights, means and the smoothed variance matrix of cluster k, respectively, for k = 1,…, K. The derivative of the density function at x is defined as

formula

According to the _K_-means algorithm, the cluster label of x can be defined as

formula

Searching the local peak starting from a point x

As we do not want to jump over the local peak, when the data fall into a cluster k, we define the maximum step size

formula

The detailed computations for the local peak search are described in Algorithm A1. We initially set a small step size β (Step 0), and try to find a step size such that the density function f improves (Step 2 and Step 3). If the same step size improves twice in a row (N suc denote the number of continuous improvements), then we double the step size; otherwise we half the step size. If the point is falling into a new cluster, we want to find out if we can jump to the new center directly (Step 6). The details are described in Algorithm A1.

The algorithm on merging local peaks

When two peaks are close and the density function between the two peaks is relatively flat, the two peaks should be combined into one. For each underlying _K_−means cluster, we define the nearest neighbor cluster distance by

formula

For an arbitrary position x, we can similarly define the function S(x) = S L(x). Let x and y be two points, we define the tolerance that describes how the density function of the line segment that connects x and y can be approximated by a straight line

formula

where z _t_=x+t(_y_-x) and formula ⁠. The function formula is the fitted density function at the position z t by using a straight line to connect the two points (x, f(x) and (y, f(y)). The second term in defining tol corrects for cluster sample sizes.

Many K_-means centers may reach the same local peak. A local peak can then be represented by a subset P j of {1,…, K} and its location is denoted by ν_j, where j_=1,…, N P and N P is the number of distinct local peaks. In other words, for each k in P j, μ_k will move to the same ν_j_ by using our local peak algorithm. Initially, set G _g_={g}, _g_=1,…, N P, i.e. each peak set just contains a single peak (Step 0). Two peak sets can be merged only if the two peaks are relative close and the density function between the peaks is relatively flat (Step 1). G g are merged hierarchically (Step 2). The details are given in Algorithm A2. After the algorithm completes, N G is the number of (see Section 2.4) final clusters.

ACKNOWLEDGEMENTS

We thank Fernand Hayot, Istvan Sugar and German Nudleman for valuable comments and discussions. We thank Ryan Brinkman and Nima Aghaeepour for providing the GvHD dataset and the associated manual gates. We appreciate the reviewers' insightful comments, resulting in a much improved article.

Funding: National Institute of Allergy and Infectious Diseases [contract HHSN 266200500021C].

Conflict of Interest: none declared.

REFERENCES

et al.

Rapid cell population identification in flow cytometry data

Cytometry A

2011

, vol.

(pg.

)

k-means++: the advantages of careful seeding

Proceedings of the Eighteenth Annual ACM−SIAM Symposium on Discrete Algorithms

2007

New Orleans

SIAM

(pg.

1027

1035

)

et al.

Statistical mixture modeling for cell subtype identification in flow cytometry

Cytometry A

2008

, vol.

(pg.

693

701

)

et al.

Merging mixture components for cell population identification in flow cytometry

Adv. Bioinformatics

2009

, vol.

2009

pg.

247646

On the histogram as a density estimator: _L_2 theory

Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Gebiete

1981

, vol.

(pg.

453

476

)

et al.

Hierarchical document clustering using frequent itemsets

Proceedings of the Third SIAM International Conference on Data Mining (SDM)

2003

San Francisco, CA

SIAM

(pg.

)

et al.

Identification of compounds that enhance the anti-lymphoma activity of rituximab using flow cytometric high-content screening

J. Immunol. Methods

2004

, vol.

292

(pg.

)

et al.

Bioconductor: open software development for computational biology and bioinformatics

Genome Biol.

2004

, vol.

pg.

R80

A K-means clustering algorithm

Appl. Stat.

1979

, vol.

(pg.

100

108

)

Comparing partitions

J. Classif.

1983

, vol.

(pg.

193

218

)

R: a language for data analysis and graphics

J. Comput. Graph. Stat.

1996

, vol.

(pg.

299

314

)

Data clustering: 50 years beyond K-means

Pattern Recogn. Lett.

2010

, vol.

(pg.

651

666

)

et al.

An efficient k-means clustering algorithm: analysis and implementation

IEEE Trans. Pattern Anal.

2002

, vol.

(pg.

881

892

)

Fluorescent cell barcoding in flow cytometry allows high-throughput drug screening and signaling profiling

Nat. Methods

2006

, vol.

(pg.

361

368

)

Least squares quantization in PCM

IEEE Trans. Inform. Theory

1982

, vol.

IT-28

(pg.

129

139

)

et al.

Automated gating of flow cytometry data via robust model-based clustering

Cytometry A

2008

, vol.

(pg.

321

)

et al.

flowClust: a Bioconductor package for automated gating of flow cytometry data

BMC Bioinformatics

2009

, vol.

pg.

145

Automated identification of subpopulations in flow cytometric list mode data using cluster analysis

Cytometry

1985

, vol.

(pg.

302

309

)

et al.

The curvHDR method for gating flow cytometry samples

BMC Bioinformatics

2010

, vol.

pg.

et al.

Automated high-dimensional flow cytometric data analysis

Proc. Natl. Acad. Sci. USA

2009

, vol.

106

(pg.

8519

8524

)

et al.

Elucidation of seventeen human peripheral blood B-cell subsets and quantification of the tetanus response using a density-based method for the automated identification of cell populations in multidimensional flow cytometry data

Cytometry B

2010

, vol.

(pg.

S69

S82

)

Objective criteria for the evaluation of clustering methods

J. Am. Stat. Assoc.

1971

, vol.

(pg.

846

850

)

V-Mmeasure: a conditional entropy-based external cluster evaluation measure

Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2007

Prague

Association for Computational Lingusistics

(pg.

410

420

)

Misty Mountain clustering: application to fast unsupervised flow cytometry gating

BMC Bioinformatics

2010

, vol.

pg.

502

Author notes

Associate Editor: Jonathan Wren

Citations

Views

Altmetric

Metrics

Total Views 6,396

5,361 Pageviews

1,035 PDF Downloads

Since 11/1/2016

Month:	Total Views:
November 2016	2
December 2016	8
January 2017	9
February 2017	21
March 2017	31
April 2017	17
May 2017	16
June 2017	123
July 2017	87
August 2017	62
September 2017	100
October 2017	89
November 2017	84
December 2017	176
January 2018	146
February 2018	130
March 2018	155
April 2018	131
May 2018	174
June 2018	145
July 2018	140
August 2018	160
September 2018	138
October 2018	112
November 2018	155
December 2018	130
January 2019	120
February 2019	126
March 2019	128
April 2019	156
May 2019	173
June 2019	111
July 2019	146
August 2019	171
September 2019	123
October 2019	138
November 2019	123
December 2019	21
January 2020	54
February 2020	36
March 2020	38
April 2020	35
May 2020	26
June 2020	40
July 2020	34
August 2020	22
September 2020	44
October 2020	38
November 2020	28
December 2020	44
January 2021	40
February 2021	29
March 2021	40
April 2021	38
May 2021	39
June 2021	25
July 2021	39
August 2021	22
September 2021	42
October 2021	41
November 2021	26
December 2021	30
January 2022	39
February 2022	43
March 2022	26
April 2022	33
May 2022	52
June 2022	53
July 2022	31
August 2022	47
September 2022	49
October 2022	55
November 2022	34
December 2022	41
January 2023	41
February 2023	36
March 2023	28
April 2023	40
May 2023	29
June 2023	16
July 2023	33
August 2023	59
September 2023	28
October 2023	30
November 2023	50
December 2023	37
January 2024	40
February 2024	45
March 2024	48
April 2024	118
May 2024	42
June 2024	36
July 2024	47
August 2024	37
September 2024	45
October 2024	54
November 2024	27

Citations

96 Web of Science

flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding (original) (raw)

Abstract

1 INTRODUCTION

2 METHODS

2.1 What is a cluster

2.2 _K_-means

2.3 Gaussian finite mixture to model the density function

2.4 Peak search and merging

2.5 Cluster tightening

3 RESULTS

3.1 Datasets

3.2 Different metrics to assess the cluster algorithm performance

3.3 Application

4 SOFTWARE

5 DISCUSSION AND FUTURE WORK

APPENDIX

Mathematical notation

Searching the local peak starting from a point x

The algorithm on merging local peaks

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Cited

flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding (original) (raw)

Abstract

1 INTRODUCTION

2 METHODS

2.1 What is a cluster

2.2 _K_-means

2.3 Gaussian finite mixture to model the density function

2.4 Peak search and merging

2.5 Cluster tightening

3 RESULTS

3.1 Datasets

3.2 Different metrics to assess the cluster algorithm performance

3.3 Application

4 SOFTWARE

5 DISCUSSION AND FUTURE WORK

APPENDIX

Mathematical notation

Searching the local peak starting from a point x

The algorithm on merging local peaks

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited