Time-variant clustering model for understanding cell fate decisions - PubMed (original) (raw)

Time-variant clustering model for understanding cell fate decisions

Wei Huang et al. Proc Natl Acad Sci U S A. 2014.

Abstract

Both spatial characteristics and temporal features are often the subjects of concern in physical, social, and biological studies. This work tackles the clustering problems for time course data in which the cluster number and clustering structure change with respect to time, dubbed time-variant clustering. We developed a hierarchical model that simultaneously clusters the objects at every time point and describes the relationships of the clusters between time points. The hidden layer of this model is a generalized form of branching processes. A reversible-jump Markov Chain Monte Carlo method was implemented for model inference, and a feature selection procedure was developed. We applied this method to explore an open question in preimplantation embryonic development. Our analyses using single-cell gene expression data suggested that the earliest cell fate decision could start at the 4-cell stage in mice, earlier than the commonly thought 8- to 16-cell stage. These results together with independent experimental data from single-cell RNA-seq provided support against a prevailing hypothesis in mammalian development.

Keywords: branching process; cell fate; clustering; embryonic development; time.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.

Fig. 1.

A toy example of time-variant clustering. (A) Cross-sectional data at three time points. Each data point (×) comes from a 2D space. (B) A desirable clustering outcome. There were two clusters (red, green) at the first time point. The red cluster persisted, and the green cluster split into two (green, blue), generating a total of three clusters at the second time point. The red cluster diminished at the third time point, whereas the other two clusters persisted.

Fig. 2.

Fig. 2.

Data structure of a single-cell gene expression experiment. (A) Experimental design. Mouse embryos at several time points (one-cell stage, two-cell stage, and so on) were retrieved. Each embryo was separated into single cells. The expression levels of 48 genes were assayed in each single cell. (B) Representation of the data matrix produced by this experiment.

Fig. 3.

Fig. 3.

The hidden branching process model. The hidden layer is a branching stochastic process. Each realization of this process is a forest, which contains the information of the number of clusters at each time point. For any given time point, the observed data are generated by a finite mixture model, with the number of components being the number of branches in the hidden layer.

Fig. 4.

Fig. 4.

Changing forest structure with basic moves. (A) An example of changing a forest into another forest using basic moves. (B) The allowed splits and merges. Node 1 at the second time point can split, whereas node 2 at the second time point cannot. Branches 2 and 3 can merge, whereas branches 1 and 2 cannot.

Fig. 5.

Fig. 5.

An outline of the iterative algorithm for model inference.

Fig. 6.

Fig. 6.

Simulation studies. Rows: data and results from every simulation. Column 1: clustering structure and the number of data points in each cluster. Column 2: the number of trees (y axis) reported for each iteration (x axis). Column 3: the number of branches at the last time point, reported for each iteration. The density histogram summarizes the distribution of the number of branches across all iterations. Column 4: The weights of one randomly selected cluster, with respect to the number of iterations. (t, c) indicates the time point and the cluster number at this time point. (Right) Histogram of the weights across all iterations. Column 5: the number of selected features with respect to the number of iterations.

Fig. 7.

Fig. 7.

Analysis of single-cell gene expression data. The number of selected genes (A), trees (B), and branches (C) with respect to the number of iterations. The density histogram in C summarizes across all iterations. The resulting clustering configuration with respect to time (D). The number of cells assigned to each cluster is given on each cluster node. The average expression levels of the five selected genes in each cluster are shown as a pie chart on top of the cluster, where the radius of each pie is proportional to the average expression level of a gene. (E) The relative sizes of each cluster with respect to time. (F) The cluster assignments of each cell in every four-cell stage embryo. Yellow and red represent clusters 1 and 2, respectively (see D and E). The cell in gray of embryo 2 is a missing data point. (G) Model of mutual inhibitory feedback loop between Pou5f1 and Pdgfa. The darker arm is supported by independent experimental data. (H) P values from Dip test of unimodality. For each gene, the between-embryo normalized FPKMs of each blastomere were used for the test.

References

    1. Hastie T, Tibshirani R, Friedman JH. 2009. The elements of statistical learning data mining, inference, and prediction. Springer Series in Statistics (Springer, New York), p xxii.
    1. Baade W, Gaposchkin CHP. Evolution of Stars and Galaxies. Harvard Univ Press; Cambridge, MA: 1963.
    1. Dressler A. Galaxy morphology in rich clusters: Implications for the formation and evolution of galaxies. Astrophys J 1. 1980;236:351–365.
    1. Picard N, Mortier F, Rossi V, Gourlet-Fleury S. Clustering species using a model of population dynamics and aggregation theory. Ecol Modell. 2010;221(2):152–160.
    1. Sztompka P. The Sociology of Social Change. Blackwell; Oxford, UK: 1994.

Publication types

MeSH terms

LinkOut - more resources