(original) (raw)
\documentclass[12pt]{article} \usepackage{epsfig} %\documentstyle[apa,12pt]{article} \oddsidemargin 0.0in \textwidth 6.5in \topmargin -0.5in \textheight 8.0in %\footheight 0.25in \footskip 0.75in \headheight 0.75in \headsep 0.25in \newenvironment{vlist}{\begin{list}{} {\setlength{\leftmargin}{2em} \setlength{\itemindent}{-2em}}}{\end{list}} \newcommand{\ls}[1] {\dimen0=\fontdimen6\the\font \lineskip=#1\dimen0 \advance\lineskip.5\fontdimen5\the\font \advance\lineskip-\dimen0 \lineskiplimit=.9\lineskip \baselineskip=\lineskip \advance\baselineskip\dimen0 \normallineskip\lineskip \normallineskiplimit\lineskiplimit \normalbaselineskip\baselineskip \ignorespaces } \def\bx{\mbox{\boldmath xxx}} \def\by{\mbox{\boldmath yyy}} \def\btheta{\mbox{\boldmath theta\thetatheta}} \def\bTheta{\mbox{\boldmath Theta\ThetaTheta}} \def\bmu{\mbox{\boldmath mu\mumu}} \def\bdelta{\mbox{\boldmath delta\deltadelta}} \def\bepsilon{\mbox{\boldmath epsilon\epsilonepsilon}} \def\boldeta{\mbox{\boldmath eta\etaeta}} \def\boldnu{\mbox{\boldmath nu\nunu}} \pagestyle{myheadings} \markright{Jordan and Jacobs: Modular and Hierarchical Systems} \begin{document} %\noindent %In press: M. Arbib (Ed.), {\em The Handbook %of Brain Theory and Neural Networks.} Cambridge, MA: MIT Press. %\vspace*{0.6cm} \begin{center} {\LARGE \bf Modular and hierarchical learning systems}\\ \vskip16pt {\large Michael I. Jordan\\ {\tt jordan@cs.berkeley.edu}\\ Division of Computer Science and Department of Statistics\\ University of California, Berkeley \vskip12pt Robert A. Jacobs\\ {\tt jacobs@bcs.rochester.edu}\\ Department of Brain and Cognitive Sciences\\ University of Rochester} \end{center} \vskip11pt \vskip11pt \noindent RUNNING HEAD: Modular and Hierarchical Systems %\vskip11pt %\vskip11pt \vspace{\fill} \footnoterule \noindent Correspondence:\\ \noindent Michael I. Jordan\\ \noindent EECS Computer Science Division\\ 387 Soda Hall \# 1776\\ Berkeley, CA 94720-1776\\ \noindent Phone: (510) 642-3806\\ \noindent Fax: (510) 642-5775\\ \noindent email: jordan@cs.berkeley.edu \newpage %\noindent %{\underline{\bf INTRODUCTION}} \section{Introduction} In this article we discuss the problem of learning in modular and hierarchical systems. Modular and hierarchical systems allow complex learning problems to be solved by dividing the problem into a set of sub-problems, each of which may be simpler to solve than the original problem. Within the context of supervised learning---our focus in this article---modular architectures arise when we assume that the data can be well described by a collection of functions, each of which is defined over a relatively local region of the input space. A modular architecture can model such data by allocating different modules to different regions of the space. Hierarchical architectures arise when we assume that the data are well described by a multi-resolution model---a model in which regions are divided recursively into sub-regions. Modular and hierarchical systems present an interesting credit assignment problem---it is generally the case that the learner is not provided with prior knowledge of the partitioning of the input space. Knowledge of the partition would correspond to being given ``labels'' specifying how to allocate modules to data points. The assumption we make is that such labels are absent. The situation is reminiscent of the unsupervised clustering problem in which a classification rule must be inferred from a data set in which the class labels are absent and, indeed, the connection to clustering has played an important role in the development of the supervised learning algorithms that we present here. The learning algorithms that we describe below solve the credit assignment problem by computing a set of values---posterior probabilities---that can be thought of as estimates of the missing ``labels.'' These posterior probabilities are based on a probabilistic model associated with each of the network modules. This approach to learning in modular systems was developed by Jacobs, Jordan, Nowlan and Hinton (1991). Jordan and Jacobs (1994) extended the modular system to a hierarchical system, made links to the statistical literature on classification and regression trees (Breiman, Friedman, Olshen, \& Stone, 1984), and developed an Expectation-Maximization (EM) algorithm for the architecture. We describe these developments in the remainder of the article, emphasizing the probabilistic framework. %\vskip12pt %\noindent %{\underline{\bf THE MIXTURE OF EXPERTS (ME) ARCHITECTURE}} \section{The Mixture of Experts (ME) Architecture} The modular architecture that we consider is shown in %Figure 1. Figure~\ref{fig:modular}. \begin{figure}[tb] \centerline{\epsfig{figure=ps/modular.ps}} \caption{A mixture of experts architecture. The output protectbmu\protect\bmuprotectbmu is the conditional mean of bfy{\bf y}bfy given bfx{\bf x}bfx (see text).} \label{fig:modular} \end{figure} The architecture is composed of NNN modules referred to as {\em expert networks}, each of which implements a parameterized function bmui=f(bfx,bthetai)\bmu_i = f({\bf x},\btheta_i)bmui=f(bfx,bthetai) from inputs bfx{\bf x}bfx to outputs bmui\bmu_ibmui, where bthetai\btheta_ibthetai is a parameter vector. We attach a probabilistic interpretation to each of the expert networks by assuming that the experts generate outputs by\byby with probability P(by∣bx,bthetai)P(\by | \bx, \btheta_i)P(by∣bx,bthetai), where bmui\bmu_ibmui is the mean of the conditional density PPP. Because we assume that different expert networks are appropriate in different regions of the input space, the architecture requires a mechanism that identifies, for any given input bfx{\bf x}bfx, that expert or blend of experts that is most likely to produce the correct output. This is accomplished via an auxiliary network, known as a {\em gating network}, that produces as output a set of scalar coefficients gig_{i}gi that serve to weight the contributions of the various experts. These coefficients are not fixed constants but vary as a function of the input bfx{\bf x}bfx. The probabilistic interpretation of the gating network is as a {\em classifier}, a system that maps an input bx\bxbx into the probabilities that the various experts will be able to generate the desired output (based on knowledge of bx\bxbx alone). These probabilities (the gig_igi) are constrained to be nonnegative and sum to one (for each bx{\bx}bx). There are many ways to enforce the probabilistic constraints on the gig_igi. One approach is to utilize the {\em softmax\/} function, defined as follows. Let xii\xi_ixii denote an intermediate set of variables that are parameterized functions of the input bfx{\bf x}bfx: \begin{equation} \xi_i = \xi_i({\bf x}, \boldeta), \label{eq:xi} \end{equation} where boldeta\boldetaboldeta is a parameter vector, and define the outputs gig_igi in terms of the xii\xi_ixii as follows: \begin{equation} g_i = \frac{e^{\xi_i}}{\sum_j e^{\xi_j}}. \label{eq:softmax} \end{equation} It is readily verified that the gig_{i}gi are nonnegative and sum to one for each bfx{\bf x}bfx. This approach has the virtue of having a simple probabilistic interpretation: the xii\xi_ixii can be viewed as discriminant surfaces for a classification problem in which the class-conditional densities are members of the exponential family of probability distributions (Jordan \& Jacobs, 1994). %\vskip12pt %\noindent %{\bf The Mixture Model} \subsection{The Mixture Model} Let us now specify the probabilistic model underlying the mixture of experts architecture more precisely. We assume that the training set calX=(bfx(l),bfy(l))l=1L{\cal X} = \{({\bf x}^{(l)}, {\bf y}^{(l)})\}_{l=1}^{L}calX=(bfx(l),bfy(l))l=1L is generated in the following way. Given the choice of an input bfx{\bf x}bfx, a label iii is chosen with probability P(i∣bfx,boldeta0)P(i | {\bf x}, \boldeta^0)P(i∣bfx,boldeta0) (where the superscript `0' denotes the putative true values of the parameters). Given the choice of the label and given the input, the target output bfy{\bf y}bfy is assumed to be generated with probability P(bfy∣bfx,bthetai0)P({\bf y} | {\bf x}, \btheta_i^0)P(bfy∣bfx,bthetai0). Each such data point is assumed to be generated independently in this manner. Note that a given output bfy{\bf y}bfy can be generated in NNN different ways, corresponding to the NNN different choices of the label iii. Thus the total probability of generating bfy{\bf y}bfy from bfx{\bf x}bfx is given by the sum over iii: \begin{equation} P({\bf y} | {\bf x}, \bTheta^0) = \sum_i P(i | {\bf x}, \boldeta^0) P({\bf y} | {\bf x}, \btheta_i^0), \label{eq:mixture} \end{equation} where bTheta0\bTheta^0bTheta0 denotes the vector of all of the parameters ($\bTheta^0 = [\btheta_1^0, \btheta_2^0, \ldots, \btheta_N^0, \boldeta^0]^T$). The density in Equation~\ref{eq:mixture} is known as a {\em mixture density}. It is a mixture density in which the mixing proportions, P(i∣bfx,boldeta0)P(i | {\bf x}, \boldeta^0)P(i∣bfx,boldeta0), are conditional on the input bfx{\bf x}bfx. It is the task of the gating network to model the probabilities P(i∣bfx,boldeta0)P(i | {\bf x}, \boldeta^0)P(i∣bfx,boldeta0), which can be construed as class probabilities in a multiway classification problem of the input bx\bxbx. We parameterize these probabilities via Equations~\ref{eq:xi} and \ref{eq:softmax}, identifying the gating network outputs gig_igi with P(i∣bfx,boldeta)P(i | {\bf x}, \boldeta)P(i∣bfx,boldeta). It is straightforward to compute moments of the mixture density. For example, the conditional mean bmu=E(by∣bx,bTheta)\bmu = E(\by | \bx, \bTheta)bmu=E(by∣bx,bTheta) is readily obtained by taking the expected value of Equation~\ref{eq:mixture}: \[ \bmu = \sum_i g_i \bmu_i, \] where bmui\bmu_ibmui is the conditional mean associated with the probability distribution P(bfy∣bfx,bthetai0)P({\bf y} | {\bf x}, \btheta_i^0)P(bfy∣bfx,bthetai0). The conditional mean is quite commonly used as the output of supervised learning systems, and this is reasonable in the mixture of experts setting as well, but only when no more than one value gig_igi is significantly different from zero for a given input bfx{\bf x}bfx. When more than one expert has a large value of gig_igi, however, the conditional distribution of bfy{\bf y}bfy given bfx{\bf x}bfx is multi-modal, and it is important to make fuller use of the entire mixture density in such cases. %\vskip12pt %\noindent %{\bf A Gradient-Based Learning Algorithm} \subsection{A Gradient-Based Learning Algorithm} To develop an algorithm for estimating the parameters of a mixture of experts architecture, we make use of the maximum likelihood (ML) principle. That is, we choose parameters for which the probability of the training set given the parameters (a function known as the {\em likelihood}) is largest. Taking the logarithm of the product of NNN densities of the form of Equation~\ref{eq:mixture} yields the following log likelihood: \begin{equation} l({\cal X}, \bTheta) = \sum_l \log \sum_i P(i | {\bf x}^{(l)}, \boldeta) P({\bf y}^{(l)} | {\bf x}^{(l)}, \btheta_i), \label{eq:loglikelihood} \end{equation} a function which we wish to maximize with respect to bTheta\bThetabTheta. One approach to maximizing the log likelihood is to use gradient ascent (a better approach is to use the EM algorithm; see below). Computing the gradient of lll with respect to bmui\bmu_ibmui and xii\xi_ixii yields: \begin{equation} \frac{\partial l}{\partial \bmu_i} = \sum_l h_i^{(l)} \frac{\partial}{\partial \bmu_i} \log P({\bf y}^{(l)} | {\bf x}^{(l)}, \btheta_i) \label{eq:grad1} \end{equation} and \begin{equation} \frac{\partial l}{\partial \xi_i} = \sum_l (h_i^{(l)} - g_i^{(l)}), \label{eq:grad2} \end{equation} where hi(l)h_i^{(l)}hi(l) is defined as P(i∣bfx(l),bfy(l))P(i | {\bf x}^{(l)}, {\bf y}^{(l)})P(i∣bfx(l),bfy(l)). In deriving this result, we have used Bayes' rule: \[ P(i | {\bf x}^{(l)}, {\bf y}^{(l)}) = \frac{P(i | {\bf x}^{(l)}) P({\bf y}^{(l)} | {\bf x}^{(l)}, i)} {\sum_j P(j | {\bf x}^{(l)}) P({\bf y}^{(l)} | {\bf x}^{(l)}, j)}, \] where we have omitted the parameters to simplify the notation. This suggests that we define hi(l)h_i^{(l)}hi(l) as the {\em posterior probability\/} of the ithi^{th}ith label, conditional on the input bfx(l){\bf x}^{(l)}bfx(l) and the output bfy(l){\bf y}^{(l)}bfy(l). Similarly, the probability gi(l)g_i^{(l)}gi(l) can be interpreted as the {\em prior probability\/} P(i∣bfx(l))P(i | {\bf x}^{(l)})P(i∣bfx(l)), the probability of the ithi^{th}ith label, given only the input bfx(l){\bf x}^{(l)}bfx(l). Given these definitions, Equation~\ref{eq:grad2} has the natural interpretation of moving the prior probabilities toward the posterior probabilities. An interesting special case is an architecture in which the expert networks and the gating network are linear and the probability density associated with the experts is a Gaussian with identity covariance matrix. In this case, Equations~\ref{eq:grad1} and \ref{eq:grad2} yield the following on-line learning algorithm (``on-line'' meaning that we have dropped the summation across lll): \begin{equation} \Delta\btheta_i = \rho h_i^{(l)}({\bf y}^{(l)} - \bmu_i^{(l)}) {{\bf x}^{(l)}}^T \label{eq:expert-update} \end{equation} and \begin{equation} \Delta\boldeta_i = \rho (h_i^{(l)} - g_i^{(l)}) {{\bf x}^{(l)}}^T, \label{eq:gating-update} \end{equation} where rho\rhorho is a learning rate. Note that both of these equations have the form of the classical LMS rule, with the updates for the experts in Equation~\ref{eq:expert-update} being modulated by their posterior probabilities. It is also of interest to examine the expression for the posterior probability in the Gaussian case: \begin{equation} h_i^{(l)} = \frac{g_i^{(l)} e^{-\frac{1}{2} ({\bf y}^{(l)} - \bmu_i^{(l)})^T ({\bf y}^{(l)} - \bmu_i^{(l)})}} {\sum_j g_j^{(l)} e^{-\frac{1}{2} ({\bf y}^{(l)} - \bmu_j^{(l)})^T ({\bf y}^{(l)} - \bmu_j^{(l)})}}. \label{eq:posteriors} \end{equation} This is a normalized distance measure that reflects the relative magnitudes of the residuals bfy(l)−bmui(l){\bf y}^{(l)} - \bmu_i^{(l)}bfy(l)−bmui(l). If the residuals for expert iii are small relative to those of the other experts then hi(l)h_{i}^{(l)}hi(l) is large, otherwise hi(l)h_{i}^{(l)}hi(l) is small. Note moreover that the hi(l)h_{i}^{(l)}hi(l) are positive and sum to one for each bfx(l){\bf x}^{(l)}bfx(l); this implies that credit is distributed to the experts in a competitive manner. It is straightforward to utilize other members of exponential family of densities as component densities for the experts, to allow dispersion (e.g., covariance) parameters to be incorporated in the model and to estimate the dispersion parameters via the learning algorithm (Jordan \& Jacobs, 1994; Jordan \& Xu, 1995). %\newpage %\vskip12pt %\noindent %{\underline{\bf THE HIERARCHICAL MIXTURE OF EXPERTS (HME)}}\\ %{\underline{\bf ARCHITECTURE}} \section{The Hierarchical Mixture of Experts (HME) Architecture} The ME architecture solves complex function approximation problems by allocating different modules to different regions of the input space. This approach can have advantages for problems in which the modules are simpler than the large network that would be required to solve the problem as a whole. If we now inquire about the internal structure of a module, however, we see that the same argument can be repeated. Perhaps it is better to split a module into simpler sub-modules rather than to use a single module to fit the data in a region. This suggests a thoroughgoing divide-and-conquer approach to supervised learning in which a tree-structured architecture is used to perform multiple, nested splits of the input space %(see Figure 2). (see Figure~\ref{fig:hierarchy}). \begin{figure}[tbh] \centerline{\epsfig{figure=ps/hierarchy.ps,height=4in}} \caption{A two-level binary hierarchical architecture. The top-level gating network produces coefficients gig_igi that effectively split the input space into regions, and the lower-level gating networks produce coefficients gj∣ig_{j|i}gj∣i that effectively split these regions into sub-regions. The expert networks fit surfaces within these nested regions. Deeper trees are formed by expanding the expert networks recursively into additional gating networks and sub-experts.} \label{fig:hierarchy} \end{figure} The splitting process terminates in a set of expert networks at the leaves of the tree, which, because they are defined over relatively small regions of the input space, can fit simple (e.g., linear) functions to the data. This hierarchical architecture, suggested by Jordan and Jacobs (1994), has close ties to the classification and regression tree models in statistics and machine learning (e.g., Breiman, et al., 1984). Indeed, the architecture can be viewed as a probabilistic variant of such models. The mathematical framework underlying the HME architecture is essentially the same as that underlying the ME architecture. We simply extend the probability model to allow nested sequences of labels to be chosen, corresponding to the nested sequence of regions needed to specify a leaf of the tree. The probability model for a two-level tree is as follows: \begin{equation} P({\bf y} | {\bf x}, \bTheta) = \sum_i P(i | {\bf x}, \boldeta) \sum_j P(j | i, {\bf x}, \boldnu_i) P({\bf y} | {\bf x}, \btheta_{ij}), \label{eq:hmixture} \end{equation} which corresponds to a choice of label iii with probability P(i∣bfx,boldeta)P(i | {\bf x}, \boldeta)P(i∣bfx,boldeta), followed by a conditional choice of label jjj with probability P(j∣i,bfx,boldnui)P(j | i, {\bf x}, \boldnu_i)P(j∣i,bfx,boldnui). This probability model yields the following log likelihood function: \begin{equation} l({\cal X}, \bTheta) = \sum_l \log \sum_i P(i | {\bf x}^{(l)}, \boldeta) \sum_j P(j | i, {\bf x}^{(l)}, \boldnu_i) P({\bf y}^{(l)} | {\bf x}^{(l)}, \btheta_{ij}), \end{equation} where as in the one-level case the prior probabilities gi(l)=P(i∣bfx(l),boldeta)g_i^{(l)} = P(i | {\bf x}^{(l)}, \boldeta)gi(l)=P(i∣bfx(l),boldeta) and gj∣i(l)=P(j∣i,bfx(l),boldnui)g_{j|i}^{(l)} = P(j | i, {\bf x}^{(l)}, \boldnu_i)gj∣i(l)=P(j∣i,bfx(l),boldnui) are defined in terms of underlying variables xii\xi_ixii and xiij\xi_{ij}xiij using the softmax function (cf. Equation~\ref{eq:softmax}). We also use Bayes' rule to define posterior probabilities in the obvious way: \begin{equation} h_i^{(l)} = \frac{g_i^{(l)} \sum_j g_{j|i}^{(l)} P({\bf y}^{(l)} | {\bf x}^{(l)}, \btheta_{ij})} {\sum_j g_j^{(l)} \sum_k g_{k|j}^{(l)} P({\bf y}^{(l)} | {\bf x}^{(l)}, \btheta_{jk})} \label{eq:hposterior1} \end{equation} and \begin{equation} h_{j|i}^{(l)} = \frac{g_{j|i}^{(l)} P({\bf y}^{(l)} | {\bf x}^{(l)}, \btheta_{ij})} {\sum_j g_{j|i}^{(l)} P({\bf y}^{(l)} | {\bf x}^{(l)}, \btheta_{ij})}. \label{eq:hposterior2} \end{equation} The posterior probability hi(l)h_i^{(l)}hi(l) can be viewed as the credit assigned to the ithi^{th}ith nonterminal in the tree, and the posterior probability hj∣i(l)h_{j|i}^{(l)}hj∣i(l) is the credit assigned to the branches below the nonterminals. The product hi(l)hj∣i(l)h_i^{(l)} h_{j|i}^{(l)}hi(l)hj∣i(l) is therefore the credit assigned to expert (i,j)(i,j)(i,j). A recursive relationship is available to compute the posterior probabilities efficiently in deep trees. The recursion proceeds upward in the tree, passing the denominator of the conditional posterior upward, multiplying by the priors, and normalizing (cf. the computation of hih_ihi from hj∣ih_{j|i}h_j∣i in Equations~\ref{eq:hposterior1} and \ref{eq:hposterior2}). We obtain a gradient ascent learning algorithm by computing the partial derivatives of lll. If, as in the previous section, we assume linear experts and a linear gating network, as well as Gaussian probabilities for the experts, we obtain the following LMS-like learning algorithm: \begin{equation} \Delta\btheta_{ij} = \rho h_i^{(l)} h_{j|i}^{(l)} ({\bf y}^{(l)} - \bmu_{ij}^{(l)}) {{\bf x}^{(l)}}^T, \label{eq:hgradient1} \end{equation} \begin{equation} \Delta\boldeta_i = \rho (h_i^{(l)} - g_i^{(l)}) {{\bf x}^{(l)}}^T, \label{eq:hgradient2} \end{equation} and \begin{equation} \Delta\boldnu_{ij} = \rho h_i^{(l)} (h_{j|i}^{(l)} - g_{j|i}^{(l)}) {{\bf x}^{(l)}}^T. \label{eq:hgradient3} \end{equation} Each of these partial derivatives have a natural interpretation in terms of credit assignment. Credit is assigned to an expert by taking the product of the posterior probabilities along the path from the root of the tree to the expert (cf. Equation~\ref{eq:hgradient1}). The updates for the gating networks move the prior probabilities at a nonterminal toward the corresponding posterior probabilities, weighting these updates by the product of the posterior probabilities along the path from the root of the tree to the nonterminal in question (cf. Equation~\ref{eq:hgradient3}). %\vskip12pt %\noindent %{\underline{\bf AN EM ALGORITHM}} \section{An EM Algorithm} Jordan and Jacobs (1994) have derived an Expectation-Maximization (EM) algorithm for estimating the parameters of the ME and HME architectures. (See McLachlan \& Krishnan, 1997, for a general treatment of the EM algorithm). This algorithm, an alternative to gradient methods, is particularly useful for models in which the expert networks and gating networks have simple parametric forms. Each iteration of the algorithm consists of two phases: (1) a recursive propagation upward and downward in the tree to compute posterior probabilities (the ``E step''), and (2) solution of a set of local weighted maximum likelihood problems at the nonterminals and terminals of the tree (the ``M step''). Jordan and Jacobs (1994) tested this algorithm on a nonlinear system identification problem (the forward dynamics of a four-degree-of-freedom robot arm), and report that it converges rapidly, converging nearly two orders of magnitude faster than backpropagation in a comparable multilayer perceptron network. %\vskip12pt %\noindent %{\underline{\bf DISCUSSION}} \section{Discussion} Mixtures of experts should be compared to ensemble methods such as bagging and boosting (see ENSEMBLE LEARNING), which provide another general approach to building a supervised learning architecture out of collections of simple learners. In bagging and boosting, the overall input-output mapping is a convex combination of the members of the ensemble, much as in the case of mixtures of experts (in which the conditional mean is a convex combination of the outputs of the experts). However, the weights in this convex combination are constants in the case of bagging and boosting, whereas they are functions of the input for mixtures of experts. Moreover, in the mixture of experts, the weights have an interpretation as prior probabilities under a probabilistic mixture model and are explicitly parameterized as such. In particular, under this model, a single expert is assumed to be associated with each data point, an assumption which is not made for the ensemble methods. In essence, the mixture of experts approaches the supervised learning problem via a divide-and-conquer methodology reminiscent of unsupervised clustering, whereas ensemble methods approach the problem via superposition and averaging. We conclude with a brief list of pointers to additional papers. The problem of model selection for HME architectures has been addressed by a number of authors. Several papers propose the use of greedy search procedures combined with some form of penalization for model complexity (Fritsch, Finke, and Waibel, 1997; Ramamurti and Ghosh, 1996; Saito and Nakano, 1996). Bayesian approaches for model selection or model averaging have also been presented, including methods based on Gibbs sampling (Jacobs, Peng, and Tanner, 1997) and variational approximation (Waterhouse, MacKay, and Robinson, 1994). Theoretical analyses of the approximation and estimation rates for the ME (Zeevi, et al., 1998) and the HME (Jiang \& Tanner, 1998) are available; a basic result is that the HME achieves an approximation rate of O(m−2/s)O(m^{-2/s})O(m−2/s), where mmm is the number of experts and sss is the dimensionality of the input vector. Jordan and Xu (1995) present an analysis of the convergence rate for the EM algorithm for the HME. Kang and Oh (1997) provide an analysis of the HME using tools from statistical physics; in particular, they show that successive partitions in the HME can be analyzed as phase transitions. There is a broad literature on engineering applications of the HME architecture, including applications to state-space filtering, optimization, control, vision, speech recognition, speaker identification and time series analysis. Recent biological applications of the ME architecture have been presented by Wolpert and Kawato (2001), who describe an architecture for human motor control based on a mixture of experts in which each expert is a paired forward-inverse model, and Erickson and Kruschke (1998), who used the mixture of the experts to build a model of human category learning that combines rule-based and exemplar-based representations. %\newpage %\noindent %{\underline{\Large References}} \subsection*{References} \begin{vlist} \item Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., 1984. \underline{Classification and Regression} \underline{Trees}, Belmont, CA: Wadsworth International Group. \item Erickson, M. A., and Kruschke, J. K. 1998. Rules and exemplars in category learning. \underline{Journal of Experimental Psychology: General}, 127:107--140. \item Fritsch, J., Finke, M., and Waibel, A., 1997. Adaptively growing hierarchical mixtures of experts, in \underline{Advances in Neural Information Processing Systems}, 9, (M. Mozer, M. Jordan, and T. Petsche, Eds.), Cambridge, MA: MIT Press. \item Haruno, M., Wolpert, D. M., and Kawato, M., 2001. MOSAIC model for sensorimotor learning and control. \underline{Neural Computation}, 13: 2201-2220. \item * Jacobs, R. A, Jordan, M. I., Nowlan, S. J., and Hinton, G. E., 1991. Adaptive mixtures of local experts. \underline{Neural Computation}, 3:79--87. \item Jacobs, R. A., Peng, F., and Tanner, M. A., 1997. A Bayesian approach to model selection in hierarchical mixtures-of-experts architectures, \underline{Neural Networks}, 10:231--241. %\item Jacobs, R. A., and Kosslyn, S. M., 1994. Encoding shape %and spatial relations: The role of receptive field size in coordinating %complementary representations. \underline{Cognitive Science}, 18:361--386. \item Jiang, W., and Tanner, M. T., 1998. Hierarchical mixtures-of-experts for exponential family regression models: Approximation and maximum likelihood estimation. \underline{Annals of Statistics}, 27:987-1011. \item * Jordan, M. I. \& Jacobs, R. A., 1994. Hierarchical mixtures of experts and the EM algorithm. \underline{Neural Computation}, 6:181--214. \item Jordan, M. I., and Xu, L., 1995. Convergence properties of the EM approach to learning in mixture-of-experts architectures. \underline{Neural Networks}, 8:1409--1431. \item Kang, K., Oh, J-H., 1997. Statistical mechanics of the mixture of experts, in \underline{Advances in Neural} \underline{Information Processing Systems} 9, (M. Mozer, M. Jordan, and T. Petsche, Eds.), Cambridge, MA: MIT Press. \item McLachlan, G. J., and Krishnan, T., 1997. \underline{The EM Algorithm and Extensions}, NY: Wiley-Interscience. %\item Nowlan, S. J., and Sejnowski, T., 1993. Filter selection model %for generating visual motion signals, in \underline{Advances in %Neural Information Processing Systems} 5, (S. J. Hanson, J. D. %Cowan, and C. L. Giles, Eds.), San Mateo, CA: Morgan Kaufmann. \item Ramamurti, V., and Ghosh, J., 1996. Structural adaptation in mixture of experts, in \underline{Proceedings of the 13th International Conference on Pattern Recognition}. Los Alamitos, CA: IEEE Computer Society Press. \item Saito, K., and Nakano, R., 1996. A constructive learning algorithm for an HME, in \underline{Proceedings} \underline{of the IEEE International Conference on Neural Networks}. %\item Waterhouse, S. R., and Robinson, A. J., 1994. Constructive %algorithms for hierarchical mixtures of experts, in \underline{Advances %in Neural Information Processing Systems} 8. (Touretzky, D.S., %Mozer, M.C., and Hasselmo, M.E., Eds.), Cambridge, MA: MIT Press. \item Waterhouse, S., MacKay, D., and Robinson, T., 1994. Bayesian methods for mixtures of experts, in \underline{Advances in Neural Information Processing Systems} 8. (Touretzky, D.S., Mozer, M.C., and Hasselmo, M.E., Eds.), Cambridge, MA: MIT Press. \item Zeevi, A., Meir, R., and Maiorov, V., 1998. Error bounds for functional approximation and estimation using mixtures of experts. \underline{IEEE Transactions on Information Theory}, 44:1010--1025. \end{vlist} \end{document} \newpage \noindent {\underline{\Large FIGURE CAPTIONS}} \vskip12pt \noindent {\bf Figure 1}. A mixture of experts architecture. The output bmu\bmubmu is the conditional mean of bfy{\bf y}bfy given bfx{\bf x}bfx (see text). \vskip12pt \noindent {\bf Figure 2}. A two-level binary hierarchical architecture. The top-level gating network produces coefficients gig_igi that effectively split the input space into regions, and the lower-level gating networks produce coefficients gj∣ig_{j|i}g_j∣i that effectively split these regions into sub-regions. The expert networks fit surfaces within these nested regions. Deeper trees are formed by expanding the expert networks recursively into additional gating networks and sub-experts. \end{document}