(original) (raw)
%\VignetteIndexEntry{R packages: LaTeX vignettes} %\VignetteEngine{R.rsp::tex} %\VignetteKeyword{R} %\VignetteKeyword{package} %\VignetteKeyword{vignette} %\VignetteKeyword{ANOVA} \documentclass[12pt]{article} \parindent 0cm \usepackage{amsmath} \usepackage[UKenglish]{isodate} \usepackage[round]{natbib} \newcommand{\btheta}{\mbox{$\boldsymbol{\theta}$}} \newcommand{\bphi}{\mbox{$\boldsymbol{\phi}$}} \newcommand{\bpsi}{\mbox{$\boldsymbol{\psi}$}} \newcommand{\bsigma}{\mbox{$\boldsymbol{\sigma}$}} \newcommand{\bdelta}{\mbox{$\boldsymbol{\delta}$}} \newcommand{\bPhi}{\mbox{$\boldsymbol{\Phi}$}} \newcommand{\brho}{\mbox{$\boldsymbol{\rho}$}} \newcommand{\bx}{\mbox{$\boldsymbol{x}$}} \newcommand{\bzero}{\mbox{$\boldsymbol{0}$}} \newcommand{\ntop}{\mbox{$n_{\textup{\textrm{top}}}$}} \newcommand{\nbot}{\mbox{$n_{\textup{\textrm{bot}}}$}} \begin{document} \cleanlookdateon \title{\textbf{Extended Generalised Linear Hidden Markov Models}} \author{Rolf Turner} \date{\today} \maketitle \section{Introduction} \label{sec:intro} The \texttt{eglhmm} package provides means of fitting hidden Markov models \cite{Rabiner1989} in contexts in which the data conform to generalised linear models or slightly extended versions thereof. The package accomodates models in which the observations (``emissions'') are assumed to arise from a number of distributions: Gaussian, Poisson, Binomial, Db (discretised beta, \citealt{Turner2021}), and Multinom. In the Poisson and Binomial cases the models are generalised linear models. In the Gaussian and Db cases the models are ``something like, but not exactly'' generalised linear models. In the case of the Multinom (or ``discnp'' --- discrete non-parametric) distribution the model in question bears some relationship to a generalised linear model but is of a substantialy different form. We shall use the expression ``extended generalised hidden Markov models''. to describe they collection of all models under consideration, including those based on the Gaussian, Db and Multinom distributions. The package fits the models in question by several different methods`, namely the EM algorithm \cite{DempsterEtAl1977}, the Levenberg-Marquardt algorithm \cite{Turner2008}, and ``brute force'' which use either the \texttt{optim} or the \texttt{nlm} package to optimise the log likelihood. The Levenberg-Marquardt algorithm, and in certain circumstances the ``brute force'' procedure, require the analytic calculation of the gradient and Hessian of the log likelihood. The calculation is intricate in the hidden Markov model context. (In fact simply calculating the log likelihood is intricate.) Most of this vignette is devoted to the calculation of the first and second derivatives of the log likelihood. \section{Recursive calculations} \label{sec:recurse} The likelihood of a hidden Markov model may feasibly be calculated in terms of the ``forward'' probabilities developed by Baum et al. (see \citealt{BaumEtAl1970}). These probabilities are calculated by means of a recursive procedure which of course depends on the likelihoods of individual observations. These likelihoods, which may be expressed in the form f(y,btheta)f(y,\btheta)f(y,btheta), may be either probability density functions or probability mass functions. The symbol yyy represents an observation (emission) and btheta\bthetabtheta represents a vector of parameters upon which the distribution in question depends. These parameters depend in turn on the underlying state of the hidden Markov chain and in general upon other predictors (in addition to ``state''). The dependence of btheta\bthetabtheta upon the predictors will involve further parameters. The derivatives of the log likelihood of the model must therefore, in turn be calculated via recursive procedures. In order to effect these procedures, we need to calculate the first and second derivatives, with respect to all of the parameters that are involved, of the single observation likelihoods f(y,btheta)f(y,\btheta)f(y,btheta). In the case of the Gaussian distribution btheta=(mu,sigma)top\btheta = (\mu,\sigma)^{\top}btheta=(mu,sigma)top where mu\mumu is the mean and sigma\sigmasigma is the standard deviation of the distribution. In the cases of the Poisson and Binomial distributions btheta\bthetabtheta is actually a scalar (which we consequently write simply at theta\thetatheta). For the Poisson distribution theta\thetatheta is equal to lambda\lambdalambda, the Poisson mean, and for the Binomial distribution theta\thetatheta is equal to ppp, the binomial success probability. In the case of the Db distribution, btheta\bthetabtheta is equal to (alpha,beta)top(\alpha,\beta)^{\top}(alpha,beta)top the vector of ``shape'' parameters of the distribution. In the case of the Multinom distribution, the model (as indicated above) has a rather different structure. Except in the Gaussian case we assume that btheta\bthetabtheta is completely determined by a vector bx\bxbx of predictor variables and a vector bphi\bphibphi of predictor coefficients. We need to determine the first and second derivatives, of the likelihood of a single observation, with respect to the entries of bphi\bphibphi. In the case of the Gaussian distribution btheta\bthetabtheta also includes the values of sigma\sigmasigma corresponding the different states. In the current implementation of the package these sigma\sigmasigma values are not obtained from the predictor coefficients bphi\bphibphi. \section{Derivatives specific to each of the distributions} \label{sec:derivs} We now provide the details of the calculation of these derivatives for each of the five distributions in question. \subsection{The Gaussian distribution} \label{sec:gauss} We denote the vector of standard deviations by bsigma=(sigma1,ldots,sigmaK)top\bsigma = (\sigma_1, \ldots, \sigma_K)^{\top}bsigma=(sigma1,ldots,sigmaK)top (where KKK is the number of states). In the current development we assume that sigmai\sigma_isigmai depends only on the state iii of the underlying hidden Markov chain (and not on any other prectors included in bx\bxbx. It is thus convenient to make explicit the dependence of the probability density functions upon the underlying state. We write the probability density function corresponding to state iii as \[ f_i(y) = \frac{1}{\sqrt{2\pi} \sigma_i} \exp \left ( \frac{-(y-\mu)^2}{2\sigma_i^2} \right ) \; . \] We model mu\mumu as mu=bxtopbphi\mu = \bx^{\top}\bphimu=bxtopbphi. Note that consequently mu\mumu depends, in general, upon the state iii although this dependence bx\bxbx is not made explicit in the foregoing expression for fi(y)f_i(y)fi(y). We need to differentiate fi(y)f_i(y)fi(y) with respect to bphi\bphibphi and bsigma\bsigmabsigma. It is straightforward, using logarithmic differentiation, to determine that: \begin{equation} \begin{split} \frac{\partial f_i(y)}{\partial \mu} &= f_i(y)\left( \frac{y-\mu}{\sigma_i^2} \right ) \\ \frac{\partial f_i(y)}{\partial \sigma_j} &= \left \{ \begin{array}{ll} f_i(y)\left( \frac{(y-\mu)^2}{\sigma_i^2} - 1 \right)/\sigma_i & \mbox{~if~} j = i\\ 0 & \mbox{~if~} j \neq i \end{array} \right . \\ \frac{\partial^2 f_i(y)}{\partial \mu^2} &= f_i(y) \left( \frac{(y-\mu)^2}{\sigma_i^2} - 1 \right )/\sigma_i^2 \\ \frac{\partial^2 f_i(y)}{\partial \sigma_i \partial \sigma_j} &= \left \{ \begin{array}{ll} f_i(y) \left( \left ( \frac{(y-\mu)^2}{\sigma_i^2} - 1 \right )^2 + 1 - \frac{3(y-\mu)^2}{\sigma_i^2} \right )/\sigma_i^2 & \mbox{~if~} j = i\\ 0 & \mbox{~if~} j \neq i \end{array} \right . \\ \frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_j} &= \left \{ \begin{array}{ll} f_i(y) \left (\frac{(y-\mu)^2}{\sigma^3} - \frac{3}{\sigma} \right )(y-\mu)/\sigma^2 & \mbox{~if~} j = i\\ 0 & \mbox{~if~} j \neq i \end{array} \right . \; . \end{split} \label{eq:muSigPartials} \end{equation} Recalling that mu=bxtopbphi\mu = \bx^{\top} \bphimu=bxtopbphi we see that \[ \frac{\partial \mu}{\partial \bphi} = \bx \; , \] An application of the chain rule then gives: \[ \frac{\partial f_i(y)}{\partial \bphi} = \frac{\partial f_i(y)}{\partial \mu} \bx \] The second derivatives of fi(y)f_i(y)fi(y) with respect to bphi\bphibphi are given by \begin{align*} \frac{\partial^2 f_i(y)}{\partial \bphi^{\top} \partial \bphi} &= \frac{\partial}{\partial \bphi^{\top}} \left ( \frac{\partial f_i(y)}{\partial \mu} \bx \right ) \\ &= \bx \left(\frac{\partial^2 f_i(y)}{\partial \mu^2} \frac{\partial \mu}{\partial \bphi^{\top}} + \frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_i} \frac{\partial \sigma_i}{\partial \bphi^{\top}} \right ) \\ &= \left(\frac{\partial^2 f_i(y)}{\partial \mu^2} \right) \bx \bx^{\top} \end{align*} since partialsigmai/partialbphitop=bzero\partial \sigma_i/\partial \bphi^{\top} = \bzeropartialsigmai/partialbphitop=bzero. The second derivatives of fi(y)f_i(y)fi(y) with respect to bphi\bphibphi and bsigma\bsigmabsigma are given by \begin{align*} \frac{\partial^2 f_i(y)}{\partial \bphi^{\top} \partial \sigma_j} &= \left \{ \begin{array}{cl} \left(\frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_j} \right) \bx^{\top} & \mbox{~if~} j = i \\[0.25cm] \bzero^{\top} & \mbox{~if~} j \neq i \end{array} \right . \\ \frac{\partial^2 f_i(y)}{\partial \sigma_j \partial \bphi} &= \left \{ \begin{array}{cl} \left(\frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_j} \right) \bx & \mbox{~if~} j = i \\[0.25cm] \bzero & \mbox{~if~} j \neq i \end{array} \right . \; . \end{align*} Note that \[ \frac{\partial^2 f_i(y)}{\partial \sigma_i \partial \sigma_j} \] is provided in \eqref{eq:muSigPartials}. The structure of the first and second derivatives of fi(y)f_i(y)fi(y) with respect to bphi\bphibphi and sigma\sigmasigma can be expressed concisely by letting \[ \bpsi = \left [ \begin{array}{l} \bsigma\\ \bphi \end{array} \right ] \] and then writing \begin{align*} \frac{\partial f_i(y)}{\partial \bpsi} &= \left[ \begin{array}{l} \frac{\partial f_i(y)}{\partial \bsigma} \\[0.1cm] \frac{\partial f_i(y)}{\partial \bphi} \end{array} \right] \\[0.25cm] &=\left[ \begin{array}{l} \frac{\partial f_i(y)}{\partial \sigma_i} \bdelta_i\\[0.1cm] \frac{\partial f_i(y)}{\partial \mu} \bx \end{array} \right] \end{align*} where bdeltai\bdelta_ibdeltai is a vector of dimension KKK whose iiith entry is 1 and whose other entries are all 0, and \begin{align*} \frac{\partial^2 f_i(y)}{\partial \bpsi^{\top} \partial \bpsi} &= \left[ \begin{array}{ll} \frac{\partial^2 f_i(y)}{\partial \bsigma^{\top} \partial \bsigma} & \frac{\partial^2 f_i(y)}{\partial \bsigma^{\top} \partial \bphi} \\ \frac{\partial^2 f_i(y)}{\partial \bphi^{\top} \partial \bsigma} & \frac{\partial^2 f_i(y)}{\partial \bphi^{\top} \partial \bphi} \end{array} \right] \\[0.25cm] & = \left[ \begin{array}{ll} \frac{\partial^2 f_i(y)}{\partial \sigma_i^2} \bdelta_i \bdelta_i^{\top} & \frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_i} \bdelta_i \bx^{\top} \\ \frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_i} \bx \bdelta_i^{\top} & \frac{\partial^2 f_i(y)}{\partial \mu^2} \bx \bx^{\top} \\ \end{array} \right] \; . \end{align*} Note that the first and second partial derivatives of fi(y)f_i(y)fi(y) with respect to mu\mumu and sigmai\sigma_isigma_i are provided in \eqref{eq:muSigPartials}. \subsection{The Poisson distribution} \label{sec:pois} The likelihood is the probability mass function \[ f(y) = e^{-\lambda} \frac{\lambda^y}{y!} \] y=0,1,2,ldotsy = 0, 1, 2, \ldotsy=0,1,2,ldots. Here btheta\bthetabtheta is a scalar, theta=lambda\theta = \lambdatheta=lambda, and we model lambda\lambdalambda via lambda=exp(bxtopbphi)\lambda = \exp(\bx^{\top} \bphi)lambda=exp(bxtopbphi), where bx\bxbx is a vector of predictors and bphi\bphibphi is a vector of predictor coefficients. The first and second derivatives of f(y)f(y)f(y) with respect to lambda\lambdalambda are \begin{align*} \frac{\partial f(y)}{\partial \lambda} &= f(y) \left (\frac{y}{\lambda} - 1 \right ) \\ \frac{\partial^2 f(y)}{\partial \lambda^2} &= f(y) \left( \left(\frac{y}{\lambda} - 1 \right)^2 - \frac{y}{\lambda^2} \right) \end{align*} Since lambda=exp(bxtopbphi)\lambda = \exp(\bx^{\top} \bphi)lambda=exp(bxtopbphi) it follows readily that the first and second derivatives of lambda\lambdalambda with respect to bphi\bphibphi are lambdabxlambda \bxlambdabx and lambdabxbxtop\lambda \bx \bx^{\top}lambdabxbxtop, respectively. Applying the chain rule we get \begin{align*} \frac{\partial f(y)}{\partial \bphi} &= \frac{\partial f(y)}{\partial \lambda} \lambda \bx \\ \frac{\partial^2 f(y)}{\partial \bphi^{\top} \partial \bphi} &= \left(\frac{\partial f(y)}{\partial \lambda} \lambda + \frac{\partial^2 f(y)}{\partial \lambda^2} \lambda^2 \right) \bx \bx^{\top} \end{align*} \subsection{The Binomial distribution} \label{sec:bino} The likelihood is the probability mass function \[ f(y) = \binom{n}{y} p^y (1-p)^{n-y} \] y=0,1,2,ldots,ny = 0, 1, 2, \ldots, ny=0,1,2,ldots,n, where nnn is the number of independent binomial trials on which the success count yyy is based, and ppp is the probability of success. Here btheta\bthetabtheta is a scalar, theta=p\theta = ptheta=p, and we model ppp via p=h(u)p = h(u)p=h(u) where u=bxtopbphiu = \bx^{\top} \bphiu=bxtopbphi, where bx\bxbx is a vector of predictors, bphi\bphibphi is a vector of predictor coefficients and h(u)h(u)h(u) is the logit function h(u)=(1+e−u)−1h(u) = (1 + e^{-u})^{-1}h(u)=(1+e−u)−1. In what follows we will need the first and second derivatives of the logit function. These are given by \begin{equation} \begin{split} h'(u) &= \frac{e^{-u}}{(1+e^{-u})^2} \mbox{~and} \\ h''(u) &= \frac{e^{-u}(e^{-u} - 1)}{(1+e^{-u})^3} \; . \end{split} \label{eq:logitDerivs} \end{equation} The first and second derivatives of f(y)f(y)f(y) with respect to ppp are \begin{align*} \frac{\partial f(y)}{\partial p} &= f(y) \left ( \frac{y}{p} - \frac{n-y}{1-p} \right )\\ \frac{\partial^2 f(y)}{\partial p^2} &= f(y) \left ( \left (\frac{y}{p} - \frac{n-y}{1-p} \right)^2 - \frac{y}{p^2} - \frac{n-y}{(1-p)^2} \right ) \; . \end{align*} Since p=h(bxtopbphi)p = h(\bx^{\top} \bphi)p=h(bxtopbphi) we see that \begin{align*} \frac{\partial p}{\partial \bphi} &= h'(\bx^{\top} \bphi) \bx \mbox{~ and}\\ \frac{\partial^2 p}{\partial \bphi^{\top} \partial \bphi} &= h''(\bx^{\top} \bphi) \bx \bx^{\top} \end{align*} Applying the chain rule we see that \begin{align*} \frac{\partial f(y)}{\partial \bphi} &= \frac{\partial f} {\partial p} h'(\bx^{\top} \bphi) \bx \mbox{~ and}\\ \frac{\partial^2 f(y)}{\partial \bphi^{\top} \partial \bphi} &= \left ( \frac{\partial f(y)}{\partial p} h''(\bx^{\top} \bphi) + \frac{\partial^2 f(y)}{\partial p^2} (h'(\bx^{\top} \bphi)^2 \right) \bx \bx^{\top} \end{align*} Recall that expressions for h′(cdot)h'(\cdot)h′(cdot) and h′′(cdot)h''(\cdot)h′′(cdot) are given by \eqref{eq:logitDerivs}. \subsection{The Db distribution} \label{sec:dbd`} The likelihood is the probability mass function which depends on a vector of parameters btheta=(alpha,beta)top\btheta = (\alpha,\beta)^{\top}btheta=(alpha,beta)top and is somewhat complicated to write down. In order to obtain an expression for this probabilty mass function we need to define \begin{align*} h_0(y) &= (y(1-y))^{-1} \\ h(y) &= h_0((y - \nbot + 1)/(\ntop - \nbot + 2)) \\ T_1(y) &= \log((y - \nbot + 1)/(\ntop - \nbot + 2)) \\ T_2(y) &= \log((\ntop - y + 1)/(\ntop - \nbot + 2)) \\ A(\alpha,\beta) &= \log \left( \sum_{i=\nbot}^{\ntop} h(i) \exp\{\alpha T_1(i) + \beta T_2(i) \} \right) \; . \end{align*} Given these definitions, the probability mass function of the Db distribution can be written as \[ f(y,\alpha,\beta) = \Pr(X=y \mid \alpha, \beta) = h(y) \exp\{\alpha T_1(y) + \beta T_2(y) - A(\alpha,\beta)\} \; . \] We model alpha\alphaalpha and beta\betabeta via \begin{align*} \alpha &= \bx^{\top} \bphi_1 \\ \beta &= \bx^{\top} \bphi_2 \end{align*} where bx\bxbx is a vector of predictors and bphi1\bphi_1bphi1 and bphi2\bphi_2bphi2 are vectors of predictor coefficients. The vector bphi\bphibphi, with respect to which we seek to differentiate the likelihood, is the catenation of bphi1\bphi_1bphi1 and bphi2\bphi_2bphi2. The first derivative of the likelihood with respect to bphi\bphibphi is \begin{align*} \frac{\partial f}{\partial \bphi} &= \frac{\partial f}{\partial \alpha} \frac{\partial \alpha}{\partial \bphi} + \frac{\partial f}{\partial \beta} \frac{\partial \beta}{\partial \bphi} \\ &= \frac{\partial f}{\partial \alpha} \left [ \begin{array}{c} \frac{\partial \alpha}{\partial \bphi_1} \\ \bzero \end{array} \right ] + \frac{\partial f}{\partial \beta} \left [ \begin{array}{c} \bzero \\ \frac{\partial \beta}{\partial \bphi_2} \end{array} \right ] \\ &= \frac{\partial f}{\partial \alpha} \left [ \begin{array}{c} \bx \\ \bzero \end{array} \right ] + \frac{\partial f}{\partial \beta} \left [ \begin{array}{c} \bzero \\ \bx \end{array} \right ] \\ &= \left [ \begin{array}{c} \frac{\partial f}{\partial \alpha} \bx \\ \frac{\partial f}{\partial \beta} \bx \end{array} \right ] \end{align*} The second derivative is calculated as \[ \frac{\partial^2 f}{\partial \bphi^{\top} \partial \bphi} = \left [ \begin{array}{c} \frac{\partial}{\partial \bphi^{\top}} \left ( \frac{\partial f}{\partial \alpha} \bx \right) \\ \frac{\partial}{\partial \bphi^{\top}} \left ( \frac{\partial f}{\partial \beta} \bx \right) \end{array} \right ] \; . \] Taking this expression one row at a time we see that \begin{align*} \frac{\partial}{\partial \bphi^{\top}} \left ( \frac{\partial f}{\partial \alpha} \right) &= \left [ \begin{array}{lr} \frac{\partial}{\partial \bphi_1^{\top}} \left ( \frac{\partial f}{\partial \alpha} \right) & \frac{\partial}{\partial \bphi_2^{\top}} \left ( \frac{\partial f}{\partial \alpha} \right) \end{array} \right ] \\ &= \left [ \begin{array}{lr} \frac{\partial^2 f}{\partial \alpha^2} \frac{\partial \alpha}{\partial \bphi_1^{\top}} & \frac{\partial^2 f}{\partial \beta \partial \alpha} \frac{\partial \beta}{\partial \bphi_2^{\top}} \end{array} \right ] \\ &= \left [ \begin{array}{lr} \frac{\partial^2 f}{\partial \alpha^2} \bx^{\top} & \frac{\partial^2 f}{\partial \beta \partial \alpha} \bx^{\top} \end{array} \right ] \mbox{~and likewise}\\ \frac{\partial}{\partial \bphi^{\top}} \left ( \frac{\partial f}{\partial \beta} \right) & = \left [ \begin{array}{lr} \frac{\partial^2 f}{\partial \beta \partial \alpha} \bx^{\top} & \frac{\partial^2 f}{\partial \beta^2} \bx^{\top} \end{array} \right ] \; . \end{align*} Combining the foregoing we get \[ \frac{\partial^2 f}{\partial \bphi^{\top} \partial \bphi} = \left [ \begin{array}{lr} \frac{\partial^2 f}{\partial \alpha^2} \bx \bx^{\top} & \frac{\partial^2 f}{\partial \beta \partial \alpha} \bx \bx^{\top} \\[0.25cm] \frac{\partial^2 f}{\partial \beta \partial \alpha} \bx \bx^{\top} & \frac{\partial^2 f}{\partial \beta^2} \bx \bx^{\top} \end{array} \right ] \; . \] As was the case for the three distributions for which btheta\bthetabtheta is a scalar, it is expedient to express the partial derivatives of f(y,alpha,beta)f(y,\alpha,\beta)f(y,alpha,beta), with respect to the parameters of the distribution, in terms of f(y,alpha,beta)f(y,\alpha,\beta)f(y,alpha,beta) The required expressions are as follows: \begin{align*} \frac{\partial f}{\partial \alpha} &= f(y,\alpha,\beta) \left ( T_1(y) - \frac{\partial A}{\partial \alpha} \right )\\ \frac{\partial f}{\partial \beta} &= f(y,\alpha,\beta) \left ( T_2(y) - \frac{\partial A}{\partial \beta} \right )\\ \frac{\partial^2 f}{\partial \alpha^2} &= f(y,\alpha,\beta) \left [ \left ( T_1(y) - \frac{\partial A}{\partial \alpha} \right )^2 - \frac{\partial^2 A}{\partial \alpha^2} \right ]\\ \frac{\partial^2 f}{\partial \alpha \partial \beta} &= f(y,\alpha,\beta) \left [ \left ( T_1(y) - \frac{\partial A}{\partial \alpha} \right ) \left ( T_2(y) - \frac{\partial A}{\partial \beta} \right ) - \frac{\partial^2 A}{\partial \alpha \partial \beta} \right ] \\ \frac{\partial^2 f}{\partial \beta^2} &= f(y,\alpha,\beta) \left [ \left ( T_2(y) - \frac{\partial A}{\partial \beta} \right )^2 - \frac{\partial^2 A}{\partial \beta^2} \right ] \end{align*} \newpage It remains to provide expressions for the partial derivatives of AAA with respect to alpha\alphaalpha and beta\betabeta. Let \[ E = \exp(A) = \sum_{i=\nbot}^{\ntop} h(i) \exp\{\alpha T_1(i) + \beta T_2(i) \} \; . \] Clearly \begin{align*} \frac{\partial A}{\partial \alpha} &= \frac{1}{E} \frac{\partial E}{\partial \alpha} \\ \frac{\partial A}{\partial \beta} &= \frac{1}{E} \frac{\partial E}{\partial \beta} \\ \frac{\partial^2 A}{\partial \alpha^2} &= \frac{1}{E} \frac{\partial^2 E}{\partial \alpha^2} - \frac{1}{E^2} \left (\frac{\partial E}{\partial \alpha} \right )^2 \\ \frac{\partial^2 A}{\partial \alpha \partial \beta} &= \frac{1}{E} \frac{\partial^2 E}{\partial \alpha \partial \beta} - \frac{1}{E^2} \left (\frac{\partial E}{\partial \alpha} \frac{\partial E}{\partial \beta} \right ) \\ \frac{\partial^2 A}{\partial \beta^2} &= \frac{1}{E} \frac{\partial^2 E}{\partial \beta^2} - \frac{1}{E^2} \left (\frac{\partial E}{\partial \beta} \right )^2 \end{align*} \enlargethispage{1\baselineskip} Finally, the relevant partial derivatives of EEE are: \begin{align*} \frac{\partial E}{\partial \alpha} &= \sum_{i=\nbot}^{\ntop} h(i) T_1(i) \exp(\alpha T_1(i) + \beta T_2(i)) \\ \frac{\partial E}{\partial \beta} &= \sum_{i=\nbot}^{\ntop} h(i) T_2(i) \exp(\alpha T_1(i) + \beta T_2(i))\\ %\end{align*} %\begin{align*} \frac{\partial^2 E}{\partial \alpha^2} &= \sum_{i=\nbot}^{\ntop} h(i) T_1(i)^2 \exp(\alpha T_1(i) + \beta T_2(i)) \\ \frac{\partial^2 E}{\partial \alpha \partial \beta} &= \sum_{i=\nbot}^{\ntop} h(i) T_1(i) T_2(i) \exp(\alpha T_1(i) + \beta T_2(i)) \\ \frac{\partial^2 E}{\partial \beta^2} &= \sum_{i=\nbot}^{\ntop} h(i) T_2(i)^2 \exp(\alpha T_1(i) + \beta T_2(i)) \; . \end{align*} \subsection{The Multinom distribution} \label{sec:multinom} This distribution is very different from those with which we have previously dealt. It is defined effectively in terms of \emph{tables}. In the hidden Markov model context, these tables take the form \[ \Pr(Y = y_i \mid S = k) = \rho_{ik} \] where YYY is the emissions variate, its possible values or ``levels'' are y1,y2,ldots,ymy_1, y_2, \ldots, y_my1,y2,ldots,ym, and SSS denotes ``state'' which (wlog) takes values 1,2,ldots,K1, 2, \ldots, K1,2,ldots,K. Of course rhocdotk=1\rho_{\cdot k} = 1rhocdotk=1 for all kkk. We shall denote Pr(Y=ymidS=k)=rhoik\Pr(Y = y \mid S = k) = \rho_{ik}Pr(Y=ymidS=k)=rhoik by fk(y)f_k(y)fk(y). The maximisation of the likelihood with respect to the rhoik\rho_{ik}rhoik is awkward, due to the ``sum-to-1'' constraints that they must satisfy, and it is better to impose this constraint ``smoothly'' via a logistic parameterisation. See \cite{Turner2008}. Such a parameterisation allows us to express the dependence of the emissions probabilities, upon ``state'', in terms of linear predictors. This in turn opens up the possibility of including other predictors, in addition to those determined by ``state'', in the model. To this end we define vectors of parameters bphii\bphi_ibphii, i=1,ldots,mi = 1, \ldots, mi=1,ldots,m, corresponding to each of the possible values of YYY. For identifiability we take bphim\bphi_mbphim to be identically 0. Each bphii\bphi_ibphii is a vector of length npnpnp, say, where npnpnp is the number of predictors. If, in a KKK state model, there are no predictors other than those determined by state, then np=Knp = Knp=K. In this case there are Ktimes(m−1)K \times (m-1)Ktimes(m−1) ``free'' parameters, just as there should be (and just at there are in the original parameterisation in terms of the rhoik\rho_{ik}rhoik). Let the kkkth entry of bphii\bphi_ibphii be phiik\phi_{ik}phiik, k=1,ldots,npk = 1,\ldots,npk=1,ldots,np. Let bphi\bphibphi be the vector consisting of the catenation of all of the phiij\phi_{ij}phiij, excluding the entries of bphim\bphi_mbphim which are all 0: \[ \bphi = (\phi_{11}, \phi_{12}, \ldots, \phi_{1,np}, \phi_{21}, \phi_{22}, \ldots, \phi_{2,np}, \ldots\ , \ldots\ ,\phi_{m-1,1}, \phi_{m-1,2}, \ldots, \phi_{m-1,np})^{\top} \; . \] Let bx\bxbx be a vector of predictors. In terms of the foregoing notation, fk(y)f_k(y)fk(y) can be written as \[ f_k(y) = \frac{e^{\bx^{\top} \bphi_y}}{Z} \] where in turn \[ Z = \sum_{\ell = 1}^k e^{\bx^{\top} \bphi_{\ell}} \; . \] The dependence of fk(y)f_k(y)fk(y) upon the state kkk is implicit in the predictor vector bx\bxbx which includes predictors indicating state. We now calculate the partial derivatives of fk(y)f_k(y)fk(y) with respect to bphi\bphibphi. First note that fracpartialfpartialbphi\frac{\partial f}{\partial \bphi}fracpartialfpartialbphi can be written as \[ \left [ \begin{array}{c} \frac{\partial f_k}{\partial \bphi_1} \\[0.25cm] \frac{\partial f_k}{\partial \bphi_2} \\ \vdots \\ \frac{\partial f_k}{\partial \bphi_{m-1}} \end{array} \right ] \;. \] Next we calculate \[ \frac{\partial f_k(y)}{\partial \bphi_i}, \mbox{~~} i = 1, \ldots, m-1 \; . \] Using logarithmic differentiation we see that \[ \frac{1}{f_k(y)} \frac{\partial f_k(y)}{\partial \bphi_i} = \delta_{yi} \bx - \frac{1}{Z} e^{\bx^{\top} \bphi_i} \bx \] so that \[ \frac{\partial f_k(y)}{\partial \bphi_i} = f_k(y) \left ( \delta_{yi} - \frac{e^{\bx^{\top} \bphi_i}}{Z} \right ) \] which can be written as fk(y)(deltayi−fk(i))bxf_k(y)(\delta_{yi} - f_k(i)) \bxfk(y)(deltayi−fk(i))bx. In summary we have \[ \frac{\partial f}{\partial \bphi} = f_k(y) \left [ \begin{array}{l} ( \delta_{y1} - f_k(1) ) \bx \\ ( \delta_{y2} - f_k(2) ) \bx \\ \multicolumn{1}{c}{\vdots} \\ ( \delta_{y,m-1} - f_k(m-1) ) \bx \end{array} \right ] \] The second derivatives of fk(y)f_k(y)fk(y) with respect to bphi\bphibphi are given by \[ \frac{\partial^2 f}{\partial \bphi \partial \bphi^{\top}} = \left [ \begin{array}{llcl} \frac{\partial^2 f}{\partial \bphi_1 \partial \bphi_1^{\top}} & \frac{\partial^2 f}{\partial \bphi_1 \partial \bphi_2^{\top}} & \ldots & \frac{\partial^2 f}{\partial \bphi_1 \partial \bphi_{m-1}^{\top}} \\[0.5cm] \frac{\partial^2 f}{\partial \bphi_2 \partial \bphi_1^{\top}} & \frac{\partial^2 f}{\partial \bphi_2 \partial \bphi_2^{\top}} & \ldots & \frac{\partial^2 f}{\partial \bphi_2 \partial \bphi_{m-1}^{\top}} \\ \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} \\ \frac{\partial^2 f}{\partial \bphi_{m-1} \partial \bphi_1^{\top}} & \frac{\partial^2 f}{\partial \bphi_{m-1} \partial \bphi_2^{\top}} & \ldots & \frac{\partial^2 f}{\partial \bphi_{m-1} \partial \bphi_{m-1}^{\top}} \end{array} \right] \] The (i,j)(i,j)(i,j)th entry of fracpartial2fpartialbphipartialbphitop\frac{\partial^2 f}{\partial \bphi \partial \bphi^{\top}}fracpartial2fpartialbphipartialbphitop, i.e. fracpartial2fpartialbphiipartialbphijtop\frac{\partial^2 f}{\partial \bphi_i \partial \bphi_j^{\top}}fracpartial2fpartialbphiipartialbphijtop, is given by \begin{align*} \frac{\partial}{\partial \bphi_i} \left ( \frac{\partial y}{\partial \bphi_j^{\top}} \right ) &=\frac{\partial}{\partial \bphi_i} \left (f_k(y)(\delta_{yj} - f_k(j)\bx^{\top} \right) \\ &= f_k(y)(0 - f_k(j)(\delta_{ij} - f_k(i))\bx\bx^{\top}) + f_k(y)(\delta_{yi} - f_k(i))\bx (\delta_{yj}- f_k(j))\bx^{\top} \\ &= f_k(y)(-f_k(j)(\delta_{ij} - f_k(i)) + (\delta_{yj} - f_k(i))(\delta_{yj} - f_k(j))) \bx \bx^{\top} \\ &= f_k(y)(f_k(i)(f_k(j) - \delta_{ij}f_k(j) + (\delta_{yi} - f_k(i)) (\delta_{yj} - f_k(j))) \bx \bx^{\top} \end{align*} At first glance this expression seems to be anomalously asymmetric in iii and jjj, but the asymmetry is illusory. Note that when ineqji \neq jineqj, deltaijfk(j)\delta_{ij}f_k(j)deltaijfk(j) is 0, and when i=ji=ji=j, deltaijfk(j)=fk(j)=fk(i)\delta_{ij}f_k(j) = f_k(j) = f_k(i)deltaijfk(j)=fk(j)=fk(i). In summary we see that \[ \frac{\partial^2 f}{\partial \bphi \partial \bphi^{\top}} = \left [ \begin{array}{llcl} a_{11} \bx \bx^{\top} & a_{12} \bx \bx^{\top} & \ldots & a_{1,m-1} \bx \bx^{\top} \\ a_{21} \bx \bx^{\top} & a_{22} \bx \bx^{\top} & \ldots & a_{2,m-1} \bx \bx^{\top} \\ \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} \\ a_{m-1,1} \bx \bx^{\top} & a_{m-1,2} \bx \bx^{\top} & \ldots & a_{m-1,,m-1} \bx \bx^{\top} \end{array} \right ] \] where aij=fk(y)(fk(i)(fk(j)−deltaijfk(j)+(deltayi−fk(i))(deltayj−fk(j))a_{ij} = f_k(y)(f_k(i)(f_k(j) - \delta_{ij}f_k(j) + (\delta_{yi} - f_k(i))(\delta_{yj} - f_k(j))aij=fk(y)(fk(i)(fk(j)−deltaijfk(j)+(deltayi−fk(i))(deltayj−f_k(j)), i,j=1,ldots,m−1i, j = 1, \ldots, m - 1i,j=1,ldots,m−1. \newpage \bibliography{eglhmm} \bibliographystyle{plainnat} \end{document}