Variational message passing (original) (raw)

From Wikipedia, the free encyclopedia

Approximate interference technique in Bayesian networks

Variational message passing (VMP) is an approximate inference technique for continuous- or discrete-valued Bayesian networks, with conjugate-exponential parents, developed by John Winn. VMP was developed as a means of generalizing the approximate variational methods used by such techniques as latent Dirichlet allocation, and works by updating an approximate distribution at each node through messages in the node's Markov blanket.

Likelihood lower bound

[edit]

Given some set of hidden variables H {\displaystyle H} {\displaystyle H} and observed variables V {\displaystyle V} {\displaystyle V}, the goal of approximate inference is to maximize a lower-bound on the probability that a graphical model is in the configuration V {\displaystyle V} {\displaystyle V}. Over some probability distribution Q {\displaystyle Q} {\displaystyle Q} (to be defined later),

ln ⁡ P ( V ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) P ( H | V ) = ∑ H Q ( H ) [ ln ⁡ P ( H , V ) Q ( H ) − ln ⁡ P ( H | V ) Q ( H ) ] {\displaystyle \ln P(V)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{P(H|V)}}=\sum _{H}Q(H){\Bigg [}\ln {\frac {P(H,V)}{Q(H)}}-\ln {\frac {P(H|V)}{Q(H)}}{\Bigg ]}} {\displaystyle \ln P(V)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{P(H|V)}}=\sum _{H}Q(H){\Bigg [}\ln {\frac {P(H,V)}{Q(H)}}-\ln {\frac {P(H|V)}{Q(H)}}{\Bigg ]}}.

So, if we define our lower bound to be

L ( Q ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) Q ( H ) {\displaystyle L(Q)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{Q(H)}}} {\displaystyle L(Q)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{Q(H)}}},

then the likelihood is simply this bound plus the relative entropy between P {\displaystyle P} {\displaystyle P} and Q {\displaystyle Q} {\displaystyle Q}. Because the relative entropy is non-negative, the function L {\displaystyle L} {\displaystyle L} defined above is indeed a lower bound of the log likelihood of our observation V {\displaystyle V} {\displaystyle V}. The distribution Q {\displaystyle Q} {\displaystyle Q} will have a simpler character than that of P {\displaystyle P} {\displaystyle P} because marginalizing over P {\displaystyle P} {\displaystyle P} is intractable for all but the simplest of graphical models. In particular, VMP uses a factorized distribution

Q ( H ) = ∏ i Q i ( H i ) , {\displaystyle Q(H)=\prod _{i}Q_{i}(H_{i}),} {\displaystyle Q(H)=\prod _{i}Q_{i}(H_{i}),}

where H i {\displaystyle H_{i}} {\displaystyle H_{i}} is a disjoint part of the graphical model.

Determining the update rule

[edit]

The likelihood estimate needs to be as large as possible; because it's a lower bound, getting closer log ⁡ P {\displaystyle \log P} {\displaystyle \log P} improves the approximation of the log likelihood. By substituting in the factorized version of Q {\displaystyle Q} {\displaystyle Q}, L ( Q ) {\displaystyle L(Q)} {\displaystyle L(Q)}, parameterized over the hidden nodes H i {\displaystyle H_{i}} {\displaystyle H_{i}} as above, is simply the negative relative entropy between Q j {\displaystyle Q_{j}} {\displaystyle Q_{j}} and Q j ∗ {\displaystyle Q_{j}^{*}} {\displaystyle Q_{j}^{*}} plus other terms independent of Q j {\displaystyle Q_{j}} {\displaystyle Q_{j}} if Q j ∗ {\displaystyle Q_{j}^{*}} {\displaystyle Q_{j}^{*}} is defined as

Q j ∗ ( H j ) = 1 Z e E − j { ln ⁡ P ( H , V ) } {\displaystyle Q_{j}^{*}(H_{j})={\frac {1}{Z}}e^{\mathbb {E} _{-j}\{\ln P(H,V)\}}} {\displaystyle Q_{j}^{*}(H_{j})={\frac {1}{Z}}e^{\mathbb {E} _{-j}\{\ln P(H,V)\}}},

where E − j { ln ⁡ P ( H , V ) } {\displaystyle \mathbb {E} _{-j}\{\ln P(H,V)\}} {\displaystyle \mathbb {E} _{-j}\{\ln P(H,V)\}} is the expectation over all distributions Q i {\displaystyle Q_{i}} {\displaystyle Q_{i}} except Q j {\displaystyle Q_{j}} {\displaystyle Q_{j}}. Thus, if we set Q j {\displaystyle Q_{j}} {\displaystyle Q_{j}} to be Q j ∗ {\displaystyle Q_{j}^{*}} {\displaystyle Q_{j}^{*}}, the bound L {\displaystyle L} {\displaystyle L} is maximized.

Messages in variational message passing

[edit]

Parents send their children the expectation of their sufficient statistic while children send their parents their natural parameter, which also requires messages to be sent from the co-parents of the node.

Relationship to exponential families

[edit]

Because all nodes in VMP come from exponential families and all parents of nodes are conjugate to their children nodes, the expectation of the sufficient statistic can be computed from the normalization factor.

The algorithm begins by computing the expected value of the sufficient statistics for that vector. Then, until the likelihood converges to a stable value (this is usually accomplished by setting a small threshold value and running the algorithm until it increases by less than that threshold value), do the following at each node:

  1. Get all messages from parents.
  2. Get all messages from children (this might require the children to get messages from the co-parents).
  3. Compute the expected value of the nodes sufficient statistics.

Because every child must be conjugate to its parent, this has limited the types of distributions that can be used in the model. For example, the parents of a Gaussian distribution must be a Gaussian distribution (corresponding to the Mean) and a gamma distribution (corresponding to the precision, or one over σ {\displaystyle \sigma } {\displaystyle \sigma } in more common parameterizations). Discrete variables can have Dirichlet parents, and Poisson and exponential nodes must have gamma parents. More recently, VMP has been extended to handle models that violate this conditional conjugacy constraint.[1]

  1. ^ Knowles, David A.; Minka, Thomas P. (2011). "Non-conjugate Variational Message Passing for Multinomial and Binary Regression" (PDF). NeurIPS.