# Variational Bayes Inference

A tutorial on VBI라는 강의자료를 토대로 정리한 내용.

http://www.orchid.ac.uk/eprints/40/1/fox_vbtut.pdf

1_ introduction

Variational Bayes is a particular variational method which aims :

• to find some approximate joint distribution Q(x;θ)
• over hidden variables x
• to approximate the true joint P(x),
• and defines ‘closeness’ as the KL divergence KL[Q(x;θ)||P(x)].

The mean-field form of VB assumes :

– Q factorises into single-variable factors, Q(x) =  iQi(xi|θi).

The asymmetric KL[Q||P] is chosen in VB principally

to yield useful computational simplifications,

but can be viewed as preferring approximation where

areas of high Q are accurate, rather than areas of high P.

-> This is often useful because if we were to draw samples from, or integrate over, Q, then the areas used will be largely accurate (though they may well miss out areas of high P).

1.1 Problem statement

We wish to find a set of distributions {Qi(xi;θi)} to minimise the KL-divergence:

D : observed data

x : unobserved variables

We sometimes omit the notational dependence of Qi on θi for clarity.

As the Qi are approximate beliefs, they are subject to the normalisation constraints:

(각 Qi가 probability function이기 때문에 당연.)

1.2 VB approximates joints, not marginals

Q approximates the joint, but the individual Qi(xi) are poor approximations to the true marginals Pi(xi).

The Qi(xi) components should not be expected to resemble – even remotely – the true marginals,

for example when inspecting the computational states of network nodes.

This makes VB considerably harder to debug than algorithms whose node states do have some local interpretation : the optimal VB components can be highly counter-intuitive and make sense only in the global context.

Suppose a factory produces screws of unknown diameter μ. We know the machines are very accurate so the precision γ = 1/σ2 of the diameters is high.

However no one has told us what the mean diameter x is, except for a drunken engineer in a pub who said is was 5mm.

We might have a prior belief π that μ is say 5±4mm (the 4mm deviation reflecting our low confidence in the report.)

Suppose we are given a sealed box containing one screw.

What is our belief in its diameter x? The exact marginal belief in x should be almost identical to our belief in μ, i.e. 5±4mm.

Fig. 1(b) shows one standard deviation of the true Gaussian joint P (μ, x) and the best mean-field Gaussian approximate joint Q(μ, x).

As discussed above, this Q is chosen to minimise error in its own domain so it appears as a tight Gaussian.

Importantly, the marginal Qx(x) is now very small, and not at all equal to the marginal P(x). We emphasize such cases because we found them to be the major cause of debugging time during development.

We also want to find the model likelihood, P(D|M) for model comparison.

This quantity will appear naturally as we try to minimise the KL distance.
There are many ways to think about the following derivations, which we will see involves a balance of three terms called

• lower bound,
• energy and
• entropy.

The derivation presented here begins by aiming to minimise the above KL distance.

For example, other equivalent derivations may begin by aiming to maximise the lower bound.

1.3 Rewriting KL optimisation as an easier problem

We will rewrite the KL equation in terms that are more tractable.

First we flip the numerator and denominator, and flip the sign:

Next, replace the conditional P(x|D with a joint P(x,D) and a prior P(D).

The reason for making this rewrite is that for Bayesian networks with exponential family nodes, the logP(x,D) term will be a be a very simple sum of node energy terms, whereas log P (x|D) is more complicated.

The log P (D) term does not involve Q so we can ignore it for the purposes of our minimisation.

So to minimize the KL divergence, we must maximize L.(lower bound)

-> subject to constraints :

L[Q(x)] are lower bounds on the model log-likelihood, P(D) = P(D|M)

(where we generally drop the M notation as we are working with a single model only).

The best bound is thus achieved when L[Q(x)] is maximised over Q.

The reason for L being a lower bound is seen by rearranging as:

Thus when the KL-divergence is zero (a perfect fit), L is equal to the model log- likelihood.

When the fit is not perfect, the KL-divergence is always positive and so L[Q(x)] < ln P (D).

또다르게 식을 정리해보자. :

-> the KL divergence is the error between L and ln P (D).

1.4 Solution of free energy optimisation

We now wish to find Q to maximise the lower bound, subject to normalisation constraints,

where we define energy as E = lnP and entropy as H[Q(x)] = −  dxQ(x)lnQ(x).

For exponential models,

this will become a convenient sum of linear functions.

By the mean field assumption:

Consider the entropy (rightmost) term.

We can bring out the sum:

Consider the partitions x = {xi, x ̄i} where x ̄i = xxi.

Substituting this into the right term of equation 1:

Now look at and rearrange the energy (left) term of equation 2,

again separating out one variable:

where we have defined

and Z normalises Q∗(x ).

Substituting this new form of the energy back into equation 2 yields :

Separate out the entropy for Hi = H[Qi(xi)] from the rest of the entropy sum:

Consider the terms in the brackets:

What a lucky co-incidence!

Though we started by trying to minimise the KL- divergence between large joint distributions (which is hard),

-> we have converted the problem to that of minimising KL-divergences between individual 1D distributions (which is easier).

Write:

Thus L depends on each individual Qi only through the KL term.

We wish to maximise L with respect to each Qi,

subject to the constraint that all Qi are normalized to unity.

This could be achieved by Lagrange multipliers and functional differentiation:

A long algebraic derivation would then eventually lead to a Gibbs distribution.

However, thanks to the KL form rearrangement we do not need to perform any of this,

because we can see immediately that L will be maximised when the KL divergence is zero, hence when

(The normalisation constraint on Qi is satisfied thanks to the inclusion of Z in theprevious definition of Q∗i).

Expanding back the definition gives the optimal Qi to be

where E(xi, x ̄i, D) = log P (xi, x ̄i, D) is theenergy.

참고할 내용>

Mean-field variational Bayesian approximation to inference in graphical models.

In physics and probability theorymean field theory (MFT also known as self-consistent field theory) studies the behavior of large and complex stochastic models by studying a simpler model.

Such models consider a large number of small individual components which interact with each other.

The effect of all the other individuals on any given individual is approximated by a single averaged effect, thus reducing a many-body problem to a one-body problem.

흠 self-consistent field theory라는 이름도 멋있네.