Thursday, July 2, 2009

A Majorization-Minimization Algorithm for (Multiple) Hyperparameters Learning


First we must understand what the majorization-minimization (MM) algorithm is. It is in fact the auxilliary function method, which might be regarded as a generalization of EM algorithm. To minimize (maximize) a function L(x), we find an upper (lower) bound for the objective which is more easier to minimize (maximize), Q(x; x') where the x' is the parameter for the upper (lower) bounding function. This can be seen as a generalization of the idea of CCCP, since in CCCP, we use linear functions to bound the convex (concave) functions and the parameter is simply the variable introduced by Legendre-Fenchel transform. And we require L(x) - Q(x; x') reaches its minimum (maximum) at x = x'. As we can see here, the following inequality holds
L(x) \leq Q(x; x') \qquad \text{or} \qquad L(x) \geq Q(x; x')
The idea is instead of optimizing L(x) directly due to its intractability, we may find suitable algorithm to optimize Q(x; x') where x' = x^{(n)}. Therefore for the minimization case we have the following inequality
L(x^{(n)}) = Q(x^{(n)}; x^{(n-1)}) + L(x^{(n)} - Q(x^{(n)}; x^{(n-1)})) \leq Q(x^{(n-1)}; x^{(n-1)}) + L(x^{(n-1)}) - Q(x^{(n-1)}; x^{(n-1)}) = L(x^{(n-1)})
where the inequality uses two optima, Q(x^{(n)}; x^{(n-1)}) is minimum of Q(x; x^{(n-1)}) and L(x) - Q(x; x^{(n-1)}) reaches its maximum at x = x^{(n-1)}.

Now let's come back to the Bayes learning problem. We know to select a proper hyperparameter, either we use cross validation, or we rely on maximization of Type II likelihood (EM solves it :-p). Here the author simply introduce another prior for the hyperparameter and integrate it out. E.g. for a Gaussian prior for the parameter, usually the hyperparameter is the precision, whose prior is then set to a Gamma distribution. But they kind of stealthly switch the concept. In Bayes framework, the joint distribution is then \Pr(\mathcal{D} \mid w) \Pr(w \mid \alpha) \Pr(\alpha). They first integrate out the hyperparameter \alpha, which leaves us \Pr(\mathcal{D} \mid w) \Pr(w). The we find the MAP estimation for simplicity, which results in
\min_w -\log \Pr(\mathcal{D} \mid w) - \log \Pr(w).
Now we may find the second term is of the form of - C\cdot\log (\| w \|^2 + \beta ), which is a different from the norm regularizer.

We solve this problem with MM optimization technique. We construct an upper bound for the second term
\log x \leq \log y + \frac{x - y}{y}
which is a linear function of x and therefore the optimization problem turns back to the original ``loss + norm regularizer'' form.

Don't see any thing special from this point of view.

0 comments: