First we must understand what the majorization-minimization (MM) algorithm is. It is in fact the auxilliary function method, which might be regarded as a generalization of EM algorithm. To minimize (maximize) a function
L(x), we find an upper (lower) bound for the objective which is more easier to minimize (maximize), Q(x; x') where the x' is the parameter for the upper (lower) bounding function. This can be seen as a generalization of the idea of CCCP, since in CCCP, we use linear functions to bound the convex (concave) functions and the parameter is simply the variable introduced by Legendre-Fenchel transform. And we require L(x) - Q(x; x') reaches its minimum (maximum) at x = x'. As we can see here, the following inequality holdsL(x) \leq Q(x; x') \qquad \text{or} \qquad L(x) \geq Q(x; x')The idea is instead of optimizing L(x) directly due to its intractability, we may find suitable algorithm to optimize Q(x; x') where x' = x^{(n)}. Therefore for the minimization case we have the following inequalityL(x^{(n)}) = Q(x^{(n)}; x^{(n-1)}) + L(x^{(n)} - Q(x^{(n)}; x^{(n-1)})) \leq Q(x^{(n-1)}; x^{(n-1)}) + L(x^{(n-1)}) - Q(x^{(n-1)}; x^{(n-1)}) = L(x^{(n-1)})where the inequality uses two optima, Q(x^{(n)}; x^{(n-1)}) is minimum of Q(x; x^{(n-1)}) and L(x) - Q(x; x^{(n-1)}) reaches its maximum at x = x^{(n-1)}.Now let's come back to the Bayes learning problem. We know to select a proper hyperparameter, either we use cross validation, or we rely on maximization of Type II likelihood (EM solves it :-p). Here the author simply introduce another prior for the hyperparameter and integrate it out. E.g. for a Gaussian prior for the parameter, usually the hyperparameter is the precision, whose prior is then set to a Gamma distribution. But they kind of stealthly switch the concept. In Bayes framework, the joint distribution is then
\Pr(\mathcal{D} \mid w) \Pr(w \mid \alpha) \Pr(\alpha). They first integrate out the hyperparameter \alpha, which leaves us \Pr(\mathcal{D} \mid w) \Pr(w). The we find the MAP estimation for simplicity, which results in\min_w -\log \Pr(\mathcal{D} \mid w) - \log \Pr(w).Now we may find the second term is of the form of - C\cdot\log (\| w \|^2 + \beta ), which is a different from the norm regularizer.We solve this problem with MM optimization technique. We construct an upper bound for the second term
\log x \leq \log y + \frac{x - y}{y}which is a linear function of x and therefore the optimization problem turns back to the original ``loss + norm regularizer'' form.Don't see any thing special from this point of view.
0 comments:
Post a Comment