Tuesday, February 24, 2009

Approximations for Binary Gaussian Process Classification


This paper compares several GPC algorithms for binary classification (multi-class algorithms are more difficult to design I guess). I read some of them in the book PRML and Rasmussen's book on GP. So I might not cover all of them. The comparisons shows that EP might be the most disirable algorithm for training GPC.

First let's see the idea of using GP for classification. It is from probit regression. As you might find the role GP plays in regression, here we simply use GP to replace the parametric form <w, x>. The rest of the game is the link function in probit regression. The common choices are logit and CDF of a standard Gaussian. Sometimes the latter is more suitable for computation (unlike logistic regression is easier than probit regression). The label y in the literature usually takes value of +1 and -1 (unlike in logistic regression, 0 and 1 are more favorable) and therefore more difficult to extend to multi-class cases. Let f be the Gaussian process and the probability is usually Pr(y | f) = sig( y f ) where sig(.) is the link function.

The difficulty in GPC is that the posterior probability can't computed easily (due to the link function). So we are seeking suitable approximations for it. Usually we simply find some Gaussians to apprxomate it. Basically, we can do it globally or locally.

The simplest idea is Laplacian approximation. We take the logarithm of the posterior probability (without the normalization factor). A Taylor expansion of it with the terms of order 1 and 2 would yield the mean and covariance matrix of the approximate Gaussian (which can be solve by Newton-Raphson method). This is obviously global and works poorly most of the time.

The expectation propagation (EP) originates from ADF (assumed-density filter). It is a little complicated to fully explained its idea but it's worth mentioning that we don't know whether it really decreases the free energy or converges while it yields the best results compared with other methods. It is local.

The KL divergence minimization (KL) minimizes the KL divergence of the proposed Gaussian with the true posterior to increase the marginal likelihood (f as the latent variable). We simply take derivatives w.r.t. the proposed parameters. It is global.

Variational bound (VB) is something like local variational method which seeks a lower bound of the difficult parts in the posterior. By maximizing the lower bound of the likelihood would increase. This is local since for each sample we find the best lower bound (log-quadratic) separately,

Factorial variational method (FV, related to the global variational method in PRML) assumes the posterior is factorial and by minimizing the KL divergence of the proposed distribution and the true posterior, the variational problem yields a form for the factors. By updating the parameters one by one, we may find the most appropritate configurations. (it's dubious to say it is local)

Label regression (LR) is even simpler (treating classification as regression).

They also implement MCMC for ``accurate'' models. I am very interested in their implementation and might use it in later work.

No comments: