Sunday, January 11, 2009

Manifold Learning: The Price of Normalization


Although many manifold learning algorithms have been proposed early from 2000, it is still seldom discussed how we can evaluate these algorithms. We may have a vague idea when we design a simple algorithm, but no further checking of these results are carried out. This paper focuses on the tough part.

The main idea of the paper is that several algorithms that exploits normalization (a feeble reason provided is to avoid collapse of projection) would not work well simply because of normalization. These algorithms include LLE, Laplacian Eigenmap (LEM), LTSA, HLLE and DFM. As is known, MVU will not suffer from this problem, though it is still a spectral algorithm (it seems that those solving largest eigenvalues are immune to this problem, they don't need normalization). An intuitive understanding of the problem is that in order to make the embedding normalized, the embedding might still be folded so as to make on several directions the data expand equally.

To analyze the properties of these algorithm, we must have a uniform framework to formulate these seeemingly unrelated algorithms. The author proposed a weight matrix Wi for each sample xi. The projection later could be formulated as the minimization of the sum of || Wi yi ||. Some of the weight matrices are simple vectors (e.g. LLE), while others are a little complicated. The optmization could be denoted as minimizing Phi( Y ) subject to normalization. This function is the key to understand this paper.

The embedding quality must be quantified. They mention the embedding result usually does not preserver neighborhood. However this preservation should be desired. Therefore it must be modified correspondingly: an affine transform A is allowed to be imposed on Y. The manifold learning algorithm fails if there exists another embedding Z such that it satisfies the normalization constraints and Phi( AY ) > Phi( Z ). This definition is quantifiable and computable.

With this tool, they analyze all five algorithms and propose several necessary conditions so as to make these algorithms effective, for both limited samples and asymptotic cases. They come to a conclusion that using these algorithms on real data might be problematic. It seems that this way of finding a good projection is not what practitioners want.

Friday, January 9, 2009

Local Distance Preservation in the GP-LVM through Back Constraints


This is a piece of work following GP-LVM. The idea is so simple that I suddenly found out I was cheated in a way. It is not a very elusive idea: the GP-LVM model maximizes the likelihood function by solving a non-linear optimization. The solution is the coordinates in the low-dimensional space. Now in order to impose our ``local information'' in the high dimensional space, it is very natural that we assume there exist a mapping (parametric or non-parametric) such that the low dimensional coordinates are represented by the image of high-dimensional coordinates.

Then we solve the constrained version of maximization of the likelihood.

I think the application of GP-LVM is more attractive than this paper -,-b

Fast Sparse Gaussian Process Methods: The Informative Vector Machine


This NIPS paper doesn't cover much as the technical report does. They are exploring a sample selection methods for Gaussian processes. They have studied regression, classification and ordinal regression problems. The idea originates from ADF (looks like EP). It is a kind of greedy algorithm that selects the best d samples that maximizes a certain criterion---differential entropy score.

I am thinking about Nystrom approximation. The two methods have different intention (perhaps?): IVM aims at finding sparse representation (just like what SVM achieves via optimization, IVM is a little brute-force though); Nystrom method is a more general approximation method for all kernel methods. The former is supervised while the latter unsupervised.

But is IVM extensible for other kernel-based classifiers and regressors?

Sunday, January 4, 2009

Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

by Neil Lawrence

This might be the first article that covers most of GP-LVM, which is of great interest to me since last time Fei Sha mentioned this model related to a former research of mine. The idea is not very new but to me it is something fresh.

The basic model comes from Probabilistic PCA (PPCA). The PPCA model is a latent variable model that interpret PCA in a probabilistic framework. The latent variable z generate our observations with a linear transformation x = Wz + b + e, where e is an isotropis noise. The bias b could be estimated with sample mean independently. In PPCA, z has a prior of standard Gaussian. The estimation of W and the variance for the noise could be achieved with EM algorithm. So you might understand, the EM algorithm actually maximizes the marginal distibution of x and we marginalize z.

In a linear model, it's easy to get the reduction function and the reconstruction function easily. But as for nonlinear model, usually we can only have one-way mapping. The inverse mapping is usually difficult to get. So does GP-LVM. We might think a GP-LVM like this. The first part comes if we marginalize W instead of z in the PPCA model. We endow W with a similar Gaussian prior and z could be obtained with eigen decomposition of X XT. Actually the solution resembles PCA and classic MDS. For a GP-LVM, unlike kernel PCA putting a kernelized Gram matrix for observed data (so the solution is still obtained via eigen decomposition), we have to find the best configurationsfor all z, whose kernelized matrix best approximated X XT. Therefore, although we have a quite similar optimization problem, it is much harder to solve now due to the nonlinearity of the kernel function. Therefore a gradient based search such as CG must be applied.

But as we might foresee, the computational cost is intolerable for practical problems. The paper then focuses on designing an efficient algorithm for GP-LVM. The most important technique they employ is informative vector machine, which selects a comparatively small active set for the kernel methods (as Nystrom method).

Friday, January 2, 2009

Exploiting Generative Model in Discriminative Classifier


This is a classic semi-supervised style of learning. First a generative model is trained with all samples without labels. The generative model has parameters and the Fisher information matrix I could be calculated (Hessian of the Likelihood function). With this a kernel could be introduced as the quadratic form of the score:
k( x, z ) = UxT I-1 Uz
The kernel could be exploited by later discriminative models, such as GLIMs and SVMs.

It can be easily proved the proposed kernel has several nice properties:
  • It is a qualified kernel.
  • As Jeffreys's rule, it is invariant under continuously differentiable transform of parameters.
  • A kernel classifier employing this kernel derived from a model whose latent variable is the label is asymptotically as good as a MAP labelling.

Structured Ranking Learning using Cumulative Distribution Networks


This NIPS paper further elucidates how to take advantage of CDN's idea to tackle a ranking problem. The author introduce a probability distributon for the preference variable given an ordered relationship. Therefore a natural choice of the ranking scheme is to maximize the likelihood or equivalently to minimize the negative log-likelihood (loss function). Once the ranking function could be parameterized, the optimization problem could be solved. The paper employs a nonparametric model. CDN here plays the role of setting up the CDF for the observed preference variable.

With this formulation, several earlier proposed algorithms could be regarded its special cases, such as RankNet, ListNet and ListMLE. I will take a close look at those ranking algorithms later.

Cumulative Distribution Networks and the Derivative-Sum-Product Algorithm


The paper first proposed novel probabilistic graphical model, i.e. CDN. The model differs with MRF and Bayesian belief network in several aspects. The basic idea of the network resembles the factor graph of MRF. The factor graph of an MRF consists a vertex set of two parts: one of the original vertices of the MRF, the other of each cliques. The representer theorem of MRF actually tells us each clique is a nonnegative potential function (unnormalized PDF, hence the vertex is often referred to by function vertex). In CDN, however, each function vertex corresponds to a CDF-like function (i.e. cumulative function as in the paper). As long as these cumulative functions satisfy several simple properties (which might be easily proved as in the paper), their product is a decent CDF.

To do inference in CDF, the authors proposed an algorithm that resembles sum-product algorithm for factor graphs (i.e. belief propagation for polytrees). We know the derivative of the CDF is PDF, therefore it looks natural their inference is called Derivative-Sum-Product (DSP) algorithm. I will read the details about the algorithm later.

This framework is applied to a ranking problem.

Here we can see, we have three teams, (X1, X2), (X3, X4) and (X5, X6, X7). The function si(xi) is the skill of the player. The performance of each team is the sum of all players, a function of xi. And we get R1, R2 and R3. We do have the rank of these teams, which comes as an ordered graph whose relationships are represented with h. The prediction is carried out with skill functions learned with previous results. In this model every function vertex is a CDF of a Gaussian and DSP could be computed easily. I think the details are worth trying to see.

In the end, several applications are pointed out, webpage ranking, collaborative filtering and etc. They also intended to focus on the learning part of the model in later research and the DSP algorithm, as the belief propagation, is only applicable to trees.

Several related topic:
convolutional factor graph: Convolutional factor graphs as probabilistic models, UAI 2004.

Clustering by Passing Messages Between Data Points


It's an interesting paper about clustering. The technique they employed is affinity propagation, believed to be an example of belief propagation. It has several important features: it identifies exemplars in the samples instead of find a mean/center directly; when the similarity matrix is sparse, the algorithm works fast since it is a linear algorithm w.r.t. the number of similarity values s(i, k); instead of choosing the number of exemplars, the user is only required to set a value for s(k, k) which may influence the number of exemplars (the larger the value is the more exmplars are).

The iterative algorithm is very simple. Two matrices are introduced to show how well the sample k is as the exemplar of sample i in the opinion of sample i or k, i.e. responsibility r(i, k) and availability a(i, k). The updating rule for responsibility is
r( i, k ) <- s( i, k ) - max { a( i, j ) + s( i, j) | j <> k}.
The initial value of availability is 0 (later it is nonpositive). The maximal sum similarity and availability indicates the sample which is similar while thinking itself works best as the exemplar of sample i. Therefore the difference of s( i, k) and the sum shows the evidence sample i thinks sample k is his exemplar---evidence based on the observation of availability and similarity. The updating rule for availability is
a( i, k ) <- min { 0, r( k, k ) + sum { max{ 0, r( j, k) | j <> i, k } } }
As you might see here, the availability is nonpositive (due to the min function). The sum only includes those samples with positive responsibility (namely those samples still regard sample k as a possible exemplar). Therefore the sum with r( k, k) shows the availability of sample k for sample i, since it is a negative value when r( k, k) is negative and few samples take sample k as the exemplar---namely the evidence sample k holds based on its observation of other samples.

After the affinity propagation converges, there are two ways of determining whether a sample is the exemplar: one chooses those samples with r( k, k ) + a( k, k ) > 0; the other chooses those with r( k, k ) + a( k, k ) > r( k, j ) + a( k, j ) satisfied for all j <> k.

One important thing about the algorithm is damping. I haven't explored this part carefully but I do encounter this technique quite frequently. Actually in the training of SRBM and BP with momentum, similar techniques exist. The convex combination of the proposed updating values and the old ones from last update ensures the convergence (why?).

The science paper includes 3 experiments, on facial images, bioinformatic data and a text data of air lines, demonstrating the effectiveness of the algorithm. It's really amazing how well it works. How did they discover this simple but useful technique? This algorithm is designed from (loopy) belief propagation on an elaborate factor graph. I am not familiar with this and have to explore this ASAP.

Here is the important reference site from PSI group.