Wednesday, February 13, 2008

Permutation Invariant SVMs


First I had a wrong image of this paper, mistaking the permutation for robustness. It turned out the permutation means sth other that robustness. One unpolished idea to tackle with problem of permutation of features is to design a suitable kernel that tolerates permutations. This paper depends on a Kuhn-Munkres algorithm which efficiently seeks the optimal permutation that minimizes sth as following:

With this algorithm, the so-called π-SVM is trained as following:

The algorithm trains an SVM with conventional techniques and tries to permute each sample so as to get a larger margin. I don't see a proof of convergence of this algorithm. The testing procedure needs Kuhn-Munkres algorithm that finds largest margin under permutations of both positive and negative classes and chooses the larger one.

To see why such an SVM is designed, we must know the application settings. They select the pixel positions (70, randomly selected) where it is white in each image as features for each image. Hence the feature is a set of positions, where permutation happens. It is interesting though a little experiment-oriented.

Tuesday, February 12, 2008

Deterministic Annealing for Semi-supervised Kernel Machines


I am surprised that another similar paper on S3VM in the same conference (c.f. the continuation method). The homotopy method does the same thing as the continuation. Find a simple, easy-to-optimize function to start with and end up with the original complex, full-of-local-minima target function. In the continuation setting, convolution with a Gaussian whose variance is diminishing is employed to mollify the target function.

Here, the key idea is the following optimization

might be solved by an annealing in MCMC category. Then we reformulate the S3VM target function as this, by introducing a probability for predicting labels for the unlabelled samples.

Then we have to deal with the optimization with a fixed T(temperature in annealing) and p, which is a convex optimization problem.

They also analyze the corresponding loss function for different Ts. As we can see, the initial loss functions are convex and as the temperature goes down, they are deformed into non-convex ones.

Tuesday, February 5, 2008

An Investigation of Computational and Informational Limits in Gaussian Mixture Clustering


I thought it was about the GMM ability in density estimation. Somehow it wasn't. I am not familiar with the literature on GMM clustering ability though I know it is trained with EM algorithms which is usu. the typical example for general EM algorithm.

They put the whole in under very restricted conditions where theoretical analysis is feasible. They assume each Gaussian has equal probability and the Gaussians have a unified scalar diagonal co-variance matrix. They list several important results on the necessary sample size when the training of GMM will lead to correct clustering results. There are several factors, the sample size N, the separation s, the number of clusters k and the dimension d. Here they further assume the covirance matrix is the identity matrix and hence only s is to be considered instead of s/σ.

They mention several tips for training with EM:
  • PCA will help cluster with GMM. The algorithm is to project data into a (k-1)-D subspace where EM is applied to train a model that tells for each sample its probability of each cluster, which is to substitute the corresponding values in the EM training in the original space. This kind of cascaded EM will improve clustering results.
  • They use a prune strategy, which sets a possible larger k', which is O(k log k). Then by merging near clusters, we get the final clustering result.

Their main result is they set up a procedure that illustrates three phases of the training of the GMM w.r.t. sample size N:
  • Given too limited data, the MLE won't yield a good clustering (a wrong one). The EM will not find the MLE and results in a model that is unstable. This is Random Phase.
  • Given more data, although EM still gives no MLE, MLE does correspond to a good clustering and hence is what we desire. This phase is Gap Phase.
  • Given enough data, EM will get MLE which is the one we desire. This is Success Phase.

They develop an approximation of the sample size where the phase transitions take place. With these they could suggest when the gap exists.

The difficulty is how we can judge the gap phase for a real data set. And we have to extend their results to more general cases. I doubt most of the time, we are dealing with random phase and gap phase :-(