Saturday, October 3, 2009

Maximum Margin Clustering


This paper is interesting in formulating a clustering problem as a convex optimization problem. Its framework is not brand new but the idea is very enlightening for a current project. referring to maximum margin method, everyone knows how SVM is modelled with a quadratic programming problem with linear constraints. In clustering, we are not interested in the parametrized separation boundary but in the partition of samples that maximizes the boundary. Then the problem turns out to be an integer optimization problem, which is often NP-hard.

The key idea is to use the out product, e.g. the kernel matrix of the assignment matrixM = y y^\top in the dual formulation of soft-margin SVM. And another interesting finding is that instead of using non-convex constraint \mathrm{rank}(M) = 1, they use three linear constraints:
  • M encodes equivalent class information, i.e. transitive M_{i, k} \geq M_{i, j} + M_{j, k} - 1, reflexive M_{i,i} = 1 and symmetric M_{i, j} = M_{j, i};
  • M has at most 2 equivalent classes, i.e. M_{j, k} \geq - M_{i, j} - M_{i, k} - 1;
  • M has at least 2 equivalent classes, \sum_i M_{i, j} \leq N - 2.

Then the problem becomes a SDP. It might not be easy to extend the idea to multi-clustering.

Saturday, September 5, 2009

On the Influence of the Kernel on the Consistency of Support Vector Machines


This is kind of math paper. I haven't really delved into some mathematical stuffs for a long time.This paper might be the first to explore the consistency of SVM (that is the asymptotic behavior of the classifier compared with the optimal, Bayes decision error). The main result might be as follows:
For universal kernels we have consistency result for L2 and L1 soft margin classifiers.
The universal kernels are those whose induced RKHS is dense in continuous function space. Gaussian and Laplacian kernels are both universal. The author derived the corresponding consistency.
At first glance I thought they find some tricks in choosing regularization parameters. But I didn't find anything truely usable (e.g. corollary 18).

The editor is Scholkopf... I guess only huge bulls like him understands the key points. For now I just have quite a faint idea.

Wednesday, July 29, 2009

Efficient Euclidean Projection in Linear Time


I am really surprised why this paper got published. The previously scanned paper in ICML 2008 noticed the linear algorithm already. Not sure...

On Sampling-based Approximate Spectral Decomposition


This paper proposed another approximation of kernel methods based on former Nystrom method and column sampling. The prime disadvantage of the earlier method is that they must compute the whole Gram matrix while the proposed adaptive method does not need to. They prove several results (not that interesting though): column sampling is best for rank 1 approximation (no one will use rank 1 approximation I guess...); for a given form, neither is optimal approximation (that is why they are not good enough?); under some cases (looks not useful in practice), Nystrom's recovery is exact.

The improvement of approximation is not salient though, but I guess the efficiency might be improved.

Tuesday, July 28, 2009

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations


This can be viewed as a deep version of convolutioal neural network (I am not quite familiar with this). In each layer (in a sense), there are two sublayers, one for detection, convolving the input with several filters, the other for max-pooling, shrinking (resize the image) the convolved sublayer. Therefore on a whole the layer's input is an image and the output is several convolved and downsampled images. They layer this kind of nets and get the deep version.

Let's see what we have to get. For each convolving sublayer, we have to train a convolution kernel and the corresponding biases (as in RBM). The change is the max-pooling layer. Each neuron in the max pooling layer only connects to a fixed size (a small patch, e.g. 2x2) of neurons in the convolutional sublayer. Since the neurons in the convolutional sublayer could be 0-1 and the max-pooling means only if none of the input neurons fires the output is 0, from the outside it is like a big neuron which can take multiple values (e.g. 2x2+1 = 5). The we may do as RBM, writing down the energy function, converting it to probability, formulating the likelihood and using CD learning with a sparsity penalty.

The good idea behind the structure is that we first get some useful filters, then parts of the object and later the whole objects. The features learned with the model gives good results on several data sets.

Thursday, July 23, 2009

Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks


This paper tells us about a method to evaluating two ranking reseults based on clicks of users. We know the ranking affects exposures of the links, i.e. the higher rank one item has, the more attension human beings pay to. Therefore it is more reasonable to use a discounted relavance called discounted cumulative gain (DCG)
\mathrm{DCG}_l = \mathrm{rel}_1 + \sum_{i = 2}^l \frac{\mathrm{rel}_i}{\log_2 i}.
That is, the rank 1 item has no discount while rank i item has a discount ratio of 1/\log_2 i.

So if we want to compare two ranking results, we have to calculate the relevance score given by given the query. This is not always known. People might have been asked to score the documents with discrete values (on a scale of 5, e.g.) but not all documents will be marked. We model this scale with a multinomial distribution and simulated the scoring procedure and compare the two DCG values. If for 95% cases, one is higher than the other, we might assert it is better.

The author proposed quite a simple model to model \Pr(X_i \mid q, c, ). It is an ordinal logistic regression model with linear and quadratic features. So when we trained the model, we can simulate the DCG and compare two rankings.

Herding Dynamical Weights to Learn


This paper talks a novel way of learning MRF such as Hopfield networks (is that also a Boltzmann machine). For these models, the primal-dual structure of maximum entrpy learning and maximum likelihood estimator tells the possible learning algorithms, e.g. gradient-based, since though the optimization is constrained in the former form, it doesn't in the dual form.

The gradient-based algorithms require solving the following inference problem
w_{\alpha}^{(t+1)} = w_\alpha^{(t)} + \eta( \bar{g}_\alpha - \mathbb{E}[g_\alpha]_P)
where the expectation is computed in the given model. The common ways of computing the expectation include a Gibbs sampler (MCMC) or mean field approximation (usually not good). For certain models, such as RBM, we might think about Hinton's contrastive divergence, but still we need non-deterministic algorithms.

The herding algorithm the author proposed here is a deterministic algorithm. It takes the limit of the annealing version of the negated log-likelihood function and it results in a tipi function. He then formulated the herding algorithm as first maximizing (due to the limit) and resulting in some pseudo samples and then calculating the gradient based on these pseudo samples.

I am not familiar with the dynamic system and the intrinsic idea behind this stuff though.