Thursday, December 25, 2008

Multi-Instance Multi-Label Learning with Application to Scene Classification


To solve MIML problems, the authors proposed two solutions: one via MIL and the other via MLL.

The first alternative, via MIL to solve MIML tasks, is quite simple since we are actually working on several independent MIL problems. Therefore any MIL algorithm could be applied here. This paper cited an algorithm called MIBoosting. Though it is written in the paper as a consistent form, I think we are actually training several independent classifier for each label, only the weights are adjusted in the loop of boosting simultaneously. This work comes from others (c.f. [1]) who derived the logistic regression model and boosting model for MIL. [1] is very simple: on logistic regression, they proposed two flavors, one simply predicting with the mean of each sample and the other with mean of log ratio of probability (please notice the training); on the boosting algorithm (bosting a MI learner is trivial), the difference between the original boosting the MIboosting is MIboosting adjusts the weights on bags and the classifier is ok if it yields more correct labels than incorrect ones on a single bag.

The second alternative, via MLL to solve MIML tasks, takes into account of the interactions between labels. Usually, we may notice that certain labels have high-correlations and exploiting this will help build a better classifier. So the problem becomes how to convert a MIL problem into a SIL one. The standard method is something like Lipschitz transform (if the term is correct),, that is to select several medoids from the data and each bag might be formulated as a fixed feature vector whose features come from the medoids and the bag (the distance, here Hausdorff distance). The we can apply MLL algorithm, such as MLSVM [2] to the problem. In fact, MLL is comparatively more difficult to evaluate (there might be missing labels, incorrect labels). [2] compares several strategies in training SVM for MLL problems and several criterion for evaluation. Strategies: MODEL-s labels each data with the label of most significance; MODEL-i doesn't use those samples with multiple labels; MODEL-n add a new class for each multi-labeled sample (that's completely the same labels imply the same class); MODEL-x uses a multi-labeled sample multiple times as positive samples. The criterions for assigning labels: P-criterion assigns a sample with labels when SVM gives a positive value; T-criterion acts differently when all SVM yields negative value (P-criterion puts it unknown) and assigns a label when the corresponding SVM least negative score; C-criterion labels the sample with closed enough outputs of SVMs. [2] finds that MODEL-x with C-criterion works better than other combinations.

References:

Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification


It looks like a similar problem to CBIR (content-based information retrieval). This application might be solved with MIL (multiple-instance learning). Here the basic idea is distance learning. First several patches are extraced from the images (with SIFT or geometric blur). The distance of each pair of images is a weighted sum of ditances of patches. Here the distance of each pair of images is actually not a distance (no symmetry is assured). We only compute the distance of image i and j to the image k and compare the two to decide which is more likely a match. Then we only need to decide the weights.

Here it exploits the maximum margin concept by a similar strategy as in SVM, that is to constrain the training set by an inequality . - .. >= 1 and minimize the norm of weights. The thing about this strategy is whether this is a margin -,-b I am always confused by it. Maybe it is reasonable but not geometrically interpretable?

Wednesday, December 24, 2008

Modeling Image Patches with a Directed Hierarchy of Markov Random Fields


As I have read previously, RBM is the building block of DBM. Here a semi-restricted Boltzmann machine is proposed to model image patches. An SRBM allow lateral connection between visible states.

In this case, the hidden neurons are still conditionally independent given the visible layer. And it's easily seen that it is still a PoE and could be trained with contrastive divergence. The problematic point is that after we sample the hidden states, it's not as easy as in the RBM to sample the visible states. In RBM, a simple mean field approximation will be enough, but SRBM has connections between visible neurons, which means we can't sample each state independent of others. Here a damped mean field approximation is used: each time the mean field approximation and the last time probability (inintial probability is the training sample).

With SRBM, we could model patches from natural images easily, as with RBM. The point of using SRBM instead of RBM is that the correlation between pixels can't be modeled via RBMs. Their experiment shows this point.

Regression on Manifold Using Kernel Dimension Reduction


This paper describes a regression problem: the samples come from a manifold embedded in a high dimensional space and the response is function defined on the manifold. There are 3 techniques applied in this problem: sufficient dimension reduction (SDR), kernel dimension reduction (KDR) and manifold learning.

The basic idea is from KDR, in which we seek a subspace (to be the central subspace) B such that the projection onto the subspace are sufficient to regress the responses. This idea (in linear case, namely SDR) requires solving the following optimization problem,

where KZc is projected Gram matrix. Here with manifold learning, the projected sample Bxi in the RKHS is approximated with the affine transformation of the coordinates obtained by the manifold learning algorithm (e.g. Laplacian Eigenmap). Therefore the optimization problem could be formulated as

After solving the problem, the regression could be done in the corresponding subspace. The difficult part is how to solve this nonlinear and nonconvex problem. It is solved with projected gradient descent method.

But it seems hard to extend to new data. It is argued that by computing the latent B, it is possible to extend the regressor on new data. But it doesn't say how. I guess it must be much tougher :-(

Tuesday, December 23, 2008

Multiplicative Updates for Nonnegative Quadratic Programming

by Fei Sha, Yuanqing Lin, Lawrence K. Saul and Daniel D. Lee

This paper addresses a method for quadratic programming with non-negative constraints. We know there are serveral basic techniques dealing with non-negative constraints: one is use the logarithm and the gradient method would becoeme a multiplicative update step, which is usually referred to as EG.

In this paper, the following optimization problem,
minimize F(x) = xTA x /2 + bT x subject to x >= 0
is explored. The proposed updating rule is simple: let A = A+ - A-, where A+ and A- are the positive elements and negative elements of A and let ai = (A+ x)i, ci = (A-x)i. The updating scale factor resembles a root of a quadratic algebraic equation:
xi <- xi (-bi + sqrt(bi2 + 4 ai ci)) / (2ai)
The rest is the proof.

Here I want to mention several easy parts about the algorithm: the gradient of the objective function is ai + bi - ci and when it's positive the update rule will decrease the objective function and when it's negative the update rule will increase the objective function; the fixed point analysis shows that the condition coincides with KKT condition.

There are several application of this optimization technique, e.g. in the dual problem of SVM. But it seems that specific algorithms, such as SMO will do better than this multiplicative update rule. So maybe we have to check the convergence rate and do some experiments for comparison.

Monday, December 22, 2008

Large Margin Hidden Markov Models for Automatic Speech Recognition


This work is the topic of Dr. Fei Sha's Ph.D. assertion. Two parts of it look familiar but I still have something not clearly comprehended.

The first part is about large margin GMM for multiway classification. First the separable cases (each class with one Gaussian) are considered: The highest probability of a certain class wins, which could be formulated with a quadratic form; the separable samples could be scaled such that a similar constraint as in SVM must be met; the maximal margin is equivalent to the minimal trace of the precision matrix of the Gaussian. If we allow samples that comes into the margin, slack variables could be introduced similarly. At last, if we add more Gaussians for each class, our constraints will increase much. They proposed to use a stronger inequality instead: taking the softmax (softmin indeed) of several terms instead of using the minimum results in a inequality with log-sum term. I don't know why the trace relates to the margin -,- From his Ph.D. thesis, it seems that the trace could be regarded as a norm for positive semi-definite matrices but it is used here simply because it will be a convex optimization problem (specifically a SDP problem). Please note that GMM is a generative model. But it is trained with maximum margin instead of maximum likelihood.

The second part is about the HMM. The GMM is used as the emission probability. Then we can write down the joint PDF easily. Now the HMM model does not need EM for training since in the given task the labels are known. The good thing in training is that the formulation has the same form as the previous one and hence still a convex optimization problem which could be solved with SDP.

I guess I will spend some time for his Ph.D, thesis later.

Sunday, December 21, 2008

DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification


I decided to read this paper since LDA has been encountered several times recently. This LDA does not stand for Linear Discriminant Analysis, which refers to a linear classifier or sometimes the famous Fisher discriminant criterion. This term might originates from the paper published in JMLR 2003 (see below, I haven't verified this) and refers to Latent Dirichlet Allocation, which is designed for NLP modeling. It's a Bayesian generative model.

As we can see in the figure, the word w depends on the topic z, which is a r.v. of multinomial distribution. The prior for z is naturally a r.v. theta of Dirichlet distribution, whose hyperparameter is alpha. The parameter beta is simply a table of parameters for each topic. After we integrate out the latent r.v.s, z and theta, by maximizing the likelihood of the marginal distribution, i.e. empirical Bayesian method, the parameters alpha and beta could be estimated (with standard EM algorithm). To get the posterior distribution of z, we have to use variational approximation.

The LDA model could be used in unsupervised learning where the topic provides a dimensionality reduction/semantic hashing function. It could also be used in classification, where z is the corresponding label.

This paper provides additional information for the topics. These labels are side supervised information.

As you might see in thr figure, the label y comes into the model and the model becomes tougher. So the resulting inference is solved with MCMC methods. The improvement of the model demonstrates how the side information could help build a better model. Not sure about the core part. But it looks worth trying...

Several References
David M. Blei, Andrew Y. Ng and Michael I. Jordan: Latent Dirichlet Allocation, JMLR 2003

Thursday, December 18, 2008

Graph Laplacian Regularization for Large-Scale Semidefinite Programming


The first time I read of Weinberger is when I was working on manifold learning. At that time, his idea of maximum variance unfolding was quite enlightening. The manifold learning task, in his interpretation, should preserve local distances while maximizing the embedding variance. The formulation of the problem in optimization term is an SDP problem.

This paper further studies the MVU idea, since the SDP can't scale well when the number of samples grows. The so-called Laplacian regularization technique has been quite popular since its proposal in manifold regularization (maybe, I am not sure about the first appearance of it). The idea is that the bottom eigen vectors of the Laplacian matrix correspond to smooth eigen functions over the graph. Therefore it is reasonable to find an approximation in this subspace for the original problem. By re-formulating the original problem we could find one with reduced complexity.

The other fine-tuning techniques is the approximation offers us a good initial value if we try to solve the non-linear optimization problem without SDP. Using CG algorithm, the result could be further improved.

I think I'd better get familiar with SDP ASAP. I am really in urgent need of comprehension of these optimization tools.