Thursday, December 25, 2008

Multi-Instance Multi-Label Learning with Application to Scene Classification


To solve MIML problems, the authors proposed two solutions: one via MIL and the other via MLL.

The first alternative, via MIL to solve MIML tasks, is quite simple since we are actually working on several independent MIL problems. Therefore any MIL algorithm could be applied here. This paper cited an algorithm called MIBoosting. Though it is written in the paper as a consistent form, I think we are actually training several independent classifier for each label, only the weights are adjusted in the loop of boosting simultaneously. This work comes from others (c.f. [1]) who derived the logistic regression model and boosting model for MIL. [1] is very simple: on logistic regression, they proposed two flavors, one simply predicting with the mean of each sample and the other with mean of log ratio of probability (please notice the training); on the boosting algorithm (bosting a MI learner is trivial), the difference between the original boosting the MIboosting is MIboosting adjusts the weights on bags and the classifier is ok if it yields more correct labels than incorrect ones on a single bag.

The second alternative, via MLL to solve MIML tasks, takes into account of the interactions between labels. Usually, we may notice that certain labels have high-correlations and exploiting this will help build a better classifier. So the problem becomes how to convert a MIL problem into a SIL one. The standard method is something like Lipschitz transform (if the term is correct),, that is to select several medoids from the data and each bag might be formulated as a fixed feature vector whose features come from the medoids and the bag (the distance, here Hausdorff distance). The we can apply MLL algorithm, such as MLSVM [2] to the problem. In fact, MLL is comparatively more difficult to evaluate (there might be missing labels, incorrect labels). [2] compares several strategies in training SVM for MLL problems and several criterion for evaluation. Strategies: MODEL-s labels each data with the label of most significance; MODEL-i doesn't use those samples with multiple labels; MODEL-n add a new class for each multi-labeled sample (that's completely the same labels imply the same class); MODEL-x uses a multi-labeled sample multiple times as positive samples. The criterions for assigning labels: P-criterion assigns a sample with labels when SVM gives a positive value; T-criterion acts differently when all SVM yields negative value (P-criterion puts it unknown) and assigns a label when the corresponding SVM least negative score; C-criterion labels the sample with closed enough outputs of SVMs. [2] finds that MODEL-x with C-criterion works better than other combinations.

References:

Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification


It looks like a similar problem to CBIR (content-based information retrieval). This application might be solved with MIL (multiple-instance learning). Here the basic idea is distance learning. First several patches are extraced from the images (with SIFT or geometric blur). The distance of each pair of images is a weighted sum of ditances of patches. Here the distance of each pair of images is actually not a distance (no symmetry is assured). We only compute the distance of image i and j to the image k and compare the two to decide which is more likely a match. Then we only need to decide the weights.

Here it exploits the maximum margin concept by a similar strategy as in SVM, that is to constrain the training set by an inequality . - .. >= 1 and minimize the norm of weights. The thing about this strategy is whether this is a margin -,-b I am always confused by it. Maybe it is reasonable but not geometrically interpretable?

Wednesday, December 24, 2008

Modeling Image Patches with a Directed Hierarchy of Markov Random Fields


As I have read previously, RBM is the building block of DBM. Here a semi-restricted Boltzmann machine is proposed to model image patches. An SRBM allow lateral connection between visible states.

In this case, the hidden neurons are still conditionally independent given the visible layer. And it's easily seen that it is still a PoE and could be trained with contrastive divergence. The problematic point is that after we sample the hidden states, it's not as easy as in the RBM to sample the visible states. In RBM, a simple mean field approximation will be enough, but SRBM has connections between visible neurons, which means we can't sample each state independent of others. Here a damped mean field approximation is used: each time the mean field approximation and the last time probability (inintial probability is the training sample).

With SRBM, we could model patches from natural images easily, as with RBM. The point of using SRBM instead of RBM is that the correlation between pixels can't be modeled via RBMs. Their experiment shows this point.

Regression on Manifold Using Kernel Dimension Reduction


This paper describes a regression problem: the samples come from a manifold embedded in a high dimensional space and the response is function defined on the manifold. There are 3 techniques applied in this problem: sufficient dimension reduction (SDR), kernel dimension reduction (KDR) and manifold learning.

The basic idea is from KDR, in which we seek a subspace (to be the central subspace) B such that the projection onto the subspace are sufficient to regress the responses. This idea (in linear case, namely SDR) requires solving the following optimization problem,

where KZc is projected Gram matrix. Here with manifold learning, the projected sample Bxi in the RKHS is approximated with the affine transformation of the coordinates obtained by the manifold learning algorithm (e.g. Laplacian Eigenmap). Therefore the optimization problem could be formulated as

After solving the problem, the regression could be done in the corresponding subspace. The difficult part is how to solve this nonlinear and nonconvex problem. It is solved with projected gradient descent method.

But it seems hard to extend to new data. It is argued that by computing the latent B, it is possible to extend the regressor on new data. But it doesn't say how. I guess it must be much tougher :-(

Tuesday, December 23, 2008

Multiplicative Updates for Nonnegative Quadratic Programming

by Fei Sha, Yuanqing Lin, Lawrence K. Saul and Daniel D. Lee

This paper addresses a method for quadratic programming with non-negative constraints. We know there are serveral basic techniques dealing with non-negative constraints: one is use the logarithm and the gradient method would becoeme a multiplicative update step, which is usually referred to as EG.

In this paper, the following optimization problem,
minimize F(x) = xTA x /2 + bT x subject to x >= 0
is explored. The proposed updating rule is simple: let A = A+ - A-, where A+ and A- are the positive elements and negative elements of A and let ai = (A+ x)i, ci = (A-x)i. The updating scale factor resembles a root of a quadratic algebraic equation:
xi <- xi (-bi + sqrt(bi2 + 4 ai ci)) / (2ai)
The rest is the proof.

Here I want to mention several easy parts about the algorithm: the gradient of the objective function is ai + bi - ci and when it's positive the update rule will decrease the objective function and when it's negative the update rule will increase the objective function; the fixed point analysis shows that the condition coincides with KKT condition.

There are several application of this optimization technique, e.g. in the dual problem of SVM. But it seems that specific algorithms, such as SMO will do better than this multiplicative update rule. So maybe we have to check the convergence rate and do some experiments for comparison.

Monday, December 22, 2008

Large Margin Hidden Markov Models for Automatic Speech Recognition


This work is the topic of Dr. Fei Sha's Ph.D. assertion. Two parts of it look familiar but I still have something not clearly comprehended.

The first part is about large margin GMM for multiway classification. First the separable cases (each class with one Gaussian) are considered: The highest probability of a certain class wins, which could be formulated with a quadratic form; the separable samples could be scaled such that a similar constraint as in SVM must be met; the maximal margin is equivalent to the minimal trace of the precision matrix of the Gaussian. If we allow samples that comes into the margin, slack variables could be introduced similarly. At last, if we add more Gaussians for each class, our constraints will increase much. They proposed to use a stronger inequality instead: taking the softmax (softmin indeed) of several terms instead of using the minimum results in a inequality with log-sum term. I don't know why the trace relates to the margin -,- From his Ph.D. thesis, it seems that the trace could be regarded as a norm for positive semi-definite matrices but it is used here simply because it will be a convex optimization problem (specifically a SDP problem). Please note that GMM is a generative model. But it is trained with maximum margin instead of maximum likelihood.

The second part is about the HMM. The GMM is used as the emission probability. Then we can write down the joint PDF easily. Now the HMM model does not need EM for training since in the given task the labels are known. The good thing in training is that the formulation has the same form as the previous one and hence still a convex optimization problem which could be solved with SDP.

I guess I will spend some time for his Ph.D, thesis later.

Sunday, December 21, 2008

DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification


I decided to read this paper since LDA has been encountered several times recently. This LDA does not stand for Linear Discriminant Analysis, which refers to a linear classifier or sometimes the famous Fisher discriminant criterion. This term might originates from the paper published in JMLR 2003 (see below, I haven't verified this) and refers to Latent Dirichlet Allocation, which is designed for NLP modeling. It's a Bayesian generative model.

As we can see in the figure, the word w depends on the topic z, which is a r.v. of multinomial distribution. The prior for z is naturally a r.v. theta of Dirichlet distribution, whose hyperparameter is alpha. The parameter beta is simply a table of parameters for each topic. After we integrate out the latent r.v.s, z and theta, by maximizing the likelihood of the marginal distribution, i.e. empirical Bayesian method, the parameters alpha and beta could be estimated (with standard EM algorithm). To get the posterior distribution of z, we have to use variational approximation.

The LDA model could be used in unsupervised learning where the topic provides a dimensionality reduction/semantic hashing function. It could also be used in classification, where z is the corresponding label.

This paper provides additional information for the topics. These labels are side supervised information.

As you might see in thr figure, the label y comes into the model and the model becomes tougher. So the resulting inference is solved with MCMC methods. The improvement of the model demonstrates how the side information could help build a better model. Not sure about the core part. But it looks worth trying...

Several References
David M. Blei, Andrew Y. Ng and Michael I. Jordan: Latent Dirichlet Allocation, JMLR 2003

Thursday, December 18, 2008

Graph Laplacian Regularization for Large-Scale Semidefinite Programming


The first time I read of Weinberger is when I was working on manifold learning. At that time, his idea of maximum variance unfolding was quite enlightening. The manifold learning task, in his interpretation, should preserve local distances while maximizing the embedding variance. The formulation of the problem in optimization term is an SDP problem.

This paper further studies the MVU idea, since the SDP can't scale well when the number of samples grows. The so-called Laplacian regularization technique has been quite popular since its proposal in manifold regularization (maybe, I am not sure about the first appearance of it). The idea is that the bottom eigen vectors of the Laplacian matrix correspond to smooth eigen functions over the graph. Therefore it is reasonable to find an approximation in this subspace for the original problem. By re-formulating the original problem we could find one with reduced complexity.

The other fine-tuning techniques is the approximation offers us a good initial value if we try to solve the non-linear optimization problem without SDP. Using CG algorithm, the result could be further improved.

I think I'd better get familiar with SDP ASAP. I am really in urgent need of comprehension of these optimization tools.

Wednesday, September 24, 2008

Spectral Clustering with Inconsistent Advice


This paper is about semi-supervised clustering. The two main algorithms are spectral clustering and 2-correlation clustering problem (2CC). The main technique used here is SDP relaxation, which is very interesting.

The basic idea of SDP relaxation: Usually a combinatoric optimization problem results in a minimization of a quadratic form, where the variables are 0-1. The relaxation allows the variables to take real values. Then the optimization could be turned into a SDP problem by exploiting the trace equation: tr XT S X = tr S G, where G = X XT. With this formulation of the original problem G would substitute X. We could derive new optimization in SDP form and later transform it back to the original problem which can be efficiently solved by eigen decomposition.

It is famous that spectral clustering does the relaxation part and simplifies the NP-hard problem into an eigenvalue problem. Somehow, given consistent advice, we could solve the semi-supervised learning problem in a subspace that accords with the advice. But for inconsistent advice, it would be infeasible to apply the technique.

Several solutions are proposed in this paper. One is simple enough. 2CC could resolve the inconsistancies, which reenables the usage of subspace method with spectral clustering.

The other two ideas exploit SDP relaxation. The 2CC optimization could be relaxed in the same way as spectral clustering, resulting in a SDP problem. Combining the two SDP, optimizing over the clustering and putting 2CC in the constraints by limiting the upper bound, we get the SDP formulation of the problem. The imposed constraint in the spectral form turns out to be the subspace constraint of a block of advice graph. Compared with the original subspace method, it's similar (but it's blockwise and the eigen vectors of another matrix.) However this method doesn't yield a flexible model that could vary the importance of the advice. By multiplying a scalar (>1), and searching in a larger space (spanned by eigen vectors of the smallest eigen value multiplied by the scalar), the relaxation is done.

I think SDP might be really something useful in the spectrum related algorithms. Graph-related algorithms and similar ideas might all benefit from the idea.

Thursday, August 28, 2008

Using the Nyström Method to Speed Up Kernel Machines


This is an old paper, which I must scan for recent research. The Nyström method originates from the numerical solution of integral equations. Since in kernel methods, we have to eigen decompose a large Gram matrix (of size N, the number of samples), which requires O( N3 ).

The eigen decomposition yields the basis functions evaluating at the sample points and the eigenvalues, which could be used to approximate the basis functions evaluating at an arbitrary position. This is the so-called Nyström method. What this paper says is to select randomly a subset of the samples, for which the required elements in Nyström method are derived. As is written in the paper, only comparatively small amount of calculation needs to be done.

This technique is verified on Gaussian Process in the experiment, yielding a good approximation.

Tuesday, July 1, 2008

Understanding camera trade-offs through a Bayesian analysis of light field projections

In the recent a few years the direction of computational photography has become hot, with unconventional cameras capturing not only a traditional image but also structure of the scene. Yet this might be the first paper to state a unified framework for such cameras.

The atomic element that interact camera sensors are light rays, which could be encoded by the notion of light field. With this notion, an image captured by a computational camera could be formulated as a linear projection of the light field. Considering the noises on the sensor, reconstruction of the light field could be addressed as solving a linear problem in the Bayesian manner.

A number of recent optical designs were examined in this framework for a empirical comparison, including pinhole, lens, wavefront coding, coded aperture, stereo and plenoptic cameras. It was found that a good depth prior, e.g. a mixture of oriented Gaussians, is critical for computational imaging tasks. It was also found that the optical design of wavefront coding is optimal to capture single-view scenes while a stereo configuration is best for capturing the full light field. Well, it's ironic to note that both configurations are quite "ancient": even earlier than the notion of light field was invented...

This paper might be helpful for vision people to design new cameras, and for setting up a more analytical study of existing designs. I personally think this could be a cornerstone of the field.

This is a eccv2008 paper and could be downloaded from:
http://people.csail.mit.edu/alevin/

Sunday, May 18, 2008

Two-Dimensional Solution Path for Support Vector Regression


In this paper a method for tuning the parameters of SVR is proposed. There are two parameters, ε(the allowable errors) and λ(the regularization parameter). Usu. they are setted up according to experiment results.

In a previous paper in NIPS 2005, the path of λ was studied. The idea is that starting from λ=infinity and by descreasing the value to 0, we might find a path for each Lagrange multiplier and the bias, which are proved to be piece-wise linear function of λ. The paths help us in determination of the regularization parameter, since the number of samples in the elbow implies a ``degree of freedom'', which is employed for selecting the GCV parameter as the SE/(N - DF) instead of MSE. The reason is still not quite clear to me.

The ICML 06 paper aims at ε. Basically they use the same analysis and the result is similar that the Lagrange multipliers and the bias are piecewise function of ε. The paper solves the problem in the NIPS paper that requires ε known and set a priori. Now this paper emphasizes that we could choose a proper ε with the path.

Monday, May 12, 2008

Trace Ratio vc. Ratio Trace for Dimensionality Reduction

by Huan Wang, Shuicheng Yan, Dong Xu, Xiaoou Tang, Thomas Huang

This is a CVPR paper in 2007. I read Yan's CVPR paper in 2005. So I am quite familiar with this direction. Basically I agree with their argument on using trace ratio, esp. for those graph embedding algorithms. It is more sounding by directly maximizing the ratio of between-class scatterness to with-in class compactness. the optimization becomes difficult when trace ratio is the objective instead of the originally used formulations, e.g. iteratively defined optimization, trace formulation and determinant formulation.

The constraints of their proposed optimization ensure orthogonality of the projection. The algorithm, also an approximation in the principal subspace of St, iteratively updates the projection V, such that
V* = arg max tr VT (Sup - λn Sut ) V s.t. VTV = I
The iteration will increase the objective function monotonously and ensures a global optimum (main contribution). For other constraints, such as one in kernelization, an eigen decomposition is first applied.

Hmm... another work as an extension to this (vector -> tensor) says a global optimum might not be obtainable, which I might check later.

Adaptive Online Gradient Descent


In this paper, we are looking for a game-like optimization problem. We have to pick up a xt for an unknown objective ft(x), which might be a loss the adversary might ompose upon us. To evaluate the final result, the accumulated loss w.r.t time t = 1, ..., T is compared with a fixed action x, that minimize the total lost. The difference of the two is hence called regret (why not wait until the last ft is given),

There are several results with with problem:
  • Zinkevich showed for linear function, the regret will increase as sqrt(T) using his proposed online gradient descent algorithm.
  • Hazan et al showed under the assumption of strong convexity, the regret will increase as log(T).

This paper has several results:
  • With a regularization technique, they propose an adaptive online gradient descent algorithm, which ensures 1-6 times regret when the coefficients for the regularizer are determined offline for the lowest regret.
  • It might be proved that for linear functions and strongly convex functions, likewise rates, O(sqrt(T)) and O(log(T)), are to be achieved.
  • There exists a similar algorithm for general Lp norm, with the help of Bregman divergence generalization, from L2 in

    The Bregman divergence is

Thursday, May 8, 2008

Bayesian Regression with Input Noise for High Dimensional Data


Usually in the regression models, no noise is assume at the input side. This paper deals with the case the assumption doesn't hold. The basic observation is that when noise exists, the orignal model tends to underestimate the parameters (para. 3, sec. 2). I am interested in why it is (no reference and proof is given).

To filter the noise, a probabilistic model that resembles JFA is proposed. I can't help recalling those in Jordan's unpublished book. Now I can understand it better. The solution to the model (training part) relies on EM, which is commonly used when hidden variables are at hand. The inference part is done by marginalizing all hidden variables and conditioning the output on the input, as is done in conventional regression. The following figure shows evolution of their model,

I decide to review Jordan's book, really worth reading. Hopefully I'd derive several famous models for later seminars. Now let's get closer to the model. The hidden state t is assumed as a Gaussian. The observed x and y (I don't think z is necessary at all, so just forget about it) are conditioned on t and another parameter, and they are Gaussians too. The only non-Gaussian hidden variable α is a Gamma-distributed R.V.. It's paraphrased as precision of the two parameters, hence the only hyperparameter of the model. Why Gamma?

About the results... The figure shows when the test case is noise-free, the proposed model yields lowest errors. But when the test cases are also contaminated by noise, all methods perform equally worse. It's a little difficult to tell, since after all the test cases are not accurate at all, we know no groundtruth.

Friday, March 28, 2008

Null Space versus Orthogonal Linear Discriminant Analysis


In this paper, several LDA algorithms are addressed. The main result says OLDA and NLDA under certain conditions will yield the same result.

I seldom consider the details of the LDA algorithms, but after I read the journal version about the OLDA, I think it's necessary to get a summary for their current work. I am thinking about the possibility of reusing their techniques in the graph-based algorithms. But for the current paper, it's too far.

One deeply-sighted observation is the simultaneous diagonalization of Sw, Sb and St. Then the optimization is generalized from ``inverse'' of St to its pseudo-inverse. Actually, we have several constraints for different LDA algorithms. The ULDA requires orthogonality w.r.t. St while OLDA simply requires orthogonality w.r.t. an identity matrix. The so called NLDA maximizes the projected between class scatterness in the null space of Sw.

The condition they find for the equality is the rank equality. They test their algorithm on several high-dimensional data sets. I guess it's easy to follow their motivation.

Though lots of graph-based dimensionality reduction algorithms extensively use the idea from LDA, the simultaneous diagonalization is seldom feasible. Maybe, it's worth trying to get a more generalized solution for that rank-deficit general eigenvalue problem.

Wednesday, February 13, 2008

Permutation Invariant SVMs


First I had a wrong image of this paper, mistaking the permutation for robustness. It turned out the permutation means sth other that robustness. One unpolished idea to tackle with problem of permutation of features is to design a suitable kernel that tolerates permutations. This paper depends on a Kuhn-Munkres algorithm which efficiently seeks the optimal permutation that minimizes sth as following:

With this algorithm, the so-called π-SVM is trained as following:

The algorithm trains an SVM with conventional techniques and tries to permute each sample so as to get a larger margin. I don't see a proof of convergence of this algorithm. The testing procedure needs Kuhn-Munkres algorithm that finds largest margin under permutations of both positive and negative classes and chooses the larger one.

To see why such an SVM is designed, we must know the application settings. They select the pixel positions (70, randomly selected) where it is white in each image as features for each image. Hence the feature is a set of positions, where permutation happens. It is interesting though a little experiment-oriented.

Tuesday, February 12, 2008

Deterministic Annealing for Semi-supervised Kernel Machines


I am surprised that another similar paper on S3VM in the same conference (c.f. the continuation method). The homotopy method does the same thing as the continuation. Find a simple, easy-to-optimize function to start with and end up with the original complex, full-of-local-minima target function. In the continuation setting, convolution with a Gaussian whose variance is diminishing is employed to mollify the target function.

Here, the key idea is the following optimization

might be solved by an annealing in MCMC category. Then we reformulate the S3VM target function as this, by introducing a probability for predicting labels for the unlabelled samples.

Then we have to deal with the optimization with a fixed T(temperature in annealing) and p, which is a convex optimization problem.

They also analyze the corresponding loss function for different Ts. As we can see, the initial loss functions are convex and as the temperature goes down, they are deformed into non-convex ones.

Tuesday, February 5, 2008

An Investigation of Computational and Informational Limits in Gaussian Mixture Clustering


I thought it was about the GMM ability in density estimation. Somehow it wasn't. I am not familiar with the literature on GMM clustering ability though I know it is trained with EM algorithms which is usu. the typical example for general EM algorithm.

They put the whole in under very restricted conditions where theoretical analysis is feasible. They assume each Gaussian has equal probability and the Gaussians have a unified scalar diagonal co-variance matrix. They list several important results on the necessary sample size when the training of GMM will lead to correct clustering results. There are several factors, the sample size N, the separation s, the number of clusters k and the dimension d. Here they further assume the covirance matrix is the identity matrix and hence only s is to be considered instead of s/σ.

They mention several tips for training with EM:
  • PCA will help cluster with GMM. The algorithm is to project data into a (k-1)-D subspace where EM is applied to train a model that tells for each sample its probability of each cluster, which is to substitute the corresponding values in the EM training in the original space. This kind of cascaded EM will improve clustering results.
  • They use a prune strategy, which sets a possible larger k', which is O(k log k). Then by merging near clusters, we get the final clustering result.

Their main result is they set up a procedure that illustrates three phases of the training of the GMM w.r.t. sample size N:
  • Given too limited data, the MLE won't yield a good clustering (a wrong one). The EM will not find the MLE and results in a model that is unstable. This is Random Phase.
  • Given more data, although EM still gives no MLE, MLE does correspond to a good clustering and hence is what we desire. This phase is Gap Phase.
  • Given enough data, EM will get MLE which is the one we desire. This is Success Phase.

They develop an approximation of the sample size where the phase transitions take place. With these they could suggest when the gap exists.

The difficulty is how we can judge the gap phase for a real data set. And we have to extend their results to more general cases. I doubt most of the time, we are dealing with random phase and gap phase :-(

Wednesday, January 30, 2008

Clustering Graphs by Weighted Substructures Mining


The second author Taku Kudo has written several useful packages in NLP. I'd love to do sth such as programming, esp. making useful tools for later research.

In this paper, the data are graphs. The features of these graphs are so-called patterns, which can be observed or not in a certain graph (hence binary features). Then the representation of a sample is a high-dimensional binary vector.

There are two things in this paper, feature selection and clustering.

The feature selection is tackled with a weighted substructure mining strategy with the help of DFS code tree (a data structure I am not familiar with). Their idea is simple, starting from an empty graph, expanding it by adding an edge (hence we get a tree-like structure for searching). Since we know if a patten is contained in another, it appears more frequent than the latter. So in the search structure, the frequency will decrease as the depth goes up. The frequent patterns can be found by pruning the search tree.

The clustering problem is dealt with binomial mixture model. We know GMM which can be applied to clustering and the model is trained with EM. Likewise, BMM is done in the same way. One thing different is a regularization technique they use. I guess it resembles the term in NMF.

The experiment is carried out on biological databases, e.g. RNA. After all, it is a little application-oriented paper.

Wednesday, January 23, 2008

Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction


I first knew this paper from afliboy, astonished since I was scanning ICML without scanning it first for my current research project. I am experimenting with it. However, up to now I still don't understand why it is not as good as I have expected on my data.

The idea is so simple. Everyone with experience in FDC will know it immediately. In FDC (or shall we say FDA in accordance with the author), we have two ``variance'' matrices, one for between-class (Sb), the other for within-class (Sw). The reduction is projecting samples into the eigen vectors of the generalized eigenvalue problem of (Sb, Sw), which is claimed to minimize the within-class variance and maximize the between-class variance simultaneously, for which I hold the belief that it is only one strategy for selecting a Pareto optimal (of coz, DNE uses another). The proposed idea reformulates the variance as

which allows us to think about the variance as a pairwise relationship, as if a graph, which I did several month ago without furthering the idea. If I had done, I would been more disappointed by now :-( So locality can be introduced by modifying the weights (those As).

Here As are the adjacent matrices. And this algorithm is then easy to kernelize, as FDC.

With several other papers on graph embedding, this area might be quite difficult to get new result. But anyway, let me try first.

Label Propagation Through Linear Neighborhoods


This is a semi-supervised learning algorithm based on graphs. First they use the idea from LLE to determine the weights for the graph (they also use the traditional way for selecting the neighborhood), however the weight has one more constraint. (in LLE the weights are those that best reconstruct the samples with the sum of 1, here additionally they have to be non-negative, which then yields a best convex combination approximation) Second they use the transductive learning idea (very similar to the ICML 06 paper on ranking graphs). The iteral procedure is paraphrased as label propagation:

To prove the convergence of the procedure, it is only necessary to analyze the eigen values of W (which they claim Perron-Frobenius theorems applies here for the non-negative matrix).

The analysis leads to another connection of their LNP with Laplacian-regularized Least Square Regression:

I must verify my intuition today.

To tackle those out-of-sample data, which turns transduction into induction, they use the similar idea to extend LLE, of coz with the additional constraint.

I am not familiar with this transductive result. But I don't think their algorithm will work any better than Laplacian-regularized least square regression. OK, let me check it.

Tuesday, January 22, 2008

Inference with the Universum


Universum is a new concept to me. In the settings with universum, we have another data set (universum) besides the training set and testing set. Now let's have a look at the logic of the whole thing.

With the training set, we have finite separation results. With each separation, we have correspondingly an equivalent class of functions from the hypothesis space. In SVM settings, the regularizer for the loss is the margin of the function. Why do we control the margin? Since it is the ``inverse'' of capacity, which controls the generalization capability while the capacity is complexity of the hypothesis space imposed on the whole feature space.

However, real problems are different in that the samples come from only a fraction of the whole feature space. We need to measure the complexity of the function on the problem. Usually in Bayes theory, the prior distribution is considered to encapsulate the information which is priorly holden. However, it is difficult to design priors.

By designing universum samples carefully so as to incorporate the information (hence universum samples are not in any labeled class, but those relevant uninteresting samples), we might figure out a way to measure the complexity. The following optimization is proposed to tackle the issued idea:

The last term is the universum regularizer. The U() function is a epsilon-sensitive loss. The resulted function should have small values on the universum samples and hence is instable (w.r.t. the sign of the values) if there is a small purturbation (the purturbed function is most possibly in the same equivalent class). So the resulted function comes from an equivalent class which has lots of contradictions on the universum samples, which indicates it has high complexity on universum sample and therefore comparably low complexity on those samples we are interested, given the limited total complexity.

Well, this is really interesting. Adding ``useless'' samples and getting a regularizer for them will get better result :-)

Learning Random Walks to Rank Nodes in Graphs


Well, forget about the other Agarwal who published the ICML 06 paper (I mistook them for one). Sorry about my carelessness. Thanks for the anonymous reader who kindly pointed it out.

In this paper, they explore the relationship of two strategies for ranking graphs, one from NetRank, which seeks to minimize a KL divergence, one proposed in the paper mentioned above, which tackles a SVM-like optimization and is called Laplacian regularizer. Their main result contains (in theorems):
  • The KL divergence bound suggests a bound for the Laplacian regularizer (which indicates a possibly acceptable smoothness of the ranking function).
  • The generalization capability can be bounded (usu. in the form R <= Remp + sth).
which I don't fully understand.

I think maybe I might read something about the PageRank and NetRank stuffs when I have time.

Thursday, January 10, 2008

Uncovering Shared Structures in Multiclass Classification


In this paper, we have Srebro, one of the authors of previous papers on MMMF. Now we are finding linear projection, a matrix W. To get a good projection, they define a loss function:

which is regularized by the norm of W. Thus we have,

But, they argue the Frobenius norm can be substituted for the trace norm, resulted in the following problem which is formally the same as MMMF:


After analyzing the dual problem of the equation above, it is found out the problem can be kernelized as


Then to solve these problem, SDP or PR-GC can be used.

Fast Maximum Margin Matrix Factorization for Collaborative Prediction


This paper is published just a little after the one in NIPS, the latter aiming at introdcing MMMF. This paper shows how to compute MMMF faster.

There is something strange in the two. The first one in NIPS claims the trace-norm regularizer is good, since it provides global optimal solution (convex optimization solved by SDP). The second one in ICML claims the following is better for computation:

For this non-convex optimization, they use PR-CG. OK... I am an illiterate of optimization techniques.

Maximum-Margin Matrix Factorization


Actually I didn't intend to read this piece. But I am wondering why the trace norm is chosen in another paper I am reading. And here is the homepage of the first author, where you can find related code and slides.

We know there are several matrix factorization problem. One is that SVD helps find the low-rank approximation of a matrix under Frobenius norm. Later in NMF, for non-negative matrix, we hope find a set of non-negative vectors, in the cone spanned by which the matrix can be best approximated, under Frobenius norm or a KL-like divergence.

Here we mainly deal with sparse matrix. The non-zeros elements might be binary. Usually we use a loss that resembles Hinge loss, which is applied to each element of UV'. Then a regularizer is added to the loss. The trace norm is adopted as the lower bound for the Frobenius norm square of U and V, so as to avoid the non-convexity caused by them.


The optimization is solved with SDP. I guess I must get something working with the latest optimization techniques, such as SDP, CCCP and etc. To get a better understanding of this work, I will read the Ph.D. thesis of the first author:
Learning with Matrix Factorizations.

A Duality View of Spectral Methods for Dimensionality Reduction


One important trick in this paper is to apply basic optimization theory to the proposed optimization problem. They choose the MVU algorithms proposed by Weinberger. As for the theory, I seldom noticed its importance, although I did try it for several times. I guess I must pay attention to the theoretical importance of it later.

The optimization of MVU is as following,

then get the Lagrange function,

And last the dual problem

By analyzing the KKT conditions and weak/strong duality, we get an insight of the relationship of the sparse Laplacian matrix and the dense Gram matrix. This is really an interesting thing. Later the authors analyzed ISOMAP, LLE, Laplacian Eigenmap.

It is worth redo their analysis to enhence your own ability.

Tuesday, January 8, 2008

Null Space versus Orthogonal Linear Discriminant Analysis


This article tells me one important thing. LDA or Fisher discriminant criterion is still being studied by lots of guy. Last time I read one by Manli Zhu, in which they propose another different criterion than the traditional ones when the co-variance matrix is rank-deficient.

One of the simplest idea is applying a ridge purturbation to the singular with-in class covariance matrix, which is usu. called regularization. The other is finding an approximation in the principal subspace of Sw. Zhu's idea roots from the latter one, in which all principal directions are selected for later general eigenvalue problem. Zhu rejects those principal directions that are almost perpendicular to the subspace spanned by Sb.

In this paper, the author mentions several other strategies for dealing with the singularity of Sw, two of which they put attention to Orthognal LDA (OLDA) and Null Space LDA (NLDA). Their main result is they yield the same result under certain condtions. The condition is mild when dealing with high dimensional data (the number of dimension is much higher than the number of samples).

I guess I will review this article after I check those LDAs, which personally I suspect their usability. Here are those literatures:
  • A new LDA-based face recognition system which can solve the small size problem, Pattern Recognition 33 (2000) by Chen and et al.
  • Solving the small sample size problem of LDA, ICPR 2002, by Huang and et al.
  • Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems, JMLR 2005.
  • Penalized discriminant analysis, Anals of Statistics 23 (1995) by Hastie and et al.

Semi-supervised Nonlinear Dimensionality Reduction


I guess those guys working on semi-supervised manifold learning are far away from practical applications. In this paper, the so-called SS-LLE and SS-LTSA are so obvious that they are not mentioning. Just consider the original algorithm, since the optimization will result the required coordinates while now some of them are known, the blockwise formulation will immediately show us the semi-supervised problem is an even easier one to solve.


Something a little more difficult is how to make the traditional ISOMAP semi-supervised. Since in LLE and LTSA, the to-be-minimized term has the meaning that the reconstruction error in the corresponding settings should be minimized. However, as a spectral embedding, the optimization in ISOMAP doesn't share a similar meaning. They use an ugly way to incorporate their idea in the case of ISOMAP:

As you can see, A is the Gram matrix obtained after the double-centering step in ISOMAP. With the shifting of the eigen values, now the desired coordinates can be obtained in the terms of M, which is similar to the one in LLE and LTSA. But somehow you can sense their non-sense. Their paper states a similar result, complaining SS-ISOMAP is not as good as the other two.

Maybe their latter part is more interesting, analysing the numerical stability of their algorithms. But somehow it is not a mathematical paper and I don't like their style in machine learning.

Statistical Debugging: Simultaneous Identification of Multiple Bugs


Look at those authors' addresses, CMU, Berkeley, Wisconsin-Madison and Stanford and consider who Jordan is. I am really curious how statisical learning is applied to such a delicated area, debugging. In my personal experience, debugging is really time-consuming and confidence-consuming :-p

However, I don't think that paper gives me a basic idea of how it is gonna work. Perhaps I have to know what they did previously. Here are several paper in its references:
  • Statistical debugging of sampled programs, NIPS 2004, the same authors
  • The first author's home page. I find her a Ph.D. of Jordan and now post-doctor at CMU. Astonished.

Parallel Analysis

This technique is used to decide the dimension for PCA. I know little about its theory and I am searching for some related materials.

The current algorithm I am applying for my experiments is very simple. It is only required to do a PCA once more. For example, X is the data matrix whose vectors are samples. Then each row of X should be randomly permuted when the new PCA is applied to it. Then compare the eigenvalues (sorted in the descending order) of the two. The dimension for PCA is the number of eigenvalues that the second PCA gets and are smaller than those the first PCA gets correspondingly.

According to the experiments result, with the dimension determined by the parallel analysis, the data and projected data yield very similar results with linear algorithms.