Tuesday, February 27, 2007

A DC-Programming Algorithm for Kernel Selection


There are several points in this paper:
One single kernel might not be desirable, since if we know nothing about the data, the kernel is selected empirically from a set of known kernels.
Given a set of parameterized kernels, how is it possible to select one that is the convex combination of them? The authors mention a finite representation in Learning convex combinations of continuously parameterized basic kernels (in COTL 2005).

A procedure for calculating the kernel selection is illustrated in the following algorithm:

The first step is carried out in convex optimization algorithm and the second is a DC programming problem. The stopping criterion ensures the procedure will increase the object function.

But it is a little strange why Step 2 is a DC programming problem. The authors cite a result from On functions representable as a difference of convex functions by Hartman, which implies:
  • Every two continuously differentiable function on Omega is DC;
  • Every continuous function on Omega is the limit of a sequence of DC functions that converges uniformly on Omega.
I have to check these things.

Monday, February 26, 2007

A New Approach to Data Driven Clustering


Well, on seeing Ghahramani, I guess it's no easy job @@ At first I was curious why their starting point resembles a lot from the paper on diffusion map. But soon I knew I was wrong. The first author is a Ph.D. student whose advisor is the second author. I wonder when I can publish a paper like he did in the top conferences. Sigh~ Maybe never.

Like diffusion map, they show us a Markov random walk model on a graph of samples. Then they notice the similarity of distributions starting at each sample can help clustering. They propose a clustering algorithm analogous to K-means(the proposed version minimizes KL divergence instead of L2 norm). The optimization problem can be solved as K-means, yielding an local minima. I thought they might stop here, but they didn't.

Careful readers notice that the number of clusters, K and the time t for random walk are not determined yet. So the following part deals with these two parameters. Given K, learn t. It's an easy problem. If the distribution matrix should be maximally preserved, low rank approximation will take the leading eigen values and we abandon the rest when the gap between them is maximized.

Then how can we obtain the desired parameters from data? Well, it's not that difficult. With the similar idea, when some gap is maximized while we don't know which K.

But then, is it possible to formulate a hierarchical clustering algorithm with this random walk? Well, use the gap size as the plausibility for the partition. Choose the smallest number of plausible clusters(big gap and few clusters). And do partition and recursively carry out the algorithm.

A piece of fresh work needs few references.

Sunday, February 25, 2007

A Regularization Framework for Multiple-Instance Learning


God... I see the name again... James T. Kwok 郭天佑. Although he's an associate professor, I have heard him very active in the field of machine learning. And last November I saw him in the conference held in Nanjing for native experts in this area. The name of the first author... I thought he was the one Tak-ming Chan :-(

In this paper, I find more information on multiple instance learning, esp. formal formulation of the problem. Thanks to Prof. Zhou's talk for us, I follow the idea of the authors without much trouble. However, as is introduced in this paper, many SVM-based algorithms in MIL were proposed in the past few years, which I am not quite familiar with. So I listed several literatures at the end from the refereces for future study.

The authors emphasize that they put not only bag regularizers but also instance regularizers to the original SVM objective function. Then they prove the representer theorem in this case so as to formulate the dual form. In the latter one, it can be easily seen that the optimization is not convex. There is a DC-style constraint. So C3P can be applied. Later they examine the regression problem as well.

I decide to stop scanning for a while(in a week's time). I have to peruse the SVM theory and these related stuffs. I guess then I will review all these papers and see whether I can understand better.

Here are several references:

Trading Convexity for Scalability


The significant information the author wanna show us is that it might not be forbidden when the objective function is not convex. The paper demonstrates us how to solve a non-convex case when the objective is a diference of convex functions by ConCave Convex Programming(CCCP, C3P). The idea is illustrated in the following picture.

Then SVM with Hinge loss is critisized since the number of SVs will increase linearly when the training set grows larger, at last hindering the training and recognition. By analyzing the objective function and according to the Fermat's lemma, those misclassified samples become SVs because the derivative of Hinge loss there is not zero and according to KKT condition, they are SVs.

So the loss function is replaced with Ramp loss, which is the difference of Hinge loss with another convex function.

And then the objective of SVM can be regarded as a difference of convex fucntions and can be solved with C3P.

This technique also applies to TSVM, because the difficulty of TSVM is the non-convexity of the objective. So the algorithm can be used without a touch. The original penalty for unlabelled samples, namely symmetric Hinge loss, now is substituted for an extended version, difference of Ramp loss.

Here are some related work to be checked:

Saturday, February 24, 2007

Locally Adaptive Classification Piloted by Uncertainty


It is known that the region mixed with samples from different classes requires special attention in the classification task. So how should be model these regions? The authors propose a criteria for this property, uncertainty, which is simply the similarity of samples from other classes divided by the total similarity of the whole region(nearest L neighbors), where the similarity is analogous to that of Laplacian eigenmap(heat kernel). Then each sample is associated with an uncertainty factor. GMM is adopted to describe the distribution of the uncertainty. Here the only difference between the original GMM and the one here denoted as RMM is the log-likelihood function of the latter one is weighted by uncertainty factors.

But the model can still be trained with EM algorithms as with GMM. The RMM tells us the probability of each sample's membership of each Gaussian(cluster).

Then for each cluster, it is desired to find a linear transform(reduction) such that

the sum of correctly-classified probability weighted by the probability of membership of the corresponding cluster(c.f. the equation below)

is maximized. This objective function is not necessarily a convex one, so gradient descent with simulated annealing is applied.

To classify a testing sample, first find the probability of its membership of each cluster. And the class it belongs to is the one that maximizes correctly-classified probability weighted by the probability of membership of corresponding cluster.

This work is related to NCA(neighborhood component analysis) when the number of Gaussians is 1. And we might read something on loc boost(Localized boosting by Meir, R and et al.)

ps: These equations are copied from the original paper, thus copyrighted by the authors above.

Thursday, February 22, 2007

An Empirical Comparison of Supervised Learning Algorithms


I think the most important thing in this paper is that it shows lots of information that I don't know, but need indeed. Here is some quoted information:
We evaluate the performance of SVMs, neural nets, logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps on eleven binary classification problems using a variety of performance metrics: accuracy, F-score, Lift, ROC Area, average precision, precision/recall break-even point, squared error, and cross-entropy... we compare the performance of each algorithm both before and after calibrating its predictions with Platt Scaling and Isotonic Regression.
Here, we need to know more about those algorithms and evaluation methods.
With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. Random forests are close second, followed by uncalibrated bagged trees, calibrated SVMs, and uncalibrated neural nets. The models that performed poorest were naive bayes, logistic regression, decision trees, and boosted stumps. Although some methods clearly perform better or worse than other methods on average, there is significant variability across the problems and metrics. Even the best models sometimes perform poorly, and models with poor average performance occasionally perform exceptionally well.
So it is really sth surprising :-p

Nightmare at Test Time: Robust Learning by Feature Deletion


The setting of this paper originates from a real application, spam filtering. If the classifier is trained as usual, it might put too much weight on a single feature and the spam senders can change their style of writing junk emails, e.g. abandoning the weighted feature and the classifier will suffer at the test time. Hence, an idea from game theory helps in this case. But it is only a trade-off between the precision and robustness, that's allowing a little more errors and doing better if deletion happens.

The method is not complicated, modification of objective function of original SVM. The punishment of misclassification is altered so that the maximal possible influence of feature deletion is taken into consideration. The adjustment is carried out per-sample, which might look strange. However, if it is carried out in a uniform way, the optimization will be more difficult, to everyone's surprise, because then the objective function is non-convex. After this, we can even formulate the dual form of the optimization problem but unfortunately, the form is not as promising as the original SVM's, which can take advantage of kernel methods.

Comparison of this method and feature selection is made. If we apply the same technique to formulate feature extraction, in either per-sample style or uniform way, a non-convex optimization problem is confronted and too tough to be solved with a snap.

The important thing about this paper is that I have to be familiar with derivation of optimization problems and get full knowledge of what kind of problems are tractable and what's their solution. This reminds me of reading the books on optimization. Now I have two of them, and hope I can finish them this semester.

Wednesday, February 14, 2007

A Continuation Method for Semi-Supervised SVMs


Fantastic thing, the scond author is now a teacher in our department, a pretty young lady :-) Yes, last time I saw her and talked about semi-supervised learning. So there's nothing strange that she knew little about Belkins' work in this area. Their starting points differ.

The basic idea of the proposed S3VM algorithm originates from Joachims seminal work on transductive SVM, in which a brute method is adopted to search for a separating plane over low density area. Now, Chapelle find an alternative by adding a regularizer to the original objective function. This regularizer is punishing all unlabelled samples if they are near the separating plane. It simply uses the loss function of the original objective function. Now the objective function looks like following
quadratic form of w + punishment of misclassification + punishment of near unlabelled samples
But now the optimization problem will tend to put unlabelled samples in one class. So a constraint is added to force the separating plane yield a balanced cut,
b = sumover labelled indices yi / n
Still the optimization is difficult due to its non-convexity. Hence a global optimization technique is applied, whose name is continuation.

In continuation, a series of functions(usu. adjusted by a parameter, e.g. the pdf of normal distr with the variation parameter sigma2) are convolved with the objective function respectively. The first one of the resulted sequence is the easist to optimize and yield an initial value for the next function. The last one of the sequence is the original objective function. As is seen, convolving with a Gaussian will mollify the multimodal objective function.

There are several related work on this topic:
Semi-supervised classification by low density separation, by Chapelle and Zien
Transductive inference for text classification using support vector machines, by Joachims, ICML1999
Transductive learning via spectral graph partitioning, by Joachims, ICML 2003
And about continuation:
The effective energy transformation scheme as a special continuation approach to global optimization with application to molecular conformation.

Collabrarive Prediction Using Ensembles of Maximum Margin Matrix Factorization


There are two key techniques in this article. One is Maximum Margin Matrix Factorization(MMMF), which I should consult the article Fast maximum margin matrix factorization for collaborative prediction in ICML 2005, the other is the ensemble method used for later classifications.

The setting of MMMF is similar to that of NMF. It has a huge and sparse matrix, Y, representing the ratings from the users, each column of which stands for an item and each row of which stands for an user. Since it's a rating matrix, the elements should be a discrete value taken from 0--R. We'd like to find matrices U, V and discretization parameters thetair such that the following objective can be minimized:
J( U, V, theta) = lambda * ( |U|F2 + |V|F2 )/2 + sumover r sumover ij h( Tijr * ( thetair - Ui Vj') )
where Tijr is 1 if r >= Yij and -1 if r < Yij. h is a loss function, called Hinge loss, which is differentiable. And hence the optimization can be carried out with gradient-based methods. There is another version of MMMF based on SDP (c.f. Maximum margin matrix factorization in NIPS 2005). MMMF stated here can only be solved with local minima and susceptible.

As for the ensemble part, there are not very new techniques. Simply several voting methods and emsemble with different initial values or bagging.

Notice that in the objective function, there won't be anything for non-rated movies. Here is the setting of weak test of generalization. Held out one column, which is the ratings of a single movie. Compute the MMMF of the remaining matrix so that from the computed U, V and theta, we can predict the ratings for the held-out movie. Another strong test of generation differs. A subset of users are held out first. With the remaining users' ratings, MMMF is done. Then give a held-out user's ratings of all movies except one and ask for the unknown rating. This is done by fixing V and tuning U.

The regularization parameter lambda is chosen with a validation set. The error rate will suggest a proper lambda for later usage.

Tuesday, February 13, 2007

The Relationship Between Precision-Recall and ROC Curves


At first, I was then ignorant about the two curves though I heard about them in all kinds of papers. Now I have the chance to explore the common and different points about them.

There are four terms in two-class problems, true positive(TP), true negative(TN), false positive(FP) and false negative(FN). The modifier true/false corresponds to the result of a given classifier, that's to say, if it correctly classifies the sample, the sample is true positive/negative. Otherwise "false" should be used. Positive and negative are the result given by the classifier. A classifier won't necessarily be a good one even though those it declares positives are indeed all positives, because half of the positives are mistaken as negatives and a single mistake is inacceptable sometimes. So a ratio like this won't be sufficient to tell us how good a classifier is.

There are 3 ratios that are commonly used. The recall rate, or true positive rate(TPR), is the ratio of TP over (TP + FN). The higher(->1) it is, the less positives it misses(needs to be recalled). Precision is the ratio of TP over (TP+FP), it is the ratio of correctly recognized samples in all samples declared so. The higher it is, the less negatives it mistakes. The false positive rate (FPR) is the ratio of FP over (FP+TN).

The so-called ROC(Receiver Operator Charicteristic) curve is the diagram of TPR over FPR while the PR(Precision-Recall) curve is precision over recall rate. They both have different environment of applications. It is said that in a largely skewed sample set, PR curve is preferred since a large number of negatives blunt the response on FPR.

Usually, we construct a ROC curve with several confusion matrices obtained in experiments. Each confusion matrix represents a point in ROC diagram and by creating a convex hull of them, the desired curve is gained. Although there should be a corresponding PR curve, it can be constructed in the same way, which leads to an overly-optimistic view.

The main theoretical result of this paper is represented in several theorems, given a fixed sample set: if recall rate is not zero, PR and confusion matrices can be mutually determined; ROC and confusion matrices can be mutually determined; the ROC curve of one classifier dominates another iff the PR of it dominates the one of the other.

The authors provides an interpolation trick that utilizes the ROC curve and corresponding linear interpolation and convert it into PR diagram for interpolation. In a word, a ROC curve can be transformed into an achievable PR curve which can then be constructed with those points to construct an achievable ROC curve.

ROC and its analogs are useful for model selection, for they provide a method of comparison. This should be explored later. A given reference is A support vector method of multivariate performance measures in ICML 2005.

Monday, February 12, 2007

Learning User Preferences for Sets of Objects


In the setting of the proposed article, we only have positive samples. The difficulty of so-called preference learning problems lies on the portfolio effect, which says putting the best together won't necessarily generates the best group. The authors proposed a language DD-PREF, which takes this effect into consideration.

The language is a tuple of four factors, P=(q, d, w, alpha). q is called depth, namely the degree that the user prefers on several features, which can be estimated with kernel density estimators(KDEs) using the positive samples. The total depth of a given set s is the weighted sum of average feature depth:
Vdep(s|P) = sumover f wf * sumover x in s qf(xf) /|s|
And there is another factor called diversity. d is used to specify the desired diversity. In real computation, it is simply the average diversity over the positive set. The total diversity is:
Vdiv(s|P) = sumover f wf * ( 1 - (df - divf(s) )2 )
And the final preferences is ranked according to the weighted(alpha) sum of Vdep and Vdiv.

Then the learning procedure requires q, d, w while alpha is set according to application. Find w requires an optimization technique called BFGS.

Here we have to learn more about the following techniques:
KDE, two references, one is Duda's Pattern Classification, the other is Representing data distributions with kernel density estimates.
BFGS, I think it is included in Neumeric Optimization.

Scanning procedure of ICML 2006 starts~

Right now, I find blogger.com suitable for taking notes for a specific topic. It differs from those blog providers who offer an alternative of creating categories. Although I am not sure which is better, I enjoy them both now~

The first paper list I have to scan is the International Conference on Machine Learning 2006(ICML 2006) . The papers are free to download at here. And now, I will make use of this blog to cover my reading notes.

And perhaps the next paper list comes from NIPS 2006 :-)