Friday, July 20, 2007

Simpler Knowledge-based Support Vecor Machines


This paper proposed another version of knowledge-based SVM. The former versions can be regarded as adding constraints to the original SVM. To ensure the optimization can still be solved in quadratic programming, the polyhedra-like constraints might introduce non-separability into the separable cases.

The authors of this paper introduce a transformation of the classifier f. The transformation ensures constraints in several cases, e.g. fixing labels for several samples in binary classification, setting an upper and lower bound for regression, setting monotonicity, exclusion of some labels in multi-class classification, setting the even or odd property. It seems a more general method compared to the former version. I don't quite catch the analysis part of this paper. Their result seemingly relies upon the transformation adopted. Sometimes, they are lucky to find a convex optimization problem while they have to use C3P and other analogous techniques for non-convex optimization problems.

Basically, to further understanding this paper, I have to explore the following papers:
  • G. Fung, O. Mangasarian and J. Shavlik: Knowledge-based support vector machine classifiers, NIPS 2002.
  • O.L. Mangasarian, J.W. Shavlik and E.W. Wild: Knowledge-based kernel approximation, JMLR 2004.
  • Rademacher complexity

Saturday, July 7, 2007

Tangent Distance Kernels for Support Vector Machines

by Bernaard Haasdonk, Daniel Keysers

This is a CVPR paper in 2002.

This paper doesnot explore the relationship between kernels and tangent distance(TD), which is difficult. On the contrary, they proposed two ways of incorperating both of them in a way:
  • Run SVM to find sorpport vectors. Compute the TD between SVs for classification.
  • Choose arbitrary RBKF, such as Gaussian, replace the Euclidian distance with TD.

Somehow, it is desirable to find a kernel representing the TD space, which is not Hilbert space, which is infeasible theoretically. Is there any other ways out?

Friday, May 25, 2007

I have to catch up with the papers!

Now ICML 2007 has released its online proceedings. Each time there will be about 150 papers waiting to be read. Now, I still haven't finished ICML 2006, perhaps, there are still 70+.
So, I have to speed up to catch up with papers!

Wednesday, May 16, 2007

Optimal Dimentionality of Metric Space for Classification

by Wei Zhang, Xiangyang Xue, Zichen Sun, Yue-Fei Guo and Hong Lu

This is an ICML 2007 paper. After reading it, I found nothing really astonishing. However it gives another direction of research. The idea is based on one of the papers I read long ago, LPP. The technique of LPP is based on Laplacian Eigenmap by finding a linear transform that minimize a similar object function as Laplacian Eigenmap. However, they are both unsupervised learning techniques.

This paper however, when the Laplacian matrix is constructed, supervised information is incorporated by setting a positive value for samples of the same class, 0 for too far away samples and a negative value for samples of different classes.

Then the so-called optimal projection uses the eigen vectors of negative eigenvalues. Though the idea is simple and direct, it works anyway and the experiment and analysis are both detailed and convincing.

Tuesday, May 15, 2007

Ranking on Graph Data


The ranking problem asks for a function which gives us an order of samples. This requirement can be found on search engines, which give the user a sequence of search results of the preferred order of the user. I don't have earlier experience of ranking problems, so I am just curious about this article. After scanning, it turns out highly related to SVM theory.

There is a ranking risk matrix that descibe the risk of ranking error.

So the empirical ranking risk can be formulated as a functional of ranking function, which is what we are looking for. As for a graph, the function degenerates to a vector. As is in SVM, a regularizer is chosen so as to smooth the function by punish the divergence of values on nearby vertices. On thinking about this, the Laplacian regularizer is a commonly used one. (Just think about the Laplacian Eigenmap, the optimized value)

Directed graphs can be treated likewise. The paper later suggests a connection to RKHS and application in bioinformatics. Indeed, multiple alignment requires a ranking result.

Since the loss function relaxed in this paper is still Hinge loss, the idea of Trading Convexity for Scalability should be applicable to it.

Graph Model Selection using Maximum Likelihood


This might be the first paper that I read on random graphs. Although I have learnt graph theory a bit so far, theoretical preparation still impedes a quick understanding. There are several comman sense in random graph theory:
  • Power-law degree distr---In some graph, the degree distr looks like a exponential/geometry distr.
  • Small world---The avaraged shorest length between two vertices is not as long as what we might imagine.

This paper shows the ability to describe a real world network (graph) that several random graph generating strategies might have. In order to compare them, the authors suggest comparing their likelihood value. Each generating strategy has several parameters that control the procedure and the parameters are estimated via MLE. At last, by comparison of likelihood values, the authors states the suitability of each model.

The four strategy exemplified in the paper are:
  • PA model: The graph is obtained by growing from a single vertex. New vertices are added by connect to existent ones, with a preferable probability to those of large degree/lots of neighbors.
  • PRG Model: First generate the ingree/outgree of each node and match them.
  • SW model: The nodes are placed on a grid, which has already edges connecting the neighbors. By adding more edges according to the Manhattan distance of every pair of vertices, the final model is obtained.
  • ER model: The existence of each potential edge has the probability p.

It can be seen that, MLE of ER model can be obtained easily. For others, it is somehow difficult and MCMC is adopted to overcome it. However, I decide to examine the technical details later.

A possible problem of this paper is that it doesn't take into account the complexity of the generating strategies. It is natural that the more complex a model is, it better fits the real world data. However a better fitting won't necessary yield a good explanation, since a more complicated procedure might have better fitting than one of the four models. So the conclution of this paper is dubious.

Wednesday, March 7, 2007

Some thinking on Parzen window and Semi-supervised learning

Here is an idea originates from the following refereces:
Pattern Classification, by O. Duda: It presents me a basic form of Parzen window. And I think it is quite natural to use other distributons instead of a uniform distribution in a super cube. So I wondered where I can find more information on this topic. PageRank then suggested severl useful links and mnauce gave several more and comments.
  • Emaneul Parzen, On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33(3):1065-1076, 1962.<available from JSTOR>
  • B. Silverman, Density estimation for statistics and data analysis, Chapman and Hall, London, 1986<books.google>
  • Asymptotic Statistics, by Vann Der vaart<in library>
  • A framework for probability density estimation, by John Shawe-Taylor<PASCAL>

Learning with Kernels, by Bernhard Schölkopf and Alex Smola: There is a simple mean classifier in it which indicates that kernel framework will convert many algorithms to superior ones. And then it tells us this classifier can be taken as a difference of Parzen windows. And then I think about SVM. The solution of a SVM tells us which samples are necessary to establish a discriminant classifier and what's their weight.

Manifold Parzen Window, by Vincent, Bengio(in NIPS 2003): They suggested a method to incorporate manifold learning ideas with density estimation. The final version is not far from what I have thought about the problem, local PCA just as MFA does. But I suddenly realize that the neighborhood sellection method used by Zhenyue Zhang is also feasible here. However they indeed extended this version in their later work.

Non-local Manifold Parzen Window, by by Vincent, Bengio(in NIPS 2005). However, then they still stayed at manifold learning. They haven't progressed as fast as I imagined. Though I am still confused about the value of this work now(am silly), I guess there can be something if we try to put the problem in the field of semi-supervised learning paradigm.

Saturday, March 3, 2007

Efficient Co-Regularized Least Squares Regression


In semisupervised learning, there is an important technique called co-learning. The basic idea is to train two independent classifier. Let them mutually interact with training errors in order to agree on unlabelled examples at last. Two phrase: independent classifiers, agree on unlabelled data.

Now let's focus on what the authors show us. As is know, the framework of SVM give a regularization technique for supervised learning already. So in order to incorporate unsupervised data, add a "agreement regularizer".

Then by representer theorem, the optimization result requires O(M3(m+n)3) where M is the number of different views, m is the number of unlabelled data and n the number of labelled data.

However, usually m will be huge due to the accessibility of unlabelled data and we will suffer with a cubic algorithm. An alternative simply uses a subspace of the original RKHS, namely the one expanded with labelled data only. Then the optimization becomes the following equation:

and this can be solved in O(M3n3+M2m).

Another result given in this paper is a distributed algorithm for the optimization problems mentioned above. The algorithm called block coordinate descent is analogous to steepest gradient descent while the former one only scrambles along several coordinates associated with the message sent by another classifier in a different site.


The paper also lists several literatures on co-learning for my future study:
Here is another survey:

Tuesday, February 27, 2007

A DC-Programming Algorithm for Kernel Selection


There are several points in this paper:
One single kernel might not be desirable, since if we know nothing about the data, the kernel is selected empirically from a set of known kernels.
Given a set of parameterized kernels, how is it possible to select one that is the convex combination of them? The authors mention a finite representation in Learning convex combinations of continuously parameterized basic kernels (in COTL 2005).

A procedure for calculating the kernel selection is illustrated in the following algorithm:

The first step is carried out in convex optimization algorithm and the second is a DC programming problem. The stopping criterion ensures the procedure will increase the object function.

But it is a little strange why Step 2 is a DC programming problem. The authors cite a result from On functions representable as a difference of convex functions by Hartman, which implies:
  • Every two continuously differentiable function on Omega is DC;
  • Every continuous function on Omega is the limit of a sequence of DC functions that converges uniformly on Omega.
I have to check these things.

Monday, February 26, 2007

A New Approach to Data Driven Clustering


Well, on seeing Ghahramani, I guess it's no easy job @@ At first I was curious why their starting point resembles a lot from the paper on diffusion map. But soon I knew I was wrong. The first author is a Ph.D. student whose advisor is the second author. I wonder when I can publish a paper like he did in the top conferences. Sigh~ Maybe never.

Like diffusion map, they show us a Markov random walk model on a graph of samples. Then they notice the similarity of distributions starting at each sample can help clustering. They propose a clustering algorithm analogous to K-means(the proposed version minimizes KL divergence instead of L2 norm). The optimization problem can be solved as K-means, yielding an local minima. I thought they might stop here, but they didn't.

Careful readers notice that the number of clusters, K and the time t for random walk are not determined yet. So the following part deals with these two parameters. Given K, learn t. It's an easy problem. If the distribution matrix should be maximally preserved, low rank approximation will take the leading eigen values and we abandon the rest when the gap between them is maximized.

Then how can we obtain the desired parameters from data? Well, it's not that difficult. With the similar idea, when some gap is maximized while we don't know which K.

But then, is it possible to formulate a hierarchical clustering algorithm with this random walk? Well, use the gap size as the plausibility for the partition. Choose the smallest number of plausible clusters(big gap and few clusters). And do partition and recursively carry out the algorithm.

A piece of fresh work needs few references.

Sunday, February 25, 2007

A Regularization Framework for Multiple-Instance Learning


God... I see the name again... James T. Kwok 郭天佑. Although he's an associate professor, I have heard him very active in the field of machine learning. And last November I saw him in the conference held in Nanjing for native experts in this area. The name of the first author... I thought he was the one Tak-ming Chan :-(

In this paper, I find more information on multiple instance learning, esp. formal formulation of the problem. Thanks to Prof. Zhou's talk for us, I follow the idea of the authors without much trouble. However, as is introduced in this paper, many SVM-based algorithms in MIL were proposed in the past few years, which I am not quite familiar with. So I listed several literatures at the end from the refereces for future study.

The authors emphasize that they put not only bag regularizers but also instance regularizers to the original SVM objective function. Then they prove the representer theorem in this case so as to formulate the dual form. In the latter one, it can be easily seen that the optimization is not convex. There is a DC-style constraint. So C3P can be applied. Later they examine the regression problem as well.

I decide to stop scanning for a while(in a week's time). I have to peruse the SVM theory and these related stuffs. I guess then I will review all these papers and see whether I can understand better.

Here are several references:

Trading Convexity for Scalability


The significant information the author wanna show us is that it might not be forbidden when the objective function is not convex. The paper demonstrates us how to solve a non-convex case when the objective is a diference of convex functions by ConCave Convex Programming(CCCP, C3P). The idea is illustrated in the following picture.

Then SVM with Hinge loss is critisized since the number of SVs will increase linearly when the training set grows larger, at last hindering the training and recognition. By analyzing the objective function and according to the Fermat's lemma, those misclassified samples become SVs because the derivative of Hinge loss there is not zero and according to KKT condition, they are SVs.

So the loss function is replaced with Ramp loss, which is the difference of Hinge loss with another convex function.

And then the objective of SVM can be regarded as a difference of convex fucntions and can be solved with C3P.

This technique also applies to TSVM, because the difficulty of TSVM is the non-convexity of the objective. So the algorithm can be used without a touch. The original penalty for unlabelled samples, namely symmetric Hinge loss, now is substituted for an extended version, difference of Ramp loss.

Here are some related work to be checked:

Saturday, February 24, 2007

Locally Adaptive Classification Piloted by Uncertainty


It is known that the region mixed with samples from different classes requires special attention in the classification task. So how should be model these regions? The authors propose a criteria for this property, uncertainty, which is simply the similarity of samples from other classes divided by the total similarity of the whole region(nearest L neighbors), where the similarity is analogous to that of Laplacian eigenmap(heat kernel). Then each sample is associated with an uncertainty factor. GMM is adopted to describe the distribution of the uncertainty. Here the only difference between the original GMM and the one here denoted as RMM is the log-likelihood function of the latter one is weighted by uncertainty factors.

But the model can still be trained with EM algorithms as with GMM. The RMM tells us the probability of each sample's membership of each Gaussian(cluster).

Then for each cluster, it is desired to find a linear transform(reduction) such that

the sum of correctly-classified probability weighted by the probability of membership of the corresponding cluster(c.f. the equation below)

is maximized. This objective function is not necessarily a convex one, so gradient descent with simulated annealing is applied.

To classify a testing sample, first find the probability of its membership of each cluster. And the class it belongs to is the one that maximizes correctly-classified probability weighted by the probability of membership of corresponding cluster.

This work is related to NCA(neighborhood component analysis) when the number of Gaussians is 1. And we might read something on loc boost(Localized boosting by Meir, R and et al.)

ps: These equations are copied from the original paper, thus copyrighted by the authors above.

Thursday, February 22, 2007

An Empirical Comparison of Supervised Learning Algorithms


I think the most important thing in this paper is that it shows lots of information that I don't know, but need indeed. Here is some quoted information:
We evaluate the performance of SVMs, neural nets, logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps on eleven binary classification problems using a variety of performance metrics: accuracy, F-score, Lift, ROC Area, average precision, precision/recall break-even point, squared error, and cross-entropy... we compare the performance of each algorithm both before and after calibrating its predictions with Platt Scaling and Isotonic Regression.
Here, we need to know more about those algorithms and evaluation methods.
With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. Random forests are close second, followed by uncalibrated bagged trees, calibrated SVMs, and uncalibrated neural nets. The models that performed poorest were naive bayes, logistic regression, decision trees, and boosted stumps. Although some methods clearly perform better or worse than other methods on average, there is significant variability across the problems and metrics. Even the best models sometimes perform poorly, and models with poor average performance occasionally perform exceptionally well.
So it is really sth surprising :-p

Nightmare at Test Time: Robust Learning by Feature Deletion


The setting of this paper originates from a real application, spam filtering. If the classifier is trained as usual, it might put too much weight on a single feature and the spam senders can change their style of writing junk emails, e.g. abandoning the weighted feature and the classifier will suffer at the test time. Hence, an idea from game theory helps in this case. But it is only a trade-off between the precision and robustness, that's allowing a little more errors and doing better if deletion happens.

The method is not complicated, modification of objective function of original SVM. The punishment of misclassification is altered so that the maximal possible influence of feature deletion is taken into consideration. The adjustment is carried out per-sample, which might look strange. However, if it is carried out in a uniform way, the optimization will be more difficult, to everyone's surprise, because then the objective function is non-convex. After this, we can even formulate the dual form of the optimization problem but unfortunately, the form is not as promising as the original SVM's, which can take advantage of kernel methods.

Comparison of this method and feature selection is made. If we apply the same technique to formulate feature extraction, in either per-sample style or uniform way, a non-convex optimization problem is confronted and too tough to be solved with a snap.

The important thing about this paper is that I have to be familiar with derivation of optimization problems and get full knowledge of what kind of problems are tractable and what's their solution. This reminds me of reading the books on optimization. Now I have two of them, and hope I can finish them this semester.

Wednesday, February 14, 2007

A Continuation Method for Semi-Supervised SVMs


Fantastic thing, the scond author is now a teacher in our department, a pretty young lady :-) Yes, last time I saw her and talked about semi-supervised learning. So there's nothing strange that she knew little about Belkins' work in this area. Their starting points differ.

The basic idea of the proposed S3VM algorithm originates from Joachims seminal work on transductive SVM, in which a brute method is adopted to search for a separating plane over low density area. Now, Chapelle find an alternative by adding a regularizer to the original objective function. This regularizer is punishing all unlabelled samples if they are near the separating plane. It simply uses the loss function of the original objective function. Now the objective function looks like following
quadratic form of w + punishment of misclassification + punishment of near unlabelled samples
But now the optimization problem will tend to put unlabelled samples in one class. So a constraint is added to force the separating plane yield a balanced cut,
b = sumover labelled indices yi / n
Still the optimization is difficult due to its non-convexity. Hence a global optimization technique is applied, whose name is continuation.

In continuation, a series of functions(usu. adjusted by a parameter, e.g. the pdf of normal distr with the variation parameter sigma2) are convolved with the objective function respectively. The first one of the resulted sequence is the easist to optimize and yield an initial value for the next function. The last one of the sequence is the original objective function. As is seen, convolving with a Gaussian will mollify the multimodal objective function.

There are several related work on this topic:
Semi-supervised classification by low density separation, by Chapelle and Zien
Transductive inference for text classification using support vector machines, by Joachims, ICML1999
Transductive learning via spectral graph partitioning, by Joachims, ICML 2003
And about continuation:
The effective energy transformation scheme as a special continuation approach to global optimization with application to molecular conformation.

Collabrarive Prediction Using Ensembles of Maximum Margin Matrix Factorization


There are two key techniques in this article. One is Maximum Margin Matrix Factorization(MMMF), which I should consult the article Fast maximum margin matrix factorization for collaborative prediction in ICML 2005, the other is the ensemble method used for later classifications.

The setting of MMMF is similar to that of NMF. It has a huge and sparse matrix, Y, representing the ratings from the users, each column of which stands for an item and each row of which stands for an user. Since it's a rating matrix, the elements should be a discrete value taken from 0--R. We'd like to find matrices U, V and discretization parameters thetair such that the following objective can be minimized:
J( U, V, theta) = lambda * ( |U|F2 + |V|F2 )/2 + sumover r sumover ij h( Tijr * ( thetair - Ui Vj') )
where Tijr is 1 if r >= Yij and -1 if r < Yij. h is a loss function, called Hinge loss, which is differentiable. And hence the optimization can be carried out with gradient-based methods. There is another version of MMMF based on SDP (c.f. Maximum margin matrix factorization in NIPS 2005). MMMF stated here can only be solved with local minima and susceptible.

As for the ensemble part, there are not very new techniques. Simply several voting methods and emsemble with different initial values or bagging.

Notice that in the objective function, there won't be anything for non-rated movies. Here is the setting of weak test of generalization. Held out one column, which is the ratings of a single movie. Compute the MMMF of the remaining matrix so that from the computed U, V and theta, we can predict the ratings for the held-out movie. Another strong test of generation differs. A subset of users are held out first. With the remaining users' ratings, MMMF is done. Then give a held-out user's ratings of all movies except one and ask for the unknown rating. This is done by fixing V and tuning U.

The regularization parameter lambda is chosen with a validation set. The error rate will suggest a proper lambda for later usage.

Tuesday, February 13, 2007

The Relationship Between Precision-Recall and ROC Curves


At first, I was then ignorant about the two curves though I heard about them in all kinds of papers. Now I have the chance to explore the common and different points about them.

There are four terms in two-class problems, true positive(TP), true negative(TN), false positive(FP) and false negative(FN). The modifier true/false corresponds to the result of a given classifier, that's to say, if it correctly classifies the sample, the sample is true positive/negative. Otherwise "false" should be used. Positive and negative are the result given by the classifier. A classifier won't necessarily be a good one even though those it declares positives are indeed all positives, because half of the positives are mistaken as negatives and a single mistake is inacceptable sometimes. So a ratio like this won't be sufficient to tell us how good a classifier is.

There are 3 ratios that are commonly used. The recall rate, or true positive rate(TPR), is the ratio of TP over (TP + FN). The higher(->1) it is, the less positives it misses(needs to be recalled). Precision is the ratio of TP over (TP+FP), it is the ratio of correctly recognized samples in all samples declared so. The higher it is, the less negatives it mistakes. The false positive rate (FPR) is the ratio of FP over (FP+TN).

The so-called ROC(Receiver Operator Charicteristic) curve is the diagram of TPR over FPR while the PR(Precision-Recall) curve is precision over recall rate. They both have different environment of applications. It is said that in a largely skewed sample set, PR curve is preferred since a large number of negatives blunt the response on FPR.

Usually, we construct a ROC curve with several confusion matrices obtained in experiments. Each confusion matrix represents a point in ROC diagram and by creating a convex hull of them, the desired curve is gained. Although there should be a corresponding PR curve, it can be constructed in the same way, which leads to an overly-optimistic view.

The main theoretical result of this paper is represented in several theorems, given a fixed sample set: if recall rate is not zero, PR and confusion matrices can be mutually determined; ROC and confusion matrices can be mutually determined; the ROC curve of one classifier dominates another iff the PR of it dominates the one of the other.

The authors provides an interpolation trick that utilizes the ROC curve and corresponding linear interpolation and convert it into PR diagram for interpolation. In a word, a ROC curve can be transformed into an achievable PR curve which can then be constructed with those points to construct an achievable ROC curve.

ROC and its analogs are useful for model selection, for they provide a method of comparison. This should be explored later. A given reference is A support vector method of multivariate performance measures in ICML 2005.

Monday, February 12, 2007

Learning User Preferences for Sets of Objects


In the setting of the proposed article, we only have positive samples. The difficulty of so-called preference learning problems lies on the portfolio effect, which says putting the best together won't necessarily generates the best group. The authors proposed a language DD-PREF, which takes this effect into consideration.

The language is a tuple of four factors, P=(q, d, w, alpha). q is called depth, namely the degree that the user prefers on several features, which can be estimated with kernel density estimators(KDEs) using the positive samples. The total depth of a given set s is the weighted sum of average feature depth:
Vdep(s|P) = sumover f wf * sumover x in s qf(xf) /|s|
And there is another factor called diversity. d is used to specify the desired diversity. In real computation, it is simply the average diversity over the positive set. The total diversity is:
Vdiv(s|P) = sumover f wf * ( 1 - (df - divf(s) )2 )
And the final preferences is ranked according to the weighted(alpha) sum of Vdep and Vdiv.

Then the learning procedure requires q, d, w while alpha is set according to application. Find w requires an optimization technique called BFGS.

Here we have to learn more about the following techniques:
KDE, two references, one is Duda's Pattern Classification, the other is Representing data distributions with kernel density estimates.
BFGS, I think it is included in Neumeric Optimization.

Scanning procedure of ICML 2006 starts~

Right now, I find blogger.com suitable for taking notes for a specific topic. It differs from those blog providers who offer an alternative of creating categories. Although I am not sure which is better, I enjoy them both now~

The first paper list I have to scan is the International Conference on Machine Learning 2006(ICML 2006) . The papers are free to download at here. And now, I will make use of this blog to cover my reading notes.

And perhaps the next paper list comes from NIPS 2006 :-)