Wednesday, January 30, 2008

Clustering Graphs by Weighted Substructures Mining


The second author Taku Kudo has written several useful packages in NLP. I'd love to do sth such as programming, esp. making useful tools for later research.

In this paper, the data are graphs. The features of these graphs are so-called patterns, which can be observed or not in a certain graph (hence binary features). Then the representation of a sample is a high-dimensional binary vector.

There are two things in this paper, feature selection and clustering.

The feature selection is tackled with a weighted substructure mining strategy with the help of DFS code tree (a data structure I am not familiar with). Their idea is simple, starting from an empty graph, expanding it by adding an edge (hence we get a tree-like structure for searching). Since we know if a patten is contained in another, it appears more frequent than the latter. So in the search structure, the frequency will decrease as the depth goes up. The frequent patterns can be found by pruning the search tree.

The clustering problem is dealt with binomial mixture model. We know GMM which can be applied to clustering and the model is trained with EM. Likewise, BMM is done in the same way. One thing different is a regularization technique they use. I guess it resembles the term in NMF.

The experiment is carried out on biological databases, e.g. RNA. After all, it is a little application-oriented paper.

Wednesday, January 23, 2008

Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction


I first knew this paper from afliboy, astonished since I was scanning ICML without scanning it first for my current research project. I am experimenting with it. However, up to now I still don't understand why it is not as good as I have expected on my data.

The idea is so simple. Everyone with experience in FDC will know it immediately. In FDC (or shall we say FDA in accordance with the author), we have two ``variance'' matrices, one for between-class (Sb), the other for within-class (Sw). The reduction is projecting samples into the eigen vectors of the generalized eigenvalue problem of (Sb, Sw), which is claimed to minimize the within-class variance and maximize the between-class variance simultaneously, for which I hold the belief that it is only one strategy for selecting a Pareto optimal (of coz, DNE uses another). The proposed idea reformulates the variance as

which allows us to think about the variance as a pairwise relationship, as if a graph, which I did several month ago without furthering the idea. If I had done, I would been more disappointed by now :-( So locality can be introduced by modifying the weights (those As).

Here As are the adjacent matrices. And this algorithm is then easy to kernelize, as FDC.

With several other papers on graph embedding, this area might be quite difficult to get new result. But anyway, let me try first.

Label Propagation Through Linear Neighborhoods


This is a semi-supervised learning algorithm based on graphs. First they use the idea from LLE to determine the weights for the graph (they also use the traditional way for selecting the neighborhood), however the weight has one more constraint. (in LLE the weights are those that best reconstruct the samples with the sum of 1, here additionally they have to be non-negative, which then yields a best convex combination approximation) Second they use the transductive learning idea (very similar to the ICML 06 paper on ranking graphs). The iteral procedure is paraphrased as label propagation:

To prove the convergence of the procedure, it is only necessary to analyze the eigen values of W (which they claim Perron-Frobenius theorems applies here for the non-negative matrix).

The analysis leads to another connection of their LNP with Laplacian-regularized Least Square Regression:

I must verify my intuition today.

To tackle those out-of-sample data, which turns transduction into induction, they use the similar idea to extend LLE, of coz with the additional constraint.

I am not familiar with this transductive result. But I don't think their algorithm will work any better than Laplacian-regularized least square regression. OK, let me check it.

Tuesday, January 22, 2008

Inference with the Universum


Universum is a new concept to me. In the settings with universum, we have another data set (universum) besides the training set and testing set. Now let's have a look at the logic of the whole thing.

With the training set, we have finite separation results. With each separation, we have correspondingly an equivalent class of functions from the hypothesis space. In SVM settings, the regularizer for the loss is the margin of the function. Why do we control the margin? Since it is the ``inverse'' of capacity, which controls the generalization capability while the capacity is complexity of the hypothesis space imposed on the whole feature space.

However, real problems are different in that the samples come from only a fraction of the whole feature space. We need to measure the complexity of the function on the problem. Usually in Bayes theory, the prior distribution is considered to encapsulate the information which is priorly holden. However, it is difficult to design priors.

By designing universum samples carefully so as to incorporate the information (hence universum samples are not in any labeled class, but those relevant uninteresting samples), we might figure out a way to measure the complexity. The following optimization is proposed to tackle the issued idea:

The last term is the universum regularizer. The U() function is a epsilon-sensitive loss. The resulted function should have small values on the universum samples and hence is instable (w.r.t. the sign of the values) if there is a small purturbation (the purturbed function is most possibly in the same equivalent class). So the resulted function comes from an equivalent class which has lots of contradictions on the universum samples, which indicates it has high complexity on universum sample and therefore comparably low complexity on those samples we are interested, given the limited total complexity.

Well, this is really interesting. Adding ``useless'' samples and getting a regularizer for them will get better result :-)

Learning Random Walks to Rank Nodes in Graphs


Well, forget about the other Agarwal who published the ICML 06 paper (I mistook them for one). Sorry about my carelessness. Thanks for the anonymous reader who kindly pointed it out.

In this paper, they explore the relationship of two strategies for ranking graphs, one from NetRank, which seeks to minimize a KL divergence, one proposed in the paper mentioned above, which tackles a SVM-like optimization and is called Laplacian regularizer. Their main result contains (in theorems):
  • The KL divergence bound suggests a bound for the Laplacian regularizer (which indicates a possibly acceptable smoothness of the ranking function).
  • The generalization capability can be bounded (usu. in the form R <= Remp + sth).
which I don't fully understand.

I think maybe I might read something about the PageRank and NetRank stuffs when I have time.

Thursday, January 10, 2008

Uncovering Shared Structures in Multiclass Classification


In this paper, we have Srebro, one of the authors of previous papers on MMMF. Now we are finding linear projection, a matrix W. To get a good projection, they define a loss function:

which is regularized by the norm of W. Thus we have,

But, they argue the Frobenius norm can be substituted for the trace norm, resulted in the following problem which is formally the same as MMMF:


After analyzing the dual problem of the equation above, it is found out the problem can be kernelized as


Then to solve these problem, SDP or PR-GC can be used.

Fast Maximum Margin Matrix Factorization for Collaborative Prediction


This paper is published just a little after the one in NIPS, the latter aiming at introdcing MMMF. This paper shows how to compute MMMF faster.

There is something strange in the two. The first one in NIPS claims the trace-norm regularizer is good, since it provides global optimal solution (convex optimization solved by SDP). The second one in ICML claims the following is better for computation:

For this non-convex optimization, they use PR-CG. OK... I am an illiterate of optimization techniques.

Maximum-Margin Matrix Factorization


Actually I didn't intend to read this piece. But I am wondering why the trace norm is chosen in another paper I am reading. And here is the homepage of the first author, where you can find related code and slides.

We know there are several matrix factorization problem. One is that SVD helps find the low-rank approximation of a matrix under Frobenius norm. Later in NMF, for non-negative matrix, we hope find a set of non-negative vectors, in the cone spanned by which the matrix can be best approximated, under Frobenius norm or a KL-like divergence.

Here we mainly deal with sparse matrix. The non-zeros elements might be binary. Usually we use a loss that resembles Hinge loss, which is applied to each element of UV'. Then a regularizer is added to the loss. The trace norm is adopted as the lower bound for the Frobenius norm square of U and V, so as to avoid the non-convexity caused by them.


The optimization is solved with SDP. I guess I must get something working with the latest optimization techniques, such as SDP, CCCP and etc. To get a better understanding of this work, I will read the Ph.D. thesis of the first author:
Learning with Matrix Factorizations.

A Duality View of Spectral Methods for Dimensionality Reduction


One important trick in this paper is to apply basic optimization theory to the proposed optimization problem. They choose the MVU algorithms proposed by Weinberger. As for the theory, I seldom noticed its importance, although I did try it for several times. I guess I must pay attention to the theoretical importance of it later.

The optimization of MVU is as following,

then get the Lagrange function,

And last the dual problem

By analyzing the KKT conditions and weak/strong duality, we get an insight of the relationship of the sparse Laplacian matrix and the dense Gram matrix. This is really an interesting thing. Later the authors analyzed ISOMAP, LLE, Laplacian Eigenmap.

It is worth redo their analysis to enhence your own ability.

Tuesday, January 8, 2008

Null Space versus Orthogonal Linear Discriminant Analysis


This article tells me one important thing. LDA or Fisher discriminant criterion is still being studied by lots of guy. Last time I read one by Manli Zhu, in which they propose another different criterion than the traditional ones when the co-variance matrix is rank-deficient.

One of the simplest idea is applying a ridge purturbation to the singular with-in class covariance matrix, which is usu. called regularization. The other is finding an approximation in the principal subspace of Sw. Zhu's idea roots from the latter one, in which all principal directions are selected for later general eigenvalue problem. Zhu rejects those principal directions that are almost perpendicular to the subspace spanned by Sb.

In this paper, the author mentions several other strategies for dealing with the singularity of Sw, two of which they put attention to Orthognal LDA (OLDA) and Null Space LDA (NLDA). Their main result is they yield the same result under certain condtions. The condition is mild when dealing with high dimensional data (the number of dimension is much higher than the number of samples).

I guess I will review this article after I check those LDAs, which personally I suspect their usability. Here are those literatures:
  • A new LDA-based face recognition system which can solve the small size problem, Pattern Recognition 33 (2000) by Chen and et al.
  • Solving the small sample size problem of LDA, ICPR 2002, by Huang and et al.
  • Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems, JMLR 2005.
  • Penalized discriminant analysis, Anals of Statistics 23 (1995) by Hastie and et al.

Semi-supervised Nonlinear Dimensionality Reduction


I guess those guys working on semi-supervised manifold learning are far away from practical applications. In this paper, the so-called SS-LLE and SS-LTSA are so obvious that they are not mentioning. Just consider the original algorithm, since the optimization will result the required coordinates while now some of them are known, the blockwise formulation will immediately show us the semi-supervised problem is an even easier one to solve.


Something a little more difficult is how to make the traditional ISOMAP semi-supervised. Since in LLE and LTSA, the to-be-minimized term has the meaning that the reconstruction error in the corresponding settings should be minimized. However, as a spectral embedding, the optimization in ISOMAP doesn't share a similar meaning. They use an ugly way to incorporate their idea in the case of ISOMAP:

As you can see, A is the Gram matrix obtained after the double-centering step in ISOMAP. With the shifting of the eigen values, now the desired coordinates can be obtained in the terms of M, which is similar to the one in LLE and LTSA. But somehow you can sense their non-sense. Their paper states a similar result, complaining SS-ISOMAP is not as good as the other two.

Maybe their latter part is more interesting, analysing the numerical stability of their algorithms. But somehow it is not a mathematical paper and I don't like their style in machine learning.

Statistical Debugging: Simultaneous Identification of Multiple Bugs


Look at those authors' addresses, CMU, Berkeley, Wisconsin-Madison and Stanford and consider who Jordan is. I am really curious how statisical learning is applied to such a delicated area, debugging. In my personal experience, debugging is really time-consuming and confidence-consuming :-p

However, I don't think that paper gives me a basic idea of how it is gonna work. Perhaps I have to know what they did previously. Here are several paper in its references:
  • Statistical debugging of sampled programs, NIPS 2004, the same authors
  • The first author's home page. I find her a Ph.D. of Jordan and now post-doctor at CMU. Astonished.

Parallel Analysis

This technique is used to decide the dimension for PCA. I know little about its theory and I am searching for some related materials.

The current algorithm I am applying for my experiments is very simple. It is only required to do a PCA once more. For example, X is the data matrix whose vectors are samples. Then each row of X should be randomly permuted when the new PCA is applied to it. Then compare the eigenvalues (sorted in the descending order) of the two. The dimension for PCA is the number of eigenvalues that the second PCA gets and are smaller than those the first PCA gets correspondingly.

According to the experiments result, with the dimension determined by the parallel analysis, the data and projected data yield very similar results with linear algorithms.