Monday, February 26, 2007

A New Approach to Data Driven Clustering


Well, on seeing Ghahramani, I guess it's no easy job @@ At first I was curious why their starting point resembles a lot from the paper on diffusion map. But soon I knew I was wrong. The first author is a Ph.D. student whose advisor is the second author. I wonder when I can publish a paper like he did in the top conferences. Sigh~ Maybe never.

Like diffusion map, they show us a Markov random walk model on a graph of samples. Then they notice the similarity of distributions starting at each sample can help clustering. They propose a clustering algorithm analogous to K-means(the proposed version minimizes KL divergence instead of L2 norm). The optimization problem can be solved as K-means, yielding an local minima. I thought they might stop here, but they didn't.

Careful readers notice that the number of clusters, K and the time t for random walk are not determined yet. So the following part deals with these two parameters. Given K, learn t. It's an easy problem. If the distribution matrix should be maximally preserved, low rank approximation will take the leading eigen values and we abandon the rest when the gap between them is maximized.

Then how can we obtain the desired parameters from data? Well, it's not that difficult. With the similar idea, when some gap is maximized while we don't know which K.

But then, is it possible to formulate a hierarchical clustering algorithm with this random walk? Well, use the gap size as the plausibility for the partition. Choose the smallest number of plausible clusters(big gap and few clusters). And do partition and recursively carry out the algorithm.

A piece of fresh work needs few references.

No comments: