Wednesday, January 30, 2008

Clustering Graphs by Weighted Substructures Mining


The second author Taku Kudo has written several useful packages in NLP. I'd love to do sth such as programming, esp. making useful tools for later research.

In this paper, the data are graphs. The features of these graphs are so-called patterns, which can be observed or not in a certain graph (hence binary features). Then the representation of a sample is a high-dimensional binary vector.

There are two things in this paper, feature selection and clustering.

The feature selection is tackled with a weighted substructure mining strategy with the help of DFS code tree (a data structure I am not familiar with). Their idea is simple, starting from an empty graph, expanding it by adding an edge (hence we get a tree-like structure for searching). Since we know if a patten is contained in another, it appears more frequent than the latter. So in the search structure, the frequency will decrease as the depth goes up. The frequent patterns can be found by pruning the search tree.

The clustering problem is dealt with binomial mixture model. We know GMM which can be applied to clustering and the model is trained with EM. Likewise, BMM is done in the same way. One thing different is a regularization technique they use. I guess it resembles the term in NMF.

The experiment is carried out on biological databases, e.g. RNA. After all, it is a little application-oriented paper.

No comments: