Showing posts with label information theory. Show all posts
Showing posts with label information theory. Show all posts

Wednesday, July 8, 2009

Geometric-aware Metric Learning


This is an extension to the ICML 07 paper, with introduction of graph regularization. The idea is to find a pair of kernel matrices, one from task-dependent kernel set and the other from a parametrized data-dependent kernels. The two sets are quite different. The former is used in later classification or related tasks and they choose the Mahalanobis kernel. The later contains locality information and is created with the graph kernels (a subspace spanned by the eigen vectors corresponding to the smallest eigenvalues of the Laplacian matrix). To measure the similarity of the two distance, the Bregman divergence is employed. the difference between the current optimization and the one in the ICML 07 paper is that the other matrix is not fixed. The rough idea is to update them alternatively (just as in LSQ in tensor).

We may change the data-dependent kernel for different tasks (e.g. unsupervised tasks using graph kernels, supervised tasks using labels). There are connections with regularization theory: the solution has a representation of a combination of graph kernel and another term which can be interpreted as a regularizer. Another possible connection is to GP: GP could be regarded as a special case of the proposed framework. I think this topic is by then very interesting. I will come back later to this topic when I have time.

Thursday, July 2, 2009

Information Theoretic Measures for Clustering Comparisons: Is a Correction for Chance Nessary?


One problem in clustering is how to measure two clustering result. E.g. given two clustering results, which one is more closed to the true clustering?

There are many analytical methods, including Rand Index (RI) and Adjusted Rand Index (ARI). They are based on contingency table. Although RI could be useful since it is 1 iff the two clustering are identical and 0 when no pair are in the same cluster, it can not extinguish a random partition with a ``good clustering''.

To correct this (why it is called correction for chance), ARI is proposed. Another class comes from information theory. They use the mutual information of proportion of two clustering (each clustering's proportion is a density, therefore the mutual information is the common information in the two clustering). As we know we can create a distance based on entropy D(p, q) = H(p, q) - I(p, q) = H(p) + H(q) - 2 I(p, q) or its normalized version \mathrm{NMI}(p, q) = \frac{I(p, q)}{\sqrt{H(p) H(q)}}.

The authors proposed a corrected-for-chance version based on information method and from their experiments, we may find it works pretty well. Their criterion is
\mathrm{AMI}(U, V) = \frac{I(U, V) - \mathbb{E}( I(M) \mid a, b)}{\sqrt{H(U) H(V)} - \mathbb{E}(I(M) \mid a, b)}
or
\mathrm{AVI} = \frac{2 I(U, V) - 2 \mathbb{E}(I(M) \mid a, b)}{H(U) + H(V) - 2 \mathbb{E}(I(M) \mid a, b)}
where \mathbb{E}(I(M) \mid a, b) is computed from the contingency table.

The author provided examples where the correction is necessary. Esp. when samples in each cluster is few.