Thursday, July 2, 2009

Information Theoretic Measures for Clustering Comparisons: Is a Correction for Chance Nessary?


One problem in clustering is how to measure two clustering result. E.g. given two clustering results, which one is more closed to the true clustering?

There are many analytical methods, including Rand Index (RI) and Adjusted Rand Index (ARI). They are based on contingency table. Although RI could be useful since it is 1 iff the two clustering are identical and 0 when no pair are in the same cluster, it can not extinguish a random partition with a ``good clustering''.

To correct this (why it is called correction for chance), ARI is proposed. Another class comes from information theory. They use the mutual information of proportion of two clustering (each clustering's proportion is a density, therefore the mutual information is the common information in the two clustering). As we know we can create a distance based on entropy D(p, q) = H(p, q) - I(p, q) = H(p) + H(q) - 2 I(p, q) or its normalized version \mathrm{NMI}(p, q) = \frac{I(p, q)}{\sqrt{H(p) H(q)}}.

The authors proposed a corrected-for-chance version based on information method and from their experiments, we may find it works pretty well. Their criterion is
\mathrm{AMI}(U, V) = \frac{I(U, V) - \mathbb{E}( I(M) \mid a, b)}{\sqrt{H(U) H(V)} - \mathbb{E}(I(M) \mid a, b)}
or
\mathrm{AVI} = \frac{2 I(U, V) - 2 \mathbb{E}(I(M) \mid a, b)}{H(U) + H(V) - 2 \mathbb{E}(I(M) \mid a, b)}
where \mathbb{E}(I(M) \mid a, b) is computed from the contingency table.

The author provided examples where the correction is necessary. Esp. when samples in each cluster is few.

No comments: