Paper Scanner: Information Theoretic Measures for Clustering Comparisons: Is a Correction for Chance Nessary?

Thursday, July 2, 2009

Information Theoretic Measures for Clustering Comparisons: Is a Correction for Chance Nessary?

by Nguyen Xuan Vinh, Julien Epps and James Bailey

One problem in clustering is how to measure two clustering result. E.g. given two clustering results, which one is more closed to the true clustering?

There are many analytical methods, including Rand Index (RI) and Adjusted Rand Index (ARI). They are based on contingency table. Although RI could be useful since it is 1 iff the two clustering are identical and 0 when no pair are in the same cluster, it can not extinguish a random partition with a ``good clustering''.

To correct this (why it is called correction for chance), ARI is proposed. Another class comes from information theory. They use the mutual information of proportion of two clustering (each clustering's proportion is a density, therefore the mutual information is the common information in the two clustering). As we know we can create a distance based on entropy D(p, q) = H(p, q) - I(p, q) = H(p) + H(q) - 2 I(p, q) or its normalized version \mathrm{NMI}(p, q) = \frac{I(p, q)}{\sqrt{H(p) H(q)}}.

The authors proposed a corrected-for-chance version based on information method and from their experiments, we may find it works pretty well. Their criterion is

\mathrm{AMI}(U, V) = \frac{I(U, V) - \mathbb{E}( I(M) \mid a, b)}{\sqrt{H(U) H(V)} - \mathbb{E}(I(M) \mid a, b)}

\mathrm{AVI} = \frac{2 I(U, V) - 2 \mathbb{E}(I(M) \mid a, b)}{H(U) + H(V) - 2 \mathbb{E}(I(M) \mid a, b)}

where \mathbb{E}(I(M) \mid a, b) is computed from the contingency table.

The author provided examples where the correction is necessary. Esp. when samples in each cluster is few.

Paper Scanner

Thursday, July 2, 2009

Information Theoretic Measures for Clustering Comparisons: Is a Correction for Chance Nessary?

No comments:

Recent Comments

Scanning Areas

Paper list

Labels

Scanner